DMA cache coherence management

Michael picture Michael · Aug 20, 2011 · Viewed 14k times · Source

My question is this: how can I determine when it is safe to disable cache snooping when I am correctly using [pci_]dma_sync_single_for_{cpu,device} in my device driver?

I'm working on a device driver for a device which writes directly to RAM over PCI Express (DMA), and am concerned about managing cache coherence. There is a control bit I can set when initiating DMA to enable or disable cache snooping during DMA, clearly for performance I would like to leave cache snooping disabled if at all possible.

In the interrupt routine I call pci_dma_sync_single_for_cpu() and ..._for_device() as appropriate, when switching DMA buffers, but on 32-bit Linux 2.6.18 (RHEL 5) it turns out that these commands are macros which expand to nothing ... which explains why my device returns garbage when cache snooping is disabled on this kernel!

I've trawled through the history of the kernel sources, and it seems that up until 2.6.25 only 64-bit x86 had hooks for DMA synchronisation. From 2.6.26 there seems to be a generic unified indirection mechanism for DMA synchronisation (currently in include/asm-generic/dma-mapping-common.h) via fields sync_single_for_{cpu,device} of dma_map_ops, but so far I've failed to find any definitions of these operations.

Answer

Niall Douglas picture Niall Douglas · Feb 27, 2012

I'm really surprised no one has answered this, so here we go on a non-Linux specific answer (I have insufficient knowledge of the Linux kernel itself to be more specific) ...

Cache snooping simply tells the DMA controller to send cache invalidation requests to all CPUs for the memory being DMAed into. This obviously adds load to the cache coherency bus, and it scales particularly badly with additional processors as not all CPUs will have a single hop connection with the DMA controller issuing the snoop. Therefore, the simple answer to "when it is safe to disable cache snooping" is when the memory being DMAed into either does not exist in any CPU cache OR its cache lines are marked as invalid. In other words, any attempt to read from the DMAed region will always result in a read from main memory.

So how do you ensure reads from a DMAed region will always go to main memory?

Back in the day before we had fancy features like DMA cache snooping, what we used to do was to pipeline DMA memory by feeding it through a series of broken up stages as follows:

Stage 1: Add "dirty" DMA memory region to the "dirty and needs to be cleaned" DMA memory list.

Stage 2: Next time the device interrupts with fresh DMA'ed data, issue an async local CPU cache invalidate for DMA segments in the "dirty and needs to be cleaned" list for all CPUs which might access those blocks (often each CPU runs its own lists made up of local memory blocks). Move said segments into a "clean" list.

Stage 3: Next DMA interrupt (which of course you're sure will not occur before the previous cache invalidate has completed), take a fresh region from the "clean" list and tell the device that its next DMA should go into that. Recycle any dirty blocks.

Stage 4: Repeat.

As much as this is more work, it has several major advantages. Firstly, you can pin DMA handling to a single CPU (typically the primary CPU0) or a single SMP node, which means only a single CPU/node need worry about cache invalidation. Secondly, you give the memory subsystem much more opportunity to hide memory latencies for you by spacing out operations over time and spreading out load on the cache coherency bus. The key for performance is generally to try and make any DMA occur on a CPU as close to the relevant DMA controller as possible and into memory as close to that CPU as possible.

If you always hand off newly DMAed into memory to user space and/or other CPUs, simply inject freshly acquired memory in at the front of the async cache invalidating pipeline. Some OSs (not sure about Linux) have an optimised routine for preordering zeroed memory, so the OS basically zeros memory in the background and keeps a quick satisfy cache around - it will pay you to keep new memory requests below that cached amount because zeroing memory is extremely slow. I'm not aware of any platform produced in the past ten years which uses hardware offloaded memory zeroing, so you must assume that all fresh memory may contain valid cache lines which need invalidating.

I appreciate this only answers half your question, but it's better than nothing. Good luck!

Niall