I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
Is there a rule of thumb for using mmap()
versus reading in blocks via C++'s fstream
library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
The mmap()
code could potentially get very messy since mmap
'd blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstream
s, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap()
is 2x faster) or simple tests?
I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap
or read
might be faster or slower.
mmap
has more overhead than read
(just like epoll
has more overhead than poll
, which has more overhead than read
). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.However,
read
, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock
pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).The discussion of mmap/read reminds me of two other performance discussions:
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that epoll
is often slower than poll
, which makes perfect sense if you know that managing epoll
requires making more syscalls.
Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED
isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.
(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)