Fastest file reading in a multi-threaded application

sud03r picture sud03r · May 20, 2012 · Viewed 17.5k times · Source

I have to read a 8192x8192 matrix into memory. I want to do it as fast as possible.
Right now I have this structure:

char inputFile[8192][8192*4]; // I know the numbers are at max 3 digits
int8_t matrix[8192][8192]; // Matrix to be populated

// Read entire file line by line using fgets
while (fgets (inputFile[lineNum++], MAXCOLS, fp));

//Populate the matrix in parallel, 
for (t = 0; t < NUM_THREADS; t++){
    pthread_create(&threads[t], NULL, ParallelRead, (void *)t);
}

In the function ParallelRead, I parse each line, do atoi and populate the matrix. The parallelism is line-wise like thread t parses line t, t+ 1 * NUM_THREADS..

On a two-core system with 2 threads, this takes

Loading big file (fgets) : 5.79126
Preprocessing data (Parallel Read) : 4.44083

Is there a way to optimize this any further?

Answer

Hans Passant picture Hans Passant · May 20, 2012

It's a bad idea to do it this way. Threads can get your more cpu cycles if you have enough cores but you still have only one hard disk. So inevitably threads cannot improve the speed of reading file data.

They actually make it much worse. Reading data from a file is fastest when you access the file sequentially. That minimizes the number of reader head seeks, by far the most expensive operation on a disk drive. By splitting the reading across multiple threads, each reading a different part of the file, you are making the reader head constantly jump back and forth. Very, very bad for throughput.

Use only one thread to read file data. You might be able to overlap it with some computational cycles on the file data by starting a thread once a chunk of the file data is loaded.

Do watch out for the test effect. When you re-run your program, typically after tweaking your code somewhat, it is likely that the program can find file data back in the file system cache so it doesn't have to be read from the disk. That's very fast, memory bus speed, a memory-to-memory copy. Pretty likely on your dataset since it isn't very big and easily fits in the amount of RAM a modern machine has. This does not (typically) happen on a production machine. So be sure to clear out the cache to get realistic numbers, whatever it takes on your OS.