fseek passing negative offset and SEEK_CUR

ATL picture ATL · Aug 14, 2015 · Viewed 13.4k times · Source

I have a bad performance running fseek(..) in a very big file. Every time a call fseek function, I need to move the file pointer position backward 100 bytes:

  fseek(fp, -100, SEEK_CUR);

before, I was doing this:

  fseek(fp, (index)*100, SEEK_SET); // which makes basically the same...

My question is how fseek moves the pointer through the file and sets the file pointer in an specific position.

I thought that it takes the file pointer and moves it backward, but now I think that what it really does, is

  • get the current position (cp)

  • add the negative index (p = idx + cp)

  • and move the file pointer from the beginning of the file to that position (fseek(fp, p, SEEK_SET))

Answer

Andrew Henle picture Andrew Henle · Aug 14, 2015

First, what operating system are you using? If it's Linux, run your application under strace to see what system calls it's actually making.

Second, fopen()/fseek()/fread() are the wrong tools for this access pattern. Those calls buffer the file reads - by reading ahead. That does you no good. You fseek() to offset X, whatever data is buffered is now useless, you fread() 100 bytes, and the buffered fread() reads more - probably 8 kB. You're likely reading almost every byte of the file over 80 times. You could use use setbuf() or setvbuf() to disable buffering, but then you'll be doing 100-byte reads while going through the file backwards. It should be faster, but not as fast as you can go.

To do this about as fast as you can (without getting into multithreaded and/or asynchronous IO):

  1. Use open()/pread(). You don't need to seek - pread() reads directly from an arbitrary offset.

  2. Read larger chunks - say 8192 x 100. Or even larger. Read backwards just as before, but do the buffering yourself and start at an offset in the file that is a multiple of the large size you're reading - the first read will probably have less than 819,200 bytes in it. Process the last 100 bytes in your buffer first, then work backwards through the buffer. When you've processed the first 100 bytes in your buffer, use pread() to read the previous 819,200 bytes (or even larger) from the file.

  3. Use direct IO if available. File system optimizations might try to "optimize" your access by reading ahead and placing data into the page cache - data that you've already processed. So bypass the page cache if possible (not all operating systems support direct IO, and not all filesystems on OSes that do support direct IO implement it.)

Something like this:

#define DATA_SIZE 100
#define NUM_CHUNKS (32UL * 1024UL)
#define READ_SIZE ( ( size_t ) DATA_SIZE * NUM_CHUNKS )

void processBuffer( const char *buffer, ssize_t bytes )
{
    if ( bytes <= 0 ) return;
    // process a buffer backwards...
}

void processFile( const char *filename )
{
    struct stat sb;
    // get page-aligned buffer for direct IO
    char *buffer = valloc( READ_SIZE );
    // Linux-style direct IO
    int fd = open( filename, O_RDONLY | O_DIRECT );
    fstat( fd, &sb );    
    // how many read operations?
    // use lldiv() to get quotient and remainder in one op
    lldiv_t numReads = lldiv( sb.st_size, READ_SIZE );
    if ( numReads.rem )
    {
        numReads.quot++;
    }
    while ( numReads.quot > 0 )
    {
        numReads.quot--;
        ssize_t bytesRead = pread( fd, buffer,
            READ_SIZE, numReads.quot * READ_SIZE );
        processBuffer( buffer, bytesRead );
    }
    free( buffer );
    close( fd );
}

You'll need to add error handling to that.