Python EOF for multi byte requests of file.read()

dawg picture dawg · Dec 13, 2010 · Viewed 14.5k times · Source

The Python docs on file.read() state that An empty string is returned when EOF is encountered immediately. The documentation further states:

Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.

I believe Guido has made his view on not adding f.eof() PERFECTLY CLEAR so need to use the Python way!

What is not clear to ME, however, is if it is a definitive test that you have reached EOF if you receive less than the requested bytes from a read, but you did receive some.

ie:

with open(filename,'rb') as f:
    while True:
        s=f.read(size)
        l=len(s) 
        if l==0: 
            break     # it is clear that this is EOF...
        if l<size:
            break      # ? Is receiving less than the request EOF???

Is it a potential error to break if you have received less than the bytes requested in a call to file.read(size)?

Answer

the wolf picture the wolf · Dec 13, 2010

You are not thinking with your snake skin on... Python is not C.

First, a review:

  • st=f.read() reads to EOF, or if opened as a binary, to the last byte;
  • st=f.read(n) attempts to reads n bytes and in no case more than n bytes;
  • st=f.readline() reads a line at a time, the line ends with '\n' or EOF;
  • st=f.readlines() uses readline() to read all the lines in a file and returns a list of the lines.

If a file read method is at EOF, it returns ''. The same type of EOF test is used in the other 'file like" methods like StringIO, socket.makefile, etc. A return of less than n bytes from f.read(n) is most assuredly NOT a dispositive test for EOF! While that code may work 99.99% of the time, it is the times it does not work that would be very frustrating to find. Plus, it is bad Python form. The only use for n in this case is to put an upper limit on the size of the return.

What are some of the reasons the Python file-like methods returns less than n bytes?

  1. EOF is certainly a common reason;
  2. A network socket may timeout on read yet remain open;
  3. Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in text mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
  4. The file is in non-blocking mode and another process begins to access the file;
  5. Temporary non-access to the file;
  6. An underlying error condition, potentially temporary, on the file, disc, network, etc.
  7. The program received a signal, but the signal handler ignored it.

I would rewrite your code in this manner:

with open(filename,'rb') as f:
    while True:
        s=f.read(max_size)
        if not s: break

        # process the data in s...

Or, write a generator:

def blocks(infile, bufsize=1024):
    while True:
        try:
            data=infile.read(bufsize)
            if data:
                yield data
            else:
                break
        except IOError as (errno, strerror):
            print "I/O error({0}): {1}".format(errno, strerror)
            break

f=open('somefile','rb')

for block in blocks(f,2**16):
    # process a block that COULD be up to 65,536 bytes long