Python binary EOF

mekkanizer picture mekkanizer · Aug 23, 2014 · Viewed 22.1k times · Source

I want to read through a binary file. Googling "python binary eof" led me here.

Now, the questions:

  1. Why does the container (x in the SO answer) contain not a single (current) byte but a whole bunch of them? What am I doing wrong?
  2. If it should be so and I am doing nothing wrong, HOW do read a single byte? I mean, is there any way to detect EOF while reading the file with read(1) method?

Answer

Sylvain Leroux picture Sylvain Leroux · Aug 23, 2014

To quote the documentation:

file.read([size])

Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.

That means (for a regular file):

  • f.read(1) will return a byte object containing either 1 byte or 0 byte is EOF was reached
  • f.read(2) will return a byte object containing either 2 bytes, or 1 byte if EOF is reached after the first byte, or 0 byte if EOF in encountered immediately.
  • ...

If you want to read your file one byte at a time, you will have to read(1) in a loop and test for "emptiness" of the result:

# From answer by @Daniel
with open(filename, 'rb') as f:
    while True:
        b = f.read(1)
        if not b:
            # eof
            break
        do_something(b)

If you want to read your file by "chunk" of say 50 bytes at a time, you will have to read(50) in a loop:

with open(filename, 'rb') as f:
    while True:
        b = f.read(50)
        if not b:
            # eof
            break
        do_something(b) # <- be prepared to handle a last chunk of length < 50
                        #    if the file length *is not* a multiple of 50

In fact, you may even break one iteration sooner:

with open(filename, 'rb') as f:
    while True:
        b = f.read(50)
        do_something(b) # <- be prepared to handle a last chunk of size 0
                        #    if the file length *is* a multiple of 50
                        #    (incl. 0 byte-length file!)
                        #    and be prepared to handle a last chunk of length < 50
                        #    if the file length *is not* a multiple of 50
        if len(b) < 50:
            break

Concerning the other part of your question:

Why does the container [..] contain [..] a whole bunch of them [bytes]?

Referring to that code:

for x in file:  
   i=i+1  
   print(x)  

To quote again the doc:

A file object is its own iterator, [..]. When a file is used as an iterator, typically in a for loop (for example, for line in f: print line.strip()), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing).

The the code above read a binary file line-by-line. That is stopping at each occurrence of the EOL char (\n). Usually, that leads to chunks of various length as most binary files contains occurrences of that char randomly distributed.

I wouldn't encourage you to read a binary file that way. Please prefer one a solution based on read(size).