Python Does Not Read Entire Text File

user1297872 picture user1297872 · Mar 28, 2012 · Viewed 20.1k times · Source

I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.

My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.

The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.

For example, I'll do a really simply command such as:

newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
    replaced = line.replace("string1", "string2")
    newfile.write(replaced)

And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?

I tried a few different solutions such as using:

import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
   sys.stdout.write(line.replace("string1", "string2")

But it has the same effect. Nor does reading the file in chunks such as using

f.read(10000)

I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.

Can anyone offer any advice or things to try?

I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7

*Edit Solution Found (Thanks Lattyware). Using an Iterator

def read_in_chunks(file, chunk_size=1000): 
   while True: 
      data = file.read(chunk_size) 
      if not data: break 
      yield data

Answer

codeape picture codeape · Mar 28, 2012

Try:

f = open("filename.txt", "rb")

On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).

You can also specify the mode when using fileinput:

fileinput.input("filename.txt", inplace=1, mode="rb")