Python read file as stream from HDFS

Charles Menguy picture Charles Menguy · Sep 19, 2012 · Viewed 67.8k times · Source

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)

What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:

for line in open("myfile", "r"):
    # do some processing

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

Thanks for any help !

Answer

Keith Randall picture Keith Randall · Sep 19, 2012

You want xreadlines, it reads lines from a file without loading the whole file into memory.

Edit:

Now I see your question, you just need to get the stdout pipe from your Popen object:

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
for line in cat.stdout:
    print line