How to load data from HDFS sequencefile in python

Question 1

How to load data from HDFS sequencefile in python

python hadoop mapreduce hive sequencefile

Terry · Nov 13, 2015 · Viewed 7.6k times · Source

Answer

Answer

Have a look at this

Run below python file before your mapreduce job
input : your sequence file
output : your input to mapreduce

import sys

from hadoop.io import SequenceFile

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print 'usage: SequenceFileReader <filename> <output>'
    else:
        reader = SequenceFile.Reader(sys.argv[1])

    key_class = reader.getKeyClass()
    value_class = reader.getValueClass()

    key = key_class()
    value = value_class()

    #reader.sync(4042)
    position = reader.getPosition()
    f = open(sys.argv[2],'w')
    while reader.next(key, value):
        f.write(value.toString()+'\n')
    reader.close()
    f.close()

You wont have to change you original python file now.

Question 2

I have a map reduce program running to read the HDFS file as below:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar -Dmapred.reduce.tasks=1000  -file $homedir/mapper.py -mapper $homedir/mapper.py -file $homedir/reducer.py -reducer $homedir/reducer.py   -input /user/data/* -output /output/ 2> output.text

Anything to be confirm, the path /user/data/* has folders including files, /user/data/* will iterate all files under all subfolders right ?

The hdfs text file contains a JSON string for each line so the mapper read the file as below:

for line in sys.stdin:
    try:
        object = json.loads(line)

But the owner of the HDFS changed the file from text into sequencefile. and I found the map reduce program output a lot of zero sized files, which probably means it did not successfully read the file from HDFS.

What should I change to code so that I can read from the sequencefile ? I also have a HIVE external table to perform the aggregation and sorting based on that output of mapreduce, and the HIVE was STORED AS TEXTFILE before, should I change to STORED AS SEQUENCEFILE ?

Thanks,

How to load data from HDFS sequencefile in python

Answer

Related questions