Sorting by value in Hadoop from a file

Deepika Sethi picture Deepika Sethi · Nov 27, 2011 · Viewed 13.7k times · Source

I have a file containing a String, then a space and then a number on every line.

Example:

Line1: Word 2
Line2 : Word1 8
Line3: Word2 1

I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:

Line1: Word1 8 1
Line2: Word  2 2
Line3: Word2 1 3

Does anyone has an idea, how can I do it in Hadoop? I am using java with Hadoop.

Answer

Tudor picture Tudor · Nov 27, 2011

You could organize your map/reduce computation like this:

Map input: default

Map output: "key: number, value: word"

_ sorting phase by key _

Here you will need to override the default sorter to sort in decreasing order.

Reduce - 1 reducer

Reduce input: "key: number, value: word"

Reduce output: "key: word, value: (number, rank)"

Keep a global counter. For each key-value pair add the rank by incrementing the counter.

Edit: Here is a code snipped of a custom descendant sorter:

public static class IntComparator extends WritableComparator {

    public IntComparator() {
        super(IntWritable.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
            byte[] b2, int s2, int l2) {

        Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
        Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();

        return v1.compareTo(v2) * (-1);
    }
}

Don't forget to actually set it as the comparator for your job:

job.setSortComparatorClass(IntComparator.class);