4

I have mapreduce input that looks like this:

key1 \t 4.1 \t more ...
key1 \t 10.3 \t more ...
key2 \t 6.9 \t more ...
key2 \t 3 \t more ...

I want to sort by the first column then by second column (reverse numerical). Is there a way to achieve this Streaming MapReduce?

My current attempt is this:

hadoop jar hadoop-streaming-1.2.1.jar -Dnum.key.fields.for.partition=1 -Dmapred.text.key.comparator.options='-k1,2rn' -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -mapper cat -reducer cat -file mr_base.py -file common.py -file mr_sort_combiner.py -input mr_combiner/2013_12_09__05_47_21/part-* -output mr_sort_combiner/2013_12_09__07_15_59/

But this sorts by first part of key and second (but does not sort second as numeric but rather as a string).

Any ideas on how I can sort two fields (one numeric and one textual)?

1 Answer 1

6

you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)

e.g. in bash

sort -k1,1 -k2rn

so for your example it would be

hadoop jar hadoop-streaming-1.2.1.jar \
    -Dmapred.text.key.comparator.options='-k1,1 - k2rn' \
    -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -mapper cat \
    -reducer cat \
    -file mr_base.py \
    -file common.py \
    -file mr_sort_combiner.py \
    -input mr_combiner/2013_12_09__05_47_21/part-* \
    -output mr_sort_combiner/2013_12_09__07_15_59/
2
  • 1
    this doesn't work for me on hadoop 2.7.3 ... any idea why? (My exact config is -D mapreduce.partition.keycomparator.options='-k1,2 -k3,3nr -k4,4nr' but it fails)
    – refaelos
    Commented Dec 21, 2016 at 16:43
  • 1
    Try to explicitly indicate how much columns do you have. E.g. -D stream.num.map.output.key.fields=4.
    – VeLKerr
    Commented Apr 6, 2018 at 11:35

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.