Hadoop MapReduce Streaming sorting on multiple columns

Question

I have mapreduce input that looks like this:

key1 \t 4.1 \t more ...
key1 \t 10.3 \t more ...
key2 \t 6.9 \t more ...
key2 \t 3 \t more ...

I want to sort by the first column then by second column (reverse numerical). Is there a way to achieve this Streaming MapReduce?

My current attempt is this:

hadoop jar hadoop-streaming-1.2.1.jar -Dnum.key.fields.for.partition=1 -Dmapred.text.key.comparator.options='-k1,2rn' -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -mapper cat -reducer cat -file mr_base.py -file common.py -file mr_sort_combiner.py -input mr_combiner/2013_12_09__05_47_21/part-* -output mr_sort_combiner/2013_12_09__07_15_59/

But this sorts by first part of key and second (but does not sort second as numeric but rather as a string).

Any ideas on how I can sort two fields (one numeric and one textual)?

ghchoi · Accepted Answer · 2018-04-09 08:29:35Z

6

you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)

e.g. in bash

sort -k1,1 -k2rn

so for your example it would be

hadoop jar hadoop-streaming-1.2.1.jar \
    -Dmapred.text.key.comparator.options='-k1,1 - k2rn' \
    -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -mapper cat \
    -reducer cat \
    -file mr_base.py \
    -file common.py \
    -file mr_sort_combiner.py \
    -input mr_combiner/2013_12_09__05_47_21/part-* \
    -output mr_sort_combiner/2013_12_09__07_15_59/

edited Apr 9, 2018 at 8:29

ghchoi

5,1484 gold badges36 silver badges58 bronze badges

answered Oct 14, 2014 at 22:04

blanche

1261 silver badge6 bronze badges

1

this doesn't work for me on hadoop 2.7.3 ... any idea why? (My exact config is -D mapreduce.partition.keycomparator.options='-k1,2 -k3,3nr -k4,4nr' but it fails)
– refaelos
Commented Dec 21, 2016 at 16:43
1

Try to explicitly indicate how much columns do you have. E.g. -D stream.num.map.output.key.fields=4.
– VeLKerr
Commented Apr 6, 2018 at 11:35

Add a comment |

Collectives™ on Stack Overflow

Hadoop MapReduce Streaming sorting on multiple columns

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
sorting
hadoop
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sortinghadoop or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
sorting
hadoop
or ask your own question.