1

I have a text file:

10  1   15
10  12  30
10  9   45
10  8   40
10  15  55
12  9   0
12  7   18
12  10  1
9   1   1
9   2   1
9   0   1
14  5   5

And I would like to get this file as an output of my MapReduce job:

9   0   1   
9   1   1   
9   2   1   
10  1   15  
10  9   40  
10  9   45  
10  12  30  
10  15  55  
12  7   18  
12  9   0   
12  10  1   
14  5   5

It means it has to be sorted by 1st, 2nd and 3rd columns.

I use this command:

#!/bin/bash

IN_DIR="/user/cloudera/temp"
OUT_DIR="/user/cloudera/temp_out"
NUM_REDUCERS=1

hdfs dfs -rmr ${OUT_DIR} > /dev/null

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name="Parsing mista pages job 1 (parsing)" \
-D stream.num.map.output.key.fields=3 \
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D mapreduce.partition.keycomparator.options='-k1,1n -k2,2n -k3,3n' \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-mapper 'cat' \
-reducer 'cat' \
-input ${IN_DIR} \
-output ${OUT_DIR}

hdfs dfs -cat ${OUT_DIR}/* | head -100

And get exactly what I want. BUT. When I do NUM_REDUCERS=2 I get this output:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00000 | head -100
9   1   1   
10  9   45  
10  12  30  
10  15  55  
12  7   18  
12  10  1   
14  5   5

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00001 | head -100
9   0   1   
9   2   1   
10  1   15  
10  9   40  
12  9   0

Why partitioner splits my data with same keys (for example '9') to different reducers?

How can I force partitioner to split Mapper output by the key and sort it by value. For example, if I have 4 reducers the reducers input should be:

reducer 1
9   0   1   
9   1   1   
9   2   1   

reducer 2
10  1   15  
10  9   40  
10  9   45  
10  12  30  
10  15  55  

reducer 3
12  7   18  
12  9   0   
12  10  1

reducer 4:   
14  5   5

1 Answer 1

0

you can overwrite the default Partioner to put each key into diferent reduce .Set the same Nums of reduce . let each reduce to deal with only one key .

for example()

groupMap.put("9", 0);
groupMap.put("10", 1);
groupMap.put("12", 2);
groupMap.put("14", 3);

Add -partitioner argument to use your own partition in your job. I think it might works for you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.