Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
1 vote
0 answers
46 views

Sort compressed data in hadoop

So here is the situation, I want to see if applying sort (from map reduce examples) on a compressed file is more efficient than on the original file. To run this experiment, I first launch ...
wilcoln's user avatar
  • 73
0 votes
0 answers
180 views

How sorting works on Partitioned skewed data set in Hive

I am having a dataset with below heirarchy in Hive, size of datsets is in TB's. -Country -Year -In_stock -Zone -trans_dt I need to sort trans_dt in ascending order within Zone (one of ...
nilesh1212's user avatar
  • 1,655
0 votes
1 answer
3k views

How to sort values (with their corresponding key) in mapReduce Hadoop framework?

I am trying to sort the input data I have using Hadoop mapReduce. The problem is that I am only able to sort the key-value pairs by key, while I am trying to sort them by value. Each value's key was ...
Sara's user avatar
  • 11
0 votes
1 answer
520 views

How to sort a custom writable type in Hadoop

I have a custom type which contains fields of Hadoop native types (e.g. Text and IntWritable) and need to use it as a key and sort as I want during the shuffle/sort phase. There are similar questions ...
Serob_b's user avatar
  • 1,059
0 votes
1 answer
1k views

How to find top 10 elements in MapReduce

I am trying to write a Python MapReduce job on some datasets I have to find certain statistics. This is a example of the input data and the form it comes in: exchange, stock_symbol, date, ...
faboys's user avatar
  • 57
1 vote
0 answers
1k views

How to make Group By and sort in Python for mapreducer in hadoop

I have a dataset with 100k rows with 17 cols. I would like to know how to groupby and sort in hadoop mapreducer using python here is my mapper.py #!/usr/bin/python import sys for line in sys.stdin:...
LonelyToh's user avatar
1 vote
1 answer
2k views

How to sort by key and value in mapreduce?

I have a text file: 10 1 15 10 12 30 10 9 45 10 8 40 10 15 55 12 9 0 12 7 18 12 10 1 9 1 1 9 2 1 9 0 1 14 5 5 And I would like to get this file as an output of my ...
Ildar Gabdrakhmanov's user avatar
0 votes
1 answer
7k views

MapReduce sort by value in descending order

I'm trying to write in pseudo code a MapReduce task that returns the items sorted in descending order. For example: for the wordcount task, instead of getting: apple 1 banana 3 mango 2 I want the ...
Shani Gamrian's user avatar
3 votes
1 answer
475 views

Do we really need sorting in the MapReduce framework?

I am completely new to MapReduce and just can't get my mind around the need to sort the mapper output according to the keys in each partition. Eventually all we want is that a reducer is fed a ...
hesk's user avatar
  • 327
1 vote
0 answers
202 views

Multiple column sorting hadoop streaming (EMR)

I'm trying to sort differently on each column on the mapper output. My output looks like this: xx yy 2 4 xx yy 1 5 xx yy 5 39 xx yy 8 3 So the first 2 columns are text the the last 2 columns are ...
refaelos's user avatar
  • 8,055
1 vote
2 answers
3k views

Searching between dates in Hbase

I have Hbase table wiht rowKeys as such (delimter = '#') 0CE5C485#1481400000#A#B#C#T 00C6F485#1481600000#F#J#C#G 065ED485#1481500000#T#X#C#G ... ... The first part is actually the hex of the ...
Huga's user avatar
  • 571
3 votes
0 answers
1k views

Hadoop Mapreduce Multiple Reducer Sorting

I am using Hadoop Mapreduce to sort a large document and using the KeyFieldBasedPartitioner to partition different inputs to different reducers. The idea I have to solve this problem is to have the ...
user avatar
0 votes
1 answer
604 views

Hadoop Map Reduce - how to separate grouping from sorting?

Just getting started writing Hadoop MR jobs. Hopefully we'll be switching over to Spark soon, but we're stuck doing MR for now. I'd like to group records by a hash of their value. But I'd like to ...
medloh's user avatar
  • 969
1 vote
2 answers
1k views

how to sort the output of a map side program in mapreduce?

My question is about how I can sort the output of a mapper in a mapreduce program(ps: there is no reducers(0)), i use just the map side to filter two inputs and I want that the result(output mappers) ...
Zoro4246's user avatar
0 votes
1 answer
1k views

MapReduce Sort By Python Tuples Numerically

I'm working wth Python tuples and have a text file that looks like (1,value1) (2,value2) (3,value3) ... (100,value100) How can I configure my MapReduce job to sort by the first key in the tuple as ...
Jack's user avatar
  • 538
0 votes
1 answer
44 views

How to dedupe a file and maintain original sort order in Hive?

My data is already sorted by descending last_column and descending third_column. I want to de-duplicate the data set based on last_column while maintaining the original sort order. So for each ...
Utsav Chatterjee's user avatar
0 votes
1 answer
3k views

How to sort a column in data set in descending order using Java Hadoop map reduce?

My data file is: Utsav Chatterjee Dangerous Soccer Coldplay 4 Rodney Purtle Awesome Football Maroon5 3 Michael Gross Amazing Basketball Iron Maiden 6 Emmanuel Ezeigwe Cool ...
Utsav Chatterjee's user avatar
3 votes
1 answer
2k views

How to sort comma separated keys in Reducer ouput?

I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=...
Punit Naik's user avatar
-1 votes
1 answer
469 views

Sort data Hadoop Mapreduce

I have the following algorithm that sort data with alphabetic order public void setup(Context context) throws IOException, InterruptedException { conf = context.getConfiguration(); ...
BigBosss's user avatar
0 votes
2 answers
130 views

Sort Mapreduce dataset

I'm trying to run the following project to sort a dataset. But, when I execute the command: Hadoop jar xx.jar /inputdir /output dir, I get following error on the terminal: org.apache.hadoop.mapred....
BigBosss's user avatar
1 vote
0 answers
74 views

Decide number of reducer in sort by statement in hive?

Do we have control over what data can we send to the reducer when doing a sort by - eg . if you have a data with 10 states (and data under each state) and you set the reducer to 6 and then you do a ...
Nikhil vyas's user avatar
0 votes
1 answer
324 views

In a large MapReduce job with "X" mappers and "Y" reducers, how many distinct copy operations will there be in the sort/shuffle phase

As I understand there will be X + Y copy operations , correct me if im wrong Thanks
Emmanuel Ramos's user avatar
3 votes
1 answer
923 views

Hadoop - Properly sort by key and group by reducer

I have some data coming out from the reducer which are like this : 9,2 3 5,7 2 2,3 0 1,5 3 6,3 0 4,2 2 7,1 1 And I would like to sort them according to the number on the second ...
Robin Dupont's user avatar
-1 votes
1 answer
159 views

Join and sort Dataset Hadoop

I'm working project on Hadoop using mapreduce (I have 2 dataset KDD and DARPAA) and I'm looking for algorithm which can group and sort those datasets in one file. The two dataset have this format: @...
BigBosss's user avatar
1 vote
2 answers
269 views

Why mapreduce secondary sorting is not on composite key's compareTo()?

To perform secondary sort, we have to create a composite key which extends WritableComparable interface and implements compareTo(). In the "Hadoop: The Definitive Guide" book and almost all the blogs ...
K246's user avatar
  • 1,107
0 votes
0 answers
58 views

What sorting algorithm does mapreduce use in Hadoop? Can I change it? [duplicate]

At first I parsed my .jar file containing the program of WordCount along with input and output destination in the command line. After the completion of my job, I saw the content of my output file: it ...
Udit Solanki's user avatar
1 vote
1 answer
169 views

Custom SortComparator not working in MapReduce wordcount program

I am trying to understand how MapReduce Sorts the Map output keys and what is the sort algorithm which it uses. I have a text file like this a b e f c b how it performs the sorting with these keys. ...
user avatar
3 votes
3 answers
2k views

NullPointerException in MapReduce Sorting Program

I know that SortComparator is used to sort the map output by their keys. I have written a custom SortComparator to understand the MapReduce framework better.This is my WordCount class with custom ...
user avatar
3 votes
3 answers
2k views

MapReduce output key in ascending order

I have written a MapReduce code for which both keys and values are integers. I am using a single Reducer. The output is like this: Key Value 1 78 128 12 174 26 2 44 2957 123 975 ...
MChirukuri's user avatar
0 votes
4 answers
2k views

Sorted Hadoop WordCount Java

I am running the WordCount program of Hadoop in Java and my first job (getting all the words and their count) works fine. However I come across a problem when I'm doing the second job who should sort ...
Melanie Journe's user avatar
1 vote
1 answer
1k views

What is the hadoop sort comparator class for?

I've implementing the hadoop sort comparator class for sorting my key. I know that it use to compare every key. But, I don't know how it can working in detail? Is that true, if it use to compare? ...
Kenny Basuki's user avatar
0 votes
1 answer
783 views

Sort in mapreduce

I am learning hadoop mapreducing. I am trying to sort (by value) using mapreduce. Below is my code for the mapper: static String splitChar = "\t"; static int colIndexone = 0; static int colIndextwo = ...
Leo's user avatar
  • 5,225
0 votes
1 answer
367 views

hadoop partitioner not working

public class Partitioner_2 implements Partitioner<Text,Text>{ @Override public int getPartition(Text key, Text value, int numPartitions) { int ...
Nikhil 's user avatar
  • 545
0 votes
1 answer
148 views

Hadoop sort phase taking hours

I started using hadoop for a week. After succesfully running the examples, I a mapreduce job to find the most used word using the WordCount example. I'm trying to run this job with 500 MB or data. ...
Ludovic S's user avatar
  • 195
1 vote
0 answers
48 views

What's the fastest approach to merging a small number of large, already sorted lists in Hadoop?

I've got a small Hadoop (CDH5.1.0, MRv2/YARN) cluster (5x nodes 4CPU, 16GB RAM, 600GB disk) of which contains a small number ~30 of ~15GB SequenceFiles. The SequenceFiles contains pairs of ...
growse's user avatar
  • 3,732
1 vote
1 answer
803 views

Is data inside mapreduce partitions sorted, if yes, how does it happen?

Is data inside mapreduce partitions sorted, if yes, how? AFAIK, it is grouped on the basis of the key. If it internally sorts, wouldn't it be an overhead to sort all the data inside all the partitions?...
MohitS's user avatar
  • 21
1 vote
0 answers
612 views

Hadoop MapReduce secondary sort: Reducer not getting called

I am trying to do a secondary sort on 4 values in my output. I referred to this tutorial. I have a 4 node cluster running Hadoop 2.2.0. I use Idea IntelliJ IDE for debugging locally. Following are ...
anixg33k's user avatar
0 votes
1 answer
236 views

Error during benchmarking Sort in Hadoop2 - Partitions do not match

I am trying to benchmark Hadoop2 MapReduce framework. It is NOT TeraSort. But testmapredsort. step-1 Create random data: hadoop jar hadoop/ randomwriter -Dtest.randomwrite.bytes_per_map=100 -Dtest....
eagertoLearn's user avatar
  • 10.1k
1 vote
3 answers
2k views

TotalOrderPartitioner ignores partition file location

I was trying to do a simple sort example with TotalOrderPartitioner. The input is a sequence file with IntWritable as key and NullWritable as value. I want to sort based on key. The output of is a ...
Majid Azimi's user avatar
  • 5,745
6 votes
1 answer
3k views

In-depth understanding of internal working of map phase in a Map reduce job in hadoop?

I am reading Hadoop: The definitive guide 3rd edtition by Tom White. It is an excellent resource for understanding the internals of Hadoop, especially Map-Reduce which I am interested in. From the ...
brain storm's user avatar
  • 31.2k
2 votes
3 answers
3k views

top-k in mapreduce when k elements do not fit in memory

What would be an efficient MapReduce algorithm to find the top-k elements from a dataset, when k is too big to fit k elements in memory? I am talking about a dataset of millions of elements and k ...
vefthym's user avatar
  • 7,462
0 votes
1 answer
2k views

Python Hadoop streaming, secondary sorting issues

Hadoop newbie here. I have some user-events logs like this, with userid and timestamp both randomly ordered: userid timestamp serviceId aaa 2012-01-01 13:12:23 4 aaa 2012-01-01 12:...
xiaolong's user avatar
  • 3,637
1 vote
2 answers
622 views

Sorting Algorithm on hadoop framework

I read numbers of links on internet. Here are few links link1, link2. But I am not able to understand. What they exactly doing. Can you pleae explain this algorithm in a simpler way. And, yes next ...
devsda's user avatar
  • 4,212
0 votes
1 answer
711 views

Hadoop: Secondary sort does not work

I have implemented an algorithm in Hadoop 1.2.1, where reducer code relies on the secondary sorting. However, when I run the algorithm one reducer receives sorted tuples, but the other does not. I've ...
Krle's user avatar
  • 70
0 votes
1 answer
4k views

Hadoop WordCount sorted by word occurrences

I need to run WordCount which will give me all the words and their occurrences but sorted by the occurrences and not by the alphabet I understand that I need to create two jobs for this and run one ...
Pini Cheyni's user avatar
  • 5,419
145 votes
8 answers
125k views

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair. What is the purpose of shuffling and sorting phase in the reducer in Map ...
user avatar
1 vote
1 answer
168 views

Hadoop sorting issue (Alternate title: 1175 is not less than 119!)

I'm new to Hadoop and done with a typical "count the IP addresses in a log" exercise. Now I'm trying to sort the output by running a second MapReduce job immediately after the first. Almost everything ...
sjohnson's user avatar
2 votes
1 answer
2k views

What is the point of using a Partitioner for Secondary Sorting in MapReduce?

If you need to have the values sorted for a given key when passed to the reduce phase, such as for a moving average, or to mimick the LAG/LEAD Analytic functions in SQL, you need to implement a ...
Matthew Moisen's user avatar
1 vote
1 answer
5k views

Map-Reduce/Hadoop sort by integer value (using MRJob)

This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py: from mrjob.job import MRJob class Beta(MRJob): def mapper(self, _, line): """ """ ...
p0lAris's user avatar
  • 4,820
0 votes
1 answer
2k views

How to control the sort order of mapper result in mapreduce before being sent to reducer

Taking a slight variation of the word count example to explain what I am trying to do. I have 3 mappers each producing a complete word count result on 3 large input files. Let us say the output is: ...
user1967879's user avatar