All Questions
76 questions
1
vote
0
answers
46
views
Sort compressed data in hadoop
So here is the situation,
I want to see if applying sort (from map reduce examples) on a compressed file is more efficient than on the original file.
To run this experiment, I first launch ...
0
votes
0
answers
180
views
How sorting works on Partitioned skewed data set in Hive
I am having a dataset with below heirarchy in Hive, size of datsets is in TB's.
-Country
-Year
-In_stock
-Zone
-trans_dt
I need to sort trans_dt in ascending order within Zone (one of ...
0
votes
1
answer
3k
views
How to sort values (with their corresponding key) in mapReduce Hadoop framework?
I am trying to sort the input data I have using Hadoop mapReduce. The problem is that I am only able to sort the key-value pairs by key, while I am trying to sort them by value. Each value's key was ...
0
votes
1
answer
520
views
How to sort a custom writable type in Hadoop
I have a custom type which contains fields of Hadoop native types (e.g. Text and IntWritable) and need to use it as a key and sort as I want during the shuffle/sort phase. There are similar questions ...
0
votes
1
answer
1k
views
How to find top 10 elements in MapReduce
I am trying to write a Python MapReduce job on some datasets I have to find certain statistics. This is a example of the input data and the form it comes in:
exchange, stock_symbol, date, ...
1
vote
0
answers
1k
views
How to make Group By and sort in Python for mapreducer in hadoop
I have a dataset with 100k rows with 17 cols.
I would like to know how to groupby and sort in hadoop mapreducer using python
here is my mapper.py
#!/usr/bin/python
import sys
for line in sys.stdin:...
1
vote
1
answer
2k
views
How to sort by key and value in mapreduce?
I have a text file:
10 1 15
10 12 30
10 9 45
10 8 40
10 15 55
12 9 0
12 7 18
12 10 1
9 1 1
9 2 1
9 0 1
14 5 5
And I would like to get this file as an output of my ...
0
votes
1
answer
7k
views
MapReduce sort by value in descending order
I'm trying to write in pseudo code a MapReduce task that returns the items sorted in descending order. For example: for the wordcount task, instead of getting:
apple 1
banana 3
mango 2
I want the ...
3
votes
1
answer
475
views
Do we really need sorting in the MapReduce framework?
I am completely new to MapReduce and just can't get my mind around the need to sort the mapper output according to the keys in each partition. Eventually all we want is that a reducer is fed a ...
1
vote
0
answers
202
views
Multiple column sorting hadoop streaming (EMR)
I'm trying to sort differently on each column on the mapper output. My output looks like this:
xx yy 2 4
xx yy 1 5
xx yy 5 39
xx yy 8 3
So the first 2 columns are text the the last 2 columns are ...
1
vote
2
answers
3k
views
Searching between dates in Hbase
I have Hbase table wiht rowKeys as such (delimter = '#')
0CE5C485#1481400000#A#B#C#T
00C6F485#1481600000#F#J#C#G
065ED485#1481500000#T#X#C#G
...
...
The first part is actually the hex of the ...
3
votes
0
answers
1k
views
Hadoop Mapreduce Multiple Reducer Sorting
I am using Hadoop Mapreduce to sort a large document and using the KeyFieldBasedPartitioner to partition different inputs to different reducers. The idea I have to solve this problem is to have the ...
0
votes
1
answer
604
views
Hadoop Map Reduce - how to separate grouping from sorting?
Just getting started writing Hadoop MR jobs. Hopefully we'll be switching over to Spark soon, but we're stuck doing MR for now.
I'd like to group records by a hash of their value. But I'd like to ...
1
vote
2
answers
1k
views
how to sort the output of a map side program in mapreduce?
My question is about how I can sort the output of a mapper in a mapreduce program(ps: there is no reducers(0)), i use just the map side to filter two inputs and I want that the result(output mappers) ...
0
votes
1
answer
1k
views
MapReduce Sort By Python Tuples Numerically
I'm working wth Python tuples and have a text file that looks like
(1,value1)
(2,value2)
(3,value3)
...
(100,value100)
How can I configure my MapReduce job to sort by the first key in the tuple as ...
0
votes
1
answer
44
views
How to dedupe a file and maintain original sort order in Hive?
My data is already sorted by descending last_column and descending third_column. I want to de-duplicate the data set based on last_column while maintaining the original sort order. So for each ...
0
votes
1
answer
3k
views
How to sort a column in data set in descending order using Java Hadoop map reduce?
My data file is:
Utsav Chatterjee Dangerous Soccer Coldplay 4
Rodney Purtle Awesome Football Maroon5 3
Michael Gross Amazing Basketball Iron Maiden 6
Emmanuel Ezeigwe Cool ...
3
votes
1
answer
2k
views
How to sort comma separated keys in Reducer ouput?
I am running an RFM Analysis program using MapReduce. The OutputKeyClass is Text.class and I am emitting comma separated R (Recency), F (Frequency), M (Monetory) as the key from Reducer where R=...
-1
votes
1
answer
469
views
Sort data Hadoop Mapreduce
I have the following algorithm that sort data with alphabetic order
public void setup(Context context) throws IOException,
InterruptedException {
conf = context.getConfiguration();
...
0
votes
2
answers
130
views
Sort Mapreduce dataset
I'm trying to run the following project to sort a dataset.
But, when I execute the command: Hadoop jar xx.jar /inputdir /output dir, I get following error on the terminal:
org.apache.hadoop.mapred....
1
vote
0
answers
74
views
Decide number of reducer in sort by statement in hive?
Do we have control over what data can we send to the reducer when doing a sort by - eg . if you have a data with 10 states (and data under each state) and you set the reducer to 6 and then you do a ...
0
votes
1
answer
324
views
In a large MapReduce job with "X" mappers and "Y" reducers, how many distinct copy operations will there be in the sort/shuffle phase
As I understand there will be X + Y copy operations , correct me if im wrong
Thanks
3
votes
1
answer
923
views
Hadoop - Properly sort by key and group by reducer
I have some data coming out from the reducer which are like this :
9,2 3
5,7 2
2,3 0
1,5 3
6,3 0
4,2 2
7,1 1
And I would like to sort them according to the number on the second ...
-1
votes
1
answer
159
views
Join and sort Dataset Hadoop
I'm working project on Hadoop using mapreduce (I have 2 dataset KDD and DARPAA) and I'm looking for algorithm which can group and sort those datasets in one file.
The two dataset have this format:
@...
1
vote
2
answers
269
views
Why mapreduce secondary sorting is not on composite key's compareTo()?
To perform secondary sort, we have to create a composite key which extends WritableComparable interface and implements compareTo().
In the "Hadoop: The Definitive Guide" book and almost all the blogs ...
0
votes
0
answers
58
views
What sorting algorithm does mapreduce use in Hadoop? Can I change it? [duplicate]
At first I parsed my .jar file containing the program of WordCount along with input and output destination in the command line. After the completion of my job, I saw the content of my output file: it ...
1
vote
1
answer
169
views
Custom SortComparator not working in MapReduce wordcount program
I am trying to understand how MapReduce Sorts the Map output keys and what is the sort algorithm which it uses. I have a text file like this
a b e f c b
how it performs the sorting with these keys. ...
3
votes
3
answers
2k
views
NullPointerException in MapReduce Sorting Program
I know that SortComparator is used to sort the map output by their keys. I have written a custom SortComparator to understand the MapReduce framework better.This is my WordCount class with custom ...
3
votes
3
answers
2k
views
MapReduce output key in ascending order
I have written a MapReduce code for which both keys and values are integers. I am using a single Reducer. The output is like this:
Key Value
1 78
128 12
174 26
2 44
2957 123
975 ...
0
votes
4
answers
2k
views
Sorted Hadoop WordCount Java
I am running the WordCount program of Hadoop in Java and my first job (getting all the words and their count) works fine.
However I come across a problem when I'm doing the second job who should sort ...
1
vote
1
answer
1k
views
What is the hadoop sort comparator class for?
I've implementing the hadoop sort comparator class for sorting my key. I know that it use to compare every key. But, I don't know how it can working in detail? Is that true, if it use to compare?
...
0
votes
1
answer
783
views
Sort in mapreduce
I am learning hadoop mapreducing. I am trying to sort (by value) using mapreduce. Below is my code for the mapper:
static String splitChar = "\t";
static int colIndexone = 0;
static int colIndextwo = ...
0
votes
1
answer
367
views
hadoop partitioner not working
public class Partitioner_2 implements Partitioner<Text,Text>{
@Override
public int getPartition(Text key, Text value, int numPartitions) {
int ...
0
votes
1
answer
148
views
Hadoop sort phase taking hours
I started using hadoop for a week. After succesfully running the examples, I a mapreduce job to find the most used word using the WordCount example.
I'm trying to run this job with 500 MB or data.
...
1
vote
0
answers
48
views
What's the fastest approach to merging a small number of large, already sorted lists in Hadoop?
I've got a small Hadoop (CDH5.1.0, MRv2/YARN) cluster (5x nodes 4CPU, 16GB RAM, 600GB disk) of which contains a small number ~30 of ~15GB SequenceFiles. The SequenceFiles contains pairs of ...
1
vote
1
answer
803
views
Is data inside mapreduce partitions sorted, if yes, how does it happen?
Is data inside mapreduce partitions sorted, if yes, how? AFAIK, it is grouped on the basis of the key. If it internally sorts, wouldn't it be an overhead to sort all the data inside all the partitions?...
1
vote
0
answers
612
views
Hadoop MapReduce secondary sort: Reducer not getting called
I am trying to do a secondary sort on 4 values in my output. I referred to this tutorial.
I have a 4 node cluster running Hadoop 2.2.0. I use Idea IntelliJ IDE for debugging locally.
Following are ...
0
votes
1
answer
236
views
Error during benchmarking Sort in Hadoop2 - Partitions do not match
I am trying to benchmark Hadoop2 MapReduce framework. It is NOT TeraSort. But testmapredsort.
step-1
Create random data:
hadoop jar hadoop/ randomwriter -Dtest.randomwrite.bytes_per_map=100 -Dtest....
1
vote
3
answers
2k
views
TotalOrderPartitioner ignores partition file location
I was trying to do a simple sort example with TotalOrderPartitioner. The input is a sequence file with IntWritable as key and NullWritable as value. I want to sort based on key. The output of is a ...
6
votes
1
answer
3k
views
In-depth understanding of internal working of map phase in a Map reduce job in hadoop?
I am reading Hadoop: The definitive guide 3rd edtition by Tom White. It is an excellent resource for understanding the internals of Hadoop, especially Map-Reduce which I am interested in.
From the ...
2
votes
3
answers
3k
views
top-k in mapreduce when k elements do not fit in memory
What would be an efficient MapReduce algorithm to find the top-k elements from a dataset, when k is too big to fit k elements in memory? I am talking about a dataset of millions of elements and k ...
0
votes
1
answer
2k
views
Python Hadoop streaming, secondary sorting issues
Hadoop newbie here. I have some user-events logs like this, with userid and timestamp both randomly ordered:
userid timestamp serviceId
aaa 2012-01-01 13:12:23 4
aaa 2012-01-01 12:...
1
vote
2
answers
622
views
Sorting Algorithm on hadoop framework
I read numbers of links on internet. Here are few links link1, link2. But I am not able to understand. What they exactly doing. Can you pleae explain this algorithm in a simpler way.
And, yes next ...
0
votes
1
answer
711
views
Hadoop: Secondary sort does not work
I have implemented an algorithm in Hadoop 1.2.1, where reducer code relies on the secondary sorting. However, when I run the algorithm one reducer receives sorted tuples, but the other does not. I've ...
0
votes
1
answer
4k
views
Hadoop WordCount sorted by word occurrences
I need to run WordCount which will give me all the words and their occurrences but sorted by the occurrences and not by the alphabet
I understand that I need to create two jobs for this and run one ...
145
votes
8
answers
125k
views
What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.
What is the purpose of shuffling and sorting phase in the reducer in Map ...
1
vote
1
answer
168
views
Hadoop sorting issue (Alternate title: 1175 is not less than 119!)
I'm new to Hadoop and done with a typical "count the IP addresses in a log" exercise. Now I'm trying to sort the output by running a second MapReduce job immediately after the first. Almost everything ...
2
votes
1
answer
2k
views
What is the point of using a Partitioner for Secondary Sorting in MapReduce?
If you need to have the values sorted for a given key when passed to the reduce phase, such as for a moving average, or to mimick the LAG/LEAD Analytic functions in SQL, you need to implement a ...
1
vote
1
answer
5k
views
Map-Reduce/Hadoop sort by integer value (using MRJob)
This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py:
from mrjob.job import MRJob
class Beta(MRJob):
def mapper(self, _, line):
"""
"""
...
0
votes
1
answer
2k
views
How to control the sort order of mapper result in mapreduce before being sent to reducer
Taking a slight variation of the word count example to explain what I am trying to do.
I have 3 mappers each producing a complete word count result on 3 large input files.
Let us say the output is:
...