All Questions
69 questions
0
votes
0
answers
56
views
Cant reach hbase (on S3) from pyspark
I have created a cluster on emr and I want to use hbase with pyspark. I am new to using distributed systems so I might make amatuer mistakes but connecting to hbase from pyspark feels very hard.
My ...
1
vote
0
answers
266
views
Master tasks on Core Nodes using AWS EMR Hadoop
Using EMR 6.X series, how does one ensure that master tasks run on Core nodes? Reading this page it looks like all it takes are two parameters:
yarn.node-labels.enabled: true
yarn.node-labels....
1
vote
1
answer
1k
views
'Could not find valid SPARK_HOME while searching' on AWS EMR
While running a python script on an EMR cluster using the spark submit command the process got stuck on 10% (can be seen through yarn application --list) and when I examined the logs, all cores ...
1
vote
1
answer
2k
views
How to enable a python library over EMR core nodes to start EMR spark application step
I am trying to run an EMR (1 master and 2 core nodes) step with a very simple python script that i uploaded to s3 to be used in EMR spark application step. This script reads a data.txt file in S3 and ...
0
votes
1
answer
421
views
How to submit hadoop MR job remotely on Amazon EMR cluster
Current situation: I have an EMR cluster. On the master node - I have a python program that does a subprocess call and executes the script that contains the following line. The subprocess triggers the ...
1
vote
0
answers
293
views
Importing Python files from S3 to Amazon Elastic MapReduce
I want to include compiled proto classes in my mapreduce and figured the easiest way to do so would be to zip the required python files as a tar.gz and upload them to S3.
I came across this ...
0
votes
1
answer
280
views
My python job I run on the master of EMR cluster fails, how do I troubleshoot?
I ssh to the master and run my hadoop job on the console for development purposes. My job fails in a mysterious way, with many java stack traces that make no sense to me, see below:
java.lang....
0
votes
1
answer
266
views
Configure EMR Hadoop Yarn from CLI
I am looking for an efficient way to modify both the mapred-site.xml and the yarn-site.xml in my configuration file for Hadoop on AWS EMR. I can achieve this manually using vim to edit it however I ...
1
vote
0
answers
195
views
Hadoop Streaming job error
Hello I am trying to run Hadoop Streaming job using Python in EMR 4.7.2 with command as follows:
hadoop-streaming -archives s3://mybucket/scripts/HDP/python_scripts/py.tgz -mapper py.tgz/...
0
votes
0
answers
246
views
EMR job stuck at 67% without any response
I have written a simple script to analyze commoncrawl data. Following is the snippet of my mapper.
src_code = record.payload.read().replace('\r', '').split('\n\n')[1]
soup = BeautifulSoup(src_code....
0
votes
0
answers
448
views
AWS Sentiment Analysis tutorial using Naive Bayes Classifier
I am following the AWS Sentiment Analysis tutorial from here.
The problem I am having is, the classifier is never finding negative tweets. It always displays only the positive and neutral ones like ...
0
votes
3
answers
104
views
create step spark python, amazon hadoop
I am creating a Spark step with Hadoop on Amazon, but I left thinking all the time. Not if it's because I'm bad code or sending bad judgment, but can not find a way out.
I pass code
spark-submit --...
3
votes
0
answers
562
views
Streaming Program on AWS EMR not working
I'm currently trying to set up a really simple streaming programm (the wordcount example) but I can't get it to run successfully.
My mapper and reducer scripts run without any errors when I start ...
1
vote
2
answers
197
views
Mapping a range of warc.gz files, EMR
I have been running a streaming step in AWS/EMR with a mapper and reducer written in Python to map some of the archives in Common Crawl for sentiment analysis.
I am moving from the older common ...
1
vote
1
answer
236
views
How to print the result from reducer into single file
I am using Amazon EMR and because of the way it works (parallel) my output gets split in multiple files.
But i would like to have one file instead with the right sequence, is it possible to do just ...
2
votes
1
answer
165
views
Hadoop returning less results than expected
I have two python scripts a mapper and reducer (basically reducer at this point just prints nothing else) and while locally i get 4 results - strings
on hadoop i get 3. How does this work?
i use ...
0
votes
1
answer
309
views
Submit jobs to EMR cluster using MRJob
MRJob waits until each job completes before giving back control to the user. I broke down a large EMR step into smaller ones and would like to submit them all in one shot.
The docs talk about ...
0
votes
2
answers
537
views
MRJob determining if running inline, local, emr or hadoop
I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. The code uses the following inside MRJob subclass mapper function to determine whether running ...
1
vote
0
answers
526
views
AWS EMR - Python path, git repo and scripts
I am running MapReduce jobs on Hive and most of the code already resides in a git repo. I know I am able to include instructions in the bootstrap script when spawning up clusters, but is it possible ...
8
votes
2
answers
17k
views
how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar
I'm practicing a video tutorial from plural sight about Amazon EMR. I am stuck as i cannot proceed as i am getting this error
Not a valid JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar
...
0
votes
1
answer
412
views
run mrjob on Amazon EMR, t2.micro not supported
I tried to run a mrjob script on Amazon EMR. It worked well when I used instance c1.medium, however, it had an error when I changed instnace to t2.micro. The full error message was shown below.
C:\...
0
votes
2
answers
394
views
Loading my own python modules for Pig UDFs on Amazon EMR
I am trying to call two of my own modules from Pig.
Here's module_one.py:
import sys
print sys.path
def foo():
pass
Here's module_two.py:
from module_one import foo
def bar():
foo()
I ...
2
votes
0
answers
3k
views
how to run a python script present in the cluster node file system in Amazon EMR
I am looking for ways to run a python script against an EMR cluster. I went through some documentation here and here but they only list data source as S3, or DynamoDB. How can I execute a python ...
1
vote
1
answer
292
views
Error in loading a file into EMR distributed cache using elastic-mapreduce
I'm using below command to launch a cluster.
./elastic-mapreduce --create \
--stream \
--cache s3n://bucket_name/code/totalInstallUsers#totalInstallUsers \
--input s3n://bucket_name/input \
--...
0
votes
0
answers
59
views
EMR issues with reducer.py
I'm running AWS and trying to run a simulation on the EMR setting. I know my mapper.py file is correct but I can't seem to figure out why my reducer.py file isn't correctly working.. The idea was to ...
1
vote
0
answers
181
views
EmrResponseError: 505 HTTP Version Not Supported
I got the following error when I ran the python file from ec2 machine.
Hadoop version: 2.4.0
ami version : 3.5.0
Boto Version : 2.32.0
Traceback (most recent call last):
File "/home/ec2-user/...
1
vote
1
answer
237
views
installing PIG 0.14 on Amazon EMR
I need to run Python streaming UDFs from PIG on Amazon EMR using Hadoop 2.x
Based on the documentation PIG works with Hadoop 2.x since version 0.14
http://pig.apache.org/docs/r0.12.0/udf.html#python-...
7
votes
1
answer
2k
views
Pydoop stucks on readline from HDFS files
I am reading first line of all the files in a directory, on local it works fine but on EMR this test is failing at stuck at around 200-300th file.
Also ps -eLF show increase of childs to 3000 even ...
0
votes
1
answer
557
views
MapReduce Job (written in python) run slow on EMR
I am trying to write a MapReduce job using python's MRJob package. The job processes ~36,000 files stored in S3. Each file is ~2MB. When I run the job locally (downloading the S3 bucket to my computer)...
1
vote
1
answer
1k
views
How to load additional JARs for an Hadoop Streaming job on Amazon EMR
TL;DR
How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)?
Long version
I want to analyze a set of Avro files (> 2000 files) using Hadoop ...
23
votes
7
answers
43k
views
Pyspark --py-files doesn't work
I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html
spsark version 1.1.0
./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \
/...
-1
votes
1
answer
93
views
How do I create "side-effect" files using Python streaming on AWS Elastic MapReduce?
I'm running a Python streaming job on Amazon's Elastic MapReduce which needs to output multiple files from the reducer. The descriptions I've found on the web of how to do this have all been old, so ...
0
votes
1
answer
331
views
import custom function in MapReduce code on AWS EMR
I have been struggling with this for 2 hours now!
I created a mapper script in python which is importing one of my custom functions in other python script.
#!/usr/bin/env python
import sys
...
0
votes
1
answer
370
views
Amazon EMR job with many json files as input
I am writing a hadoop streaming application in python to run on EMR. The input for the EMR job is a directory of files in an S3 bucket, each of which is a json file containing a single json object. I ...
0
votes
1
answer
139
views
Issue with using files in distributed cache in Elastic MapReduce
I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job.
However, my script doesn't seem to be able to find the modules in the cache. I archived the ...
0
votes
1
answer
591
views
Hadoop EMR using Python
I'm using Hadoop streaming to use my mapper and reducer code in python to run a Mapreduce job. I have input data in s3, and I'm trying to use that for the job. However, when I run the command like ...
3
votes
1
answer
520
views
File processing using AWS EMR
I need architectural suggestion for this problem I'm working on. I have log files coming in every 15 minutes in gzipped folder. Each of these have about 100,000 further files to process. I have a ...
0
votes
3
answers
325
views
How is data partitioned and distributed among datanodes in MapReduce?
I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number ...
1
vote
1
answer
3k
views
Hadoop Streaming program subprocess failed with code 139
I'm running a Hadoop streaming program (written in Python) over Amazon EMR that is having some issues. It all runs fine when I do tests with a few thousand records and I've tested the program locally ...
1
vote
2
answers
2k
views
Bootstrapping libraries on EMR using python MRJob
Problem Statement:
I am trying to run a map-reduce job in Amazon EMR using python MRJob library, and I am having trouble with bootstrapping the nodes with the requisite libraries and packages.
...
3
votes
1
answer
526
views
Hadoop streaming on AWS - Sentiment Analysis Example
I am doing the AWS Big data example: sentiment analysis using Hadoop streaming with Python code (link below:)
http://blog.newitfarmer.com/anls/analytics-bi/sentiment-analysis-analytics-bi/13436/...
1
vote
4
answers
4k
views
Combine output files of MapReduce job
I have written a Mapper and Reducer in Python and have executed it successfully on Amazon's Elastic MapReduce(EMR) using Hadoop Streaming.
The final result folder contains the output in three ...
6
votes
0
answers
286
views
Processing MongoDB data using Mrjob on Amazon EMR
I know that Mrjob uses Hadoop Streaming. I also know that there is a plugin for using MongoDB with Hadoop Streaming. However, I couldn't find any examples on bringing two together.
Is this (at least ...
1
vote
1
answer
541
views
MapReduce Amazon Python Get the line umber of the input file
I have several texts and I want to know the line number and the file where appears a word.
I got the file well but not the line number.
This is the map
#!/usr/bin/env python
import sys
import os
...
2
votes
1
answer
3k
views
AWS Elastic mapreduce doesn't seem to be correctly converting the streaming to jar
I have a mapper and reducer that work fine when I run them in the piped version:
cat data.csv | ./mapper.py | sort -k1,1 | ./reducer.py
I used the elastic mapreducer wizard, loaded inputs, outputs, ...
0
votes
2
answers
546
views
Threading with Hadoop Streaming
I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know ...
3
votes
3
answers
2k
views
How do I write the output of an EMR streaming job to HDFS?
I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR ...
1
vote
1
answer
1k
views
Map Reduce multiple outputs in python boto
I am trying to partition an input file using AWS EMR.
I use a streaming step to read from stdin.
I want to split this file into 2 files based on the values of specific fields from each line of stdin ...