Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
0 votes
0 answers
56 views

Cant reach hbase (on S3) from pyspark

I have created a cluster on emr and I want to use hbase with pyspark. I am new to using distributed systems so I might make amatuer mistakes but connecting to hbase from pyspark feels very hard. My ...
Asfandyar Abbasi's user avatar
1 vote
0 answers
266 views

Master tasks on Core Nodes using AWS EMR Hadoop

Using EMR 6.X series, how does one ensure that master tasks run on Core nodes? Reading this page it looks like all it takes are two parameters: yarn.node-labels.enabled: true yarn.node-labels....
Stephen's user avatar
  • 146
1 vote
1 answer
1k views

'Could not find valid SPARK_HOME while searching' on AWS EMR

While running a python script on an EMR cluster using the spark submit command the process got stuck on 10% (can be seen through yarn application --list) and when I examined the logs, all cores ...
YonGU's user avatar
  • 694
1 vote
1 answer
2k views

How to enable a python library over EMR core nodes to start EMR spark application step

I am trying to run an EMR (1 master and 2 core nodes) step with a very simple python script that i uploaded to s3 to be used in EMR spark application step. This script reads a data.txt file in S3 and ...
Yaser's user avatar
  • 29
0 votes
1 answer
421 views

How to submit hadoop MR job remotely on Amazon EMR cluster

Current situation: I have an EMR cluster. On the master node - I have a python program that does a subprocess call and executes the script that contains the following line. The subprocess triggers the ...
yguw's user avatar
  • 844
1 vote
0 answers
293 views

Importing Python files from S3 to Amazon Elastic MapReduce

I want to include compiled proto classes in my mapreduce and figured the easiest way to do so would be to zip the required python files as a tar.gz and upload them to S3. I came across this ...
M. Bloom's user avatar
0 votes
1 answer
280 views

My python job I run on the master of EMR cluster fails, how do I troubleshoot?

I ssh to the master and run my hadoop job on the console for development purposes. My job fails in a mysterious way, with many java stack traces that make no sense to me, see below: java.lang....
gae123's user avatar
  • 9,417
0 votes
1 answer
266 views

Configure EMR Hadoop Yarn from CLI

I am looking for an efficient way to modify both the mapred-site.xml and the yarn-site.xml in my configuration file for Hadoop on AWS EMR. I can achieve this manually using vim to edit it however I ...
gold_cy's user avatar
  • 14.2k
1 vote
0 answers
195 views

Hadoop Streaming job error

Hello I am trying to run Hadoop Streaming job using Python in EMR 4.7.2 with command as follows: hadoop-streaming -archives s3://mybucket/scripts/HDP/python_scripts/py.tgz -mapper py.tgz/...
AKSHAY SHINGOTE's user avatar
0 votes
0 answers
246 views

EMR job stuck at 67% without any response

I have written a simple script to analyze commoncrawl data. Following is the snippet of my mapper. src_code = record.payload.read().replace('\r', '').split('\n\n')[1] soup = BeautifulSoup(src_code....
Hafiz Muhammad Shafiq's user avatar
0 votes
0 answers
448 views

AWS Sentiment Analysis tutorial using Naive Bayes Classifier

I am following the AWS Sentiment Analysis tutorial from here. The problem I am having is, the classifier is never finding negative tweets. It always displays only the positive and neutral ones like ...
Kemat Rochi's user avatar
0 votes
3 answers
104 views

create step spark python, amazon hadoop

I am creating a Spark step with Hadoop on Amazon, but I left thinking all the time. Not if it's because I'm bad code or sending bad judgment, but can not find a way out. I pass code spark-submit --...
David's user avatar
  • 119
3 votes
0 answers
562 views

Streaming Program on AWS EMR not working

I'm currently trying to set up a really simple streaming programm (the wordcount example) but I can't get it to run successfully. My mapper and reducer scripts run without any errors when I start ...
mammago's user avatar
  • 247
1 vote
2 answers
197 views

Mapping a range of warc.gz files, EMR

I have been running a streaming step in AWS/EMR with a mapper and reducer written in Python to map some of the archives in Common Crawl for sentiment analysis. I am moving from the older common ...
DataGuy's user avatar
  • 1,725
1 vote
1 answer
236 views

How to print the result from reducer into single file

I am using Amazon EMR and because of the way it works (parallel) my output gets split in multiple files. But i would like to have one file instead with the right sequence, is it possible to do just ...
Petros Kyriakou's user avatar
2 votes
1 answer
165 views

Hadoop returning less results than expected

I have two python scripts a mapper and reducer (basically reducer at this point just prints nothing else) and while locally i get 4 results - strings on hadoop i get 3. How does this work? i use ...
Petros Kyriakou's user avatar
0 votes
1 answer
309 views

Submit jobs to EMR cluster using MRJob

MRJob waits until each job completes before giving back control to the user. I broke down a large EMR step into smaller ones and would like to submit them all in one shot. The docs talk about ...
Pykler's user avatar
  • 14.8k
0 votes
2 answers
537 views

MRJob determining if running inline, local, emr or hadoop

I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. The code uses the following inside MRJob subclass mapper function to determine whether running ...
Pykler's user avatar
  • 14.8k
1 vote
0 answers
526 views

AWS EMR - Python path, git repo and scripts

I am running MapReduce jobs on Hive and most of the code already resides in a git repo. I know I am able to include instructions in the bootstrap script when spawning up clusters, but is it possible ...
intl's user avatar
  • 2,773
8 votes
2 answers
17k views

how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar

I'm practicing a video tutorial from plural sight about Amazon EMR. I am stuck as i cannot proceed as i am getting this error Not a valid JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar ...
harshil bhatt's user avatar
0 votes
1 answer
412 views

run mrjob on Amazon EMR, t2.micro not supported

I tried to run a mrjob script on Amazon EMR. It worked well when I used instance c1.medium, however, it had an error when I changed instnace to t2.micro. The full error message was shown below. C:\...
neil ye's user avatar
0 votes
2 answers
394 views

Loading my own python modules for Pig UDFs on Amazon EMR

I am trying to call two of my own modules from Pig. Here's module_one.py: import sys print sys.path def foo(): pass Here's module_two.py: from module_one import foo def bar(): foo() I ...
MaratC's user avatar
  • 6,794
2 votes
0 answers
3k views

how to run a python script present in the cluster node file system in Amazon EMR

I am looking for ways to run a python script against an EMR cluster. I went through some documentation here and here but they only list data source as S3, or DynamoDB. How can I execute a python ...
user2966197's user avatar
  • 2,981
1 vote
1 answer
292 views

Error in loading a file into EMR distributed cache using elastic-mapreduce

I'm using below command to launch a cluster. ./elastic-mapreduce --create \ --stream \ --cache s3n://bucket_name/code/totalInstallUsers#totalInstallUsers \ --input s3n://bucket_name/input \ --...
John Knight's user avatar
  • 3,243
0 votes
0 answers
59 views

EMR issues with reducer.py

I'm running AWS and trying to run a simulation on the EMR setting. I know my mapper.py file is correct but I can't seem to figure out why my reducer.py file isn't correctly working.. The idea was to ...
Samuel Cosby's user avatar
1 vote
0 answers
181 views

EmrResponseError: 505 HTTP Version Not Supported

I got the following error when I ran the python file from ec2 machine. Hadoop version: 2.4.0 ami version : 3.5.0 Boto Version : 2.32.0 Traceback (most recent call last): File "/home/ec2-user/...
Brisi's user avatar
  • 1,811
1 vote
1 answer
237 views

installing PIG 0.14 on Amazon EMR

I need to run Python streaming UDFs from PIG on Amazon EMR using Hadoop 2.x Based on the documentation PIG works with Hadoop 2.x since version 0.14 http://pig.apache.org/docs/r0.12.0/udf.html#python-...
ziky90's user avatar
  • 2,697
7 votes
1 answer
2k views

Pydoop stucks on readline from HDFS files

I am reading first line of all the files in a directory, on local it works fine but on EMR this test is failing at stuck at around 200-300th file. Also ps -eLF show increase of childs to 3000 even ...
maaz's user avatar
  • 4,483
0 votes
1 answer
557 views

MapReduce Job (written in python) run slow on EMR

I am trying to write a MapReduce job using python's MRJob package. The job processes ~36,000 files stored in S3. Each file is ~2MB. When I run the job locally (downloading the S3 bucket to my computer)...
DickJ's user avatar
  • 313
1 vote
1 answer
1k views

How to load additional JARs for an Hadoop Streaming job on Amazon EMR

TL;DR How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)? Long version I want to analyze a set of Avro files (> 2000 files) using Hadoop ...
CristianCantoro's user avatar
23 votes
7 answers
43k views

Pyspark --py-files doesn't work

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html spsark version 1.1.0 ./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \ /...
C19's user avatar
  • 758
-1 votes
1 answer
93 views

How do I create "side-effect" files using Python streaming on AWS Elastic MapReduce?

I'm running a Python streaming job on Amazon's Elastic MapReduce which needs to output multiple files from the reducer. The descriptions I've found on the web of how to do this have all been old, so ...
Tom Morris's user avatar
  • 10.5k
0 votes
1 answer
331 views

import custom function in MapReduce code on AWS EMR

I have been struggling with this for 2 hours now! I created a mapper script in python which is importing one of my custom functions in other python script. #!/usr/bin/env python import sys ...
pan8863's user avatar
  • 733
0 votes
1 answer
370 views

Amazon EMR job with many json files as input

I am writing a hadoop streaming application in python to run on EMR. The input for the EMR job is a directory of files in an S3 bucket, each of which is a json file containing a single json object. I ...
Jay Hack's user avatar
  • 139
0 votes
1 answer
139 views

Issue with using files in distributed cache in Elastic MapReduce

I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job. However, my script doesn't seem to be able to find the modules in the cache. I archived the ...
user296554's user avatar
0 votes
1 answer
591 views

Hadoop EMR using Python

I'm using Hadoop streaming to use my mapper and reducer code in python to run a Mapreduce job. I have input data in s3, and I'm trying to use that for the job. However, when I run the command like ...
aishpr's user avatar
  • 143
3 votes
1 answer
520 views

File processing using AWS EMR

I need architectural suggestion for this problem I'm working on. I have log files coming in every 15 minutes in gzipped folder. Each of these have about 100,000 further files to process. I have a ...
Amey's user avatar
  • 31
0 votes
3 answers
325 views

How is data partitioned and distributed among datanodes in MapReduce?

I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number ...
i3wangyi's user avatar
  • 2,399
1 vote
1 answer
3k views

Hadoop Streaming program subprocess failed with code 139

I'm running a Hadoop streaming program (written in Python) over Amazon EMR that is having some issues. It all runs fine when I do tests with a few thousand records and I've tested the program locally ...
acnutch's user avatar
  • 104
1 vote
2 answers
2k views

Bootstrapping libraries on EMR using python MRJob

Problem Statement: I am trying to run a map-reduce job in Amazon EMR using python MRJob library, and I am having trouble with bootstrapping the nodes with the requisite libraries and packages. ...
Shreyas's user avatar
  • 377
3 votes
1 answer
526 views

Hadoop streaming on AWS - Sentiment Analysis Example

I am doing the AWS Big data example: sentiment analysis using Hadoop streaming with Python code (link below:) http://blog.newitfarmer.com/anls/analytics-bi/sentiment-analysis-analytics-bi/13436/...
user3148246's user avatar
1 vote
4 answers
4k views

Combine output files of MapReduce job

I have written a Mapper and Reducer in Python and have executed it successfully on Amazon's Elastic MapReduce(EMR) using Hadoop Streaming. The final result folder contains the output in three ...
Arun Kumar's user avatar
6 votes
0 answers
286 views

Processing MongoDB data using Mrjob on Amazon EMR

I know that Mrjob uses Hadoop Streaming. I also know that there is a plugin for using MongoDB with Hadoop Streaming. However, I couldn't find any examples on bringing two together. Is this (at least ...
Eser Aygün's user avatar
  • 7,984
1 vote
1 answer
541 views

MapReduce Amazon Python Get the line umber of the input file

I have several texts and I want to know the line number and the file where appears a word. I got the file well but not the line number. This is the map #!/usr/bin/env python import sys import os ...
Carlos S's user avatar
2 votes
1 answer
3k views

AWS Elastic mapreduce doesn't seem to be correctly converting the streaming to jar

I have a mapper and reducer that work fine when I run them in the piped version: cat data.csv | ./mapper.py | sort -k1,1 | ./reducer.py I used the elastic mapreducer wizard, loaded inputs, outputs, ...
Mittenchops's user avatar
  • 19.6k
0 votes
2 answers
546 views

Threading with Hadoop Streaming

I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know ...
viper's user avatar
  • 2,210
3 votes
3 answers
2k views

How do I write the output of an EMR streaming job to HDFS?

I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR ...
Abe's user avatar
  • 23.8k
1 vote
1 answer
1k views

Map Reduce multiple outputs in python boto

I am trying to partition an input file using AWS EMR. I use a streaming step to read from stdin. I want to split this file into 2 files based on the values of specific fields from each line of stdin ...
Zihs's user avatar
  • 347