Newest 'amazon-emr+python+hadoop' Questions

0 votes

0 answers

56 views

Cant reach hbase (on S3) from pyspark

I have created a cluster on emr and I want to use hbase with pyspark. I am new to using distributed systems so I might make amatuer mistakes but connecting to hbase from pyspark feels very hard. My ...

Asfandyar Abbasi

83

asked Mar 12 at 9:03

1 vote

0 answers

266 views

Master tasks on Core Nodes using AWS EMR Hadoop

Using EMR 6.X series, how does one ensure that master tasks run on Core nodes? Reading this page it looks like all it takes are two parameters: yarn.node-labels.enabled: true yarn.node-labels....

Stephen

146

asked Nov 18, 2021 at 20:18

1 vote

1 answer

1k views

'Could not find valid SPARK_HOME while searching' on AWS EMR

While running a python script on an EMR cluster using the spark submit command the process got stuck on 10% (can be seen through yarn application --list) and when I examined the logs, all cores ...

YonGU

694

asked Oct 21, 2020 at 11:11

1 vote

1 answer

2k views

How to enable a python library over EMR core nodes to start EMR spark application step

I am trying to run an EMR (1 master and 2 core nodes) step with a very simple python script that i uploaded to s3 to be used in EMR spark application step. This script reads a data.txt file in S3 and ...

Yaser

29

asked Feb 8, 2020 at 22:01

0 votes

1 answer

421 views

How to submit hadoop MR job remotely on Amazon EMR cluster

Current situation: I have an EMR cluster. On the master node - I have a python program that does a subprocess call and executes the script that contains the following line. The subprocess triggers the ...

yguw

844

asked Dec 6, 2018 at 0:23

1 vote

0 answers

293 views

Importing Python files from S3 to Amazon Elastic MapReduce

I want to include compiled proto classes in my mapreduce and figured the easiest way to do so would be to zip the required python files as a tar.gz and upload them to S3. I came across this ...

M. Bloom

43

asked Dec 5, 2018 at 19:00

0 votes

1 answer

280 views

My python job I run on the master of EMR cluster fails, how do I troubleshoot?

I ssh to the master and run my hadoop job on the console for development purposes. My job fails in a mysterious way, with many java stack traces that make no sense to me, see below: java.lang....

gae123

9,417

asked Aug 31, 2017 at 22:09

0 votes

1 answer

266 views

Configure EMR Hadoop Yarn from CLI

I am looking for an efficient way to modify both the mapred-site.xml and the yarn-site.xml in my configuration file for Hadoop on AWS EMR. I can achieve this manually using vim to edit it however I ...

gold_cy

14.2k

asked Apr 8, 2017 at 1:47

1 vote

0 answers

195 views

Hadoop Streaming job error

Hello I am trying to run Hadoop Streaming job using Python in EMR 4.7.2 with command as follows: hadoop-streaming -archives s3://mybucket/scripts/HDP/python_scripts/py.tgz -mapper py.tgz/...

AKSHAY SHINGOTE

407

asked Feb 6, 2017 at 10:15

0 votes

0 answers

246 views

EMR job stuck at 67% without any response

I have written a simple script to analyze commoncrawl data. Following is the snippet of my mapper. src_code = record.payload.read().replace('\r', '').split('\n\n')[1] soup = BeautifulSoup(src_code....

Hafiz Muhammad Shafiq

8,660

asked Jan 13, 2017 at 12:19

0 votes

0 answers

448 views

AWS Sentiment Analysis tutorial using Naive Bayes Classifier

I am following the AWS Sentiment Analysis tutorial from here. The problem I am having is, the classifier is never finding negative tweets. It always displays only the positive and neutral ones like ...

Kemat Rochi

952

asked Nov 2, 2016 at 20:04

0 votes

3 answers

104 views

create step spark python, amazon hadoop

I am creating a Spark step with Hadoop on Amazon, but I left thinking all the time. Not if it's because I'm bad code or sending bad judgment, but can not find a way out. I pass code spark-submit --...

David

119

asked Aug 25, 2016 at 7:21

3 votes

0 answers

562 views

Streaming Program on AWS EMR not working

I'm currently trying to set up a really simple streaming programm (the wordcount example) but I can't get it to run successfully. My mapper and reducer scripts run without any errors when I start ...

mammago

247

asked Jul 29, 2016 at 6:46

1 vote

2 answers

197 views

Mapping a range of warc.gz files, EMR

I have been running a streaming step in AWS/EMR with a mapper and reducer written in Python to map some of the archives in Common Crawl for sentiment analysis. I am moving from the older common ...

DataGuy

1,725

asked Jul 7, 2016 at 15:51

1 vote

1 answer

236 views

How to print the result from reducer into single file

I am using Amazon EMR and because of the way it works (parallel) my output gets split in multiple files. But i would like to have one file instead with the right sequence, is it possible to do just ...

Petros Kyriakou

5,323

asked May 13, 2016 at 1:14

2 votes

1 answer

165 views

Hadoop returning less results than expected

I have two python scripts a mapper and reducer (basically reducer at this point just prints nothing else) and while locally i get 4 results - strings on hadoop i get 3. How does this work? i use ...

Petros Kyriakou

5,323

asked Apr 27, 2016 at 20:16

0 votes

1 answer

309 views

Submit jobs to EMR cluster using MRJob

MRJob waits until each job completes before giving back control to the user. I broke down a large EMR step into smaller ones and would like to submit them all in one shot. The docs talk about ...

Pykler

14.8k

asked Apr 26, 2016 at 19:50

0 votes

2 answers

537 views

MRJob determining if running inline, local, emr or hadoop

I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. The code uses the following inside MRJob subclass mapper function to determine whether running ...

Pykler

14.8k

asked Apr 23, 2016 at 15:26

1 vote

0 answers

526 views

AWS EMR - Python path, git repo and scripts

I am running MapReduce jobs on Hive and most of the code already resides in a git repo. I know I am able to include instructions in the bootstrap script when spawning up clusters, but is it possible ...

intl

2,773

asked Nov 25, 2015 at 12:08

8 votes

2 answers

17k views

how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar

I'm practicing a video tutorial from plural sight about Amazon EMR. I am stuck as i cannot proceed as i am getting this error Not a valid JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar ...

harshil bhatt

152

asked Sep 12, 2015 at 21:07

0 votes

1 answer

412 views

run mrjob on Amazon EMR, t2.micro not supported

I tried to run a mrjob script on Amazon EMR. It worked well when I used instance c1.medium, however, it had an error when I changed instnace to t2.micro. The full error message was shown below. C:\...

neil ye

5

asked Jul 31, 2015 at 1:39

0 votes

2 answers

394 views

Loading my own python modules for Pig UDFs on Amazon EMR

I am trying to call two of my own modules from Pig. Here's module_one.py: import sys print sys.path def foo(): pass Here's module_two.py: from module_one import foo def bar(): foo() I ...

MaratC

6,794

asked Jun 14, 2015 at 12:41

2 votes

0 answers

3k views

how to run a python script present in the cluster node file system in Amazon EMR

I am looking for ways to run a python script against an EMR cluster. I went through some documentation here and here but they only list data source as S3, or DynamoDB. How can I execute a python ...

user2966197

2,981

asked Jun 8, 2015 at 20:04

1 vote

1 answer

292 views

Error in loading a file into EMR distributed cache using elastic-mapreduce

I'm using below command to launch a cluster. ./elastic-mapreduce --create \ --stream \ --cache s3n://bucket_name/code/totalInstallUsers#totalInstallUsers \ --input s3n://bucket_name/input \ --...

John Knight

3,243

asked May 8, 2015 at 0:40

0 votes

0 answers

59 views

EMR issues with reducer.py

I'm running AWS and trying to run a simulation on the EMR setting. I know my mapper.py file is correct but I can't seem to figure out why my reducer.py file isn't correctly working.. The idea was to ...

Samuel Cosby

1

asked Apr 30, 2015 at 16:08

1 vote

0 answers

181 views

EmrResponseError: 505 HTTP Version Not Supported

I got the following error when I ran the python file from ec2 machine. Hadoop version: 2.4.0 ami version : 3.5.0 Boto Version : 2.32.0 Traceback (most recent call last): File "/home/ec2-user/...

Brisi

1,811

asked Mar 20, 2015 at 6:04

1 vote

1 answer

237 views

installing PIG 0.14 on Amazon EMR

I need to run Python streaming UDFs from PIG on Amazon EMR using Hadoop 2.x Based on the documentation PIG works with Hadoop 2.x since version 0.14 http://pig.apache.org/docs/r0.12.0/udf.html#python-...

ziky90

2,697

asked Mar 4, 2015 at 21:39

7 votes

1 answer

2k views

Pydoop stucks on readline from HDFS files

I am reading first line of all the files in a directory, on local it works fine but on EMR this test is failing at stuck at around 200-300th file. Also ps -eLF show increase of childs to 3000 even ...

maaz

4,483

asked Feb 24, 2015 at 9:48

0 votes

1 answer

557 views

MapReduce Job (written in python) run slow on EMR

I am trying to write a MapReduce job using python's MRJob package. The job processes ~36,000 files stored in S3. Each file is ~2MB. When I run the job locally (downloading the S3 bucket to my computer)...

DickJ

313

asked Feb 22, 2015 at 15:53

1 vote

1 answer

1k views

How to load additional JARs for an Hadoop Streaming job on Amazon EMR

TL;DR How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)? Long version I want to analyze a set of Avro files (> 2000 files) using Hadoop ...

CristianCantoro

841

asked Feb 7, 2015 at 21:17

23 votes

7 answers

43k views

Pyspark --py-files doesn't work

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html spsark version 1.1.0 ./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \ /...

C19

758

asked Dec 25, 2014 at 5:46

-1 votes

1 answer

93 views

How do I create "side-effect" files using Python streaming on AWS Elastic MapReduce?

I'm running a Python streaming job on Amazon's Elastic MapReduce which needs to output multiple files from the reducer. The descriptions I've found on the web of how to do this have all been old, so ...

Tom Morris

10.5k

asked Nov 18, 2014 at 17:23

0 votes

1 answer

331 views

import custom function in MapReduce code on AWS EMR

I have been struggling with this for 2 hours now! I created a mapper script in python which is importing one of my custom functions in other python script. #!/usr/bin/env python import sys ...

pan8863

733

asked Nov 9, 2014 at 2:26

0 votes

1 answer

370 views

Amazon EMR job with many json files as input

I am writing a hadoop streaming application in python to run on EMR. The input for the EMR job is a directory of files in an S3 bucket, each of which is a json file containing a single json object. I ...

Jay Hack

139

asked Jul 23, 2014 at 13:36

0 votes

1 answer

139 views

Issue with using files in distributed cache in Elastic MapReduce

I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job. However, my script doesn't seem to be able to find the modules in the cache. I archived the ...

user296554

11

asked Jul 10, 2014 at 5:09

0 votes

1 answer

591 views

Hadoop EMR using Python

I'm using Hadoop streaming to use my mapper and reducer code in python to run a Mapreduce job. I have input data in s3, and I'm trying to use that for the job. However, when I run the command like ...

aishpr

143

asked Jun 20, 2014 at 0:12

3 votes

1 answer

520 views

File processing using AWS EMR

I need architectural suggestion for this problem I'm working on. I have log files coming in every 15 minutes in gzipped folder. Each of these have about 100,000 further files to process. I have a ...

Amey

31

asked May 27, 2014 at 21:29

0 votes

3 answers

325 views

How is data partitioned and distributed among datanodes in MapReduce?

I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number ...

i3wangyi

2,399

asked May 22, 2014 at 2:24

1 vote

1 answer

3k views

Hadoop Streaming program subprocess failed with code 139

I'm running a Hadoop streaming program (written in Python) over Amazon EMR that is having some issues. It all runs fine when I do tests with a few thousand records and I've tested the program locally ...

acnutch

104

asked May 13, 2014 at 2:25

1 vote

2 answers

2k views

Bootstrapping libraries on EMR using python MRJob

Problem Statement: I am trying to run a map-reduce job in Amazon EMR using python MRJob library, and I am having trouble with bootstrapping the nodes with the requisite libraries and packages. ...

Shreyas

377

asked May 3, 2014 at 5:15

3 votes

1 answer

526 views

Hadoop streaming on AWS - Sentiment Analysis Example

I am doing the AWS Big data example: sentiment analysis using Hadoop streaming with Python code (link below:) http://blog.newitfarmer.com/anls/analytics-bi/sentiment-analysis-analytics-bi/13436/...

user3148246

31

asked Dec 31, 2013 at 4:02

1 vote

4 answers

4k views

Combine output files of MapReduce job

I have written a Mapper and Reducer in Python and have executed it successfully on Amazon's Elastic MapReduce(EMR) using Hadoop Streaming. The final result folder contains the output in three ...

Arun Kumar

101

asked Dec 14, 2013 at 8:21

6 votes

0 answers

286 views

Processing MongoDB data using Mrjob on Amazon EMR

I know that Mrjob uses Hadoop Streaming. I also know that there is a plugin for using MongoDB with Hadoop Streaming. However, I couldn't find any examples on bringing two together. Is this (at least ...

Eser Aygün

7,984

asked Dec 6, 2013 at 12:15

1 vote

1 answer

541 views

MapReduce Amazon Python Get the line umber of the input file

I have several texts and I want to know the line number and the file where appears a word. I got the file well but not the line number. This is the map #!/usr/bin/env python import sys import os ...

Carlos S

13

asked Oct 12, 2013 at 12:14

2 votes

1 answer

3k views

AWS Elastic mapreduce doesn't seem to be correctly converting the streaming to jar

I have a mapper and reducer that work fine when I run them in the piped version: cat data.csv | ./mapper.py | sort -k1,1 | ./reducer.py I used the elastic mapreducer wizard, loaded inputs, outputs, ...

Mittenchops

19.6k

asked Sep 1, 2013 at 7:34

0 votes

2 answers

546 views

Threading with Hadoop Streaming

I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know ...

viper

2,210

asked Aug 6, 2013 at 17:53

3 votes

3 answers

2k views

How do I write the output of an EMR streaming job to HDFS?

I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR ...

Abe

23.8k

asked May 8, 2013 at 4:27

1 vote

1 answer

1k views

Map Reduce multiple outputs in python boto

I am trying to partition an input file using AWS EMR. I use a streaming step to read from stdin. I want to split this file into 2 files based on the values of specific fields from each line of stdin ...

Zihs

347

asked Apr 30, 2013 at 16:09

Collectives™ on Stack Overflow

All Questions

Related Tags