AWS Elastic Mapreduce optimizing Pig job

Ask Question

Asked 11 years, 9 months ago

Modified 11 years, 9 months ago

Viewed 338 times

Part of AWS Collective

I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue.

The logfiles in question are stored in s3 with keys that correspond to the dates they are emitted from the logging server, eg: /2013/03/01/access.log. These files are very, very large. My mapreduce job runs an Apache Pig script that simply examines some of the uri paths stored in the log files and outputs generalized counts that correspond to our business logic.

My client code in boto takes date times as input on cli and schedules a jobflow with a PigStep instance for every date needed. Thus, passing something like python script.py 2013-02-01 2013-03-01 would iterate over 29 days worth of datetime objects and create pigsteps with the respective input keys for s3. This means that the resulting jobflow could have many, many steps, one for each day in the timedelta between the from_date and to_date.

My problem is that my EMR jobflow is exceedingly slow, almost absurdly so. It's been running for a night now and hasn't made it even halfway through that example set. Is there something wrong I am doing creating many jobflow steps like this? Should I attempt to generalize the pig script for the different keys instead, rather than preprocessing it in the client code and creating a step for each date? Is this a feasible place to look for an optimization on Elastic Mapreduce? It's worth mentioning that a similar job for a months worth of comparable data passed to the AWS elastic-mapreduce cli ruby client took about 15 minutes to execute (this job was fueled by the same pig script.)

EDIT

Neglected to mention, job was scheduled for two instances of type m1.small, which admittedly may in itself be the problem.

edited Mar 19, 2013 at 15:09

asked Mar 19, 2013 at 13:33

Mark Grey

10.2k10 gold badges49 silver badges78 bronze badges

small instance is rather small. Add ganglia (as bootstrap step to EMR) and check how it behaves (CPU load, memory...).
– Guy
Commented Mar 20, 2013 at 6:10
Is it lag between job steps that's killing you or are the steps themselves slow? I typically see ~30 seconds-2 minutes delay between steps. What type of hosts and what number of hosts did your ruby example have? Is your job map only or does it have reducers? If it has reducers, how many did you set it to use (Pig chooses 1 only by default). 2 instances of m1.small doesn't sound much better then running this job without hadoop, what speed would you expect running the job in local mode?
– DMulligan
Commented Mar 24, 2013 at 0:46

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Collectives™ on Stack Overflow

AWS Elastic Mapreduce optimizing Pig job

0

Your Answer

Browse other questions tagged
amazon-web-services
apache-pig
boto
emr
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged amazon-web-servicesapache-pigbotoemr or ask your own question.

Browse other questions tagged
amazon-web-services
apache-pig
boto
emr
or ask your own question.