0

I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue.

The logfiles in question are stored in s3 with keys that correspond to the dates they are emitted from the logging server, eg: /2013/03/01/access.log. These files are very, very large. My mapreduce job runs an Apache Pig script that simply examines some of the uri paths stored in the log files and outputs generalized counts that correspond to our business logic.

My client code in boto takes date times as input on cli and schedules a jobflow with a PigStep instance for every date needed. Thus, passing something like python script.py 2013-02-01 2013-03-01 would iterate over 29 days worth of datetime objects and create pigsteps with the respective input keys for s3. This means that the resulting jobflow could have many, many steps, one for each day in the timedelta between the from_date and to_date.

My problem is that my EMR jobflow is exceedingly slow, almost absurdly so. It's been running for a night now and hasn't made it even halfway through that example set. Is there something wrong I am doing creating many jobflow steps like this? Should I attempt to generalize the pig script for the different keys instead, rather than preprocessing it in the client code and creating a step for each date? Is this a feasible place to look for an optimization on Elastic Mapreduce? It's worth mentioning that a similar job for a months worth of comparable data passed to the AWS elastic-mapreduce cli ruby client took about 15 minutes to execute (this job was fueled by the same pig script.)

EDIT

Neglected to mention, job was scheduled for two instances of type m1.small, which admittedly may in itself be the problem.

2
  • small instance is rather small. Add ganglia (as bootstrap step to EMR) and check how it behaves (CPU load, memory...).
    – Guy
    Commented Mar 20, 2013 at 6:10
  • Is it lag between job steps that's killing you or are the steps themselves slow? I typically see ~30 seconds-2 minutes delay between steps. What type of hosts and what number of hosts did your ruby example have? Is your job map only or does it have reducers? If it has reducers, how many did you set it to use (Pig chooses 1 only by default). 2 instances of m1.small doesn't sound much better then running this job without hadoop, what speed would you expect running the job in local mode?
    – DMulligan
    Commented Mar 24, 2013 at 0:46

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.