We are running a Oozie workflow which has Shell action
and a Spark action
which means a shell script and a Spark job which runs in sequence.
Running single workflow:
- Total: 3 mins
- Shell action: 50 secs
- Spark job: 2 mins
- The rest of the time is gone in initializing from oozie and allocating containers from yarn which is absolutely fine.
Usecase: We are suppose to run 700 instances of the same workflow at once( by region, zone and area, which is a business case).
When running the 700 instances of the same workflow we are noticing delay in completion of 700 workflows although we have scaled the cluster linearly. We are expecting 700 workflows to complete in 3 mins or atleast by 5mins but this is not the case. There is a delay of 5mins to launch all the 700 workflows which is fine too by that it should complete by 10mins but it is not the case.
What exactly is happening is that when 700 workflows are submitted it is taking arond 5-6 mins to launch all the workflows from ooize (we are ok with this). The overall time taken to complete 700 workflows is around 30 mins which means some workflows which kickstarted at 7:00 would complete at 7:30. But the time taken by actions remains same which means shell action still take 50s and spark job is taking 2-3mins to complete. Noticing delay in starting the shell action and spark job although oozie has taken the workflow into the prep state.
What we checked so far:
- Initially we thought it is to do with Oozie and worked on the configurations.
- Later we thought Yarn and tuned some configurations.
- Also, did create queue to run shell and launcher jobs in one queue and spark jobs in another queue.
- We have gone through yarn and oozie logs too.
Can someone throw somelight around this?
client
mode, spark driver will consume memory on your master node. Which will restrict Oozie to create more no of launcher. Try running spark job oncluster
mode. Read more on spark client vs cluster mode