2

Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?

In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.

Thanks.

2 Answers 2

3

Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml, workflow.xml and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp command, not with spark-submit, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit in background. Look for that process in process list. It will be running under java -cp command but with some additional Jars, that are added by spark-submit. Add those Jars in CLASS_PATH. and that's it. Now you can run your Spark applications through Oozie.

1.  nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2.  ps aux | grep '/path/to/App.jar'

EDITED: You can also use latest Oozie, which has Spark Action also.

8
  • Could you please give an example how you triggered this from the Oozie workflow please. What I tried was using the <exec> command in Oozie and then in the shell script itself, I ran nohup spark-submit --class ... /path/to/app.jar & but that didn't seem to work. It seems Oozie did nothing but just quit so no spark job was submitted. What I am trying to do is to let Oozie submit the spark job and then quit (mark the job as success & completed) because it consume quite a lot of resources otherwise (2 cores & 2G of ram as a minimum, I can't find away to make it go lower). Thanks a lot!
    – RHE
    Commented Aug 15, 2016 at 16:41
  • 1
    I didn't get you, what you are actually trying to do, can you please elaborate this??
    – Zia Kiyani
    Commented Aug 15, 2016 at 17:09
  • Hi Zia, thanks for the reply. When you run a spark job using Oozie, let's say the Spark job takes 20 minutes to finish, usually the Oozie job will finish after the Spark job finishes, in other words after 20 minutes. What I would like to do is to finish the Oozie process early (i.e. by running the Spark job in the background using nohup or disown) immediately after spark-submit is run. You probably don't want to do this for a normal Spark job, but for Spark streaming it kind of makes sense because Spark streaming jobs runs 24/7 non-stop. Maybe I shouldn't use Oozie for Spark Streaming...
    – RHE
    Commented Aug 15, 2016 at 22:10
  • Ohh I got it now. Yes for spark streaming why you want to use Oozie? As spark streaming run continuously, Oozie is used where we have to schedule job after intervals. Anyway if you still want this then the best option is run a command using your code. But for that, you have to use <exec> command in a daemon thread. So that your command can run after the program terminates.
    – Zia Kiyani
    Commented Aug 16, 2016 at 9:05
  • 1
    Good, Yes, this is a perfect way, schedule batch jobs through Oozie and run streaming jobs with spark-submit. I also use the same technique.
    – Zia Kiyani
    Commented Aug 18, 2016 at 9:24
0

To run Spark SQL by Oozie you need to use Oozie Spark Action. You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path. ]$ locate oozie.gz /usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz

Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml

< spark-opts>--file /hive-site.xml < /spark-opts>

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.