Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml
, workflow.xml
and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp
command, not with spark-submit
, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit
in background.
Look for that process in process list. It will be running under java -cp
command but with some additional Jars, that are added by spark-submit
. Add those Jars in CLASS_PATH
. and that's it. Now you can run your Spark applications through Oozie.
1. nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2. ps aux | grep '/path/to/App.jar'
EDITED: You can also use latest Oozie, which has Spark Action
also.
nohup spark-submit --class ... /path/to/app.jar &
but that didn't seem to work. It seems Oozie did nothing but just quit so no spark job was submitted. What I am trying to do is to let Oozie submit the spark job and then quit (mark the job as success & completed) because it consume quite a lot of resources otherwise (2 cores & 2G of ram as a minimum, I can't find away to make it go lower). Thanks a lot!
spark-submit
is run. You probably don't want to do this for a normal Spark job, but for Spark streaming it kind of makes sense because Spark streaming jobs runs 24/7 non-stop. Maybe I shouldn't use Oozie for Spark Streaming...
<exec>
command in a daemon thread. So that your command can run after the program terminates.
Commented
Aug 16, 2016 at 9:05
spark-submit
. I also use the same technique.
Commented
Aug 18, 2016 at 9:24
To run Spark SQL by Oozie you need to use Oozie Spark Action. You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path. ]$ locate oozie.gz /usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz
Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml
< spark-opts>--file /hive-site.xml < /spark-opts>