1

I have really simple workflow.

<workflow-app name="testSparkjob" xmlns="uri:oozie:workflow:0.5">
<start to="testJob"/>

  <action name="testJob">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.compress.map.output</name>
                <value>true</value>
            </property>
        </configuration>
        <master>local[*]</master>
        <name>Spark Example</name>
        <jar>mapping.py</jar>
        <spark-opts>--executor-memory 1G --num-executors 3 
--executor-cores     1 </spark-opts>
        <arg>argument1</arg>
        <arg>argument2</arg>
    </spark>
    <ok to="end"/>
    <error to="killAction"/>
</action>
 <kill name="killAction">
    <message>"Killed job due to error"</message>
</kill>
<end name="end"/>
</workflow-app>

Spark script does pretty much nothing:

if len(sys.argv) < 2:
  print('You must pass 2 parameters ')
  #just for testing, later will be discarded, sys.exit(1) will be used.")
  ext = 'testArgA'
  int = 'testArgB'
  #sys.exit(1)
else:
  print('arguments accepted')
  ext = sys.argv[1]
  int = sys.argv[2]

The script is located on hdfs in the same folder as workflow.xml.

When I runt the workflow I got following error

Launcher ERROR, reason: Main class 
[org.apache.oozie.action.hadoop.SparkMain], exit code [2]

I tought it is permission issue, so I set the hdfs folder -chmod 777 and my local folder also to chmod 777 I am using spark 1.6. When I run the script through spark-submit, everything is fine (even much more comlicated scripts which read/write to hdfs or to hive).

EDIT: I tried this

<action name="forceLoadFromLocal2hdfs">
<shell xmlns="uri:oozie:shell-action:0.3">
  <job-tracker>${jobTracker}</job-tracker>
  <name-node>${nameNode}</name-node>
  <configuration>
    <property>
      <name>mapred.job.queue.name</name>
      <value>${queueName}</value>
    </property>
  </configuration>
  <exec>driver-script.sh</exec>
<!-- single -->
  <argument>s</argument>
<!-- py script -->
  <argument>load_local_2_hdfs.py</argument>
<!-- local file to be moved-->
  <argument>localFilePath</argument>
<!-- hdfs destination folder, be aware of, script is deleting existing folder! -->
  <argument>hdfsPath</argument>
  <file>${workflowRoot}driver-script.sh</file>
  <file>${workflowRoot}load_local_2_hdfs.py</file>
</shell>
<ok to="end"/>
<error to="killAction"/>

The workkflow SUCCEEDED, but the file is not copied to the hdfs. No errors. The script does work by itself tho. More here.

3 Answers 3

2

Unfortunately Oozie Spark action supports only Java artifacts, so you have to specify the main class (that error message hardly trying to explain). So you have two options:

  1. rewrite your code to Java/Scala
  2. use custom action or script like this (I did not test it)
2
  • Running the shell-script with python scrip as an argument was my initial idea, but it was no go. I hope, it will be now :) Thx a lot for confiming my thoughts. Commented Jul 25, 2017 at 16:23
  • Anyway, it is weird, becasue in documentatiton I found : The jar element indicates a comma separated list of jars or python files. Commented Jul 25, 2017 at 16:46
1
  • 1: Try with getOpt

In your property

fuente=BURO_CONCENTRADO

Code .py

try:
    parametros, args = getopt.getopt(sys.argv[1:], "f:i:", ["fuente=", "id="])
    if len(parametros) < 2:
        print("Argumentos incompletos")
        sys.exit(1)
except getopt.GetoptError:
    print("Error en los argumentos")
    sys.exit(2)

for opt, arg in parametros:
    if opt in ("-f", "--fuente"):
        nom_fuente = str(arg).strip()
    elif opt in ("-i", "--id"):
        id_proceso = str(arg).strip()
    else:
        print("Parametro '" + opt + "' no reconocido")

In your workflow

                <jar>${nameNode}/${workflowRoot}/${appDir}/lib/GeneradorDeInsumos.py</jar>
                <spark-opts>
                    --queue ${queueName}
                    --num-executors 40
                    --executor-cores 2
                    --executor-memory 8g


                </spark-opts>
                <arg>-f ${fuente}</arg>
                <arg>-i ${wf:id()}</arg>

And output 'vuala'

Fuente:BURO_CONCENTRADO
Contexto:<pyspark.context.SparkContext object at 0x7efd80424090>
id_workflow:0062795-190808055847737-oozie-oozi-W
0

You can use spark-action in order to run a python script, but you must pass as argument the path to the Python API for Spark. Also the 1st line of your file must be as such:

#!/usr/bin/env python.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.