Oozie pyspark job

Question

I have really simple workflow.

<workflow-app name="testSparkjob" xmlns="uri:oozie:workflow:0.5">
<start to="testJob"/>

  <action name="testJob">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.compress.map.output</name>
                <value>true</value>
            </property>
        </configuration>
        <master>local[*]</master>
        <name>Spark Example</name>
        <jar>mapping.py</jar>
        <spark-opts>--executor-memory 1G --num-executors 3 
--executor-cores     1 </spark-opts>
        <arg>argument1</arg>
        <arg>argument2</arg>
    </spark>
    <ok to="end"/>
    <error to="killAction"/>
</action>
 <kill name="killAction">
    <message>"Killed job due to error"</message>
</kill>
<end name="end"/>
</workflow-app>

Spark script does pretty much nothing:

if len(sys.argv) < 2:
  print('You must pass 2 parameters ')
  #just for testing, later will be discarded, sys.exit(1) will be used.")
  ext = 'testArgA'
  int = 'testArgB'
  #sys.exit(1)
else:
  print('arguments accepted')
  ext = sys.argv[1]
  int = sys.argv[2]

The script is located on hdfs in the same folder as workflow.xml.

When I runt the workflow I got following error

Launcher ERROR, reason: Main class 
[org.apache.oozie.action.hadoop.SparkMain], exit code [2]

I tought it is permission issue, so I set the hdfs folder -chmod 777 and my local folder also to chmod 777 I am using spark 1.6. When I run the script through spark-submit, everything is fine (even much more comlicated scripts which read/write to hdfs or to hive).

EDIT: I tried this

<action name="forceLoadFromLocal2hdfs">
<shell xmlns="uri:oozie:shell-action:0.3">
  <job-tracker>${jobTracker}</job-tracker>
  <name-node>${nameNode}</name-node>
  <configuration>
    <property>
      <name>mapred.job.queue.name</name>
      <value>${queueName}</value>
    </property>
  </configuration>
  <exec>driver-script.sh</exec>
<!-- single -->
  <argument>s</argument>
<!-- py script -->
  <argument>load_local_2_hdfs.py</argument>
<!-- local file to be moved-->
  <argument>localFilePath</argument>
<!-- hdfs destination folder, be aware of, script is deleting existing folder! -->
  <argument>hdfsPath</argument>
  <file>${workflowRoot}driver-script.sh</file>
  <file>${workflowRoot}load_local_2_hdfs.py</file>
</shell>
<ok to="end"/>
<error to="killAction"/>

The workkflow SUCCEEDED, but the file is not copied to the hdfs. No errors. The script does work by itself tho. More here.

Vitalii Kotliarenko · Accepted Answer · 2017-07-25 15:22:41Z

2

Unfortunately Oozie Spark action supports only Java artifacts, so you have to specify the main class (that error message hardly trying to explain). So you have two options:

rewrite your code to Java/Scala
use custom action or script like this (I did not test it)

answered Jul 25, 2017 at 15:22

Vitalii Kotliarenko

2,96719 silver badges26 bronze badges

Running the shell-script with python scrip as an argument was my initial idea, but it was no go. I hope, it will be now :) Thx a lot for confiming my thoughts.
– la_femme_it
Commented Jul 25, 2017 at 16:23
Anyway, it is weird, becasue in documentatiton I found : The jar element indicates a comma separated list of jars or python files.
– la_femme_it
Commented Jul 25, 2017 at 16:46

Add a comment |

Arturo Gonzalez · Accepted Answer · 2019-09-13 00:57:59Z

1: Try with getOpt

In your property

fuente=BURO_CONCENTRADO

Code .py

try:
    parametros, args = getopt.getopt(sys.argv[1:], "f:i:", ["fuente=", "id="])
    if len(parametros) < 2:
        print("Argumentos incompletos")
        sys.exit(1)
except getopt.GetoptError:
    print("Error en los argumentos")
    sys.exit(2)

for opt, arg in parametros:
    if opt in ("-f", "--fuente"):
        nom_fuente = str(arg).strip()
    elif opt in ("-i", "--id"):
        id_proceso = str(arg).strip()
    else:
        print("Parametro '" + opt + "' no reconocido")

In your workflow

                <jar>${nameNode}/${workflowRoot}/${appDir}/lib/GeneradorDeInsumos.py</jar>
                <spark-opts>
                    --queue ${queueName}
                    --num-executors 40
                    --executor-cores 2
                    --executor-memory 8g


                </spark-opts>
                <arg>-f ${fuente}</arg>
                <arg>-i ${wf:id()}</arg>

And output 'vuala'

Fuente:BURO_CONCENTRADO
Contexto:<pyspark.context.SparkContext object at 0x7efd80424090>
id_workflow:0062795-190808055847737-oozie-oozi-W

Gourav · Accepted Answer · 2019-02-08 10:21:52Z

0

You can use spark-action in order to run a python script, but you must pass as argument the path to the Python API for Spark. Also the 1st line of your file must be as such:

#!/usr/bin/env python.

edited Feb 8, 2019 at 10:21

Gourav

2,8195 gold badges30 silver badges49 bronze badges

answered Feb 8, 2019 at 9:54

eMazarakis

1321 silver badge14 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Oozie pyspark job

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged
hadoop
apache-spark
pyspark
oozie
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged hadoopapache-sparkpysparkoozie or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
hadoop
apache-spark
pyspark
oozie
or ask your own question.