I had followed the same steps to setup a spark standalone cluster. I was able to debug the driver, master , worked and executor JVM's.
The master and the worker node is configured on a server class machine. The machine has 12 CPU cores. Source code for Spark -2.2.0 has been cloned from the Spark Git Repo.
STEPS:
1] Command to launch the Master JVM:
root@ubuntu:~/spark-2.2.0-bin-hadoop2.7/bin#
./spark-class -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8787 org.apache.spark.deploy.master.Master
The shell script spark-class is used to launch the master manually. The first args are JVM args launching the master in debug mode. The JVM is suspended and waits for the IDE to make a remote connection.
Following are the screenshots showing the IDE configuration for Remote Debugging:
2] Command to launch the Worker JVM:
root@ubuntu:~/spark-2.2.0-bin-hadoop2.7/bin# ./spark-class -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8788 org.apache.spark.deploy.worker.Worker spark://10.71.220.34:7077
Same as master, the last argument specifies the address of the spark master. The debug port for worker is 8788.
As part of the launch the worker registers with the master.
Screenshot
3] A Basic java app with a main method is compiled and wrapped in an uber/fat jar. This has been explained in the text “learning spark”. Basically an Uber jar contains all transitive dependencies.
Created by running mvn package at the following directory:
root@ubuntu:/home/customer/Documents/Texts/Spark/learning-spark-master# mvn package
The above generates a jar under ./target folder
The screen shot below is the java application, which would be submitted to the Spark Cluster:
4] Command to submit the command to the standalone cluster
root@ubuntu:/home/customer/Documents/Texts/Spark/learning-spark-master# /home/customer/spark-2.2.0-bin-hadoop2.7/bin/spark-submit --master spark://10.71.220.34:7077
--conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8790"
--conf "spark.executor.extraClassPath=/home/customer/Documents/Texts/Spark/
learning-spark-master/target/java-0.0.2.jar"
--class com.oreilly.learningsparkexamples.java.BasicMapToDouble
--name "MapToDouble"
./target/java-0.0.2.jar
spark://10.71.220.34:7077 → Argument to the java program →
com.oreilly.learningsparkexamples.java.BasicMapToDouble
· The above command is from the client node that runs the application with the main method in it. However the transformations are executed on the remote executor JVM.
·
The –conf parameters are important. They are used to configure the
executor JVM's.
The Executor JVMS are launched at runtime by the Worker JVM's.
· The first conf parameter specifies that the Executor JVM should
be launched in debug
mode and suspended right away. It comes up on port 8790.
· The second conf parameter specifies that the executor class path
should contain the application specific jars that are submitted to
the executor. On a distributed setup these jars need to be moved to
the Executor JVM machine.
· The last argument is used by the client app to connect to the
spark master.
To understand, how the client application connects to the Spark cluster,
we need to debug the client app and step through it. For that we need to configure it to run in the debug mode.
To debug the client, we need to edit the script spark-submit as follows:
Contents from the spark-submit
exec "${SPARK_HOME}"/bin/spark-class -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8789 org.apache.spark.deploy.SparkSubmit "$@"
5] After the client registers, The worker starts an executor at run-time on a different thread.
Screenshot below shows the class ExecutorRunner.scala
6] We now connect to the forked executor JVM using the IDE. Executor JVM would run the transformation functions in our submitted application.
JavaDoubleRDD result = rdd.mapToDouble( → ***Transformation function/lambda***
new DoubleFunction<Integer>() {
public double call(Integer x) {
double y = (double) x;
return y * y;
}
});
7] The transformation function runs, only when the action “collect” will be invoked.
8] The screenshot below displays the Executor view, when the mapToDouble function is invoked in parallel on multiple elements of the list. The Executor JVM executes the function in 12 threads as there are 12 cores. As the number of Cores was not set on the command line, the worker JVM by default set the option: -cores=12.
9] Screen shot showing the client submitted code [maptodouble()]
running in the remote forked Executor JVM.
10] After all the tasks have been executed, the Executor JVM exits. After the client app exits, the worker node gets unblocked and waits for the next submission.
References
I have created a Blog that describes the steps on how to debug these sub-systems. Hopefully, this helps others.
Blog that outlines the steps:
https://sandeepmspark.blogspot.com/2018/01/spark-standalone-cluster-internals.html