Troubleshooting Spark Challenges
Troubleshooting Spark Challenges
Troubleshooting Spark Challenges
Spark Challenges
Part 1: TOP TEN SPARK DIFFICULTIES
“The most difficult thing is finding
And Spark works somewhat differently across platforms –
on-premises; on cloud-specific platforms such as AWS EMR,
out why your job is failing, which Azure HDInsight, and Google Dataproc; and on Databricks,
For more on Spark and its use, please see this piece in
Infoworld. And for more depth about the problems that arise in
creating and running Spark jobs, at both the job level and the
cluster level, please see the links below. There is also a good
introductory guide.
• Cloud concerns. Resources in the cloud are flexible 6. How do I size my nodes, and match them to the right
and “pay as you go” – but as you go, you pay. So the main servers/instance types?
concern in the cloud is managing costs. (As AWS puts it,
7. How do I see what’s going on across the Spark stack
“When running big data pipelines on the cloud, operational
and apps?
cost optimization is the name of the game.”) This concern
increases because reliability concerns in the cloud can often 8. Is my data partitioned correctly for my SQL queries?
be addressed by “throwing hardware at the problem” – 9. When do I take advantage of auto-scaling?
increasing reliability, but at greater cost.
10. How do I get insights into jobs that have problems?
• On-premises Spark vs Amazon EMR. When moving
For easy access, these challenges are listed below, linked to the
to Amazon EMR, it’s easy to do a “lift and shift” from
appropriate page in this guide:
on-premises Spark to EMR. This saves time and money
on the cloud migration effort, but any inefficiencies in the Job-Level Challenges Cluster-Level Challenges
on-premises environment are reproduced in the cloud, 1. Executor and core allocation 6. Resource allocation
increasing costs. It’s also fully possible to refactor before 7. Observability
2. Memory allocation
moving to EMR, just as with Databricks.
3. Data skew/small files 8. Data partitioning vs. SQL
• On-premises Spark vs Databricks. When moving to queries/inefficiency
4. Pipeline optimization
Databricks, most companies take advantage of Databricks’ 9. Use of auto-scaling
5. Finding out whether a job is
capabilities, such as ease of starting/shutting down clusters, optimized 10. Troubleshooting
and do at least some refactoring as part of the cloud
Impacts: Resources for a given job (at the cluster level) or
migration effort. This costs time and money in the cloud
across clusters tend to be significantly under-allocated (causes
migration effort, but results in lower costs and, potentially,
crashes, hurting business results) or over-allocated (wastes
greater reliability for the refactored job in the cloud.
resources and can cause other jobs to crash, both of which hurt
business results).
Section 1: Five Job-Level Challenges You are likely to have your own sensible starting point for
your on-premises or cloud platform, the servers or instances
These challenges occur at the level of individual jobs. Fixing available, and experience your team has had with similar
them can be the responsibility of the developer or data scientist workloads. Once your job runs successfully a few times, you
who created the job, or of operations people or data engineers can either leave it alone, or optimize it. We recommend that
who work on both individual jobs and at the cluster level. you optimize it, because optimization:
However, job-level challenges, taken together, have massive • Helps you save resources and money (not over-allocating)
implications for clusters, and for the entire data estate. One
• Helps prevent crashes, because you right-size the resources
of our Unravel Data customers has undertaken a right-
(not under-allocating)
sizing program for resource-intensive jobs that has clawed
back nearly half the space in their clusters, even though • Helps you fix crashes fast, because allocations are roughly
data processing volume and jobs in production have been correct, and because you understand the job better
increasing.
2. HOW MUCH MEMORY SHOULD I
For these challenges, we’ll assume that the cluster your job is ALLOCATE FOR EACH JOB?
running in is relatively well-designed (see next section); that Memory allocation is per executor, and the most you can
other jobs in the cluster are not resource hogs that will knock allocate is the total available in the node. If you’re in the cloud,
your job out of the running; and that you have the tools you this is governed by your instance type; on-premises, by your
need to troubleshoot individual jobs. physical server or virtual machine. Some memory is needed
for your cluster manager and system resources (16GB may be a
1. HOW MANY EXECUTORS AND CORES typical amount), and the rest is available for jobs.
SHOULD A JOB USE?
One of the key advantages of Spark is parallelization – you If you have three executors in a 128GB cluster, and 16GB
run your job’s code against different data partitions in is taken up by the cluster, that leaves 37GB per executor.
parallel workstreams, as in the diagram below. The number of However, a few GB will be required for executor overhead;
workstreams that run at once is the number of executors, times the remainder is your per-executor memory. You will want
the number of cores per executor. So how many executors to partition your data so it can be processed efficiently in the
should your job use, and how many cores per executor – that available memory.
is, how many workstreams do you want running at once?
This is just a starting point, however. You may need to be using
a different instance type, or a different number of executors,
to make the most efficient use of your node’s resources against
the job you’re running. As with the number of executors (see
previous section), optimizing your job will help you know
whether you are over- or under-allocating memory, reduce the
likelihood of crashes, and get you ready for troubleshooting
when the need arises.
Both data skew and small files incur a meta-problem that’s There are some general rules. For instance, a “bad” – inefficient
common across Spark – when a job slows down or crashes, – join can take hours. But it’s very hard to find where your app
how do you know what the problem was? We will mention is spending its time, let alone whether a specific SQL command
this again, but it can be particularly difficult to know this for is taking a long time, and whether it can indeed be optimized.
data-related problems, as an otherwise well-constructed job
can have seemingly random slowdowns or halts, caused by Spark’s Catalyst optimizer does its best to optimize your queries
hard-to-predict and hard-to-detect inconsistencies across for you. But when data sizes grow large enough, and processing
different data sets. gets complex enough, you have to help it along if you want your
resource usage, costs, and runtimes to stay on the acceptable side.
4. HOW DO I OPTIMIZE AT THE PIPELINE
LEVEL? Section 2: Cluster-Level Challenges
Spark pipelines are made up of dataframes, connected by
transformers (which calculate new data from existing data), Cluster-level challenges are those that arise for a cluster that
and Estimators. Pipelines are widely used for all sorts of runs many (perhaps hundreds or thousands) of jobs, in cluster
processing, including extract, transform, and load (ETL) jobs design (how to get the most out of a specific cluster), cluster
and machine learning. Spark makes it easy to combine jobs distribution (how to create a set of clusters that best meets
into pipelines, but it does not make it easy to monitor and your needs), and allocation across on-premises resources and
manage jobs at the pipeline level. So it’s easy for monitoring, one or more public, private, or hybrid cloud resources.
managing, and optimizing pipelines to appear as an
exponentially more difficult version of optimizing individual The first step toward meeting cluster-level challenges is to meet
Spark jobs. job-level challenges effectively, as described above. A cluster
that’s running unoptimized, poorly understood, slowdown-
prone and crash-prone jobs is impossible to optimize. But if
your jobs are right-sized, cluster-level challenges become much
easier to meet. (Note that Unravel Data, as mentioned in the
previous section, helps you find your resource-heavy Spark
jobs, and optimize those first. It also does much of the work of
troubleshooting and optimization for you.)
Existing Transformers create new Dataframes, with an Meeting cluster-level challenges for Spark may be a topic better
Estimator producing the final model. (Source: Spark suited for a graduate-level computer science seminar than for
Pipelines: Elegant Yet Powerful, InsightDataScience.) this guide, but here are some of the issues that come up, and a
few comments on each:
unraveldata.com | [email protected] © Unravel. All rights reserved. Unravel and the Unravel logo are registered trademarks
of Unravel. All other trademarks are the property of their respective owners.