Troubleshooting Spark Challenges

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7
At a glance
Powered by AI
The key takeaways are the top 10 difficulties of troubleshooting Spark applications including issues at the job level and cluster level.

Some challenges of troubleshooting Spark applications include its memory-resident nature, parallel processing, variants across platforms, and large number of configuration options.

Impacts of troubleshooting challenges include under-allocation and over-allocation of resources leading to slow and crashing jobs as well as wasted resources.

Troubleshooting

Spark Challenges
Part 1: TOP TEN SPARK DIFFICULTIES
“The most difficult thing is finding
And Spark works somewhat differently across platforms –
on-premises; on cloud-specific platforms such as AWS EMR,
out why your job is failing, which Azure HDInsight, and Google Dataproc; and on Databricks,

parameters to change. Most of


which is available across the major public clouds. Each variant
offers some of its own challenges, and a somewhat different set
the time, it’s OOM errors...” of tools for solving them.

– Jagat Singh, Quora 4. Configuration options. Spark has hundreds of


configuration options. And Spark interacts with the hardware
and software environment it’s running in, each component of
Spark has become one of the most important tools for which has its own configuration options. Getting one or two
processing data – especially non-relational data – and critical settings right is hard; when several related settings
deriving value from it. And Spark serves as a platform for the have to be correct, guesswork becomes the norm, and over-
creation and delivery of analytics, AI, and machine learning allocation of resources, especially memory and CPUs (see
applications, among others. But troubleshooting Spark below) becomes the safe strategy.
applications is hard – and we’re here to help.
5. Trial and error approach. With so many configuration
In this guide, we’ll describe ten challenges that arise frequently options, how to optimize? Well, if a job currently takes six
in troubleshooting Spark applications. We’ll start with issues hours, you can change one, or a few, options, and run it again.
at the job level, encountered by most people on the data team That takes six hours, plus or minus. Repeat this three or four
– operations people/administrators, data engineers, and data times, and it’s the end of the week. You may have improved
scientists, as well as analysts. Then, we’ll look at problems that the configuration, but you probably won’t have exhausted the
apply across a cluster. These problems are usually handled by possibilities as to what the best settings are.
operations people/administrators and data engineers.

For more on Spark and its use, please see this piece in
Infoworld. And for more depth about the problems that arise in
creating and running Spark jobs, at both the job level and the
cluster level, please see the links below. There is also a good
introductory guide.

Five Reasons Why Troubleshooting


Spark Applications Is Hard
Some of the things that make Spark great also make it hard to The Spark application is the Driver Process, and the job is
troubleshoot. Here are some key Spark features, and some of split up across executors. (Source: Apache Spark for the
the issues that arise in relation to them: Impatient on DZone.)

1. Memory-resident. Spark gets much of its speed and


power by using memory, rather than disk, for interim storage Three Issues with Spark Jobs,
of source data and results. However, this can cost a lot of
resources and money, which is especially visible in the cloud. It
on-Premises and in the Cloud
can also make it easy for jobs to crash due to lack of sufficient Spark jobs can require troubleshooting against three main
available memory. And it makes problems hard to diagnose – kinds of issues:
only traces written to disk survive after crashes.
• Failure. Spark jobs can simply fail. Sometimes a job will
2. Parallel processing. Spark takes your job and applies fail on one try, then work again after a restart. Just finding
it, in parallel, to all the data partitions assigned to your job. out that the job failed can be hard; finding out why can be
(You specify the data partitions, another tough and important harder. (Since the job is memory-resident, failure makes the
decision.) But when a processing workstream runs into trouble, evidence disappear.)
it can be hard to find and understand the problem among the
• Poor performance. A Spark job can run slower than
multiple workstreams running at once.
you would like it to; slower than an external service level
3. Variants. Spark is open source, so it can be tweaked and agreement (SLA); or slower than it would do if it were
revised in innumerable ways. There are major differences optimized. It’s very hard to know how long a job “should”
among the Spark 1 series, Spark 2.x, and the newer Spark 3. take, or where to start in optimizing a job or a cluster.
• Excessive cost or resource use. The resource use or, All of these concerns are accompanied by a distinct lack of
especially in the cloud, the hard dollar cost of a job may raise needed information. Companies often make crucial decisions –
concern. As with performance, it’s hard to know how much on-premises vs. cloud, EMR vs. Databricks, “lift and shift” vs.
the resource use and cost “should” be, until you put work refactoring – with only guesses available as to what different
into optimizing and see where you’ve gotten to. options will cost in time, resources, and money.

All of the issues and challenges described here apply to Spark


across all platforms, whether it’s running on-premises, in Ten Spark Challenges
Amazon EMR, or on Databricks (across AWS, Azure, or GCP).
Many Spark challenges relate to configuration, including the
However, there are a few subtle differences:
number of executors to assign, memory usage (at the driver
• Move to cloud. There is a big movement of big data level, and per executor), and what kind of hardware/machine
workloads from on-premises (largely running Spark on instances to use. You make configuration choices per job, and
Hadoop) to the cloud (largely running Spark on Amazon also for the overall cluster in which jobs run, and these are
EMR or Databricks). Moving to cloud provides greater interdependent – so things get complicated, fast.
flexibility and faster time to market, as well as access to
Some challenges occur at the job level; these challenges are
built-in services found on each platform.
shared right across the data team. They include:
• Move to on-premises. There is a small movement
of workloads from the cloud back to on-premises 1. How many executors should each job use?
environments. When a cloud workload “settles down,” 2. How much memory should I allocate for each job?
such that flexibility is less important, then it may become
3. How do I find and eliminate data skew?
significantly cheaper to run it on-premises instead.
4. How do I make my pipelines work better?
• On-premises concerns. Resources (and costs) on-
premises tend to be relatively fixed; there can be a leadtime 5. How do I know if a specific job is optimized?
of months to years to significantly expand on-premises
resources. So the main concern on-premises is maximizing Other challenges come up at the cluster level, or even at the
the existing estate: making more jobs run in existing stack level, as you decide what jobs to run on what clusters.
resources, and getting jobs to complete reliably and on-time, These problems tend to be the remit of operations people and
to maximize the pay-off from the existing estate. data engineers. They include:

• Cloud concerns. Resources in the cloud are flexible 6. How do I size my nodes, and match them to the right
and “pay as you go” – but as you go, you pay. So the main servers/instance types?
concern in the cloud is managing costs. (As AWS puts it,
7. How do I see what’s going on across the Spark stack
“When running big data pipelines on the cloud, operational
and apps?
cost optimization is the name of the game.”) This concern
increases because reliability concerns in the cloud can often 8. Is my data partitioned correctly for my SQL queries?
be addressed by “throwing hardware at the problem” – 9. When do I take advantage of auto-scaling?
increasing reliability, but at greater cost.
10. How do I get insights into jobs that have problems?
• On-premises Spark vs Amazon EMR. When moving
For easy access, these challenges are listed below, linked to the
to Amazon EMR, it’s easy to do a “lift and shift” from
appropriate page in this guide:
on-premises Spark to EMR. This saves time and money
on the cloud migration effort, but any inefficiencies in the Job-Level Challenges Cluster-Level Challenges
on-premises environment are reproduced in the cloud, 1. Executor and core allocation 6. Resource allocation
increasing costs. It’s also fully possible to refactor before 7. Observability
2. Memory allocation
moving to EMR, just as with Databricks.
3. Data skew/small files 8. Data partitioning vs. SQL
• On-premises Spark vs Databricks. When moving to queries/inefficiency
4. Pipeline optimization
Databricks, most companies take advantage of Databricks’ 9. Use of auto-scaling
5. Finding out whether a job is
capabilities, such as ease of starting/shutting down clusters, optimized 10. Troubleshooting
and do at least some refactoring as part of the cloud
Impacts: Resources for a given job (at the cluster level) or
migration effort. This costs time and money in the cloud
across clusters tend to be significantly under-allocated (causes
migration effort, but results in lower costs and, potentially,
crashes, hurting business results) or over-allocated (wastes
greater reliability for the refactored job in the cloud.
resources and can cause other jobs to crash, both of which hurt
business results).
Section 1: Five Job-Level Challenges You are likely to have your own sensible starting point for
your on-premises or cloud platform, the servers or instances
These challenges occur at the level of individual jobs. Fixing available, and experience your team has had with similar
them can be the responsibility of the developer or data scientist workloads. Once your job runs successfully a few times, you
who created the job, or of operations people or data engineers can either leave it alone, or optimize it. We recommend that
who work on both individual jobs and at the cluster level. you optimize it, because optimization:

However, job-level challenges, taken together, have massive • Helps you save resources and money (not over-allocating)
implications for clusters, and for the entire data estate. One
• Helps prevent crashes, because you right-size the resources
of our Unravel Data customers has undertaken a right-
(not under-allocating)
sizing program for resource-intensive jobs that has clawed
back nearly half the space in their clusters, even though • Helps you fix crashes fast, because allocations are roughly
data processing volume and jobs in production have been correct, and because you understand the job better
increasing.
2. HOW MUCH MEMORY SHOULD I
For these challenges, we’ll assume that the cluster your job is ALLOCATE FOR EACH JOB?
running in is relatively well-designed (see next section); that Memory allocation is per executor, and the most you can
other jobs in the cluster are not resource hogs that will knock allocate is the total available in the node. If you’re in the cloud,
your job out of the running; and that you have the tools you this is governed by your instance type; on-premises, by your
need to troubleshoot individual jobs. physical server or virtual machine. Some memory is needed
for your cluster manager and system resources (16GB may be a
1. HOW MANY EXECUTORS AND CORES typical amount), and the rest is available for jobs.
SHOULD A JOB USE?
One of the key advantages of Spark is parallelization – you If you have three executors in a 128GB cluster, and 16GB
run your job’s code against different data partitions in is taken up by the cluster, that leaves 37GB per executor.
parallel workstreams, as in the diagram below. The number of However, a few GB will be required for executor overhead;
workstreams that run at once is the number of executors, times the remainder is your per-executor memory. You will want
the number of cores per executor. So how many executors to partition your data so it can be processed efficiently in the
should your job use, and how many cores per executor – that available memory.
is, how many workstreams do you want running at once?
This is just a starting point, however. You may need to be using
a different instance type, or a different number of executors,
to make the most efficient use of your node’s resources against
the job you’re running. As with the number of executors (see
previous section), optimizing your job will help you know
whether you are over- or under-allocating memory, reduce the
likelihood of crashes, and get you ready for troubleshooting
when the need arises.

For more on memory management, see this widely read article,


Spark Memory Management, by our own Rishitesh Mishra.

3. HOW DO I HANDLE DATA SKEW AND


SMALL FILES?
A Spark job using three cores to parallelize output. Data skew and small files are complementary problems. Data
Up to three tasks run simultaneously, and seven tasks are skew tends to describe large files – where one key value, or a
completed in a fixed period of time. (Source: few, have a large share of the total data associated with them.
Lisa Hua, Spark Overview, Slideshare.) This can force Spark, as it’s processing the data, to move data
around in the cluster, which can slow down your task, cause
low utilization of CPU capacity, and cause out-of-memory
You want high usage of cores, high usage of memory per
errors which abort your job. Several techniques for handling
core, and data partitioning appropriate to the job. (Usually,
very large files which appear as a result of data skew are given
partitioning on the field or fields you’re querying on.) This
in the popular article, Data Skew and Garbage Collection, by
beginner’s guide for Hadoop suggests two-three cores per
Rishitesh Mishra of Unravel.
executor, but not more than five; this expert’s guide to Spark
tuning on AWS suggests that you use three executors per node,
with five cores per executor, as your starting point for all jobs.
Small files are partly the other end of data skew – a share 5. HOW DO I KNOW IF A SPECIFIC JOB IS
of partitions will tend to be small. And Spark, since it is a OPTIMIZED?
parallel processing system, may generate many small files from Neither Spark nor, for that matter, SQL are designed for ease of
parallel processes. Also, some processes you use, such as file optimization. Spark comes with a monitoring and management
compression, may cause a large number of small files to appear, interface, Spark UI, which can help. But Spark UI can be
causing inefficiencies. You may need to reduce parallelism challenging to use, especially for the types of comparisons –
(undercutting one of the advantages of Spark), repartition (an over time, across jobs, and across a large, busy cluster – that
expensive operation you should minimize), or start adjusting you need to really optimize a job. And there is no “SQL UI” that
your parameters, your data, or both (see details). specifically tells you how to optimize your SQL queries.

Both data skew and small files incur a meta-problem that’s There are some general rules. For instance, a “bad” – inefficient
common across Spark – when a job slows down or crashes, – join can take hours. But it’s very hard to find where your app
how do you know what the problem was? We will mention is spending its time, let alone whether a specific SQL command
this again, but it can be particularly difficult to know this for is taking a long time, and whether it can indeed be optimized.
data-related problems, as an otherwise well-constructed job
can have seemingly random slowdowns or halts, caused by Spark’s Catalyst optimizer does its best to optimize your queries
hard-to-predict and hard-to-detect inconsistencies across for you. But when data sizes grow large enough, and processing
different data sets. gets complex enough, you have to help it along if you want your
resource usage, costs, and runtimes to stay on the acceptable side.
4. HOW DO I OPTIMIZE AT THE PIPELINE
LEVEL? Section 2: Cluster-Level Challenges
Spark pipelines are made up of dataframes, connected by
transformers (which calculate new data from existing data), Cluster-level challenges are those that arise for a cluster that
and Estimators. Pipelines are widely used for all sorts of runs many (perhaps hundreds or thousands) of jobs, in cluster
processing, including extract, transform, and load (ETL) jobs design (how to get the most out of a specific cluster), cluster
and machine learning. Spark makes it easy to combine jobs distribution (how to create a set of clusters that best meets
into pipelines, but it does not make it easy to monitor and your needs), and allocation across on-premises resources and
manage jobs at the pipeline level. So it’s easy for monitoring, one or more public, private, or hybrid cloud resources.
managing, and optimizing pipelines to appear as an
exponentially more difficult version of optimizing individual The first step toward meeting cluster-level challenges is to meet
Spark jobs. job-level challenges effectively, as described above. A cluster
that’s running unoptimized, poorly understood, slowdown-
prone and crash-prone jobs is impossible to optimize. But if
your jobs are right-sized, cluster-level challenges become much
easier to meet. (Note that Unravel Data, as mentioned in the
previous section, helps you find your resource-heavy Spark
jobs, and optimize those first. It also does much of the work of
troubleshooting and optimization for you.)

Existing Transformers create new Dataframes, with an Meeting cluster-level challenges for Spark may be a topic better
Estimator producing the final model. (Source: Spark suited for a graduate-level computer science seminar than for
Pipelines: Elegant Yet Powerful, InsightDataScience.) this guide, but here are some of the issues that come up, and a
few comments on each:

Many pipeline components are “tried and trusted” individually,


6. ARE NODES MATCHED UP TO SERVERS
and are thereby less likely to cause problems than new
OR CLOUD INSTANCES?
components you create yourself. However, interactions
A Spark node – a physical server or a cloud instance – will
between pipeline steps can cause novel problems.
have an allocation of CPUs and physical memory. (The whole
point of Spark is to run things in actual memory, so this is
Just as job issues roll up to the cluster level, they also roll up
crucial.) You have to fit your executors and memory allocations
to the pipeline level. Pipelines are increasingly the unit of work
into nodes that are carefully matched to existing resources,
for DataOps, but it takes truly deep knowledge of your jobs and
on-premises or in the cloud. (You can allocate more or fewer
your cluster(s) for you to work effectively at the pipeline level.
Spark cores than there are available CPUs, but matching them
This article, which tackles the issues involved in some depth,
makes things more predictable, uses resources better, and may
describes pipeline debugging as an “art.”
make troubleshooting easier.)
On-premises, poor matching between nodes, physical servers, 8. IS MY DATA PARTITIONED CORRECTLY
executors and memory results in inefficiencies, but these may FOR MY SQL QUERIES? (AND OTHER
not be very visible; as long as the total physical resource is INEFFICIENCIES)
sufficient for the jobs running, there’s no obvious problem. Operators can get quite upset, and rightly so, over “bad” or
However, issues like this can cause datacenters to be very “rogue” queries that can cost way more, in resources or cost,
poorly utilized, meaning there’s big overspending going on than they need to. One colleague describes a team he worked
– it’s just not noticed. (Ironically, the impending prospect of on that went through more than $100,000 of cloud costs in a
cloud migration may cause an organization to freeze on- weekend of crash-testing a new application – a discovery made
premises spending, shining a spotlight on costs and efficiency.) after the fact. (But before the job was put into production,
where it would have really run up some bills.)
In the cloud, “pay as you go” pricing shines a different type of
spotlight on efficient use of resources – inefficiency shows up SQL is not designed to tell you how much a query is likely
in each month’s bill. You need to match nodes, cloud instances, to cost, and more elegant-looking SQL queries (i.e., fewer
and job CPU and memory allocations very closely indeed, or statements) may well be more expensive. The same is true of
incur what might amount to massive overspending. This article all kinds of code you have running. So you have to do some or
gives you some guidelines for running Apache Spark cost- all of three things:
effectively on AWS EC2 instances, and is worth a read even if
• Learn something about SQL, and about coding languages
you’re running on-premises, or on a different cloud provider.
you use, especially how they work at runtime
You still have big problems here. In the cloud, with costs both • Understand how to optimize your code and partition your
visible and variable, cost allocation is a big issue. It’s hard to data for good price/performance
know who’s spending what, let alone what the business results
• Experiment with your app to understand where the resource
that go with each unit of spending are. But tuning workloads
use/cost “hot spots” are, and reduce them where possible
against server resources and/or instances is the first step in
gaining control of your spending, across all your data estates. All this fits in the “optimize” recommendations from 1. and 2.
above. We’ll talk more about how to carry out optimization in
7. HOW DO I SEE WHAT’S GOING ON IN Part 2 of this guide.
MY CLUSTER?
“Spark is notoriously difficult to tune and maintain,” according 9. WHEN DO I TAKE ADVANTAGE OF
to an article in The New Stack. Clusters need to be “expertly AUTO-SCALING?
managed” to perform well, or all the good characteristics of The ability to auto-scale – to assign resources to a job just
Spark can come crashing down in a heap of frustration and while it’s running, or to increase resources smoothly to meet
high costs. (In people’s time and in business losses, as well as processing peaks – is one of the most enticing features of the
direct, hard dollar costs.) cloud. It’s also one of the most dangerous; there is no practical
limit to how much you can spend. You need some form of
Key Spark advantages include accessibility to a wide range of guardrails, and some form of alerting, to remove the risk of
users and the ability to run in memory. But the most popular truly gigantic bills.
tool for Spark monitoring and management, Spark UI, doesn’t
really help much at the cluster level. You can’t, for instance, The need for auto-scaling might, for instance, determine whether
easily tell which jobs consume the most resources over time. So you move a given workload to the cloud, or leave it running,
it’s hard to know where to focus your optimization efforts. And unchanged, in your on-premises data center. But to help an
Spark UI doesn’t support more advanced functionality – such application benefit from auto-scaling, you have to profile it,
as comparing the current job run to previous runs, issuing then cause resources to be allocated and de-allocated to match
warnings, or making recommendations, for example. the peaks and valleys. And you have some calculations to make,
because cloud providers charge you more for spot resources
Logs on cloud clusters are lost when a cluster is terminated, – those you grab and let go of, as needed – than for persistent
so problems that occur in short-running clusters can be that resources that you keep running for a long time. Spot resources
much harder to debug. More generally, managing log files may cost two or three times as much as dedicated ones.
is itself a big data management and data accessibility issue,
making debugging and governance harder. This occurs in both The first step, as you might have guessed, is to optimize your
on-premises and cloud environments. And, when workloads application, as in the previous sections. Auto-scaling is a price/
are moved to the cloud, you no longer have a fixed-cost data performance optimization, and a potentially resource-intensive
estate, nor the “tribal knowledge” accrued from years of one. You should do other optimizations first.
running a gradually changing set of workloads on-premises.
Then profile your optimized application. You need to calculate
Instead, you have new technologies and pay-as-you-go billing.
ongoing and peak memory and processor usage, figure out
So cluster-level management, hard as it is, becomes critical.
how long you need each, and the resource needs and cost for • Under-allocation. It can be tricky to allocate your
each state. And then decide whether it’s worth auto-scaling the resources efficiently on your cluster, partition your datasets
job, whenever it runs, and how to do that. You may also need effectively, and determine the right level of resources for
to find quiet times on a cluster to run some jobs, so the job’s each job. If you under-allocate (either for a job’s driver or
peaks don’t overwhelm the cluster’s resources. the executors), a job is likely to run too slowly, or to crash.
As a result, many developers and operators resort to…
To help, Databricks has two types of clusters, and the second
• Over-allocation. If you assign too many resources to
type works well with auto-scaling. Most jobs start out in
your job, you’re wasting resources (on-premises) or money
an interactive cluster, which is like an on-premises cluster;
(cloud). We hear about jobs that need, for example, 2GB of
multiple people use a set a shared resources. It is, by definition,
memory, but are allocated much more – in one case, 85GB.
very difficult to avoid seriously underusing the capacity of an
interactive cluster. Applications can run slowly, because they’re under-allocated –
or because some apps are over-allocated, causing others to run
So you are meant to move each of your repeated, resource- slowly. Data teams then spend much of their time fire-fighting
intensive, and well-understood jobs off to its own, dedicated, issues that may come and go, depending on the particular
job-specific cluster. A job-specific cluster spins up, runs its job, combination of jobs running that day. With every level of
and spins down. This is a form of auto-scaling already, and resource in shortage, new, business-critical apps are held up,
you can also scale the cluster’s resources to match job peaks, if so the cash needed to invest against these problems doesn’t
appropriate. But note that you want your application profiled show up. IT becomes an organizational headache, rather than a
and optimized before moving it to a job-specific cluster. source of business capability.

10. HOW DO I FIND AND FIX PROBLEMS?


Just as it’s hard to fix an individual Spark job, there’s no
Conclusion
easy way to know where to look for problems across a Spark To jump ahead to the end of this series a bit, our customers
cluster. And once you do find a problem, there’s very little here at Unravel are easily able to spot and fix over-allocation
guidance on how to fix it. Is the problem with the job itself, or and inefficiencies. They can then monitor their jobs in
the environment it’s running in? For instance, over-allocating production, finding and fixing issues as they arise. Developers
memory or CPUs for some Spark jobs can starve others. In the even get on board, checking their jobs before moving them to
cloud, the noisy neighbors problem can slow down a Spark job production, then teaming up with Operations to keep them
run to the extent that it causes business problems on one outing tuned and humming.
– but leaves the same job to finish in good time on the next run.
One Unravel customer, Mastercard, has been able to reduce
The better you handle the other challenges listed in this guide,
usage of their clusters by roughly half, even as data sizes and
the fewer problems you’ll have, but it’s still very hard to know
application density has moved steadily upward during the
how to most productively spend Spark operations time. For
global pandemic. And everyone gets along better, and has
instance, a slow Spark job on one run may be worth fixing in
more fun at work, while achieving these previously unimagined
its own right, and may be warning you of crashes on future
results.
runs. But it’s very hard just to see what the trend is for a Spark
job in performance, let alone to get some idea of what the So, whether you choose to use Unravel or not, develop a culture
job is accomplishing vs. its resource use and average time to of right-sizing and efficiency in your work with Spark. It will
complete. So Spark troubleshooting ends up being reactive, seem to be a hassle at first, but your team will become much
with all too many furry, blind little heads popping up for stronger, and you’ll enjoy your work life more, as a result.
operators to play Whack-a-Mole with.
You need a sort of X-ray of your Spark jobs, better cluster-level
Impacts of These Challenges monitoring, environment information, and to correlate all
of these sources into recommendations. In Troubleshooting
If you meet the above challenges effectively, you’ll use your Spark Applications, Part 2: Solutions, we will describe the
resources efficiently and cost-effectively. However, our most widely used tools for Spark troubleshooting – including
observation here at Unravel Data is that most Spark clusters the Spark Web UI and our own offering, Unravel Data – and
are not run efficiently. how to assemble and correlate the information you need. If
you would like to know more about Unravel Data now, you can
What we tend to see most are the following problems –
download a free trial or contact Unravel.
at a job level, within a cluster, or across all clusters:

unraveldata.com | [email protected] © Unravel. All rights reserved. Unravel and the Unravel logo are registered trademarks
of Unravel. All other trademarks are the property of their respective owners.

You might also like