Essays On Data Analysis
Essays On Data Analysis
Essays On Data Analysis
Roger D. Peng
This book is for sale at
http://leanpub.com/dataanalysisessays
4. Context Compatibility . . . . . . . . . . . . . . . . . 18
9. Abductive Reasoning . . . . . . . . . . . . . . . . . . 50
This report was written in 1990 but the essence of this quo-
tation still rings true today. Pregibon was writing about his
effort to develop “expert systems” that would guide scientists
doing data analysis. His report was largely a chronicle of
failures that could be attributed to a genuine lack of under-
standing of how people analyze data.
Top to Bottom
One can think of data analysis from the “top down” or from
the “bottom up”. Neither characterization is completely cor-
rect but both are useful as mental models. In this book we
will discuss both and how they help guide us in doing data
analysis.
The top down approach sees data analysis a product to be
designed. As has been noted by many professional designers
What is Data Analysis? 8
Summary
Time
Technology
Trustworthiness
Acceptance
very close. Ultimately, lawsuits were filed and a trial was set
to determine exactly how the vote counting should proceed.
Statisticians were called to testify for both Bush and Gore.
The statistician called to testify for the Gore team was Nicolas
Hengartner, formerly of Yale University (he was my under-
graduate advisor when I was there). Hengartner presented a
thorough analysis of the data that was given to him by the
Gore team and concluded there were differences in how the
votes were being counted across Florida and that some ballots
were undercounted.
However, on cross-examination, the lawyer for Bush was able
to catch Hengartner in a “gotcha” moment which ultimately
had to do with the manner in which the data were collected,
about which Hengartner had been unaware. Was the analysis
a success? It’s difficult to say without having been directly in-
volved. Nobody challenged the methodology that Hengartner
used in the analysis, which was by all accounts a very simple
analysis. Therefore, one could argue that it had intrinsic
validity. However, one could also argue that he should have
known about the issue with how the data were collected (and
perhaps the broader context) and incorporated that into his
analysis and presentation to the court. Ultimately, Hengart-
ner’s analysis was only one piece in a collection of evidence
presented and so it’s difficult to say what role it played in the
ultimate outcome.
Audience
Summary
double-diamond
Divergent and Convergent Phases of Data Analysis 39
Phase 1: Exploration
analysis/
3 https://simplystatistics.org/2018/07/06/data-creators/
Divergent and Convergent Phases of Data Analysis 41
You might love all your children equally but you still need to
pick a favorite. The reason is nobody has the resources4 to
pursue every avenue of investigation. Furthermore, pursuing
every avenue would likely not be that productive. You are bet-
ter off sharpening and refining your question. This simplifies
the analysis in the future and makes it much more likely that
people (including you!) will be able to act on the results that
you present.
This phase of analysis is convergent and requires synthesiz-
ing many different ideas into a coherent plan or strategy. Tak-
ing the thousands of plots, tables, and summaries that you’ve
made and deciding on a problem specification is not easy,
and to my surprise, I have not seen a lot of tools dedicated
to assisting in this task. Nevertheless, the goal here is to end
up with a reasonably detailed specification of what we are
trying to achieve and how the data will be used to achieve it.
It might be something like “We’re going to fit a linear model
with this outcome and these predictors in order to answer this
question” or “We’re building a prediction model using this
collection of features to optimize this metric”.
In some settings (such as in consulting) you might need to for-
mally write this specification down and present it to others. At
any rate, you will need to justify it based on your exploration
of the data in Phase 1 and whatever outside factors may be
relevant. Having a keen understanding of your audience5
becomes relevant at the conclusion of this phase.
analysis/
5 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
Divergent and Convergent Phases of Data Analysis 42
out. For starters, what will the results look like? What sum-
maries do we want to produce and how will they be pre-
sented? Having a detailed specification is good, but it’s not
final. When I was a software engineer, we often got highly
detailed specifications of the software we were supposed to
build. But even then, there were many choices to make.
Thus begins another divergent phase of analysis, where we
typically build models and gauge their performance and ro-
bustness. This is the data analyst’s version of prototyping. We
may look at model fit and see how things work out relative
to our expectations set out in the problem specification. We
might consider sensitivity analyses or other checks on our
assumptions about the world and the data. Again, there may
be many tables and plots, but this time not of the data, but of
the results. The important thing here is that we are dealing
with concrete models, not the rough “sketches” done in Phase
1.
Because the work in this phase will likely end up in some
form in the final product, we need to develop a more formal
workflow and process to track what we are doing. Things
like version control play a role, as well as scriptable data
analysis packages that can describe our work in code. Even
though many aspects of this phase still may not be used,
it is important to have reproducibility in mind as work is
developed so that it doesn’t have to be “tacked on” after the
fact (an often painful process).
Phase 4: Narration
Implications
At this point it’s worth recalling that all models are wrong, but
some are useful. So why is this model for data analysis useful?
I think there are a few areas where the model is useful as an
explanatory tool and for highlighting where there might be
avenues for future work.
Decision Points
One thing I like about the visual shape of the model is that it
highlights two key areas–the ends of both convergent phases–
where critical decisions have to be made. At the end of Phase
2, we must decide on the problem specification, after sorting
through all of the exploratory work that we’ve done. At the
6 https://fivethirtyeight.com
Divergent and Convergent Phases of Data Analysis 44
Tooling
Education
graphics/
Divergent and Convergent Phases of Data Analysis 46
Future Work
Tooling
Education
Summary
Summary
1. Recognition of problem
2. One technique used
3. Competing techniques used
4. Rough comparisons of efficacy
After thinking about all this I was inspired to draw the follow-
ing diagram.
It would seem that the message here is that the goal of data
analysis is to explore the data. In other words, data analysis
is exploratory data analysis. Maybe this shouldn’t be so sur-
prising given that Tukey wrote the book4 on exploratory data
analysis. In this paper, at least, he essentially dismisses other
goals as overly optimistic or not really meaningful.
For the most part I agree with that sentiment, in the sense
that looking for “the answer” in a single set of data is going
to result in disappointment. At best, you will accumulate
evidence that will point you in a new and promising direction.
Then you can iterate, perhaps by collecting new data, or by
4 https://en.wikipedia.org/wiki/Exploratory_data_analysis
Tukey, Design Thinking, and Better Questions 59
(Note that the all caps are originally his!) Given this, it’s not
too surprising that Tukey seems to equate exploratory data
analysis with essentially all of data analysis.
Better Questions
There’s one story that, for me, totally captures the spirit of
exploratory data analysis. Legend has it that Tukey once
asked a student what were the benefits of the median polish
technique5 , a technique he invented to analyze two-way tab-
ular data. The student dutifully answered that the benefit of
the technique is that it provided summaries of the rows and
columns via the row- and column-medians. In other words,
like any good statistical technique, it summarized the data by
reducing it in some way. Tukey fired back, saying that this was
incorrect—the benefit was that the technique created more
data. That “more data” was the residuals that are leftover
in the table itself after running the median polish. It is the
residuals that really let you learn about the data, discover
whether there is anything unusual, whether your question
is well-formulated, and how you might move on to the next
step. So in the end, you got row medians, column medians,
and residuals, i.e. more data.
5 https://en.wikipedia.org/wiki/Median_polish
Tukey, Design Thinking, and Better Questions 60
analysis/
The Role of Creativity 62
Missing Data
Missing data are present in almost every dataset and the most
important question a data analyst can ask when confronted
with missing data is “Why are the data missing?” It’s impor-
tant to develop some understanding of the mechanism behind
what makes the data missing in order to develop an appropri-
ate strategy for dealing with missing data (i.e. doing nothing,
imputation, etc.) But the data themselves often provide little
information about this mechanism; often the mechanism is
coded outside the data, possibly not even written down but
3 https://simplystatistics.org/2018/05/24/context-compatibility-in-data-
analysis/
4 https://simplystatistics.org/2018/06/18/the-role-of-resources-in-data-
analysis/
5 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
The Role of Creativity 63
The data analyst will likely have to work under a set of re-
source constraints7 , placing boundaries on what can be done
with the data. The first and foremost constraint is likely to be
time. One can only try so many things in the time allotted,
or some analyses may take too long to complete. Therefore,
compromises may need to be made unless more time and
resources can be negotiated. Tooling will be limited in that
certain combinations of models and software may not exist
and there may not be time to develop new tools from scratch.
A good data analyst must make an estimate of the time avail-
able and determine whether it is sufficient to complete the
analysis. If resources are insufficient, then the analyst must
either negotiate for more resources or adapt the analysis to
fit the available resources. Creativity will almost certainly be
required when there are severe resource constraints, in order
to squeeze as much productivity out of what is available.
Summary
analysis/
12. Should the Data Come
First?
One conversation I’ve had quite a few times revolves around
the question, “What’s the difference between science and data
science?” If I were to come up with a simple distinction, I
might say that
Once we’ve figured out the context around the data, essen-
tially retracing the history of the data, we can then ask “Are
these data appropriate for the question that I want to ask?”
Answering this question involves comparing the context sur-
rounding the original data and then ensuring that it is com-
patible with the context for your question. If there is compat-
ibility, or we can create compatibility via statistical modeling
or assumptions, then we can intelligently move forward to
analyze the data and generate evidence concerning a differ-
ent question. We will then have to make a separate argument
to some audience regarding the evidence in the data and any
conclusions we may make. Even though the data may have
been convincing for one question, it doesn’t mean that the
data will be convincing for a different question.
If we were to develop a procedure for a data science “proce-
dural” TV show, it might like the following.
Data science often starts with the data, but in an ideal world
it wouldn’t. In an ideal world, we would ask questions and
carefully design experiments to collect data specific to those
questions. But this is simply not practical and data need to be
shared across contexts. The difficulty of the data scientist’s
job is understanding each dataset’s context, making sure it
is compatible with the current question, developing the evi-
dence from the data, and then getting an audience to accept
the results.
13. Partitioning the
Variation in Data
One of the fundamental questions that we can ask in any data
analysis is, “Why do things vary?” Although I think this is fun-
damental, I’ve found that it’s not explicitly asked as often as I
might think. The problem with not asking this question, I’ve
found, is that it can often lead to a lot of pointless and time-
consuming work. Taking a moment to ask yourself, “What do
I know that can explain why this feature or variable varies?”
can often make you realize that you actually know more than
you think you do.
When embarking on a data analysis, ideally before you look at
the data, it’s useful to partition the variation in the data. This
can be roughly broken down into to categories of variation:
fixed and random. Within each of those categories, there can
be a number of sub-categories of things to investigate.
Fixed variation
Random variation
Is it really random?
Summary
features for which you do not have data but that you can go
out and collect. Making the effort to collect additional data
when it is warranted can save a lot of time and effort trying
to model variation as if it were random. More importantly,
omitting important fixed effects in a statistical model can
lead to hidden bias or confounding. When data on omitted
variables cannot be collected, trying to find a surrogate for
those variables can be a reasonable alternative.
14. How Data Analysts
Think - A Case Study
In episode 71 of Not So Standard Deviations1 , Hilary Parker
and I inaugurated our first “Data Science Design Challenge”
segment where we discussed how we would solve a given
problem using data science. The idea with calling it a “design
challenge” was to contrast it with common “hackathon” type
models where you are presented with an already-collected
dataset and then challenged to find something interesting
in the data. Here, we wanted to start with a problem and
then talk about how data might be collected and analyzed to
address the problem. While both approaches might result in
the same end-product, they address the various problems you
encounter in a data analysis in a different order.
In this post, I want to break down our discussion of the chal-
lenge and highlight some of the issues that were discussed in
framing the problem and in designing the data collection and
analysis. I’ll end with some thoughts about generalizing this
approach to other problems.
You can download an MP3 of this segment of the episode2 (it
is about 45 minutes long) or you can read the transcript of the
segment3 . If you’d prefer to stream the segment you can start
listening here4 .
1 http://nssdeviations.com/
2 https://www.dropbox.com/s/yajgbr25dbh20i0/NSSD%20Episode%
2071%20Design%20Challenge.mp3?dl=0
3 https://drive.google.com/open?id=11dEhj-eoh8w13dS-
mWvDMv7NKWXZcXMr
4 https://overcast.fm/+FMBuKdMEI/00:30
How Data Analysts Think - A Case Study 75
The Brief
The general goal was to learn more about the time it takes for
each of us to commute to work. Hilary lives in San Francisco
and I live in Baltimore, so the characteristics of our commutes
are very different. She walks and takes public transit; I drive
most days. We also wanted to discuss how we might collect
data on our commute times in a systematic, but not intrusive,
manner. When we originally discussed having this segment,
this vague description was about the level of specification that
we started with, so an initial major task was to
Right off the bat, Hilary notes that she doesn’t actually do this
commute as often as she’d thought. Between working from
home, taking care of chores in the morning, making stops on
the way, and walking/talking with friends, a lot of variation
can be introduced in to the data.
I mention that “going to work” and “going home”, while both
can be thought of as commutes, are not the same thing and
that we might be interested in one more than the other. Hilary
agrees that they are different problems but they are both
potentially of interest.
Question/Intervention Duality
Then she describes how she can use Wi-Fi connections (and
dis-connections) to serve as surrogates for leaving and arriv-
ing.
What exactly are the data that we will be collecting? What are
the covariates that we need to help us understand and model
the commute times? Obvious candidates are
Hilary notes that from the start/end times we can get things
like day of the week and time of day (e.g. via the lubri-
date6 package). She also notes that her system doesn’t exactly
6 https://cran.r-project.org/web/packages/lubridate/index.html
How Data Analysts Think - A Case Study 80
data/
How Data Analysts Think - A Case Study 81
Hilary raises a hard truth, which is that not everyone gets the
same consideration when it comes to showing up on time. For
an important meeting, we might allow for “three standard
deviations” more than the mean travel time to ensure some
margin of safety for arriving on time. However, for a more
routine meeting, we might just provide for one standard de-
viation of travel time and let natural variation take its course
for better or for worse.
Towards the end I ask Hilary how much data is needed for
this project? However, before asking I should have discussed
the nature of the study itself:
Hilary suggests that it is the latter and that she will simply
collect data and make decisions as she goes. However, it’s
How Data Analysts Think - A Case Study 85
clear that the time frame is not “forever” because the method
of data collection is not zero cost. Therefore, at some point the
costs of collecting data will likely be too great in light of any
perceived benefit.
Discussion
What have we learned from all of this? Most likely, the prob-
lem of estimating commute times is not relevant to every-
body. But I think there are aspects of the process described
above that illustrate how the data analytic process works
before data collection begins (yes, data analysis includes parts
where there are no data). These aspects can be lifted from this
particular example and generalized to other data analyses. In
this section I will discuss some of these aspects and describe
why they may be relevant to other analyses.
Problem Framing
for you. However, you may still be able to control when you
leave home.
With the question “How long does it take to commute by
Muni?” one might characterize the potential intervention as
“Taking Muni to work or not”. However, if Muni breaks down,
then that is out of your control and you simply cannot take
that choice. A more useful question then is “How long does it
take to commute when I choose to take Muni?” This difference
may seem subtle, but it does imply a different analysis and
is associated with a potential intervention that is completely
controllable. I may not be able to take Muni everyday, but I
can definitely choose to take it everyday.
Sketch Models
Summary
I have spoken with people who argue that are little in the way
of generalizable concepts in data analysis because every data
analysis is uniquely different from every other. However, I
think this experience of observing myself talk with Hilary
about this small example suggests to me that there are some
general concepts. Things like gauging your personal interest
in the problem could be useful in managing potential re-
sources dedicated to an analysis, and I think considering fixed
and random variation is important aspect of any data analytic
design or analysis. Finally, developing a sketch (statistical)
model before the data are in hand can be useful for setting
expectations and for setting a benchmark for when to be
surprised or skeptical.
One problem with learning data analysis is that we rarely,
as students, get to observe the thought process that occurs
at the early stages. In part, that is why I think many call
for more experiential learning in data analysis, because the
only way to see the process is to do the process. But I think
we could invest more time and effort into recording some
of these processes, even in somewhat artificial situations like
this one, in order to abstract out any generalizable concepts
and advice. Such summaries and abstractions could serve
as useful data analysis texts, allowing people to grasp the
general concepts of analysis while using the time dedicated to
experiential learning for studying the unique details of their
problem.
III Human Factors
15. Trustworthy Data
Analysis
The success of a data analysis depends critically on the au-
dience. But why? A lot has to do with whether the audience
trusts the analysis as well as the person presenting the anal-
ysis. Almost all presentations are incomplete because for any
analysis of reasonable size, some details must be omitted for
the sake of clarity. A good presentation will have a struc-
tured narrative that will guide the presenter in choosing what
should be included and what should be omitted. However,
audiences will vary in their acceptance of that narrative and
will often want to know if other details exist to support it.
The Presentation
There are many questions that one might ask before one were
to place any trust in the results of this analysis. Here are just
a few:
One might think of other things to do, but the items listed
above are in direct response to the questions asked before.
1 https://simplystatistics.org/2018/05/24/context-compatibility-in-data-
analysis/
Trustworthy Data Analysis 94
Here we have
by-steve-coll/9781594204586/
Relationships in Data Analysis 100
At the end of the day, someone has to pay for data analysis,
and this person is the patron. This person might have gotten
a grant, or signed a customer, or simply identified a need and
the resources for doing the analysis. The key thing here is
that the patron provides the resources and determines the
tools available for analysis. Typically, the resources we are
concerned with are time available to the analyst. The Patron,
through the allocation of resources, controls the scope of
the analysis. If the patron needs the analysis tomorrow, the
analysis is going to be different than if they need it in a month.
A bad relationship here can lead to mismatched expectations
between the patron and the analyst. Often the patron thinks
the analysis should take less time than it really does. Con-
versely, the analyst may be led to believe that the patron is
deliberately allocating fewer resources to the data analysis
because of other priorities. None of this is good, and the
relationship between the two must be robust enough in order
to straighten out any disagreements or confusion.
The data analyst needs to find some way to assess the needs
and capabilities of the audience, because there is always an
audience2 . There will likely be many different ways to present
2 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
Relationships in Data Analysis 102
Implications
between the analyst and the various people that have a stake
in the results. In the worst case scenario, a breakdown in
relationships can lead to serious failure3 .
I think most people who analyze data would say that data
analysis is easiest when the patron, subject matter expert, the
analyst, and the audience are all the same person. The reason
is because the relationships are all completely understood
and easy to manage. Communication is simple when it only
has to jump from neuron to anther. Doing data analysis “for
yourself” is often quick, highly iterative, and easy to make
progress. Collaborating with others on data analysis can be
slow, rife with miscommunication, and frustrating. One com-
mon scenario is where the patron, expert, and the audience
are the same person. If there is a good relationship between
this person and the analyst, then the data analysis process
here can work very smoothly and productively.
Combining different roles into the same person can some-
times lead to conflicts:
analysis-failures/
Relationships in Data Analysis 104
Other Theories
There are other kinds of theory and often their role is not
to make general statements about the natural world. Rather,
their goal is to provide quasi-general summaries of what
is commonly done, or what might be typical. So instead of
making statements along the lines of “X is true”, the aim is
to make statements like “X is most common”. Often, those
statements can be made because there is a written record of
what was done in the past and the practitioners in the area
have a collective memory of what works and what doesn’t.
On War
On Music
bad”. His thinking was not that the theory of music tells
us what sounds good versus bad but rather tells us what is
commonly done versus not commonly done. One could infer
that things that are commonly done are therefore good, but
that would be an individual judgment and not an inherent
aspect of the theory.
Knowledge of music theory is useful if only because it pro-
vides an efficient summary of what is expected. You can’t
break the rules if you don’t know what the rules are. Creating
things that are surprising or unexpected or interesting relies
critically on knowing what your audience is expecting to
hear. The reason why Schoenberg’s “atonal” style of music
sounds so different is because his audiences were expecting
music written in the more common tonal style. Sometimes,
we can rely on music theory to help us avoid a specific
chord progression (e.g. parallel fifths) because that “sounds
bad”, but what we really mean is that such a pattern is not
commonly used and is perhaps unexpected. So if you’re going
to do it, feel free, but it should be for a good reason.
General Statements
Theory Principles
science/
The Role of Theory in Data Analysis 116
analysis/
4 https://simplystatistics.org/2018/06/04/trustworthy-data-analysis/
The Role of Theory in Data Analysis 118
Now that time has passed and I’ve had an opportunity to see
what’s going on in the world of data science, what I think
about good data scientists, and what seems to make for good
data analysis, I have a few more ideas on what makes for a
good data scientist. In particular, I think there are broadly five
“tentpoles” for a good data scientist. Each tentpole represents
a major area of activity that will to some extent be applied in
any given data analysis.
When I ask myself the question “What is data science?” I tend
to think of the following five components. Data science is
is-great-at-data-analysis/
The Tentpoles of Data Science 120
Design Thinking
Think/dp/1847886361
The Tentpoles of Data Science 121
Workflows
Over the past 15 years or so, there has been a growing dis-
cussion of the importance of good workflows in the data
analysis community. At this point, I’d say a critical job of a
data scientist is to develop and manage the workflows for a
given data problem. Most likely, it is the data scientist who
will be in a position to observe how the data flows through
a team or across different pieces of software, and so the data
scientist will know how best to manage these transitions. If a
data science problem is a systems problem, then the workflow
indicates how different pieces of the system talk to each other.
While the tools of data analytic workflow management are
constantly changing, the importance of the idea persists and
4 https://simplystatistics.org/2018/09/14/divergent-and-convergent-
phases-of-data-analysis/
5 https://simplystatistics.org/2018/11/01/the-role-of-academia-in-data-
science-education/
The Tentpoles of Data Science 122
staying up-to-date with the best tools is a key part of the job.
In the scientific arena the end goal of good workflow man-
agement is often reproducibility of the scientific analysis. But
good workflow can also be critical for collaboration, team
management, and producing good science (as opposed to
merely reproducible science). Having a good workflow can
also facilitate sharing of data or results, whether it’s with
another team at the company or with the public more gen-
erally, as in the case of scientific results. Finally, being able
to understand and communicate how a given result has been
generated through the workflow can be of great importance
when problems occur and need to be debugged.
Human Relationships
analysis/
7 https://simplystatistics.org/2018/06/18/the-role-of-resources-in-data-
analysis/
8 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
9 https://simplystatistics.org/2018/04/30/relationships-in-data-analysis/
The Tentpoles of Data Science 123
Statistical Methods
analysis-failures/
The Tentpoles of Data Science 124
data/
The Tentpoles of Data Science 125
and to understand how the data are tied to the result. This
is a data analysis after all, and we should be able to see for
ourselves how the data inform the conclusion. As an audience
member in this situation, I’m not as interested in just trusting
the presenter and their conclusions.
Summary
Generative Model
Analytical Model
Stephanie Hicks and I have written about what are the ele-
ments of a data analysis as well as what might be the prin-
ciples7 that guide the development of an analysis. In a sep-
arate paper8 , we describe and characterize the success of a
data analysis, based on a matching of principles between the
analyst and the audience. This is something I have touched
on previously, both in this book and on my podcast with
Hilary Parker9 , but in a generally more hand-wavey fashion.
Developing a more formal model, as Stephanie and I have
done here, has been useful and has provided some additional
insights.
For both the generative model and the analytical model of
data analysis, the missing ingredient was a clear definition of
what made a data analysis successful. The other side of that
coin, of course, is knowing when a data analysis has failed.
The analytical approach is useful because it allows us to sep-
arate the analysis from the analyst and to categorize analyses
7 https://arxiv.org/abs/1903.07639
8 https://arxiv.org/abs/1904.11907
9 http://nssdeviations.com/
Generative and Analytical Models 131