Dataanalysisessays
Dataanalysisessays
Dataanalysisessays
Roger D. Peng
This book is for sale at
http://leanpub.com/dataanalysisessays
4. Context Compatibility . . . . . . . . . . . . . . . . . 18
10.Abductive Reasoning . . . . . . . . . . . . . . . . . . 57
We repeat this process for each tool in the toolbox. The prob-
lem is that the end of the statement doesn’t follow from the
beginning of the statement. Just because you know all of the
characteristics of a hammer doesn’t mean that you know how
to build a house, or even a chair. You may be able to infer it,
but that’s perhaps the best we can hope for.
In graduate school, this kind of training is arguably okay be-
cause we know the students will go on to work with an advisor
who will in fact teach them data analysis. But it still raises the
question: Why can’t we teach data analysis in the classroom?
Why must it be taught one-on-one with an apprenticeship
model? The urgency of these questions has grown substan-
tially in recent times with the rise of big data, data science,
and analytics. Everyone is analyzing data now, but we have
very little guidance to give them. We still can’t tell them “what
to do”.
One phenomenon that I’ve observed over time is that stu-
dents, once they have taken our courses, are often very well-
versed in data analytic tools and their characteristics. The-
orems tell them that certain tools can be used in some sit-
uations but not in others. But when they are given a real
scientific problem with real data, they often make what we
might consider elementary mistakes. It’s not for a lack of
understanding of the tools. It’s clearly something else that is
missing in their training.
The goal of this book is to give at least a partial answer to
Michelle’s question of “What should I do?” Perhaps ironically,
for a book about data analysis, much of the discussion will
center around things that are outside the data. But ultimately,
that is the part that is missing from traditional statistical train-
ing, the things outside the data that play a critical role in
how we conduct and interpret data analyses. Three concepts
will emerge from the discussion to follow. They are context,
resources, and audience, each of which will get their own
chapter. In addition, I will discuss an expanded picture of
The Question 5
This report was written in 1990 but the essence of this quo-
tation still rings true today. Pregibon was writing about his
effort to develop “expert systems” that would guide scientists
doing data analysis. His report was largely a chronicle of
failures that could be attributed to a genuine lack of under-
standing of how people analyze data.
Top to Bottom
One can think of data analysis from the “top down” or from
the “bottom up”. Neither characterization is completely cor-
rect but both are useful as mental models. In this book we
will discuss both and how they help guide us in doing data
analysis.
The top down approach sees data analysis a product to be de-
signed. As has been noted by many professional designers and
What is Data Analysis? 8
each person involved doing their job and not talking to any-
one else. One may wish it were so, because it would be much
simpler this way, but wishing doesn’t make it so. This initial
discussion of figuring out the right problem is an important
part of designing a data analysis. But if one has a thorough
discussion, we’re not done questioning the question yet.
Sometimes a problem does not become clear until we’ve at-
tempted to solve it. A data analyst’s job is to propose solu-
tions to a problem in order to explore the problem space. For
example, in the above air pollution example, we might first
express interest in looking at particulate matter air pollution1 .
But when we look at the available data we see that there are
too many missing values to do a useful analysis. So we might
switch to looking at ozone pollution instead, which is equally
important from a health perspective.
At this point it is important that you not get too far into the
problem where a lot of time or resources have to be invested.
Making that initial attempt to look at particulate matter prob-
ably didn’t involve more than downloading a file from a web
site, but it allowed us to explore the boundary of what might
be possible. Sometimes the proposed solution “just works”,
but more often it raises new questions and forces the analyst
to rethink the underlying problem. Initial solution attempts
should be “sketches”, or rough models, to see if things will
work. The preliminary data obtained by these sketches can be
valuable for prioritizing the possible solutions and converg-
ing on the ultimate approach.
This initial process of questioning the question can feel frus-
trating to some, particularly for those who have come to the
data analyst to get something done. Often to the collaborator,
it feels like the analyst is questioning their own knowledge of
the topic and is re-litigating old issues that have previously
been resolved. The data analyst should be sensitive to these
concerns and explain why such a discussion needs to be had.
The analyst should also understand that the collaborator is
an expert in their field and probably does know what they’re
1 https://www.epa.gov/pm-pollution
Data Analysis as a Product 15
Summary
Time
Technology
Trustworthiness
Acceptance
Audience
Summary
I still feel the same way about these skills but my feeling now
is that actually that post made the job of the data scientist
seem easier than it is. This is because it wrapped all of these
skills into a single job when in reality data science requires
being good at four jobs. In order to explain what I mean by
this, we have to step back and ask a much more fundamental
question.
1 https://simplystatistics.org/2019/01/18/the-tentpoles-of-data-science/
The Four Jobs of the Data Scientist 38
In case it’s not clear, these are not the same question. For
example, in Statistics, based on the curriculum from most PhD
program, the core of the field involves statistical methods,
statistical theory, probability, and maybe some computing.
Data analysis is generally not formally taught (i.e. in the class-
room), but rather picked up as part of a thesis or research
project. Many classes labeled “Data Science” or “Data Analy-
sis” simply teach more methods like machine learning, clus-
tering, or dimension reduction. Formal software engineering
techniques are also not generally taught, but in practice are
often used.
One could argue that data analysis and software engineering
is something that statisticians do but it’s not the core of the
field. Whether that is correct or incorrect is not my point.
I’m only saying that a distinction has to be made somewhere.
Statisticians will always do more than what would be consid-
ered the core of the field.
With data science, I think we are collectively taking inventory
of what data scientists tend to do. The problem is that at the
moment it seems to be all over the map. Traditional statistics
does tend to be central to the activity, but so does computer
science, software engineering, cognitive science, ethics, com-
munication, etc. This is hardly a definition of the core of a
field but rather an enumeration of activities.
The question then is can we define something that all data
scientists do? If we had to teach something to all data science
students without knowing where they might end up after-
wards, what would it be? My opinion is that at some point,
The Four Jobs of the Data Scientist 39
Scientist
new data and executes the data collection. The scientist must
work with the statistician to design a system for analyzing the
data and ultimately construct a set of expected outcomes from
any analysis of the data being collected.
The scientist plays a key role in developing the system that
results in our set of expected outcomes. Components of this
system might include a literature review, meta-analysis, pre-
liminary data, or anecdotal data from colleagues. I use the
term “Scientist” broadly here. In other settings this could be
a policy-maker or product manager.
Statistician
Systems Engineer
Once the analytic system is applied to the data there are only
two possible outcomes:
For software and for data analysis alike, the challenge is that
bugs or unexpected behavior can originate from anywhere.
Any complex system is composed of multiple components,
some of which may be your responsibility and many of which
are someone else’s. But bugs and anomalies do not respect
those boundaries! There may be an issue that occurs in one
component that only becomes known when you see the data
analytic output.
So if you are responsible for diagnosing a problem, it is your
responsibility to investigate the behavior of each component
of the system. If it is something you are not that familiar with,
then you need to become familiar with it, either by learning
on your own or (more likely) talking to the person who is in
fact responsible.
A common source of unexpected behavior in data analytic
output is the data collection process, but the statistician who
analyzes the data may not be responsible for that aspect of the
project. Nevertheless, the systems engineer who identifies an
anomaly has to go back through and talk to the statistician
and the scientist to figure out exactly how each component
works.
Ultimately, the systems engineer is tasked with taking a broad
view of all the activities that affect the output from a data
analysis in order to identify any deviations from what we
2 https://twitter.com/EmmaVitz/status/1330697959156027392?s=20
The Four Jobs of the Data Scientist 42
Politician
double-diamond
Divergent and Convergent Phases of Data Analysis 46
Phase 1: Exploration
analysis/
3 https://simplystatistics.org/2018/07/06/data-creators/
Divergent and Convergent Phases of Data Analysis 48
You might love all your children equally but you still need to
pick a favorite. The reason is nobody has the resources4 to
pursue every avenue of investigation. Furthermore, pursuing
every avenue would likely not be that productive. You are bet-
ter off sharpening and refining your question. This simplifies
the analysis in the future and makes it much more likely that
people (including you!) will be able to act on the results that
you present.
This phase of analysis is convergent and requires synthesiz-
ing many different ideas into a coherent plan or strategy. Tak-
ing the thousands of plots, tables, and summaries that you’ve
made and deciding on a problem specification is not easy,
and to my surprise, I have not seen a lot of tools dedicated
to assisting in this task. Nevertheless, the goal here is to end
up with a reasonably detailed specification of what we are
trying to achieve and how the data will be used to achieve it.
It might be something like “We’re going to fit a linear model
with this outcome and these predictors in order to answer this
question” or “We’re building a prediction model using this
collection of features to optimize this metric”.
In some settings (such as in consulting) you might need to
formally write this specification down and present it to others.
At any rate, you will need to justify it based on your explo-
ration of the data in Phase 1 and whatever outside factors may
be relevant. Having a keen understanding of your audience5
becomes relevant at the conclusion of this phase.
analysis/
5 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
Divergent and Convergent Phases of Data Analysis 49
For starters, what will the results look like? What summaries
do we want to produce and how will they be presented? Hav-
ing a detailed specification is good, but it’s not final. When I
was a software engineer, we often got highly detailed specifi-
cations of the software we were supposed to build. But even
then, there were many choices to make.
Thus begins another divergent phase of analysis, where we
typically build models and gauge their performance and ro-
bustness. This is the data analyst’s version of prototyping. We
may look at model fit and see how things work out relative
to our expectations set out in the problem specification. We
might consider sensitivity analyses or other checks on our
assumptions about the world and the data. Again, there may
be many tables and plots, but this time not of the data, but of
the results. The important thing here is that we are dealing
with concrete models, not the rough “sketches” done in Phase
1.
Because the work in this phase will likely end up in some
form in the final product, we need to develop a more formal
workflow and process to track what we are doing. Things like
version control play a role, as well as scriptable data analysis
packages that can describe our work in code. Even though
many aspects of this phase still may not be used, it is impor-
tant to have reproducibility in mind as work is developed so
that it doesn’t have to be “tacked on” after the fact (an often
painful process).
Phase 4: Narration
Implications
At this point it’s worth recalling that all models are wrong, but
some are useful. So why is this model for data analysis useful?
I think there are a few areas where the model is useful as an
explanatory tool and for highlighting where there might be
avenues for future work.
Decision Points
One thing I like about the visual shape of the model is that it
highlights two key areas–the ends of both convergent phases–
where critical decisions have to be made. At the end of Phase
2, we must decide on the problem specification, after sorting
through all of the exploratory work that we’ve done. At the
6 https://fivethirtyeight.com
Divergent and Convergent Phases of Data Analysis 51
Tooling
Education
graphics/
Divergent and Convergent Phases of Data Analysis 53
Future Work
Tooling
Education
Summary
Summary
1. Recognition of problem
2. One technique used
3. Competing techniques used
4. Rough comparisons of efficacy
After thinking about all this I was inspired to draw the follow-
ing diagram.
It would seem that the message here is that the goal of data
analysis is to explore the data. In other words, data analysis
is exploratory data analysis. Maybe this shouldn’t be so sur-
prising given that Tukey wrote the book4 on exploratory data
analysis. In this paper, at least, he essentially dismisses other
goals as overly optimistic or not really meaningful.
For the most part I agree with that sentiment, in the sense
that looking for “the answer” in a single set of data is going
to result in disappointment. At best, you will accumulate ev-
idence that will point you in a new and promising direction.
Then you can iterate, perhaps by collecting new data, or by
asking different questions. At worst, you will conclude that
you’ve “figured it out” and then be shocked when someone
4 https://en.wikipedia.org/wiki/Exploratory_data_analysis
Tukey, Design Thinking, and Better Questions 66
(Note that the all caps are originally his!) Given this, it’s not
too surprising that Tukey seems to equate exploratory data
analysis with essentially all of data analysis.
Better Questions
There’s one story that, for me, totally captures the spirit of ex-
ploratory data analysis. Legend has it that Tukey once asked
a student what were the benefits of the median polish tech-
nique5 , a technique he invented to analyze two-way tabu-
lar data. The student dutifully answered that the benefit of
the technique is that it provided summaries of the rows and
columns via the row- and column-medians. In other words,
like any good statistical technique, it summarized the data by
reducing it in some way. Tukey fired back, saying that this was
incorrect—the benefit was that the technique created more
data. That “more data” was the residuals that are leftover
in the table itself after running the median polish. It is the
residuals that really let you learn about the data, discover
whether there is anything unusual, whether your question
is well-formulated, and how you might move on to the next
step. So in the end, you got row medians, column medians,
and residuals, i.e. more data.
If a good exploratory technique gives you more data, then
maybe good exploratory data analysis gives you more ques-
tions, or better questions. More refined, more focused, and
5 https://en.wikipedia.org/wiki/Median_polish
Tukey, Design Thinking, and Better Questions 67
analysis/
The Role of Creativity 69
Missing Data
Missing data are present in almost every dataset and the most
important question a data analyst can ask when confronted
with missing data is “Why are the data missing?” It’s impor-
tant to develop some understanding of the mechanism behind
what makes the data missing in order to develop an appropri-
ate strategy for dealing with missing data (i.e. doing nothing,
imputation, etc.) But the data themselves often provide little
information about this mechanism; often the mechanism is
coded outside the data, possibly not even written down but
3 https://simplystatistics.org/2018/05/24/context-compatibility-in-data-
analysis/
4 https://simplystatistics.org/2018/06/18/the-role-of-resources-in-data-
analysis/
5 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
The Role of Creativity 70
The data analyst will likely have to work under a set of re-
source constraints7 , placing boundaries on what can be done
with the data. The first and foremost constraint is likely to
be time. One can only try so many things in the time allotted,
or some analyses may take too long to complete. Therefore,
compromises may need to be made unless more time and
resources can be negotiated. Tooling will be limited in that
certain combinations of models and software may not exist
and there may not be time to develop new tools from scratch.
A good data analyst must make an estimate of the time avail-
able and determine whether it is sufficient to complete the
analysis. If resources are insufficient, then the analyst must
either negotiate for more resources or adapt the analysis to
fit the available resources. Creativity will almost certainly be
required when there are severe resource constraints, in order
to squeeze as much productivity out of what is available.
Summary
analysis/
13. Should the Data Come
First?
One conversation I’ve had quite a few times revolves around
the question, “What’s the difference between science and data
science?” If I were to come up with a simple distinction, I
might say that
Once we’ve figured out the context around the data, essen-
tially retracing the history of the data, we can then ask “Are
these data appropriate for the question that I want to ask?”
Answering this question involves comparing the context sur-
rounding the original data and then ensuring that it is com-
patible with the context for your question. If there is compat-
ibility, or we can create compatibility via statistical modeling
or assumptions, then we can intelligently move forward to
analyze the data and generate evidence concerning a differ-
ent question. We will then have to make a separate argument
to some audience regarding the evidence in the data and any
conclusions we may make. Even though the data may have
been convincing for one question, it doesn’t mean that the
data will be convincing for a different question.
If we were to develop a procedure for a data science “proce-
dural” TV show, it might like the following.
Data science often starts with the data, but in an ideal world
it wouldn’t. In an ideal world, we would ask questions and
carefully design experiments to collect data specific to those
questions. But this is simply not practical and data need to be
shared across contexts. The difficulty of the data scientist’s
job is understanding each dataset’s context, making sure it
is compatible with the current question, developing the evi-
dence from the data, and then getting an audience to accept
the results.
14. Partitioning the
Variation in Data
One of the fundamental questions that we can ask in any data
analysis is, “Why do things vary?” Although I think this is
fundamental, I’ve found that it’s not explicitly asked as often
as I might think. The problem with not asking this question,
I’ve found, is that it can often lead to a lot of pointless and time-
consuming work. Taking a moment to ask yourself, “What do
I know that can explain why this feature or variable varies?”
can often make you realize that you actually know more than
you think you do.
When embarking on a data analysis, ideally before you look at
the data, it’s useful to partition the variation in the data. This
can be roughly broken down into to categories of variation:
fixed and random. Within each of those categories, there can
be a number of sub-categories of things to investigate.
Fixed variation
Random variation
Is it really random?
Summary
features for which you do not have data but that you can go
out and collect. Making the effort to collect additional data
when it is warranted can save a lot of time and effort trying
to model variation as if it were random. More importantly,
omitting important fixed effects in a statistical model can
lead to hidden bias or confounding. When data on omitted
variables cannot be collected, trying to find a surrogate for
those variables can be a reasonable alternative.
15. How Data Analysts
Think - A Case Study
In episode 71 of Not So Standard Deviations1 , Hilary Parker
and I inaugurated our first “Data Science Design Challenge”
segment where we discussed how we would solve a given
problem using data science. The idea with calling it a “design
challenge” was to contrast it with common “hackathon” type
models where you are presented with an already-collected
dataset and then challenged to find something interesting
in the data. Here, we wanted to start with a problem and
then talk about how data might be collected and analyzed to
address the problem. While both approaches might result in
the same end-product, they address the various problems you
encounter in a data analysis in a different order.
In this post, I want to break down our discussion of the chal-
lenge and highlight some of the issues that were discussed in
framing the problem and in designing the data collection and
analysis. I’ll end with some thoughts about generalizing this
approach to other problems.
You can download an MP3 of this segment of the episode2 (it
is about 45 minutes long) or you can read the transcript of the
segment3 . If you’d prefer to stream the segment you can start
listening here4 .
1 http://nssdeviations.com/
2 https://www.dropbox.com/s/yajgbr25dbh20i0/NSSD%20Episode%
2071%20Design%20Challenge.mp3?dl=0
3 https://drive.google.com/open?id=11dEhj-eoh8w13dS-
mWvDMv7NKWXZcXMr
4 https://overcast.fm/+FMBuKdMEI/00:30
How Data Analysts Think - A Case Study 82
The Brief
The general goal was to learn more about the time it takes for
each of us to commute to work. Hilary lives in San Francisco
and I live in Baltimore, so the characteristics of our commutes
are very different. She walks and takes public transit; I drive
most days. We also wanted to discuss how we might collect
data on our commute times in a systematic, but not intrusive,
manner. When we originally discussed having this segment,
this vague description was about the level of specification that
we started with, so an initial major task was to
Right off the bat, Hilary notes that she doesn’t actually do this
commute as often as she’d thought. Between working from
home, taking care of chores in the morning, making stops on
the way, and walking/talking with friends, a lot of variation
can be introduced in to the data.
I mention that “going to work” and “going home”, while both
can be thought of as commutes, are not the same thing and
that we might be interested in one more than the other. Hilary
agrees that they are different problems but they are both
potentially of interest.
Question/Intervention Duality
Then she describes how she can use Wi-Fi connections (and
dis-connections) to serve as surrogates for leaving and arriv-
ing.
What exactly are the data that we will be collecting? What are
the covariates that we need to help us understand and model
the commute times? Obvious candidates are
Hilary notes that from the start/end times we can get things
like day of the week and time of day (e.g. via the lubridate6
6 https://cran.r-project.org/web/packages/lubridate/index.html
How Data Analysts Think - A Case Study 87
package). She also notes that her system doesn’t exactly pro-
duce date/time data, but rather a text sentence that includes
the date/time embedded within. Thankfully, that can be sys-
tematically dealt with using simple string processing func-
tions.
A question arises about whether a separate variable should be
created to capture “special circumstances” while commuting.
In the data analysis, we may want to exclude days where
we know something special happened to make the commute
much longer than we might have expected (e.g. we happened
to see a friend along the way or we decided to stop at Wal-
greens). The question here is
data/
How Data Analysts Think - A Case Study 88
Thinking about what the data will ultimately be used for raises
two important statistical considerations:
Hilary raises a hard truth, which is that not everyone gets the
same consideration when it comes to showing up on time. For
an important meeting, we might allow for “three standard
deviations” more than the mean travel time to ensure some
margin of safety for arriving on time. However, for a more
routine meeting, we might just provide for one standard de-
viation of travel time and let natural variation take its course
for better or for worse.
Towards the end I ask Hilary how much data is needed for
this project? However, before asking I should have discussed
the nature of the study itself:
Hilary suggests that it is the latter and that she will simply
collect data and make decisions as she goes. However, it’s
How Data Analysts Think - A Case Study 92
clear that the time frame is not “forever” because the method
of data collection is not zero cost. Therefore, at some point the
costs of collecting data will likely be too great in light of any
perceived benefit.
Discussion
What have we learned from all of this? Most likely, the prob-
lem of estimating commute times is not relevant to everybody.
But I think there are aspects of the process described above
that illustrate how the data analytic process works before
data collection begins (yes, data analysis includes parts where
there are no data). These aspects can be lifted from this partic-
ular example and generalized to other data analyses. In this
section I will discuss some of these aspects and describe why
they may be relevant to other analyses.
Problem Framing
you. However, you may still be able to control when you leave
home.
With the question “How long does it take to commute by
Muni?” one might characterize the potential intervention as
“Taking Muni to work or not”. However, if Muni breaks down,
then that is out of your control and you simply cannot take
that choice. A more useful question then is “How long does it
take to commute when I choose to take Muni?” This difference
may seem subtle, but it does imply a different analysis and
is associated with a potential intervention that is completely
controllable. I may not be able to take Muni everyday, but I
can definitely choose to take it everyday.
Sketch Models
Summary
I have spoken with people who argue that are little in the
way of generalizable concepts in data analysis because every
data analysis is uniquely different from every other. How-
ever, I think this experience of observing myself talk with
Hilary about this small example suggests to me that there
are some general concepts. Things like gauging your personal
interest in the problem could be useful in managing potential
resources dedicated to an analysis, and I think considering
fixed and random variation is important aspect of any data
analytic design or analysis. Finally, developing a sketch (sta-
tistical) model before the data are in hand can be useful for
setting expectations and for setting a benchmark for when to
be surprised or skeptical.
One problem with learning data analysis is that we rarely, as
students, get to observe the thought process that occurs at the
early stages. In part, that is why I think many call for more
experiential learning in data analysis, because the only way
to see the process is to do the process. But I think we could
invest more time and effort into recording some of these pro-
cesses, even in somewhat artificial situations like this one, in
order to abstract out any generalizable concepts and advice.
Such summaries and abstractions could serve as useful data
analysis texts, allowing people to grasp the general concepts
of analysis while using the time dedicated to experiential
learning for studying the unique details of their problem.
III Human Factors
16. Trustworthy Data
Analysis
The success of a data analysis depends critically on the au-
dience. But why? A lot has to do with whether the audience
trusts the analysis as well as the person presenting the anal-
ysis. Almost all presentations are incomplete because for any
analysis of reasonable size, some details must be omitted for
the sake of clarity. A good presentation will have a struc-
tured narrative that will guide the presenter in choosing what
should be included and what should be omitted. However,
audiences will vary in their acceptance of that narrative and
will often want to know if other details exist to support it.
The Presentation
There are many questions that one might ask before one were
to place any trust in the results of this analysis. Here are just
a few:
One might think of other things to do, but the items listed
above are in direct response to the questions asked before.
1 https://simplystatistics.org/2018/05/24/context-compatibility-in-data-
analysis/
Trustworthy Data Analysis 101
Here we have
in the data, and you may want to just put that kind of informa-
tion up front in part A. If the audience likes to have a higher
level perspective of things, you can reserve the information
for part B.
Considering the audience is useful because it can often drive
you to do analyses that perhaps you hadn’t thought to do at
first. For example, if your boss always wants to see a sensitiv-
ity analysis, then it might be wise to do that and put the results
in part B, even if you don’t think it’s critically necessary or
if it’s tedious to present. On occasion, you might find that
the sensitivity analyses in fact sheds light on an unforeseen
aspect of the data. It would be nice if there were a “global
list of things to do in every analysis”, but there isn’t and even
if there were it would likely be too long to complete for any
specific analysis. So one way to optimize your approach is to
consider the audience and what they might want to see, and
to merge that with what you think is needed for the analysis.
If you are the audience, then considering the audience’s needs
is a relatively simple task. But often the audience will be
separate (thesis committee, journal reviewers/editors, confer-
ence attendees) and you will have to make your best effort
at guessing. If you have direct access to the audience, then
a simpler approach would be to just ask them. But this is a
potentially time-consuming task (depending on how long it
takes for them to respond) and may not be feasible in the time
frame allowed for the analysis.
by-steve-coll/9781594204586/
Relationships in Data Analysis 107
At the end of the day, someone has to pay for data analysis,
and this person is the patron. This person might have gotten
a grant, or signed a customer, or simply identified a need and
the resources for doing the analysis. The key thing here is
that the patron provides the resources and determines the
tools available for analysis. Typically, the resources we are
concerned with are time available to the analyst. The Patron,
through the allocation of resources, controls the scope of the
analysis. If the patron needs the analysis tomorrow, the anal-
ysis is going to be different than if they need it in a month.
A bad relationship here can lead to mismatched expectations
between the patron and the analyst. Often the patron thinks
the analysis should take less time than it really does. Con-
versely, the analyst may be led to believe that the patron is
deliberately allocating fewer resources to the data analysis
because of other priorities. None of this is good, and the rela-
tionship between the two must be robust enough in order to
straighten out any disagreements or confusion.
The data analyst needs to find some way to assess the needs
and capabilities of the audience, because there is always an
audience2 . There will likely be many different ways to present
the results of an analysis and it is the analyst’s job to figure
what might be the best way to make the results acceptable to
2 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
Relationships in Data Analysis 109
Implications
analysis-failures/
Relationships in Data Analysis 111
edu/research/redoc/cdvswww.html
Economic Models for Reproducible Analysis 116
Analysis Depreciation
tenance and support costs were nominal and did not really
need to be paid for explicitly.
Fast forward to today and the economic model has not changed
but the “business” of academic research has. Now, every pub-
lication has data and code/software attached to it which come
with maintenance and support costs that can extend for a
substantial period into the future. While any given publica-
tion may not require significant maintenance and support,
the costs for an investigator’s publications in aggregate can
add up very quickly. Even a single paper that turns out to be
popular can take up a lot of time and energy.
If you play this movie to the end, it becomes soberingly clear
that reproducible research, from an economic stand point, is
not really sustainable. To see this, it might help to use an anal-
ogy from the business world. Most businesses have capital
costs, where they buy large expensive things – machinery,
buildings, etc. These things have a long life, but are thought to
degrade over time (accountants call it depreciation). As a re-
sult, most businesses have “maintenance capital expenditure”
costs that they report to show how much money they are in-
vesting every quarter to keep their equipment/buildings/etc.
up to shape. In this context, the capital expenditure is worth
it because every new building or machine that is purchased
is designed to ultimately produce more revenue. As long as
the revenue generated exceeds the cost of maintenance, the
capital costs are worth it (not to oversimplify or anything!).
In academia, each new publications incurs some maintenance
and support costs to ensure reproducibility (the “capital ex-
penditure” here) but it’s unclear how much each new publi-
cation brings in more “revenue” to offset those costs. Sure,
more publications allow one to expand the lab or get more
grant funding or hire more students/postdocs, but I wouldn’t
say that’s universally true. Some fields are just constrained by
how much total funding there is and so the available funding
cannot really be increased by “reaching more customers”.
Given that the budgets for funding agencies (at least in the
U.S.) have barely kept up with inflation and the number of
publications increases every year, it seems the goal of making
Economic Models for Reproducible Analysis 119
Other Theories
There are other kinds of theory and often their role is not
to make general statements about the natural world. Rather,
their goal is to provide quasi-general summaries of what is
commonly done, or what might be typical. So instead of mak-
ing statements along the lines of “X is true”, the aim is to make
statements like “X is most common”. Often, those statements
can be made because there is a written record of what was
done in the past and the practitioners in the area have a
collective memory of what works and what doesn’t.
On War
On Music
what sounds good versus bad but rather tells us what is com-
monly done versus not commonly done. One could infer that
things that are commonly done are therefore good, but that
would be an individual judgment and not an inherent aspect
of the theory.
Knowledge of music theory is useful if only because it pro-
vides an efficient summary of what is expected. You can’t
break the rules if you don’t know what the rules are. Creating
things that are surprising or unexpected or interesting relies
critically on knowing what your audience is expecting to hear.
The reason why Schoenberg’s “atonal” style of music sounds
so different is because his audiences were expecting music
written in the more common tonal style. Sometimes, we can
rely on music theory to help us avoid a specific chord progres-
sion (e.g. parallel fifths) because that “sounds bad”, but what
we really mean is that such a pattern is not commonly used
and is perhaps unexpected. So if you’re going to do it, feel free,
but it should be for a good reason.
General Statements
has developed over more than 20 years and has only grown
in importance. With the increase in computational power and
data collection technology, it is essentially impossible to rely
on written representations of data analysis. The only way to
truly know what has happened to produce a result is to look
at the code and perhaps run it yourself.
When I wrote my first paper on reproducibility in 20061 the
reaction was hardly one of universal agreement. But today, I
think many would see the statement above as true. What has
changed? Mostly, data analysts in the field have gained sub-
stantially more experience with complex data analyses and
have increasingly been bitten by the non-reproducibility of
certain analyses. With experience, both good and bad, we can
come to an understanding of what works and what doesn’t.
Reproducibility works as a mechanism to communicate what
was done, it isn’t too burdensome if it’s considered from the
beginning of an analysis, and as a by-product it can make data
available to others for different uses.
There is no need for a new data analyst to learn about re-
producibility “from experience”. We don’t need to lead a ju-
nior data analyst down a months-long winding path of non-
reproducible analyses until they are finally bitten by non-
reproducibility (and therefore “learn their lesson”). We can
just tell them
Theory Principles
science/
The Role of Theory in Data Analysis 128
analysis/
4 https://simplystatistics.org/2018/06/04/trustworthy-data-analysis/
The Role of Theory in Data Analysis 130
Now that time has passed and I’ve had an opportunity to see
what’s going on in the world of data science, what I think
about good data scientists, and what seems to make for good
data analysis, I have a few more ideas on what makes for a
good data scientist. In particular, I think there are broadly five
“tentpoles” for a good data scientist. Each tentpole represents
a major area of activity that will to some extent be applied in
any given data analysis.
When I ask myself the question “What is data science?” I tend
to think of the following five components. Data science is
is-great-at-data-analysis/
The Tentpoles of Data Science 132
Design Thinking
Think/dp/1847886361
The Tentpoles of Data Science 133
Workflows
Over the past 15 years or so, there has been a growing discus-
sion of the importance of good workflows in the data analysis
community. At this point, I’d say a critical job of a data sci-
entist is to develop and manage the workflows for a given
data problem. Most likely, it is the data scientist who will
be in a position to observe how the data flows through a
team or across different pieces of software, and so the data
scientist will know how best to manage these transitions. If a
data science problem is a systems problem, then the workflow
indicates how different pieces of the system talk to each other.
While the tools of data analytic workflow management are
constantly changing, the importance of the idea persists and
4 https://simplystatistics.org/2018/09/14/divergent-and-convergent-
phases-of-data-analysis/
5 https://simplystatistics.org/2018/11/01/the-role-of-academia-in-data-
science-education/
The Tentpoles of Data Science 134
staying up-to-date with the best tools is a key part of the job.
In the scientific arena the end goal of good workflow man-
agement is often reproducibility of the scientific analysis. But
good workflow can also be critical for collaboration, team
management, and producing good science (as opposed to merely
reproducible science). Having a good workflow can also facili-
tate sharing of data or results, whether it’s with another team
at the company or with the public more generally, as in the
case of scientific results. Finally, being able to understand and
communicate how a given result has been generated through
the workflow can be of great importance when problems
occur and need to be debugged.
Human Relationships
analysis/
7 https://simplystatistics.org/2018/06/18/the-role-of-resources-in-data-
analysis/
8 https://simplystatistics.org/2018/04/17/what-is-a-successful-data-
analysis/
9 https://simplystatistics.org/2018/04/30/relationships-in-data-analysis/
The Tentpoles of Data Science 135
Statistical Methods
analysis-failures/
The Tentpoles of Data Science 136
data/
The Tentpoles of Data Science 137
and to understand how the data are tied to the result. This
is a data analysis after all, and we should be able to see for
ourselves how the data inform the conclusion. As an audience
member in this situation, I’m not as interested in just trusting
the presenter and their conclusions.
Summary
A good data scientist can be hard to find, and part of the rea-
son is because being a good data scientist requires mastering
skills in a wide range of areas. However, these five tentpoles
are not haphazardly chosen; rather they reflect the interwo-
ven set of skills that are needed to solve complex data prob-
lems. Focusing on being good at these five tentpoles means
sacrificing time spent studying other things. To the extent that
we can coalesce around the idea of convincing people to do
exactly that, data science will become a distinct field with its
own identity and vision.
22. Generative and
Analytical Models
Describing how a data analysis is created is a topic of keen
interest to me and there are a few different ways to think
about it. Two different ways of thinking about data analysis
are what I call the “generative” approach and the “analytical”
approach. Another, more informal, way that I like to think
about these approaches is as the “biological” model and the
“physician” model. Reading through the literature on the pro-
cess of data analysis, I’ve noticed that many seem to focus on
the former rather than the latter and I think that presents an
opportunity for new and interesting work.
Generative Model
Analytical Model
Stephanie Hicks and I have written about what are the ele-
ments of a data analysis as well as what might be the prin-
ciples7 that guide the development of an analysis. In a sep-
arate paper8 , we describe and characterize the success of a
data analysis, based on a matching of principles between the
analyst and the audience. This is something I have touched
on previously, both in this book and on my podcast with
Hilary Parker9 , but in a generally more hand-wavey fashion.
Developing a more formal model, as Stephanie and I have
done here, has been useful and has provided some additional
insights.
For both the generative model and the analytical model of
data analysis, the missing ingredient was a clear definition of
what made a data analysis successful. The other side of that
coin, of course, is knowing when a data analysis has failed.
The analytical approach is useful because it allows us to sepa-
rate the analysis from the analyst and to categorize analyses
7 https://arxiv.org/abs/1903.07639
8 https://arxiv.org/abs/1904.11907
9 http://nssdeviations.com/
Generative and Analytical Models 143