Intuitive Biostatistics & Normality Test & Sample PDF
Intuitive Biostatistics & Normality Test & Sample PDF
Intuitive Biostatistics & Normality Test & Sample PDF
HARVEY MOTULSKY, M.D.
GraphPad Software, Inc.
FOURTH EDITION
New York Oxford
OX F O R D U N I V E R SI T Y P R E S S
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
987654321
Printed by LSC Communications, Inc. United States of America
Intuitive Biostatistics is a beautiful book that has much to teach experimental bi-
ologists of all stripes. Unlike other statistics texts I have seen, it includes extensive
and carefully crafted discussions of the perils of multiple comparisons, warnings
about common and avoidable mistakes in data analysis, a review of the assump-
tions that apply to various tests, an emphasis on confidence intervals rather than
P values, explanations as to why the concept of statistical significance is rarely
needed in scientific work, and a clear explanation of nonlinear regression (com-
monly used in labs; rarely explained in statistics books).
In fact, I am so pleased with Intuitive Biostatistics that I decided to make it
the reference of choice for my postdoctoral associates and graduate students, all
of whom depend on statistics and most of whom need a closer awareness of pre-
cisely why. Motulsky has written thoughtfully, with compelling logic and wit. He
teaches by example what one may expect of statistical methods and, perhaps just
as important, what one may not expect of them. He is to be congratulated for this
work, which will surely be valuable and perhaps even transformative for many of
the scientists who read it.
—Bruce Beutler, 2011 Nobel Laureate, Physiology or Medicine
Director, Center for the Genetics of Host Defense
UT Southwestern Medical Center
I am entranced by the book. Statistics is a topic that is often difficult for many
scientists to fully appreciate. The writing style and explanations of Intuitive Bio-
statistics makes the concepts accessible. I recommend this text to all researchers.
Thank you for writing it.
—Tim Bushnell, Director of Shared Resource Laboratories,
University of Rochester Medical Center
I read many statistics textbooks and have come across very few that actually ex-
plain statistical concepts well. Yours is a stand-out exception. In particular, I think
you’ve done an outstanding job of helping readers understand P values and confi-
dence intervals, and yours is one of the very first introductory textbooks to discuss
the crucial concept of false discovery rates. I have already recommended your text
to postgraduate students and postdoctoral researchers at my own institute.
—Rob Herbert
Neuroscience Research Australia
preface xxv
preface xxv
Who is this Book for?
What makes the Book Unique?
What’s New?
Which Chapters are Essential?
Who Helped?
Who Am I?
8. Types of Variables 75
Continuous Variables
Discrete Variables
Why It Matters?
Not Quite as Distinct as They Seem
Q&A
Chapter Summary
Terms Introduced in this Chapter
9. Quantifying Scatter 80
Interpreting a Standard Deviation
How it Works: Calculating SD
Why n – 1?
Situations in Which n Can Seem Ambiguous
part j appendices 517
Appendix A: Statistics with Graphpad
Appendix B: Statistics with Excel
Appendix C: Statistics with R
Appendix D: Values of the t Distribution Needed to
Compute CIs
Appendix E: A Review of Logarithms
Appendix F: Choosing a Statistical Test
Appendix G: Problems and Answers
references 533
index 548
xxv
If you think this book is too long, check out my other book, Essential Biostatistics,
which is about one-third the size and price of this one (Motulsky, 2015).
Statistical lingo
In Lewis Carroll’s Through the Looking Glass, Humpty Dumpty says, “When I use
a word, it means exactly what I say it means—neither more nor less” (Carroll,
1871). Lewis Carroll (the pseudonym for Charles Dodgson) was a mathematician,
and it almost seems he was thinking of statisticians when he wrote that line. But
that can’t be true because little statistical terminology had been invented by 1871.
Statistics books can get especially confusing when they use words and phrases
that have both ordinary meanings and technical meanings. The problem is that
you may think the author is using the ordinary meaning of a word or phrase,
when in fact the author is using that word or phrase as a technical term with a very
different meaning. I try hard to point out these potential ambiguities when I use
potentially confusing terms such as:
• Significant
• Error
• Hypothesis
• Model
• Power
• Variance
• Residual
• Normal
• Independent
• Sample
• Population
• Fit
• Confidence
• Distribution
• Control
• Outliers. Values far from the other values in a set are called outliers. Chap-
ter 25 explains how to think about outliers.
• Comparing the fit of alternative models. Statistical hypothesis testing is
usually viewed as a way to test a null hypothesis. Chapter 35 explains an
alternative way to view statistical hypothesis testing as a way to compare the
fits of alternative models.
• Meta-analysis as a way to reach conclusions by combining data from several
studies (Chapter 43).
• Detailed review of assumptions. All analyses are based on a set of assump-
tions, and many chapters discuss these assumptions in depth.
• Lengthy discussion of common mistakes in data analysis. Most chapters in-
clude lists (with explanations) of common mistakes and misunderstandings.
A unique organization
The organization of the book is unique.
Part A has three introductory chapters. Chapter 1 explains how common
sense can mislead us when thinking about probability and statistics. Chapter 2
briefly explains some of the complexities of dealing with probability. Chapter 3 ex-
plains the basic idea of statistics—to make general conclusions from limited data,
to extrapolate from sample to population.
Part B explains confidence intervals (CIs) in three contexts. Chapter 4 introduces
the concept of a CI in the context of CIs of a proportion. I think this is the simplest
example of a CI, since it requires no background information. Most books would start
with the CI of the mean, but this would require first explaining the Gaussian distribu-
tion, the standard deviation, and the difference between the standard deviation and
the standard error of the mean. CIs of proportions are much easier to understand.
Chapters 5 and 6 are short chapters that explain CIs of survival data and Poisson
(counted) data. Many instructors will choose to skip these two chapters.
Part C finally gets to continuous data, the concept with which most statis-
tics books start. The first three chapters are fairly conventional, explaining how to
graph continuous data, how to quantify variability, and the Gaussian distribution.
Chapter 11 is about lognormal distributions, which are common in biology but
are rarely explained in statistics texts. Chapter 12 explains the CI of the mean, and
Chapter 13 is an optional chapter that gives a taste of the theory behind CIs. Then
comes an important chapter (Chapter 14), which explains the different kinds of
error bars, emphasizing the difference between the standard deviation and the
standard error of the mean (which are frequently confused).
Part D is unconventional, as it explains the ideas of P values, statistical hy-
pothesis testing, and statistical power without explaining any statistical tests. I
think it is easier to learn the concepts of a P value and statistical significance apart
from the details of a particular test. This section also includes a chapter (unusual
for introductory books) on testing for equivalence.
Part E explains challenges in statistics. The first two chapters of this section
explain the problem of multiple comparisons. This is a huge challenge in data
analysis but a topic that is not covered by most introductory statistics books. The
next two chapters briefly explain the principles of normality and outlier tests,
topics that most statistics texts omit. Finally, Chapter 26 is an overview of deter-
mining necessary sample size.
Part F explains the basic statistical tests, including those that compare sur-
vival curves (an issue omitted from many introductory texts).
Part G is about fitting models to data. It begins, of course, with linear regres-
sion. Later chapters in the section explain the ideas of creating models and briefly
explain the ideas of nonlinear regression (a method used commonly in biological
research but omitted from most introductory texts), multiple regression, logistic
regression, and proportional hazards regression.
Part H contains miscellaneous chapters briefly introducing analysis of variance
(ANOVA), nonparametric methods, and sensitivity and specificity. It ends with a
chapter (Chapter 43) on meta-analysis, a topic covered by few introductory texts.
Part I tries to put it all together. Chapter 44 is a brief summary of the key
ideas of statistics. Chapter 45 is a much longer chapter explaining common traps
in data analysis, a reality check missing from most statistics texts. Chapter 46
goes through one example in detail as a review. Chapter 47 is a new (to this edi-
tion) chapter on statistical concepts you need to understand to follow the cur-
rent controversies about the lack of reproducibility of scientific works. Chapter
48, also new, is a checklist of things to think about when publishing (or review-
ing) statistical results.
If you don’t like the order of the chapters, read (or teach) them in a different
order. It is not essential that you read the chapters in order. Realistically, statistics
covers a lot of topics, and there is no ideal order. Every topic will be easier to
understand if you had learned something else first. There is no ideal linear path
through the material, and many of my chapters refer to later chapters. Some teach-
ers have told me that they have successfully presented the chapters in a very differ-
ent order than I present them.
WHAT’S NEW?
What was new in the second and third editions?
The second edition (published in 2010, 15 years after the first edition) was a com-
plete rewrite with new chapters, expanded coverage of some topics that were only
touched upon in the first edition, and a complete reorganization.
I substantially edited every chapter of the third edition and added new chap-
ters on probability, meta-analysis, and statistical traps to avoid. The third edition
introduced new sections in almost all chapters on common mistakes to avoid, sta-
tistical terms introduced in that chapter, and a chapter summary.
Overview of the fourth edition
In this fourth edition, I edited every chapter for clarity, to introduce new material,
and to improve the Q&A and Common Mistakes sections. I substantially rewrote
two chapters, Chapter 26 on sample size calculations and Chapter 28 about case-
control studies. I also added two new chapters. Chapter 47 discusses statistical
concepts regarding the reproducibility of scientific data. Chapter 48 is a set of
checklists to use when publishing or reviewing scientific papers.
List of new topics in the fourth edition
• Chapter 1. Two new sections were added to the list of ways that statistics
is not intuitive. One section points out that we don’t expect variability to
depend on sample size. The other points out that we let our biases deter-
mine how we interpret data.
• Chapter 2. New sections on conditional probability and likelihood. Updated
examples.
• Chapter 4. Begins with a new section to explain different kinds of variables.
New example (basketball) to replace a dated example about premature
babies. Added section on Bayesian credible intervals. Improved discussion
of “95% of what?” Took out rules of five and seven. Pie and stacked bar
graphs to display a proportion.
• Chapter 7. New Q&As. Violin plot.
• Chapter 9. How to interpret a SD when data are not Gaussian. Different
ways to report a mean and SD. How to handle data where you collect data
from both eyes (or ears, elbows, etc.) in each person.
• Chapter 11. Geometric SD factor. Mentions (in Q&As) that lognormal dis-
tributions are common (e.g., dB for sound, Richter scale for earthquakes).
Transforming to logs turns lognormal into Gaussian.
• Chapter 14. Error bars with lognormal data (geometric SD; CI of geometric
mean). How to abbreviate the standard error of mean (SEM and SE are both
used). Error bars with n = 2.
• Chapter 15. Stopped using the term assume with null hypothesis and in-
stead talk about “what if the null hypothesis were true?” Defines null versus
nil hypothesis. Manhattan plot. Advantage of CI over P. Cites the 2016
report about P values from the American Statistical Association.
• Chapter 16. Type S errors. What questions are answered by P values and CIs?
• Chapter 18. Added two examples and removed an outdated one (predni-
sone and hepatitis). Major rewrite.
• Chapter 19. Rewrote section on very high P values. Points out that a study
result can be consistent both with an effect existing and with it not existing.
• Chapter 20. Distinguishing power from beta and the false discovery rate.
When it makes sense to compute power.
• Chapter 21. Fixed 90% versus 95% confidence intervals. Two one-sided
t tests.
• Chapter 22. Introduces the phrase (used in physics) look elsewhere effect.
• Chapter 23. Two new ways to get trapped by multiple comparisons, the
garden of forking paths, and dichotomizing in multiple ways.
• Chapter 24. QQ plots. Corrected the explanation of kurtosis.
• Chapter 25. Points out that outlier has two meanings.
• Chapter 26. This chapter on sample size calculations has been entirely re-
written to clarify many topics.
• Chapter 28. This chapter on case-control studies has been substantially re-
written to clarify core concepts.
• Chapter 29. Improved definition of hazard ratio.
• Chapter 31. Added discussion of pros and cons of adjusting for pairing or
matching.
• Chapter 32. New common mistake pointed out that if you correlate a vari-
able A with another A-B, you expect r to be 0.7 even if data are totally
random. Points out that r is not a percentage.
• Chapter 33. Which variable is X, and which is Y? Misleading results if you
do one regression from data collected from two groups.
• Chapter 34. Defines the terms response variable and explanatory variable.
Discusses three distinct goals of regression.
• Chapter 39. Expanded discussion of two-way ANOVA with an example.
• Chapter 42. Removed discussion of LOD score. Added example for HIV
testing.
• Chapter 43. Added a discussion of meta-analyses using individual partici-
pant data, enlarged the discussion of funnel plots, added more Q&As.
• Chapter 45. New statistical traps: dichotomizing, confusing FDR with
significance level, finding small differences with lots of noise, overfitting,
pseudoreplication.
• Chapter 47. New chapter on reproducibility.
• Chapter 48. New chapter with checklists for reporting statistical methods.
WHO HELPED?
A huge thanks to the many people listed herein who reviewed draft chapters of the
fourth edition. Their contributions were huge and immensely improved this book:
Reviewers:
John D. Bonagura, The Ohio State University
Hwanseok Choi, University of Southern Mississippi
John D. Chovan, Otterbein University
Stacey S. Cofield, The University of Alabama at Birmingham
Jesse Dallery, University of Florida
Vincent A. DeBari, Seton Hall University
Heather J. Hoffman, George Washington University
Stefan Judex, Stony Brook University
Janet E. Kübler, California State University, Northridge
Huaizhen Qin, Tulane University
Zaina Qureshi, University of South Carolina
Emily Rollinson, Stony Brook University
Walter E. Schargel, The University of Texas at Arlington
Evelyn Schlenker, University of South Dakota Sanford School of Medicine
Guogen Shan, University of Nevada Las Vegas
William C. Wimley, Tulane University School of Medicine
Jun-Yen Yeh, Long Island University
Louis G. Zachos, University of Mississippi
Thanks also to the reviewers of past editions, as many of their ideas and correc-
tions survived into this edition: Raid Amin, Timothy Bell, Arthur Berg, Patrick
Breheny, Michael F. Cassidy, William M. Cook, Beth Dawson, Vincent DeBari,
Kathleen Engelmann, Lisa A. Gardner, William Greco, Dennis A. Johnston, Martin
Jones, Janet E. Kubler, Janet E. Kübler, Lee Limbird, Leonard C. Onyiah, Nancy
Ostiguy, Carol Paronis, Ann Schwartz, Manfred Stommel, Liansheng Tang, Wil-
liam C. Wimley, Gary Yellen Jan Agosti, David Airey, William (Matt) Briggs, Peter
Chen, Cynthia J Coffman, Jacek Dmochowski, Jim Ebersole, Gregory Fant, Joe
Felsenstein, Harry Frank, Joshua French, Phillip Ganter, Cedric Garland, Steven
Grambow, John Hayes, Ed Jackson, Lawrence Kamin, Eliot Krause, James Leeper,
Yulan Liang, Longjian Liu, Lloyd Mancl, Sheniz Moonie, Arno Motulsky, Law-
rence “Doc” Muhlbaier, Pamela Ohman-Strickland, Lynn Price, Jeanette Ruby,
Soma Roychowdhury, Andrew Schaffner, Paige Searle, Christopher Sempos, Arti
Shankar, Patricia A Shewokis, Jennifer Shook, Sumihiro Suzuki, Jimmy Walker,
Paul Weiss, and Dustin White.
I would also like to express appreciation to everyone at Oxford University
Press: Jason Noe, Senior Editor; Andrew Heaton and Nina Rodriguez-Marty,
Editorial Assistants; Patrick Lynch, Editorial Director; John Challice, Publisher
and Vice President; Frank Mortimer, Director of Marketing; Lisa Grzan, Manager
In-House Production; Shelby Peak, Senior Production Editor; Michele Laseau, Art
Director; Bonni Leon-Berman, Senior Designer; Sarah Vogelsong, Copy Editor;
and Pamela Hanley, Production Editor.
WHO AM I?
After graduating from medical school and completing an internship in internal
medicine, I switched to research in receptor pharmacology (and published over 50
peer-reviewed articles). While I was on the faculty of the Department of Pharma-
cology at the University of California, San Diego, I was given the job of teaching
statistics to first-year medical students and (later) graduate students. The syllabi
for those courses grew into the first edition of this book.
I hated creating graphs by hand, so I created some programs to do it for
me! I also created some simple statistics programs after realizing that the exist-
ing statistical software, while great for statisticians, was overkill for most scien-
tists. These efforts constituted the beginnings of GraphPad Software, Inc., which
has been my full-time endeavor for many years (see Appendix A). In this role,
I exchange emails with students and scientists almost daily, which makes me
acutely aware of the many ways that statistical concepts can be confusing or
misunderstood.
I have organized this book in a unique way and have chosen an unusual set of
topics to include in an introductory text. However, none of the ideas are particu-
larly original. All the statistical concepts are standard and have been discussed in
many texts. I include references for some concepts that are not widely known, but
I don’t provide citations for methods that are in common usage.
Please email me with your comments, corrections, and suggestions for the
next edition. I’ll post errata at www.intuitivebiostatistics.com.
Harvey Motulsky
hmotulsky@graphpad.com
November 2016
Chapter where
Abbreviation Definition defined
α (alpha) Significance level 16
ANOVA Analysis of variance 39
CI Confidence interval 4
CV Coefficient of variation 9
df Degrees of freedom 9
FDR False discovery rate 18
FPR False positive rate 18
FPRP False positive reporting 18
probability
n Sample size 4
OR Odds ratio 28
SD or s Standard deviation 9
SE Standard error 14
SEM Standard error of the mean 14
p (lower case) Proportion 4
P (upper case) P value 15
r Correlation coefficient 32
ROC Receiver operating 42
characteristic curve
RR Relative risk 27
W Margin of error 4
xxxv
Introducing Statistics
C HA P T E R 1
Statistics and Probability
Are Not Intuitive
If something has a 50% chance of happening, then 9 times out
of 10 it will.
Y O GI B ERRA
T he word intuitive has two meanings. One is “easy to use and under-
stand,” which is my goal for this book—hence its title. The other
meaning is “instinctive, or acting on what one feels to be true even with-
out reason.” This fun (really!) chapter shows how our instincts often lead
us astray when dealing with probabilities.
WE TEND TO BE OVERCONFIDENT
How good are people at judging how confident they are? You can test your own
ability to quantify uncertainty using a test devised by Russo and Schoemaker
(1989). Answer each of these questions with a range. Pick a range that you think
3
4 PA RT A • I N T R O DU C I N G STAT I ST IC S
has a 90% chance of containing the correct answer. Don’t use Google to find
the answer. Don’t give up and say you don’t know. Of course, you won’t know
the answers precisely! The goal is not to provide precise answers, but rather is
to correctly quantify your uncertainty and come up with ranges of values that
you think are 90% likely to include the true answer. If you have no idea, answer
with a super wide interval. For example, if you truly have no idea at all about
the answer to the first question, answer with the range zero to 120 years old,
which you can be 100% sure includes the true answer. But try to narrow your
responses to each of these questions to a range that you are 90% sure contains
the right answer:
• Martin Luther King Jr.’s age at death
• Length of the Nile River, in miles or kilometers
• Number of countries in OPEC
• Number of books in the Old Testament
• Diameter of the moon, in miles or kilometers
• Weight of an empty Boeing 747, in pounds or kilograms
• Year Mozart was born
• Gestation period of an Asian elephant, in days
• Distance from London to Tokyo, in miles or kilometers
• Deepest known point in the ocean, in miles or kilometers
Compare your answers with the correct answers listed at the end of this chap-
ter. If you were 90% sure of each answer, you would expect about nine intervals to
include the correct answer and one to exclude it.
Russo and Schoemaker (1989) tested more than 1,000 people and reported
that 99% of them were overconfident. The goal was to create ranges that in-
cluded the correct answer 90% of the time, but most people created narrow
ranges that included only 30% to 60% of the correct answers. Similar studies
have been done with experts estimating facts in their areas of expertise, and the
results have been similar.
Since we tend to be too sure of ourselves, scientists must use statistical meth-
ods to quantify confidence properly.
– – X – X – X X X – – – – X X X – X X – X X – – – X X – – – X X
X – – X – X X – – X X – – X – X – X – – – X X X X – – X X – – –
X X X X – X X – X – X – X X X – – – – – X – X – X X X – – – – X
– X – X – – X X – X X – X X – – X X X X – – – – X X – X – X – –
– X – X – X X – – – X X – – – – – – X – X – X – – X – – X – X X
– – X X X – X – X – – – X X X X – X X X X – – – – – X X – X X X
X – – X X – – X X X X – X X X – – X – – X X X X X – X X X – – –
X – X – – – X X X X X – – X X – X X – X X X – X X – X – – X – X
X X X – – X X X X X – X – X – X X – X – X X X X – X X – X X X X
– – – X X X – – X X X – X X X – – X – – X – X X X X X – – – X –
important that we recognize this built-in mental bias. Statistical rigor is essential
to avoid being fooled by seeing apparent patterns among random data.
medical care in these tiny counties leads to higher rates of kidney cancer. But it
seems pretty strange that both the highest and lowest incidences of kidney cancer
be in counties with small populations?
The reason is simple, once you think about it. In large counties, there is
little variation around the average rate. Among small counties, however, there
is much more variability. Consider an extreme example of a tiny county with
only 1,000 residents. If no one in that county had kidney cancer, that county
would be among those with the lowest (zero) incidence of kidney cancer. But if
only one of those people had kidney cancer, that county would then be among
those with the highest rate of kidney cancer. In a really tiny county, it only takes
one case of kidney cancer to flip from having one of the lowest rates to having
one of the highest rates. In general, just by chance, the incidence rates will vary
much more among counties with tiny populations than among counties with large
populations. Therefore, counties with both the highest and the lowest incidences
of kidney cancer tend to have smaller populations than counties with average
incidences of kidney cancer.
Random variation can have a bigger effect on averages within small groups
than within large groups. This simple principle is logical, yet is not intuitive to
many people.
host chooses one of the other two doors to open and shows you that there is no car
behind it. He now offers you the chance to change your mind and choose the other
door (the one he has not opened).
Should you switch?
Before reading on, you should think about the problem and decide whether you
should switch. There are no tricks or traps. Exactly one door has the prize, all doors
appear identical, and the host—who knows which door leads to the new car—has a
perfect poker face and gives you no clues. There is never a car behind the door the
host chooses to open. Don’t cheat. Think it through before continuing.
When you first choose, there are three doors and each is equally likely to
have the car behind it. So your chance of picking the winning door is one-third.
Let’s separately think through the two cases: originally picking a winning door or
originally picking a losing door.
If you originally picked the winning door, then neither of the other doors has
a car behind it. The host opens one of these two doors. If you now switch doors,
you will have switched to the other losing door.
What happens if you originally picked a losing door? In this case, one of the
remaining doors has a car behind it and the other one doesn’t. The host knows
which is which. He opens the door without the car. If you now switch, you will
win the car.
Let’s recap. If you originally chose the correct door (an event that has a one-
third chance of occurring), then switching will make you lose. If you originally
picked either of the two losing doors (an event that has a two-thirds chance of oc-
curring), then switching will definitely make you win. Switching from one losing
door to the other losing door is impossible, because the host will have opened the
other losing door.
Your best choice is to switch! Of course, you can’t be absolutely sure that
switching doors will help. One-third of the time you will be switching away from
the prize. But the other two-thirds of the time you will be switching to the prize. If
you repeat the game many times, you will win twice as often by switching doors
every time. If you only get to play once, you have twice the chance of winning by
switching doors.
Almost everyone (including mathematicians and statisticians) intui-
tively reaches the wrong conclusion and thinks that switching won’t be helpful
(Vos Savant, 1997).
Note that this study wasn’t really done to ask about the association between
astrological sign and disease. It was done to demonstrate the difficulty of inter-
preting statistical results when many comparisons are performed.
Chapters 22 and 23 explore multiple comparisons in more depth.
140
100
80
Before After
140 140
120 120
100 100
80 80
Now imagine the study was designed differently. You’ve made the before
measurements and want to test a treatment for high blood pressure. There is no
point in treating individuals whose blood pressure is not high, so you select the
people with the highest pressures to study. Figure 1.1B illustrates data for only
C HA P T E R 1 • Statistics and Probability Are Not Intuitive 11
those 12 individuals with the highest before values. In every case but one, the after
values are lower. If you performed a statistical test (e.g., a paired t test; see Chap-
ter 31), the results would convince you that the treatment decreased blood pres-
sure. Figure 1.1C illustrates a similar phenomenon with the other 12 pairs, those
with low before values. In all but two cases, these values go up after treatment.
This evidence would convince you that the treatment increases blood pressure.
But these are random data! The before and after values came from the same
distribution. What happened?
This is an example of regression to the mean: the more extreme a variable is
upon its first measurement, the more likely it is to be closer to the average the second
time it is measured. People who are especially lucky at picking stocks one year are
likely to be less lucky the next year. People who get extremely high scores on one
exam are likely to get lower scores on a repeat exam. An athlete who does extremely
well in one season is likely to perform more poorly the next season. This probably
explains much of the Sports Illustrated cover jinx—many believe that appearing on
the cover of Sports Illustrated will bring an athlete bad luck (Wolff, 2002).
Table 1.2. One of the tables that Kahan and colleagues used. After viewing this table,
people were asked to determine whether the hypothetical experiment showed that the
skin condition of people treated with the cream was more likely to “get better” or “get
worse” compared to those who were not treated.
Table 1.3. The other table that Kahan and colleagues used. After viewing this table,
people were asked to determine whether the made-up data show that cities that
enacted a ban on carrying concealed handguns were more likely to have an increase or
decrease in crime compared to cities that did not ban concealed handguns.
1 2 PA RT A • I N T R O DU C I N G STAT I ST IC S
The experimental subjects were not asked for subtle interpretation of the data
but rather were simply asked whether or not the data support a particular hypoth-
esis. The math in Tables 1.2 and 1.3 is pretty straightforward:
• In the top row, the rash got better in 223/298 = 74.8% of the treated people,
and the crime rate went down in 74.8% of the cities that banned concealed
handguns.
• In the bottom row, the rash got better in 107/128 = 83.6% of the untreated
people, and the crime rate when down in 83.6% of the cities that did not ban
carrying concealed handguns.
• Most people had a decrease in rash, and most cities had a decrease in crime.
But did the intervention matter? Since 74.8% is less than 83.6%, the data
clearly show people who used the cream were less likely to have an improved
rash than people who did not use the cream. Cities that passed handgun law
had a smaller decrease in crime than cities who did not pass such a law.
When the data were labeled to be about the effectiveness of a skin cream,
liberal Democrats and conservative Republicans (the two main political parties
in the United States) did about the same. But when the data were labeled to be
about the effectiveness of a gun safety policy, the results depended on political
orientation. Liberal democrats tended to find that the data showed that gun safety
laws reduced crime. Conservatives tended to conclude the opposite (this study
was done in the United States, where conservatives tend to be against gun safety
legislation.)
This study shows that when people have a preconceived notion about the
conclusion, they tend to interpret the data to support that conclusion.
CHAPTER SUMMARY
• Our brains do a really bad job of interpreting data. We see patterns in ran-
dom data, tend to be overconfident in our conclusions, and mangle inter-
pretations that involve combining probabilities.
• Our intuitions tend to lead us astray when interpreting probabilities and
when interpreting multiple comparisons.
• Statistical (and scientific) rigor is needed to avoid reaching invalid
conclusions.
C HA P T E R 1 • Statistics and Probability Are Not Intuitive 13
The Complexities of Probability
Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write.
H. G. W ELLS
BASICS OF PROBABILITY
Probabilities range from 0.0 to 1.0 (or 100%) and are used to quantify a prediction
about future events or the certainty of a belief. A probability of 0.0 means either
that an event can’t happen or that someone is absolutely sure that a statement is
wrong. A probability of 1.0 (or 100%) means that an event is certain to happen
or that someone is absolutely certain a statement is correct. A probability of 0.50
(or 50%) means that an event is equally likely to happen or not happen, or that
someone believes that a statement is equally likely to be true or false.
This chapter uses the terminology of Kruschke (2011) to distinguish two uses
of probability:
• Probability that is “out there,” or outside your head. This is probability as long-
term frequency. The probability that a certain event will happen has a definite
value, but we rarely have enough information to know that value with certainty.
• Probability that is inside your head. This is probability as strength of sub-
jective beliefs, so it may vary among people and even among different
assessments by the same person.
14
C HA P T E R 2 • The Complexities of Probability 15
about probabilities is as the predictions of future events that are derived by using a
model. A model is a simplified description of a mechanism. For this example, we
can create the following simple model:
• Each ovum has an X chromosome, and none have a Y chromosome.
• Half the sperm have an X chromosome (but no Y) and half have a Y chro-
mosome (and no X).
• Only one sperm will fertilize the ovum.
• Each sperm has an equal chance of fertilizing the ovum.
• If the winning sperm has a Y chromosome, the fetus will have both an X
and a Y chromosome and so will be male. If the winning sperm has an X
chromosome, the fetus will have two X chromosomes and so will be female.
• Any miscarriage or abortion is equally likely to happen to male and female
fetuses.
If you assume that this model is true, then the predictions are easy to figure
out. Since all sperm have the same chance of being the one to fertilize the ovum
and since the sperm are equally divided between those with an X chromosome and
those with a Y chromosome, the chance that the fetus will have a Y chromosome is
50%. Thus, our model predicts that the chance that the fetus will be male is 0.50,
or 50%. In any particular group of babies, the fraction of boys might be more or
less than 50%. But in the long run, you’d expect 50% of the babies to be boys.
You can make predictions about the occurrence of future events from any
model, even if the model doesn’t reflect reality. Some models will prove to be
useful, and some won’t. The model described here is pretty close to being correct,
so its predictions are pretty useful but not perfect.
The reviews of this book are glowing and convince you that its premise is cor-
rect. Therefore, you plan to follow the book’s recommendations to increase your
chance of having a boy.
What is the chance that you’ll have a boy? If you have complete faith that
the method is correct, then you believe that the probability, as stated on the book
jacket, is 75%. You don’t have any data supporting this number, but you believe it
strongly. It is your firmly held subjective probability.
But what if you have doubts? You think the method probably works, and you
will quantify that by saying you think there is an 85% chance that the Shettles
method works (i.e., your chance of having a boy would be 75%). That leaves a
15% chance that the method is worthless and your chance of having a boy is
51.7%. What is your chance of having a boy? Calculate a weighted average of
the two predictions, weighting on your subjective assessment of the chance that
each theory is correct. So the chance is (0.850 × 0.750) + (0.150 × 0.517) =
0.715 = 71.5%.
Of course, different people have different beliefs about the efficacy of the
method. I didn’t look into the matter carefully, but I did spend a few minutes
searching and found no studies actually proving that the Shettles method works,
and one (in a prestigious journal) saying it doesn’t work (Wilcox et al., 1995).
Given that limited amount of research, I will quantify my beliefs by saying that
I believe there is only a 1% chance that the Shettles method works (and that your
chance of having a boy is 75%) and thus a 99% chance that the method does not
work (and so your chance of having a boy is 51.7%). The weighted average is thus
(0.010 × 0.750) + (0.990 × 0.517) = 0.519 = 51.9%.
Since you and I have different assessments of the probability of whether the
Shettles method works, my answer for the chance of you having a boy (51.9%)
doesn’t match your answer (71.5%). The calculations depend upon a subjective
assessment of the likelihood that the Shettles method works, so it isn’t surprising
that my answer and yours differ.
fetus is male and a 48.3% chance that it is female. That was your chance before
you got pregnant, so it still is your chance now.
But notice the change in perspective. Now you are no longer talking about
predicting the outcome of a random future event but rather are quantifying your
ignorance of an event that has already happened.
The other half of the physicians were given the same question but in natural
frequencies (whole numbers) rather than percentages:
Eight out of every 1,000 women will have breast cancer. Of those 8 women with
breast cancer, 7 will have a positive mammogram. Of the remaining 992 women
who don’t have breast cancer, about 70 will still have a positive mammogram.
Imagine a woman who has a positive mammogram. What is the probability that
she actually has breast cancer?
• The fraction of studies performed under the null hypothesis in which the
P value is less than 0.05 is not the same as the fraction of studies with a
P value less than 0.05 for which the null hypothesis is true. (This will make
more sense once you read Chapter 15.)
Knowing how easy it is to mistakenly reverse a probability statement will
help you avoid that error.
LINGO
Probability vs. odds
So far, this chapter has quantified chance as probabilities. But it is also possible
to express the same values as odds. Odds and probability are two alternatives
for expressing precisely the same concept. Every probability can be expressed as
odds. Every odds can be expressed as a probability. Some scientific fields tend to
prefer using probabilities; other fields tend to favor odds. There is no consistent
advantage to using one or the other.
If you search for demographic information on the fraction of babies who are
boys, what you’ll find is the sex ratio. This term, as used by demographers, is the
ratio of males to females born. Worldwide, the sex ratio at birth in many countries
is about 1.07. Another way to say this is that the odds of having a boy versus a girl
are 1.07 to 1.00, or 107 to 100.
It is easy to convert the odds to a probability. If there are 107 boys born for
every 100 girls born, the chance that any particular baby will be male is 107/
(107 + 100) = 0.517, or 51.7%.
It is also easy to convert from a probability to odds. If the probability of
having a boy is 51.7%, then you expect 517 boys to be born for every 1,000 births.
Of these 1,000 births, you expect 517 boys and 483 girls (which is 1,000 − 517)
to be born. So the odds of having a boy versus a girl are 517/483 = 1.07 to 1.00.
The odds are defined as the probability that the event will occur divided by the
probability that the event will not occur.
Odds can be any positive value or zero, but they cannot be negative. A prob-
ability must be between zero and 1 if expressed as a fraction, or be between zero
and 100 if expressed as a percentage.
C HA P T E R 2 • The Complexities of Probability 21
PROBABILITY IN STATISTICS
Earlier in this chapter, I pointed out that probability can be “out there” or “in
your head.” The rest of this book mostly explains confidence intervals (CIs) and
PROBABILITY
General ➔ Specific
Population ➔ Sample
Model ➔ Data
STATISTICS
General ➔ Specific
Population ➔ Sample
Model ➔ Data
P values, which both use “out there” probabilities. This style of analyzing data
is called frequentist statistics. Prior beliefs, or prior data, never enter frequentist
calculations. Only the data from a current set are used as inputs when calculating
P values (see Chapter 15) or CIs (see Chapters 4 and 12). However, scientists often
account for prior data and theory when interpreting these results, a topic to which
we’ll return in Chapters 18, 19, and 42.
Many statisticians prefer an alternative approach called Bayesian statistics, in
which prior beliefs are quantified and used as part of the calculations. These prior
probabilities can be subjective (based on informed opinion), objective (based on
solid data or well-established theory), or uninformative (based on the belief that
all possibilities are equally likely). Bayesian calculations combine these prior
probabilities with the current data to compute probabilities and Bayesian CIs
called credible intervals. This book only briefly explains Bayesian statistics.
Q&A
CHAPTER SUMMARY
• Probability calculations can be confusing.
• There are two meanings of probability. One meaning is long-range fre-
quency that an event will happen. It is probability “out there.” The other
meaning is that probability quantifies how sure one is about the truth of a
proposition. This is probability “in your head.”
• All probability statements are based on a set of assumptions.
• It is impossible to understand a probability or frequency until you have
clearly defined both its numerator and its denominator.
• Probability calculations go from general to specific, from population to
sample, and from model to data. Statistical calculations work in the oppo-
site direction: from specific to general, from sample to population, and
from data to model.
• Frequentist statistical methods calculate probabilities from the data (P val-
ues, CIs). Your beliefs before collecting the data do not enter into the calcu-
lations, but they often are considered when interpreting the results.
• Bayesian statistical methods combine prior probabilities (which may be
either subjective or based on solid data) and experimental evidence as part
of the calculations. This book does not explain how these methods work.
C HA P T E R 2 • The Complexities of Probability 23
From Sample to Population
There is something fascinating about science. One gets such
a wholesale return of conjecture out of a trifling investment
of fact.
M ARK T WAIN (L IFE ON THE M ISSISSIPPI, 1850 )
B efore you can learn about statistics, you need to understand the big
picture. This chapter concisely explains what statistical calculations
are used for and what questions they can answer.
24
C HA P T E R 3 • From Sample to Population 25
you have sampled a substantial fraction of the population (>10% or so), then you
must use special statistical methods designed for use when a large fraction of the
population is sampled.
The parameter is the half-life. In one half-life, the drug concentration goes down
to 50% of the starting value; in two half-lives, it goes down to 25%; in three half-
lives, it goes down to 12.5%; and so on.
One goal of statistics is to analyze your data to make inferences about the
values of the parameters that define the model. Another goal is to compare alter-
native models to see which best explains the data. Chapters 34 and 35 discuss this
idea in more depth.
CHAPTER SUMMARY
• The goal of data analysis is simple: to make the strongest possible conclu-
sions from limited amounts of data.
• Statistics help you extrapolate from a particular set of data (your sample)
to make a more general conclusion (about the population). Statistical cal-
culations go from specific to general, from sample to population, from
data to model.
• Bias occurs when the experimental design is faulty, so that on average the
result is too high (or too low). Results are biased when the result computed
from the sample is not, in fact, the best possible estimate of the value in the
population.
• The results of statistical calculations are always expressed in terms of
probability.
• If your sample is the entire population, you only need statistical calculations
to extrapolate to an even larger population, other settings, or future times.
Introducing Confidence
Intervals
C HA P T E R 4
Confidence Interval of a Proportion
The first principle is that you must not fool yourself—and you
are the easiest person to fool.
R ICHARD F EYNMAN
31
3 2 PA RT B • I N T R O DU C I N G C O N F I D E N C E I N T E RVA L S
Missed
Missed
Success
Success
Total = 20 Total = 20
Figure 4.1. Two ways to plot data with two possible outcomes.
Figure 4.1 shows two ways of plotting these data. The pie chart on the left is
conventional, but a stacked bar graph is probably easier to read.
Note that there is no uncertainty about what we observed. We are absolutely
sure we counted the number of free throws and baskets correctly. Calculation of
a CI will not overcome any counting mistakes. What we don’t know is the overall
success rate of free throws for this player.
The CI for this example ranges from 63.1% to 95.6% (calculated using the
modified Wald method, explained later in this chapter). Sometimes this is written
as [63.1%, 95.6%]. You might get slightly different results depending on which
program you use.
95% CI, we accept a 5% chance that the true population value will not be included
in the range.
Before reading on, make your best guess for the 95% CI when your sample
has 100 people and 33 of those people have said they would vote for your candi-
date (so the proportion in the sample is 0.33, or 33%).
Note that there is no uncertainty about what we observed in the voter sample.
We are absolutely sure that 33.0% of the 100 people polled said they would vote
for our candidate. If we weren’t sure of that count, calculation of a CI could not
overcome any mistakes that were made in tabulating those numbers or any ambi-
guity in defining “vote.” What we don’t know is the proportion of favorable voters
in the entire population. Write down your guess before continuing.
One way to calculate the CI is explained later in this chapter. You can also use
free Web-based calculators to do the math. In this case, the 95% CI extends from
25% to 43%. Sometimes this is written as [25%, 43%]. You might get slightly dif-
ferent results depending on which algorithm the program you use. We can be 95%
confident that somewhere between 25% and 43% of the population you polled
preferred your candidate on the day the poll was conducted. The phrase “95%
confident” is explained in more detail in the following discussion.
Don’t get confused by the two different uses of percentages. The CI is for
95% confidence. That quantifies how sure you want to be. You could ask for a
99% CI if you wanted your range to have a higher chance of including the true
value. Each confidence limit (25% and 43%) represents the percentage of voters
who will vote for your candidate.
How good was your guess? Many people guess that the interval is narrower
than it is.
ASSUMPTIONS: CI OF A PROPORTION
The whole idea of a CI is to make a general conclusion from some specific data.
This generalization is only useful if certain assumptions are true. A huge part of
the task of learning statistics is to learn the assumptions upon which statistical
inferences are based.
was a second problem. Supporters of Landon were much more likely to return
replies to the poll than were supporters of Roosevelt (Squire, 1988). The poll
predicted that Landon would win by a large margin, but, in fact, Roosevelt won.
A simulation
To demonstrate this, let’s switch to a simulation in which you know exactly the
population from which the data were selected. Assume that you have a bowl of
100 balls, 25 of which are red and 75 of which are black. Mix the balls well and
choose one randomly. Put it back in, mix again, and choose another one. Repeat
this process 15 times, record the fraction of the sample that are red, and com-
pute the 95% CI for that proportion. Figure 4.2 shows the simulation of 20 such
samples. Each 95% CI is shown as a bar extending from the lower confidence
limit to the upper confidence limit. The value of the observed proportion in each
sample is shown as a line in the middle of each CI. The horizontal line shows the
true population value (25% of the balls are red). In about half of the samples, the
sample proportion is less than 25%, and in the other half the sample proportion is
higher than 25%. In one of the samples, the population value lies outside the 95%
CI. In the long run, given the assumptions previously listed, 1 of 20 (5%) of the
95% CIs will not include the population value, and that is exactly what we see in
this set of 20 simulated experiments.
3 6 PA RT B • I N T R O DU C I N G C O N F I D E N C E I N T E RVA L S
0.75
Probability
0.50
0.25
0.00
Twenty Experiments
Figure 4.2. What would happen if you collected many samples and computed a 95% CI
for each?
Each bar shows one simulated experiment, indicating the proportion of red balls chosen
from a mixture of 25% red balls (the rest are black), with n = 15. The percentage success is
shown as a line near the middle of each bar, which extends from one 95% confidence limit
to the other. In all but one of the simulated samples, the CI includes the true population
proportion of red balls (shown as a horizontal line). The 95% CI of Sample 9, however, does
not include the true population value. You expect this to happen in 5% of samples. Because
Figure 4.1 shows the results of simulations, we know when the CI doesn’t include the true
population value. When analyzing data, however, the population value is unknown, so you
have no way of knowing whether the CI includes the true population value.
Figure 4.2 helps explain CIs, but you cannot create such a figure when you
analyze data. When you analyze data, you don’t know the actual population value.
You only have results from a single experiment. In the long run, 95% of such
intervals will contain the population value and 5% will not. But there is no way
you can know whether or not a particular 95% CI includes the population value
because you don’t know that population value (except when running simulations).
in November. Many voters changed their minds in the interim period. The 95% CI
computed from data collected in September was inappropriately used to predict
voting results two months later.
LINGO
CIs versus confidence limits
The two ends of the CI are called the confidence limits. The CI extends from one con-
fidence limit to the other. The CI is a range, whereas each confidence limit is a value.
Many scientists use the terms confidence interval and confidence limits inter-
changeably. Fortunately, mixing up these terms does not get in the way of under-
standing statistical results!
Estimate
The sample proportion is said to be a point estimate of the true population propor-
tion. The CI covers a range of values so is said to be an interval estimate.
Note that estimate has a special meaning in statistics. It does not mean an ap-
proximate calculation or an informed hunch, but rather is the result of a defined cal-
culation. The term estimate is used because the value computed from your sample
is only an estimate of the true value in the population (which you can’t know).
Confidence level
A 95% CI has a confidence level of 95%. If you generate a 99% CI, the confi-
dence level equals 99%. The term confidence level is used to describe the desired
amount of confidence. This is also called the coverage probability.
Uncertainty interval
Gelman (2010) thinks that the term confidence interval is too confusing and too
often misunderstood. He proposed using the term uncertainty interval instead. I
think that is a good idea, but this term is rarely used.
• The modified Wald method, developed by Agresti and Coull (1998) is quite
accurate and is also easy to compute by hand (see next section). Simula-
tions with many sets of data demonstrate that it works very well (Brown,
Cai, & DasGupta, 2001; Ludbrook & Lew, 2009).
You can find a Web calculator that computes both the modified Wald and the
Clopper–Pearson method at http://www.graphpad.com/quickcalcs/confInterval1/.
If the numerator and denominator are equal, the sample proportion is 100%,
which is also the upper limit of the CI. The same logic applies, and there are two
ways to define the lower limit mirroring the approach previously described to
calculate the upper limit.
about the population proportion. It can be (and is) used for any kind of data. Since
this kind of Bayesian reasoning is not standard in most fields of biology, this book
will only present this very brief introduction to this way of thinking.
Q&A
100%
75%
Confidence interval
50%
25%
0%
90% CI 95% CI 99% CI
100%
50%
25%
0%
N=10 N=20 N=40 N=80 N=160
100%
95% Confidence interval
75%
50%
25%
0%
98/100 8/10
CHAPTER SUMMARY
• This chapter discusses results that have two possible outcomes and are
summarized as a proportion. The proportion is computed by dividing the
number of times one outcome happened by the total number of trials.
• The binomial distribution predicts the distribution of results when you cre-
ate many random samples and know the overall proportion of the two pos-
sible outcomes.
• A fundamental concept in statistics is the use of a CI to analyze a single
sample of data to make conclusions about the population from which the
data were sampled.
C HA P T E R 4 • Confidence Interval of a Proportion 45
• To compute a CI of a proportion, you only need the two numbers that form
its numerator and its denominator.
• Given a set of assumptions, 95% of CIs computed from a proportion will
include the true population value of that proportion. You’ll never know if a
particular CI is part of that 95% or not.
• There is nothing magic about 95%. CIs can be created for any degree of
confidence, but 95% is used most commonly.
• The width of a CI depends in part on sample size. The interval is narrower
when the sample size is larger.
Confidence Interval of Survival Data
In the long run, we are all dead.
J OHN M AYNARD K EYNES
O utcomes that can only happen once (e.g., death) are often displayed
as graphs showing percentage survival as a function of time. This
chapter explains how survival curves are created and how to interpret
their confidence intervals (CIs). Even if you don’t use survival data, skim
this chapter to review the concept of CIs with a different kind of data.
This chapter explains how to analyze a single survival curve. Chapter 29
explains how to compare survival curves.
SURVIVAL DATA
The term survival curve is a bit limiting, because these kinds of graphs are used
to plot time to any well-defined end point or event. The event is often death, but
it could also be the time until occlusion of a vascular graft, first metastasis, or
rejection of a transplanted kidney. The same principles are used to plot time until
a lightbulb burns out, until a router needs to be rebooted, until a pipe leaks, and
so on. In these cases, the term failure time is used instead of survival time. The
event does not have to be dire. The event could be restoration of renal function,
discharge from a hospital, resolution of a cold, or graduation. The event must be a
one-time event. Special methods are needed when events can recur.
The methods described here (and in Chapter 29) apply when you know the
survival time of each subject (or know when the data were censored, as explained
in the next section). These methods are not appropriate for analyzing, for example,
the survival of thousands or millions of cells, because you don’t know the survival
time of each individual cell. Instead, you would simply plot the percentage survival
versus time and fit a curve to the data or connect the points with point-to-point lines.
Imagine a study of a cancer treatment that enrolled patients between 1995 and
2000 and ended in 2008. If a patient enrolled in 2000 and was still alive at the end
of the study, his survival time would be unknown but must exceed eight years.
Although the study lasted 13 years, the fate of that patient after year 8 is unknown.
During the study, imagine that some subjects dropped out—perhaps they moved
to a different city or wanted to take a medication disallowed on the protocol. If a
patient moved after two years in the study, her survival time would be unknown but
must exceed two years. Even if you knew how long she lived, you couldn’t use the
data, because she would no longer be taking the experimental drug. However, the
analysis must account for the fact that she lived at least two years on the protocol.
Information about these patients is said to be censored. The word censor has
a negative connotation. It sounds like the subject has done something bad. Not so.
It’s the data that have been censored, not the subject! These censored observations
should not be removed from the analyses—they must just be accounted for prop-
erly. The survival analysis must take into account how long each subject is known
to have been alive and following the experimental protocol, and it must not use
any information gathered after that time.
Table 5.1 presents data for a survival study that only includes seven patients.
That would be a very small study, but the size makes the example easy to follow.
The data for three patients are censored for three different reasons. One of the
censored observations is for someone who is still alive at the end of the study. We
don’t know how long he will live after that. Another person moved away from the
area and thus left the study protocol. Even if we knew how much longer she lived,
we couldn’t use the information, because she was no longer following the study
protocol. One person died in a car crash. Different investigators handle this kind of
situation differently. Some define a death to be a death, no matter what the cause
(termed all-cause mortality). Some investigators present the data two ways, first
using all subjects, called an intent-to-treat analysis, and then censoring subjects
who did not adhere to the full protocol, called an according-to-protocol analysis.
Here we will define a death from a clearly unrelated cause (such as a car crash) to
be a censored observation. We know he lived 3.67 years on the treatment and don’t
know how much longer he would have lived, because his life was cut short.
Table 5.2 demonstrates how these data are entered into a computer program
when using censoring. The codes for death (1) and censored (0) are commonly
used but are not completely standard.
YEARS CODE
4.07 1
6.54 0
1.39 1
6.17 0
5.89 1
4.76 1
3.67 0
Table 5.2. How the data of Table 5.1 are entered into a computer program.
100% 100%
Survival
Survival
50% 50%
0% 0%
0 2 4 6 8 0 2 4 6 8
Years Years
100% 100%
Survival
Survival
50% 50%
0% 0%
0 2 4 6 8 0 2 4 6 8
Years Years
Figure 5.1. Four versions of a survival curve created from the data in Tables 5.1 through 5.3.
The X-axis plots the number of years after each subject was enrolled. Note that time zero
does not have to be any particular day or year. The Y-axis plots percentage survival. The
three censored patients are shown as symbols in three of the graphs and as upward blips
in the fourth. Four of the subjects died. You can see each death as a downward step in the
curves. The 95% CIs for the populations are shown as error bars in the top two graphs and
as a shaded region (confidence bands) in the bottom-left graph. The bottom-right graph
shows the median survival time, which is the time at which half of the subjects have died
and half are still alive. Read across at 50% to determine median survival, which is 5.89 years
in this example.
Mean survival?
Mean survival is rarely computed. This is because calculating a mean survival
time requires that you know every single time of death. Thus, the mean survival
cannot be computed if any observations are censored or if any subjects are still
alive when the study ends.
In contrast, median survival can be computed when some observations are
censored and when the study ends before all subjects have died. Once half of the
C HA P T E R 5 • Confidence Interval of Survival Data 51
subjects have died, the median survival is unambiguous, even without knowing
how long the others will live.
Five-year survival
Survival with cancer is often quantified as five-year survival. A vertical line is
drawn at X = 5 years, and the Y value that line intersects is the five-year percent-
age survival. Of course, there is nothing special about five years (rather than two
or four or six years) except tradition.
dropping out. Including these subjects (and censoring their data) violates the as-
sumption that censoring is unrelated to survival. But excluding these subjects en-
tirely can also lead to biased results. The best plan is to analyze the data both ways.
If the conclusions of the two analyses are similar, then the results are straightfor-
ward to interpret. If the conclusions of the two analyses differ substantially, then
the study results are simply ambiguous.
Q&A
80
60
Percentage death
40
20
0 2 4 6 8
Years
Figure 5.2. A graph of percentage death, rather than percentage survival, illustrates the
same information.
When only a small percentage of subjects have died by the end of the study, this kind of
graph can be more informative (because the Y-axis doesn’t have to extend all the way to 100).
CHAPTER SUMMARY
• Survival analysis is used to analyze data for which the outcome is elapsed
time until a one-time event (often death) occurs.
• The tricky part of survival analysis is dealing with censored observations.
An observation is censored when the one-time event hasn’t happened by the
time the study ends, or when the patient stops following the study protocol.
• Survival curves can be plotted with confidence bands.
• A survival curve can be summarized by the median survival or the five-year
survival.
• Survival curves, and their confidence bands, can only be interpreted if you
accept a list of assumptions.
Confidence Interval of Counted Data
(Poisson Distribution)
Not everything that can be counted counts; and not everything
that counts can be counted.
A LBERT E INSTEIN
55
5 6 PA RT B • I N T R O DU C I N G C O N F I D E N C E I N T E RVA L S
Relative frequency
Relative frequency
Mean = 1.6 counts/min Mean = 7.5 counts/min
0 1 2 3 4 5 6 7 0 5 10 15 20
Number of counts Number of counts
This Poisson distribution is notably asymmetrical. This makes sense, because the
number of radioactive background counts cannot be less than zero but has no
upper limit. The right half of Figure 6.1 shows the predictions of a Poisson dis-
tribution with an average of 7.5 counts per minute. This Poisson distribution is
nearly symmetrical and almost looks like a Gaussian distribution (see Chapter 10).
the initial analysis overestimated the number of near misses (“NASA Blows
Millions on Flawed Airline Safety Survey,” 2007).
Raisins in bagels
Imagine that you carefully dissect a bagel and find 10 raisins. You assume that raisins
are randomly distributed among bagels, that the raisins don’t stick to each other to
create clumps (perhaps a dubious assumption), and that the average number of raisins
per bagel doesn’t change over time (the recipe is constant). The 95% CI (determined
using the Poisson distribution) ranges from 4.8 to 18.4 raisins per bagel. You can be
95% certain that range includes the overall average number of raisins per bagel.
Radioactive counts
Imagine that you have counted 120 radioactive counts in one minute. Radioactive
counts occur randomly and independently, and the average rate doesn’t change
(within a reasonable time frame much shorter than the isotope’s half-life). Thus,
the Poisson distribution is a reasonable model. The 95% CI for the average number
of counts per minute is 99.5 to 143.5.
Note that you must base this calculation on the actual number of radioactive
disintegrations counted. If you counted the tubes for 10 minutes, the calculation
must be based on the counts in 10 minutes and not on the calculated number of
counts in 1 minute. This point is explained in more detail later in this chapter.
Person-years
Exposure to an environmental toxin caused 1.6 deaths per 1,000 person-years
exposure. What is the 95% CI? To calculate the CI, you must know the exact
number of deaths that were observed in the study. This study observed 16 deaths
5 8 PA RT B • I N T R O DU C I N G C O N F I D E N C E I N T E RVA L S
850
800
700
650
600
Figure 6.2. The advantages of counting radioactive samples for a longer time interval.
One radioactive sample was repeatedly measured for 1-minute intervals (left) and 10-minute
intervals (right). The number of radioactive counts detected in the 10-minute samples was
divided by 10, so both parts of the graph show counts per minute. By counting longer, there
is less Poisson error. Thanks to Arthur Christopoulos for providing these data.
values by 10 to obtain the 95% CI for the average number of decays per minute,
which ranges from 684 to 718. Counting for a longer period of time gives you a
more precise assessment of the average number of counts per interval, and thus
the CI is narrower.
Let’s revisit the example of raisins in bagels. Instead of counting raisins in
one bagel, imagine that you dissected seven individual bagels and found counts of
9, 7, 13, 12, 10, 9, and 10 raisins. In total, there were 70 raisins in seven bagels,
for an average of 10 raisins per bagel. The CI should be computed using the total
counted. Given 70 objects counted in a certain volume, the 95% CI for the average
number ranges from 54.57 to 88.44. This is the number of raisins per seven bagels.
Divide by seven to express these results in more useful units. The 95% CI ranges
from 7.8 to 12.6 raisins per bagel.
Q&A
CHAPTER SUMMARY
• When events occur randomly, independently of one another, and with an
average rate that doesn’t change over time, the number of events counted in
a certain time interval follows a Poisson distribution.
• When objects are randomly distributed and not clumped, the number of
objects counted in a certain volume follows a Poisson distribution.
• From the number of events actually observed (or the number of objects
actually counted), a CI can be computed for the average number of events
per unit of time (or number of objects per unit volume).
• When using a Poisson distribution to compute a CI, you must base the calcu-
lation on the actual number of objects (or events) counted. Any normalization
to more convenient units must be done after you have computed the CI.
Continuous Variables
C HA P T E R 7
Graphing Continuous Data
When you can measure what you are speaking about and
express it in numbers, you know something about it; but when
you cannot measure it, when you cannot express it in numbers,
your knowledge is of the meager and unsatisfactory kind.
L ORD K ELVIN
CONTINUOUS DATA
When analyzing data, it is essential to choose methods appropriate for the kind of
data with which you are working. To highlight this point, this book began with dis-
cussions of three kinds of data. Chapter 4 discussed data expressed as two possible
outcomes, summarized as a proportion; Chapter 5 explained survival data and
how to account for censored observations; and Chapter 6 discussed data expressed
as the actual number of events counted in a certain time or as objects counted in
a certain volume.
This chapter begins our discussion of continuous data, such as blood pres-
sures, enzyme activity, weights, and temperature. In many scientific fields, con-
tinuous data are more common than other kinds of data.
63
6 4 PA RT C • C O N T I N U OU S VA R IA B L E S
37.0
36.0
37.1
37.1
36.2
37.3
36.8
37.0
36.3
36.9
36.7
36.8
Calculating an arithmetic mean or average is easy: add up all the values and
divide by the number of observations. The mean of the smaller (n = 12) subset is
36.77°C. If the data contain an outlier (a value far from the others), the mean won’t
be very representative (learn more about outliers in Chapter 25). For example, if
the largest value (37.1) were mistakenly entered into the computer as 371 (i.e.,
without the decimal point), the mean would increase to 64.6°C, which is larger
than all the other values.
The mean is one way to quantify the middle, or central tendency, of the data,
but it is not the only way. Here are some others:
• The median is the middle value. Rank the values from lowest to highest
and identify the middle one. If there is an even number of values, average
the two middle ones. For the n = 130 data set, the median is the average
of the 65th and 66th ranked values, or 36.85°C. Compared to the mean,
the median is not influenced by an outlier, and can be more useful with
skewed distributions. For example, the mean cost of a US wedding in 2011
was $27,021, but the median price was $16,886 (Oremus, 2013). A small
proportion of really expensive weddings bring up the mean cost but don’t
affect the median cost.
• The geometric mean is calculated by transforming all values into their loga-
rithms, computing the mean of these logarithms, and then taking the antilog
of that mean (logarithms and antilogarithms are reviewed in Appendix E).
Because a logarithm is only defined for values greater than zero, the geo-
metric mean cannot be calculated if any values are zero or negative. None
of the temperature values is zero or negative, so the geometric mean could
be computed. However, the geometric mean would not be useful with the
sample temperature data, because 0.0°C does not mean “no temperature,”
C HA P T E R 7 • Graphing Continuous Data 65
Biological variability
Most of this variation is probably the result of biological variability. People (and
animals, and even cells) are different from one another, and these differences are
important! Moreover, people (and animals) vary over time because alterations in
age, time of day, mood, and diet. In biological and clinical studies, much or most
of the scatter is often caused by biological variation.
Precision
Compare repeated measurements to see how precise the measurements are. Pre-
cise means the same thing as repeatable or reproducible. A method is precise
when repeated measurements give very similar results. The variation among re-
peated measurements is sometimes called experimental error.
Many statistics books (especially those designed for engineers) implicitly
assume that most variability is the result of imprecision. In medical studies, biolog-
ical variation often contributes more variation than does experimental imprecision.
6 6 PA RT C • C O N T I N U OU S VA R IA B L E S
Bias
Biased measurements result from systematic errors. Bias can be caused by any
factor that consistently alters the results: the proverbial thumb on the scale, defec-
tive thermometers, bugs in computer programs (maybe the temperature was mea-
sured in centigrade and the program that converted the values to Fahrenheit was
buggy), the placebo effect, and so on. As used in statistics, the word bias refers to
anything that leads to systematic errors, not only the preconceived notions of the
experimenter. Biased data are not accurate.
Accuracy
A result is accurate when it is close to being correct. You can only know if a
value is correct, of course, if you can measure that same quantity using another
method known to be accurate. A set of measurements can be quite precise without
being accurate if the methodology is not working or is not calibrated properly. A
measurement, or a method to obtain measurements, can be accurate and precise,
accurate (on average) but not precise, precise but not accurate, or neither accurate
nor precise.
Error
In ordinary language, the word error means something that happened acciden-
tally. But in statistics, error is often used to refer to any source of variability, as a
synonym for scatter or variability.
PERCENTILES
You are probably familiar with the concept of a percentile. For example, a quarter
(25%) of the values in a set of data are smaller than the 25th percentile, and 75%
of the values are smaller the 75th percentile (so 25% are larger).
The 50th percentile is identical to the median. The 50th percentile is the value
in the middle. Half the values are larger (or equal to) the median, and half are
lower (or equal to). If there is an even number of values, then the median is the
average of the middle two values.
Calculating the percentiles is trickier than you’d guess. Eight different equa-
tions can be used to compute percentiles (Harter, 1984). All methods compute
the same result for the median (50th percentile) but may not compute the same
results for other percentiles. With large data sets, however, the results are all
similar.
The 25th and 75th percentiles are called quartiles. The difference computed
as the 75th percentile minus the 25th percentile is called the interquartile range.
Half the values lie within this interquartile range.
C HA P T E R 7 • Graphing Continuous Data 67
Box-and-whiskers plots
A box-and-whiskers plot gives you a good sense of the distribution of data without
showing every value (see Figure 7.2). Box-and-whiskers plots work great when
you have too many data points to show clearly on a column scatter graph but don’t
want to take the space to show a full frequency distribution.
A horizontal line marks the median (50th percentile) of each group. The
boxes extend from the 25th to the 75th percentiles and therefore contain half the
39
38
Temperature °C
37
36
35
All Subset
39
38
Temperature °C
37
36
35
Figure 7.2. Box-and-whiskers and violin plots.
(Left) Box-and-whiskers plots of the entire data set. The whiskers extend down to the 5th
percentile and up to the 95th, with individual values showing beyond that. (Middle) The
whiskers show the range of all the data. (Right) A violin plot of the same data.
values. A quarter (25%) of the values are higher than the top of the box, and 25%
of the values are below the bottom of the box.
The whiskers can be graphed in various ways. The first box-and-whiskers
plot in Figure 7.2 plots the whiskers down to the 5th and up to the 95th percentiles
and plots individual dots for values lower than the 5th and higher than the 95th
percentile. The other box-and-whiskers plot in Figure 7.2 plots the whiskers down
to the smallest value and up to the largest, and so it doesn’t plot any individual
points. Whiskers can be defined in other ways as well.
Violin plots
Figure 7.2 also shows a violin plot of the same data (Hintze & Nelson, 1998). The
median and quartiles are shown with black lines. The overall distribution is shown
by the violin-shaped gray area. The “violin” is thickest where there are the most
values and thinner where there are fewer values, so the shape of the “violin” gives
you a sense of the distribution of the values.
GRAPHING DISTRIBUTIONS
Frequency distributions
A frequency distribution lets you see the distribution of many values. Divide the
range of values into a set of smaller ranges (bins) and then graph the number of
values (or the fraction of values) in each bin. Figure 7.3 displays frequency distri-
bution graphs for the temperature data. If you add the height of all the bars, you’ll
C HA P T E R 7 • Graphing Continuous Data 69
150 80
Number of individuals
Number of individuals
60
100
40
50
20
0 0
35 36 37 38 39
.0
.5
.0
.5
.0
.5
.0
.5
.0
35
35
36
36
37
37
38
38
39
Bin center °C Bin center °C
20
Number of individuals
15
10
0
.0
.5
.0
.5
.0
.5
.0
.5
.0
35
35
36
36
37
37
38
38
39
Bin center °C
Figure 7.3. Frequency distribution histograms of the temperature data with various
bin widths.
With too few bins (top left), you don’t get a sense of how the values vary. With many bins
(bottom), the graph shows too much detail for most purposes. The top-right graph shows the
number of bins that seems about right. Each bar plots the number of individuals whose body
temperature is in a defined range (bin). The centers of each range, the bin centers, are labeled.
get the total number of values. If the graph plots fractions or percentages instead
of number of values, then the sum of the height of all bars will equal 1.0 or 100%.
The trick in constructing frequency distributions is deciding how wide to make
each bin. The three graphs in Figure 7.3 use different bin widths. The graph on the
top left has too few bins (each bin covers too wide a range of values), so it doesn’t
show you enough detail about how the data are distributed. The graph on the bottom
has too many bins (each bin covers too narrow a range of values), so it shows you
too much detail (in my opinion). The upper-right graph seems the most useful.
Watch out for the term histogram. It is usually defined as a frequency dis-
tribution plotted as a bar graph, as illustrated in Figure 7.3. Sometimes the term
histogram is used more generally to refer to any bar graph, even one that is not a
frequency distribution.
This kind of graph can be made without choosing a bin width. Figure 7.4 illustrates
a cumulative frequency distribution of the temperature data.
Figure 7.5 shows the same graph, but the Y-axis plots the percentage (rather
than the number) of values less than or equal to each X value. The X value when
Y is 50% is the median. The right side of Figure 7.5 illustrates the same distri-
bution plotted with the Y-axis transformed in such a way that a cumulative dis-
tribution from a Gaussian distribution (see Chapter 10) becomes a straight line.
When graphing a cumulative frequency distribution, there is no need to
decide on the width of each bin. Instead, each value can be individually plotted.
150
Cumulative no. of people
100
50
0
35 36 37 38 39
°C
99.99%
100% 99.9%
99.0%
Cumulative percent
Cumulative percent
90.0%
70.0%
50.0%
50% 30.0%
10.0% N = 130
N = 130
1.0%
0.1%
0% 0.01%
35 36 37 38 39 35 36 37 38 39
°C °C
Beware of smoothing
When plotting data that change over time, it is tempting to remove much of the
variability to make the overall trend more visible. This can be accomplished by
plotting a rolling average, also called a moving average or smoothed data. For
example, each point on the graph can be replaced by the average of it and the three
nearest neighbors on each side. The number of points averaged to smooth the data
can range from two to many. If more points are included in the rolling average (say
10 on each side, rather than 3), the curve will be smoother. Smoothing methods
differ in how neighboring points are weighted.
Smoothed data should never be entered into statistical calculations. If you
enter smoothed data into statistics programs, the results reported by most statisti-
cal tests will be invalid. Smoothing removes information, so most analyses of
smoothed data are not useful.
Look ahead to Figure 33.4 to see an example of how analysis of smoothed
data can lead to a misleading conclusion.
Q&A
CHAPTER SUMMARY
• Many scientific variables are continuous.
• One way to summarize these values is to calculate the mean, median, mode,
geometrical mean, or harmonic mean.
• When graphing this kind of data, consider creating a graph that shows the
scatter of the data. Either show every value on a scatter plot or show the
distribution of values with a box-and-whiskers plot or a frequency distribu-
tion histogram.
• It is often useful to filter, adjust, smooth, or normalize the data before fur-
ther graphing and analysis. These methods can be abused. Think carefully
about whether these methods are being used effectively and honestly.
Types of Variables
Get your facts first, then you can distort them as you please.
M ARK T WAIN
T he past four chapters have discussed four kinds of data. This chapter
reviews the distinctions among different kinds of variables. Much
of this chapter is simply terminology, but these definitions commonly
appear in exam questions.
CONTINUOUS VARIABLES
Variables that can take on any value (including fractional values) are called con-
tinuous variables. The next six chapters deal with continuous variables. You need to
distinguish two kinds of continuous variables: interval variables and ratio variables.
Interval variables
Chapter 7 used body temperature in degrees centigrade for its examples. This kind
of continuous variable is termed an interval variable (but not a ratio variable). It
is an interval variable because a difference (interval) of 1°C means the same thing
all the way along the scale, no matter where you start.
Computing the difference between two values can make sense when using in-
terval variables. The 10°C difference between the temperatures of 100°C and 90°C
has the same meaning as the difference between the temperatures of 90°C and 80°C.
Calculating the ratio of two temperatures measured in this way is not useful.
The problem is that the definition of zero is arbitrary. A temperature of 0.0°C is
defined as the temperature at which water freezes and certainly does not mean
“no temperature.” A temperature of 0.0°F is a completely different temperature
(–17.8°C). Because the zero point is arbitrary (and doesn’t mean no temperature),
it would make no sense at all to compute ratios of temperatures. A temperature of
100°C is not twice as hot as 50°C.
Figure 8.1 illustrates average body temperatures of several species (Blumberg,
2004). The platypus has an average temperature of 30.5°C, whereas a canary has
an average temperature of 40.5°C. It is incorrect to say that a canary has a tem-
perature 33% higher than that of a platypus. If you did that same calculation using
degrees Fahrenheit, you’d get a different answer.
75
7 6 PA RT C • C O N T I N U OU S VA R IA B L E S
(A) (B)
50 45
40
40
Temperature°C
Temperature°C
30
35
20
30
10
0 25
an
og
us
an
og
us
ry
ry
na
na
yp
yp
D
D
um
um
Ca
Ca
at
at
H
H
Pl
Pl
(C)
45
Temperature°C
40
35
30
25
an
og
us
ry
na
yp
D
um
Ca
at
H
Pl
Figure 8.1A is misleading. The bars start at zero, inviting you to compare
their relative heights and think about the ratio of those heights. But that compari-
son is not useful. This graph is also not helpful because it is difficult to see the
differences between values.
Figure 8.1B uses a different baseline to demonstrate the differences. The bar
for the canary is about three times as high as the bar for the platypus, but this ratio
(indeed, any ratio) is not useful. Figure 8.1C illustrates the most informative way to
graph these values. The use of points, rather than bars, doesn’t suggest thinking in
terms of a ratio. A simple table might be better than any kind of graph for these values.
Ratio variables
With a ratio variable, zero is not arbitrary. Zero height is no height. Zero weight
is no weight. Zero enzyme activity is no enzyme activity. So height, weight, and
enzyme activity are ratio variables.
As the name suggests, it can make sense to compute the ratio of two ratio
variables. A weight of 4 grams is twice the weight of 2 grams, because weight is
a ratio variable. But a temperature of 100°C is not twice as hot as 50°C, because
C HA P T E R 8 • Types of Variables 77
temperature in degrees centigrade is not a ratio variable. Note, however, that tem-
perature in Kelvin is a ratio variable, because 0.0 in Kelvin really does mean (at
least to a physicist) no temperature. Temperatures in Kelvin are far removed from
temperatures that we ordinarily encounter, so they are rarely used in biology.
Like interval variables, you can compute the difference between ratio vari-
ables. Unlike interval variables, you can calculate the ratio of two ratio variables.
DISCRETE VARIABLES
Variables that can only have a limited set number of possible values are called
discrete variables.
Ordinal variables
An ordinal variable expresses rank. The order matters but not the exact value. For
example, pain is expressed on a scale of 1 to 10. A score of 7 means more pain than
a score of 5, which is more than a score of 3. But it doesn’t make sense to compute
the difference between two values, because the difference between 7 and 5 may
not be comparable to the difference between 5 and 3. The values simply express an
order. Another example would be movie or restaurant ratings from one to five stars.
WHY IT MATTERS?
Table 8.1 summarizes which kinds of calculations are meaningful with which
kinds of variables. It refers to the standard deviation and coefficient of variation,
both of which are explained in Chapter 9, as well as the standard error of the mean
(explained in Chapter 14).
Table 8.1. Calculations that are meaningful with various kinds of variables.
The standard deviation and coefficient of variation is explained in Chapter 9, and the stan-
dard error of the mean is explained in Chapter 14.
7 8 PA RT C • C O N T I N U OU S VA R IA B L E S
Q&A
CHAPTER SUMMARY
• Different kinds of variables require different kinds of analyses.
• Prior chapters have already discussed three kinds of data: proportions
(binomial variables), survival data, and counts.
• This chapter explains how continuous data can be subdivided into interval
and ratio (and ordinal) data.
• Interval variables are variables for which a certain difference (interval)
between two values is interpreted identically no matter where you start. For
example, the difference between zero and 1 is interpreted the same as the
difference between 999 and 1,000.
• Ratio variables are variables for which zero is not an arbitrary value. Weight
is a ratio variable because weight = 0 means there is no weight. Temperature
in Fahrenheit or centigrade is not a ratio variable, because 0° does not mean
there is no temperature.
• It does not make any sense to compute a coefficient of variation, or to com-
pute ratios, of continuous variables that are not ratio variables.
• The distinctions among the different kinds of variables are not quite as crisp
as they sound.
Quantifying Scatter
The average human has one breast and one testicle.
D ES M C H ALE
80
C HA P T E R 9 • Quantifying Scatter 81
39
Mean ± SD
Temperature°C 38
37
36
35
Figure 9.1. (Left) Each individual value. (Right) The mean and SD.
These steps can be shown as an equation, where Yi stands for one of the n
− is the mean.
values, and Y
SD =
∑ (Yi − Y)2
n −1 9.1
WHY n – 1?
When calculating the SD, the sum of squares is divided by (n – 1). This is the
definition of the sample SD, which is the best possible estimate of the SD of the
entire population, as determined from one particular sample. Read on if you are
curious to know why the denominator is (n – 1) rather than n.
The simplest way to analyze such data is to average the values from each
animal. For this example, there are three means (one for each animal). You would
then compute the SD (and CI) from those three means using n = 3. The results can
then be extrapolated to the population of animals you could have studied.
Representative experiments
Some investigators prefer to just show data from one “representative” experiment.
The error bars in the table or graph are calculated only from the replicates in that
one experiment. If the data were collected in triplicate, the graph might be labeled
“n = 3,” but, in fact, all the data come from one experiment, not three. It can be
useful to report the SD of replicate data within a single experiment as a way to
demonstrate the precision of the method and to spot experimental problems. But
substantive results should be reported with data from multiple experiments (Vaux,
Fidler, & Cumming, 2012).
Coefficient of variation
For ratio variables, variability can be quantified as the coefficient of variation
(CV), which equals the SD divided by the mean. If the CV equals 0.25, you know
that the SD is 25% of the mean.
Because the SD and the mean are both expressed in the same units, the CV is
a fraction with no units. Often the CV is expressed as a percentage.
For the preceding temperature example, the CV would be completely mean-
ingless. Temperature is an interval variable, not a ratio variable, because zero is
defined arbitrarily (see Chapter 8). A CV computed from temperatures measured
in degrees centigrade would not be the same as a CV computed from temperatures
measured in degrees Fahrenheit. Neither CV would be meaningful, because the
idea of dividing a measure of scatter by the mean only makes sense with ratio
variables, for which zero really means zero.
The CV is useful for comparing scatter of variables measured in different
units. You could ask, for example, whether the variation in pulse rate is greater
than or less than the variation in the concentration of serum sodium. The pulse rate
and sodium are measured in completely different units, so comparing their SDs
would make no sense. Comparing their coefficients of variation might be useful
to someone studying homeostasis.
Variance
The variance equals the SD squared and so is expressed in the same units as the
data but squared. In the body temperature example, the variance is 0.16°C squared.
Statistical theory is based on variances, rather than SD, so mathematical stat-
isticians routinely think about variances. Scientists analyzing data do not often
have to encounter variances, not even when using analysis of variance (ANOVA),
explained in Chapter 39.
Interquartile range
You probably are already familiar with the concept of percentiles. The 25th
percentile is a value below which 25% of the values in your data set lie. The
inter-quartile range is defined by subtracting the 25th percentile from the 75th
8 6 PA RT C • C O N T I N U OU S VA R IA B L E S
percentile. Because both percentile values are expressed in the same units as the
data, the interquartile range is also expressed in the same units.
For the body temperature data (n = 12 subset), the 25th percentile is 36.4°C
and the 75th percentile is 37.1°C, so the interquartile range is 0.7°C. For the full
(n = 130) data set, the 25th percentile is 36.6°C and the 75th percentile is 37.1°C,
so the interquartile range is 0.5°C.
Five-number summary
The distribution of a set of numbers can be summarized with five values, known
as the five-number summary: the minimum, the 25th percentile (first quartile), the
median, the 75th percentile (the third quartile), and the maximum.
Q&A
CHAPTER SUMMARY
• The most common way to quantify scatter is with a SD.
• A useful rule of thumb is that about two-thirds of the observations in a
population usually lie within the range defined by the mean minus 1 SD to
the mean plus 1 SD.
• Other variables used to quantify scatter are the variance (SD squared), the
CV (which equals the SD divided by the mean), the interquartile range, and
the median absolute deviation.
• While it is useful to quantify variation, it is often easiest to understand the
variability in a data set by seeing a graph of every data point or a graph of
the frequency distribution.
8 8 PA RT C • C O N T I N U OU S VA R IA B L E S
The Gaussian Distribution
Everybody believes in the normal approximation, the experi-
menters because they think it is a mathematical theorem, the
mathematicians because they think it is an experimental fact.
G . L IPPMAN
89
9 0 PA RT C • C O N T I N U OU S VA R IA B L E S
clinical value might be caused by many genetic and environmental factors. When
scatter is the result of many independent additive causes, the distribution will tend
to follow a bell-shaped Gaussian distribution.
–3 SD –2 SD –1 SD Mean +1 SD +2 SD +3 SD –3 SD –2 SD –1 SD Mean +1 SD +2 SD +3 SD
of the values lie between 36.0 and 37.6°C. If you look back at Figure 9.1, you can
see that these estimates are fairly accurate.
Value − Mean
z=
SD
The variable z is the number of SD away from the mean. When z = 1, a
value is 1 SD above the mean. When z = −2, a value is 2 SD below the mean.
Table 10.1 tabulates the fraction of a normal distribution between −z and +z for
various values of z.
If you work in pharmacology, don’t confuse this use of the variable z with the
specialized Z-factor used to assess the quality of an assay used to screen drugs
(Zhang, Chung, & Oldenburg, 1999). The two are not related.
PERCENTAGE OF STANDARD
NORMAL DISTRIBUTION
z BET WEEN −z AND z
0.67 50.00
0.97 66.66
1.00 68.27
1.65 90.00
1.96 95.00
2.00 95.45
2.58 99.00
3.00 99.73
Q&A
CHAPTER SUMMARY
• The Gaussian bell-shaped distribution is the basis for much of statistics. It
arises when many random factors create variability.
• With a Gaussian distribution, about two-thirds of the values are within 1
SD of the mean, and about 95% of the values are within 2 SD of the mean.
• The Gaussian distribution is also called a normal distribution. But this use
of normal is very different than the usual use of that word to mean ordinary
or abundant.
• The central limit theorem explains why Gaussian distributions are central
to much of statistics. Basically, this theorem says that the distribution of
many sample means will tend to be Gaussian, even if the data are not sam-
pled from a Gaussian distribution.
9 4 PA RT C • C O N T I N U OU S VA R IA B L E S
The Lognormal Distribution
and Geometric Mean
42.7 percent of all statistics are made up on the spot.
S TEVEN W RIGHT
95
9 6 PA RT C • C O N T I N U OU S VA R IA B L E S
8 000 4 10 000
log(lEC50, nM)
EC50 in nM
EC50 (nM)
6 000 3 1 000
4 000 2 100
2 000 1 10
0 0 1
LOGARITHMS?
Logarithms? How did logarithms get involved? Briefly, it is because the logarithm
of the product of two values equals the sum of the logarithm of the first value plus
the logarithm of the second. So logarithms convert multiplicative scatter (lognor-
mal distribution) to additive scatter (Gaussian). Logarithms (and antilogarithms)
are reviewed in Appendix E.
If you transform each value sampled from a lognormal distribution to its
logarithm, the distribution becomes Gaussian. Thus, the logarithms of the values
follow a Gaussian distribution when the raw data are sampled from a lognormal
distribution.
C HA P T E R 1 1 • The Lognormal Distribution and Geometric Mean 97
The middle graph in Figure 11.1 plots the logarithm of the EC50 values. Note
that the distribution is symmetrical. The graph on the right illustrates an alterna-
tive way to plot the data. The axis has a logarithmic scale. Note that every major
tick on the axis represents a value 10 times higher than the previous value. The
distribution of data points is identical to that in the middle graph, but the graph on
the right is easier to comprehend because the Y values are labeled in natural units
of the data, rather than logarithms.
If data are sampled from a lognormal distribution, convert to logarithms
before using a statistical test that assumes sampling from a Gaussian distribution,
such as a t test or ANOVA.
GEOMETRIC MEAN
The mean of the data from Figure 11.1 is 1,333 nM. This is illustrated as a hori-
zontal line in the left panel of Figure 11.1. The mean is larger than all but one of
the values, so it is not a good measure of the central tendency of the data.
The middle panel plots the logarithms of the values on a linear scale. The
horizontal line is at the mean of the logarithms, 2.71. About half of the values are
higher and half are smaller.
The right panel in Figure 11.1 uses a logarithmic axis. The values are
the same as those of the graph on the left, but the spacing of the values
on the axis is logarithmic. The horizontal line is at the antilog of the mean of
the logarithms. This graph uses logarithms base 10, so the antilog is computed
by calculating 102.71, which equals 513. The value 513 nM is called the geomet-
ric mean (GM).
To compute a GM, first transform all the values to their logarithms and then
calculate the mean of those logarithms. Finally, transform that mean of the loga-
rithms back to the original units of the data.
GEOMETRIC SD
To calculate the GM, we first computed the mean of the logarithms and then
computed the antilogarithm (power of 10) of that mean. Similar steps compute
the geometric standard deviation. First compute the standard deviation of the
logarithms, which for the example in Figure 11.1 equals 0.632. Then take the
antilogarithm (10 to the power) of that value, which is 4.29. That is the geometric
standard deviation.
When interpreting the SD of values from a Gaussian distribution, you
expect about two-thirds of the values to in the range that goes from the mean
minus SD to the mean plus SD. Remember that logarithms essentially convert
multiplication to addition, so the geometric SD must be multiplied times, or
divided into, the GM.
The GM equals 513 nM and the geometric SD is 4.29. You expect two-thirds
of the values in this distribution to lie in the range 513/4.29 to 513*4.29, which
is 120 nM to 2201 nM. This range will appear symmetrical when plotted on a
9 8 PA RT C • C O N T I N U OU S VA R IA B L E S
logarithmic axis but asymmetrical when plotted on a linear axis. You’ll see this in
Figure 14.3 (in the chapter on error bars).
The geometric SD has no units. Because the geometric SD is multiplied by or
divided into the GM, it is sometimes called the geometric SD factor. The geomet-
ric SD was defined by Kirkwood (1979) and is not commonly used (but should
be). Limpert and Stahel (2011) propose reporting the GM and geometric standard
deviation factor (GSD) as GM ×/ GSD, read as “times or divided by”. This con-
trasts with mean ± SD, read as “plus or minus”.
Q&A
CHAPTER SUMMARY
• Lognormal distributions are very common in many fields of science.
• Lognormal distributions arise when multiple random factors are multi-
plied together to determine the value. This is common in biology. In con-
trast, Gaussian distributions arise when multiple random factors are added
together.
• Lognormal distributions have a long right tail (are said to be skewed to the
right).
• The center of a lognormal distribution is quantified with a geometric mean,
measured in the same units as the data.
• The variation of a lognormal distribution is quantified with a standard devi-
ation factor, a unitless value.
• With data sampled from a Gaussian distribution, you think about the mean
plus or minus the SD. With data sampled from a lognormal distribution,
you think about the geometric mean multiplied or divided by the geometric
SD factor.
• You may get misleading results if you make the common mistake of choos-
ing analyses that assume sampling from a Gaussian distribution when in
fact your data are actually sampled from a lognormal distribution.
• In most cases, the best way to analyze lognormal data is to take the loga-
rithm of each value and then analyze those logarithms.
Confidence Interval of a Mean
It is easy to lie with statistics. It is hard to tell the truth
without it.
A NDREJS D UNKELS
INTERPRETING A CI OF A MEAN
For our ongoing n = 130 body temperature example (see Figure 7.1), any sta-
tistics program will calculate that the 95% CI of the mean ranges from 36.75˚C
to 36.89˚C. For the smaller n = 12 subset, the 95% CI of the mean ranges from
36.51˚C to 37.02˚C.
Note that there is no uncertainty about the sample mean. We are 100% sure
that we have calculated the sample mean correctly. Any errors in recording the
data or computing the mean will not be accounted for in computing the CI of the
mean. By definition, the CI is always centered on the sample mean. The popula-
tion mean is not known and can’t be known. However, given some assumptions
outlined in the following discussion, we can be 95% sure that the calculated in-
terval contains it.
What exactly does it mean to be “95% sure”? When you have measured only
one sample, you don’t know the value of the population mean. The population mean
either lies within the 95% CI or it doesn’t. You don’t know, and there is no way to
find out. If you calculate a 95% CI from many independent samples, the population
mean will be included in the CI in 95% of the samples but will be outside of the CI
in the other 5% of the samples. Using data from one sample, therefore, you can say
that you are 95% confident that the 95% CI includes the population mean.
The correct syntax is to express the CI as “36.75 to 36.89” or as
“[36.75, 36.89].” It is considered bad form to express the CI as “36.75–36.89,”
because the hyphen would be confusing when the values are negative. Although
it seems sensible to express the CI as “36.82 ± 0.07,” that format is rarely used.
101
1 0 2 PA RT C • C O N T I N U OU S VA R IA B L E S
39
n= 130 n = 12
38
Body temperature°C
95% CI
37 95% CI
36
35
Figure 12.1. The 95% CI does not contain 95% of the values, especially when the sample
size is large.
C HA P T E R 1 2 • Confidence Interval of a Mean 103
ASSUMPTIONS: CI OF A MEAN
To interpret a CI of a mean, you must accept the following assumptions.
a 5% chance that the lower limit (36.56) is greater than the population mean. That
leaves a 90% (100% minus 5% minus 5%) chance that the interval 36.56°C to
36.98°C includes the true population mean.
But what if we are clinically interested only in fevers and want to know only
the upper confidence limit? Because there is a 5% chance that the population
mean is greater than 36.98°C (see previous paragraph), that leaves a 95% chance
that the true population mean is less than 36.98°C. This is a one-sided 95% CI.
More precisely, you could say that the range from minus infinity to 36.98°C is
95% likely to contain the population mean.
CI of the SD
A CI can be determined for nearly any value you calculate from a sample of data.
With the n = 12 body temperature data, the SD was 0.40°C. It isn’t done often, but
it is possible to compute a 95% CI of the SD itself. In this example, the 95% CI of
the SD extends from SD = 0.28°C to SD = 0.68°C (calculated using a Web-based
calculator at www.graphpad.com/quickcalcs/CISD1/).
The interpretation is straightforward. Given the same assumptions as those
used when interpreting the CI of the mean, we are 95% sure that the calculated
interval contains the true population SD. With the n = 130 body temperature data,
the SD is 0.41°C. Because the sample size is so much larger, the sample SD is a
more precise estimate of the population SD, and the CI is narrower, ranging from
SD = 0.37°C to SD = 0.47°C.
Table 12.2 presents the CI for a standard deviation for various sample sizes.
CI of a geometric mean
Chapter 11 explained how to compute the geometric mean. This process is easily
extended to computing the 95% CI of the geometric mean. The first step in
n 95% CI OF SD
2 0.45 to 31.9 SD
3 0.52 to 6.29 SD
5 0.60 to 2.87 SD
10 0.69 to 1.83 SD
25 0.78 to 1.39 SD
50 0.84 to 1.25 SD
100 0.88 to 1.16 SD
500 0.94 to 1.07 SD
1,000 0.96 to 1.05 SD
t *⋅ s
W=
n
W = t * ⋅ SEM
Q&A
37.5
99% CI
95% CI
37.0 90% CI
Body temperature°C
36.5
36.0
35.5
Figure 12.2. If you want to be more confident that a CI contains the population
mean, you must make the interval wider.
The 99% interval is wider than the 95% interval, which is wider than the 90% CI.
CHAPTER SUMMARY
• A CI of the mean shows you how precisely you have determined the popu-
lation mean.
• If you compute 95% CIs from many samples, you expect that 95% will
include the true population mean and 5% will not. You’ll never know
whether a particular CI includes the population mean.
• The CI of a mean is computed from the sample mean, the sample SD, and
the sample size.
• The CI does not display the scatter of the data. In most cases, the majority
of the data values will lie outside the CI.
• If you desire more confidence (99% rather than 95%), the CI will be wider.
• Larger samples have narrower CIs than smaller samples with the same SD.
• Interpreting the CI of a mean requires accepting a list of assumptions.
The Theory of Confidence Intervals
Confidence is what you have before you understand the
problem.
W O ODY A LLEN
110
C HA P T E R 1 3 • The Theory of Confidence Intervals 111
Because these are simulated data from a hypothetical population, the value of
μ is known and constant, as is n. For each sample, compute m and s and then use
Equation 13.1 to compute t.
For each random sample, the sample mean is equally likely to be larger or
smaller than the population mean, so the t ratio is equally likely to be positive or
negative. It will usually be fairly close to zero, but it can also be far away. How
far? It depends on sample size (n), the SD (s), and chance. Figure 13.1 illustrates
the distribution of t computed for a sample size of 12. Of course, there is no need
to actually perform these simulations. Mathematical statisticians have derived the
distribution of t using calculus.
Sample size, n, is included in the equation that defines t, so you might expect
the distribution of t to be the same regardless of sample size. In fact, the t distribu-
tion depends on sample size. With small samples, the curve in Figure 13.1 will
be wider; with large samples, the curve will be narrower. With huge samples, the
curve in Figure 13.1 will become indistinguishable from the Gaussian distribution
shown in Figure 10.1.
The flip!
Here comes the most important step, in which the math is flipped around to
compute a CI.
–4 –3 –2 –1 0 1 2 3 4
t
In any one sample, the sample mean (m) and SD (s) are known, as is the
sample size (n). What is not known is the population mean. That is why we want
to generate a CI.
Rearrange Equation 13.1 to solve for μ (i.e., put μ on the left side of the
equation):
s
= m± t*
n
For the example, n = 12, m = 36.77°C, and s = 0.40°C. The t ratio has a 95%
probability of being between –2.201 and +2.201, so t* = 2.201. Plug in those
values and compute the population mean twice. The first time use the minus sign,
and the second time do the calculations again using the plus sign. The results are
the limits of the 95% CI, which ranges from 36.52°C to 37.02°C.
How it works
The t distribution is defined by assuming the population is Gaussian and investi-
gating the variation among the means of many samples. The math is then flipped to
make inferences about the population mean from the mean and SD of one sample.
This method uses probability calculations that are logically simple and flips
them to answer questions about data analysis. Depending on how you look at it, this
flip is either really obvious or deeply profound. Thinking about this kind of inverse
probability is tricky and has kept statisticians and philosophers busy for centuries.
For each of the 500 new samples, compute the mean (because we want to know
the CI of the population mean). Next, determine the 2.5th and 97.5th percentiles
of that list of means. For this example, the 2.5th percentile is 36.55°C and the
97.5th percentile is 36.97°C. Because the difference between 97.5 and 2.5 is 95,
we can say that 95% of the means of the pseudosamples are between 36.55°C
and 36.97°C.
The flip!
Statistical conclusions require flipping the logic from the distribution of multiple
samples to the CI of the population mean. With the resampling approach, the flip
is simple. The range of values that contains 95% of the resampled means (36.55°C
to 36.97°C) is also the 95% CI of the population mean.
For this example, the resampled CI is nearly identical to the CI computed
using the conventional method (which assumes a Gaussian distribution). But the
resampling method does not require the assumption of a Gaussian distribution and
is more accurate when that assumption is not justified. The only assumption used
in the resampling approach is that the values in the sample vary independently and
are representative of the population.
That’s it?
This resampling method seems too simple to be useful! It is surprising that ran-
domly resampling from the same set of values gives useful information about the
population from which that sample was drawn. But it does create useful results.
Plenty of theoretical and practical (simulation) work has validated this approach,
which some statisticians think should be widely used.
Learn more
To learn more about the resampling approach—also called the bootstrapping or
computer-intensive method—start by reading books by Wilcox (2010) and Manly
(2006). The method explained here is called a percentile-based resampling confi-
dence interval. Fancier methods create slightly more accurate CIs.
A huge advantage of the resampling approach is that it is so versatile. It can
be used to obtain the CI of the median, the interquartile range, or almost any
other parameter. Resampling approaches are extensively used in the analysis of
genomics data.
24% 42%
Figure 13.2 illustrates the answer. If 33% of the entire population of voters
is in favor of your candidate, then if you collect many samples of 100 voters, in
95% of those samples, between 24% and 42% of people will be in favor of your
candidate.
That range tells you about multiple samples from one population. We want
to know what can be inferred about the population from one sample. It turns out
that that same range is the CI. Given our one sample, we are 95% sure the true
population value is somewhere between 24% and 42%.
With continuous data, the resampling approach is more versatile than the t
distribution approach, because it doesn’t assume a Gaussian (or any other) distri-
bution. With binomial data, there is no real advantage to the resampling approach,
except that it is a bit easier to understand than the approach based on the binomial
distribution (see the next section).
24%
2.5%
42%
2.5%
95% CI
The lower 95% CI (L) can be determined by an indirect approach. We can use
probability equations to answer the question, If the population proportion equals
L, what is the probability that a sample proportion (n = 100) will equal 0.33 (as
we observed in our sample) or more? To compute a 95% CI, each tail of the dis-
tribution must be 2.5%, so we want to find the value of L that makes the answer
to that question 2.5%.
Determining the upper 95% confidence limit of a proportion (U) uses a simi-
lar approach. If the population proportion equals U, what is the probability that
a sample proportion (n = 100) will equal 0.33 or less? We want the answer to be
2.5% and solve for U.
Using the binomial distribution to solve for L and U is not a straightfor-
ward task. A brute force approach is to try lots of possible values of L or U to
find values for which the cumulative binomial distribution gives a result of 2.5%.
Excel’s Solver can automate this process (Winston, 2004).
For this example, L equals 0.24, or 24%. If fewer than 24% of voters were truly
in favor of your candidate, there would be less than a 2.5% chance of randomly
choosing 100 subjects of which 33% (or more) are in favor of your candidate.
For the example, U equals 0.42, or 42%. If more than 42% of voters really were
in favor of your candidate, there would be less than a 2.5% chance of randomly
choosing 100 subjects of which 33% (or less) were in favor of your candidate.
So far, we have been asking about which sample proportions are likely given
known population proportions (L and U). Now we want to make inferences about
the population from one sample. That requires a flip in reasoning. Here it comes.
Calculate 100% – 2.5% – 2.5%, which leaves 95%. Therefore, there is a 95%
chance that the true percentage of voters in favor of your candidate is between
24% and 42%.
Q&A
How can resampling from one sample tell you anything useful about the population
from which that sample was drawn?
It is amazing that resampling works. It is also called bootstrapping, which comes
from the phrase “pulling oneself up by the bootstrap,” which is impossible. It is
not obvious that bootstrap/resample methods produce valid inferences about
the population the sample was drawn from, but this has been proven by both
mathematical proofs and simulations.
Is it correct to say that there is a 95% chance that the population mean (or proportion)
lies within the computed CI?
No. The population mean (or proportion) has a fixed value (which we never
know). So it is not appropriate to ask about the chance that the population
mean (or proportion) has any particular value. It is what it is with no chance in-
volved. In contrast, the computed CIs depend on which sample of data you hap-
pened to choose, so vary from sample to sample based on random sampling. It
is correct to say that there is a 95% chance that a 95% CI includes the population
mean (or proportion).
C HA P T E R 1 3 • The Theory of Confidence Intervals 117
Why the focus on parameters like the mean? I don’t know want to know the precision
of the mean. I want to know about the distribution of values in the population.
The 95% prediction interval answers your question. It is the range of values that
is 95% sure to contain 95% of the values in the entire population. I’m not sure
why, but prediction intervals are not often used in most fields of biology. Predic-
tion intervals are much wider than CIs.
CHAPTER SUMMARY
• You can understand CIs without understanding how they are computed.
This chapter gives you a peek at how the math works. You don’t need to
understand this chapter to understand how to interpret CIs.
• The math works by flipping around (solving) equations that predict samples
from a known population to let you make inferences about the population
from a single sample.
• Depending on your frame of mind, this flip can seem really obvious or
deeply profound.
• An alternative approach is to use resampling (also called bootstrapping)
methods. The idea is that analyzing a set of bootstrapped samples cre-
ated from the original data will generate a CI for the population mean or
proportion.
Error Bars
The only role of the standard error . . . is to distort and conceal
the data. The reader wants to know the actual span of the data;
but the investigator displays an estimated zone for the mean.
A . R . F EINSTEIN
SD VERSUS SEM
What is the SD?
The standard deviation, abbreviated SD or s, was explained in Chapters 9 and 10.
It is expressed in the same units as the data and quantifies variation among the
values. If the data are sampled from a Gaussian distribution, you expect about
two-thirds of the values to lie within 1 SD of the mean.
118
C HA P T E R 1 4 • Error Bars 119
The SEM quantifies how precisely you know the population mean
Imagine taking many samples of sample size n from your population. These
sample means will not be identical, so you could quantify their variation by com-
puting the SD of the set of means. The SEM computed from one sample is your
best estimate of what the SD among sample means would be if you collected an
infinite number of samples of a defined size.
Chapter 12 defined the margin of error (W) of a CI of a mean as follows:
t * ⋅ SD
W=
n
Substituting the definition of the SEM,
W = t * ⋅ SEM
SD = SEM ⋅ n 14.1
For the n = 12 body temperature sample, the SEM equals 0.1157°C. If that was
all you knew, you could compute the SD. Multiply 0.1157 times the square root of
12, and the SD equals 0.40°C. This is an exact calculation, not an approximation.
60 60
40 40
% Emax
20 20
0 0
SD
CI
ge
es
til
SE
n
%
&
ra
ar
95
&
n
qu
&
ea
&
ian
ea
&
M
n
M
ian
ea
ed
M
ed
M
Figure 14.1. (Left) The actual data. (Right) Five kinds of error bars, each representing a
different way to portray variation or precision.
60
40
% Emax
20
0
A B C D
Figure 14.2. Four different styles for plotting the mean and SD.
Figure 14.1 (left) illustrates the raw data plotted as a column scatter plot.
Figure 14.1 (right) shows the same data with several kinds of error bars represent-
ing the SD, SEM, 95% CI, range, and interquartile range. When you create graphs
with error bars, be sure to state clearly how they were computed. These methods
all plot the same values, and there is no real reason to prefer one style over another.
Figure 14.2 shows that error bars (in this case, representing the SD) can be plot-
ted with or without horizontal caps, and (with bar graphs) in one direction or both di-
rections. The choice is a matter of preference and style. When error bars with caps are
placed on bars and only extend above the bar (bar C in Figure 14.2), the resulting plot
is sometimes called a dynamite plunger plot, sometimes shortened to dynamite plot.
If you increase the sample size, is the SD expected to get larger, get
smaller, or stay about the same?
The SD quantifies the scatter of the data. Whether increasing the size of the sample
increases or decreases the scatter in that sample depends on the random selection
of values. The SD, therefore, is equally likely to get larger or to get smaller as the
sample size increases.
That statement, however, has some fine print: for all sample sizes, the sample
variance (i.e., the square of the sample SD) is the best possible estimate of the popu-
lation variance. If you were to compute the sample variance from many samples
from a population, on average, it would be correct. Therefore, the sample variance is
said to be unbiased, regardless of n. In contrast, when n is small, the sample SD tends
to slightly underestimate the population SD. Increasing n, therefore, is expected to
increase the SD a bit. Because statistical theory is based on the variance rather than
the SD, it is not worth worrying about (or correcting for) the bias in the sample SD.
Few statistics books even mention it. The expected increase in the sample SD as you
increase sample size is tiny compared to the expected decrease in the SEM.
10000 5 100000
8000 4
10000
log(EC50, nM)
EC50 (nM)
EC50 (nM)
6000 3
1000
4000 2
100
2000 1
0 0 10
(Colquhoun, 2003). The first (leftmost) bar shows that there are 71 (Y value) papers
that had between zero and 20 citations (since the X value is centered at 10 and each
bar represents a bin width of 20 citations). The last (rightmost) bar shows that a single
paper (of the 500 studied) had 2,364 citations. This frequency distribution is easy to
understand even though it is very asymmetrical, even more asymmetrical than a log-
normal distribution (see Chapter 11). With a glance at the graph, you can see that many
papers are only cited a few times (or not at all), while only a few are heavily cited.
The mean number of citations is approximately 115, with a SD of approxi-
mately 157 (“approximately” because I only have access to the frequency distribu-
tion, not the raw data). Viewing a graph showing that the mean is 115 and the SD
is 157 would not help you to understand the data. Seeing that the mean is 115 and
the SEM is 7 would be less helpful and might be quite misleading.
The median number of citations is approximately 65, and the interquartile
range runs from about 20 to 130. Reporting the median with that range would be
more informative than reporting the mean and SD or SEM. However, it seems to
me that the best way to summarize these data is to show a frequency distribution.
No numerical summary really does a good job.
100
80
60
# of papers
40
20
300
200
SD
100
0
A B C D
C and D are extremely different from the distribution of values in Data Set B. It
can be useful to summarize data as mean and SD (or SEM), but it can also be
misleading. Look at the raw data when possible.
When graphing your own data, keep in mind that it is easy to show the actual
data on dot plots (column scatter graphs) with up to 100 or even a few hundred
values. With more points, it is easy to plot a box-and-whiskers graph, a violin plot,
or a frequency distribution histogram. Don’t plot mean and an error bar without
thinking first about alternative ways to plot the data.
Mistake: Plotting a mean and error bar without defining how the error
bars were computed
If you make a graph showing means and error bars (or a table with means plus or
minus error values), it is essential that you state clearly what the error bars are—
SD, SEM, CI, range, or something else? Without a clear legend or a statement in
the methods section, the graph or table is ambiguous.
Q&A
CHAPTER SUMMARY
• The SD quantifies variation.
• The SEM quantifies how precisely you know the population mean. It does
not quantify variability. The SEM is computed from the SD and the sample
size.
• The SEM is always smaller than the SD. Both are measured in the same
units as the data.
• Graphs are often plotted as means and error bars, which are usually either
the SD or the SEM.
• Consider plotting each value, or a frequency distribution, before deciding
to graph a mean and error bar.
• You can calculate the SD from the SEM and n using Equation 14.1.
1 2 6 PA RT C • C O N T I N U OU S VA R IA B L E S
• Show the SD when your goal is to show variability and show the SEM (or,
better, the confidence interval) when your goal is to show how precisely
you have determined the population mean.
• The choice of SD or SEM is often based on traditions in a particular scien-
tific field. Often, SEMs are chosen simply because they are smaller.
• Graphs with error bars (or tables with error values) should always contain
labels, whether the error bars are SD, SEM, or something else.
P Values and Statistical
Significance
C HA P T E R 1 5
Introducing P Values
Imperfectly understood CIs are more useful and less dangerous
than incorrectly understood P values.
H OENIG AND H EISEY (2001)
M any statistical analyses are reported with P values, which are ex-
plained in this essential chapter. P values are often misunderstood
because they answer a question you probably never thought to ask. This
chapter explains what P values are and the many incorrect ways in which
they are interpreted.
INTRODUCING P VALUES
You’ve probably already encountered P values. They appear in many statistical
results, either as a number (“P = 0.0432”) or as an inequality (“P < 0.01”). They
are also used to make conclusions about statistical significance, as is explained
in Chapter 16.
P values are difficult to understand and are often misinterpreted. This chapter
starts with several examples to explain what P values are and then identifies the
common incorrect beliefs many have about P values.
129
1 3 0 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
coin had landed on tails 16 or more times (and thus on heads 4 or fewer times).
Putting all this together, we want to answer the following question: If the coin
tosses are random and the answers are recorded correctly, what is the chance that
when you flip the coin 20 times, you’ll observe either 16 or more heads or 4 or
fewer heads (meaning 16 or more tails)?
The values in the third column can also be computed using the Excel func-
tion Binom.dist. For example, the value for 10 heads is computed as =BINOM.
DIST(10, 20, 0.5, FALSE). The first value is the number of “successes” (heads
in this case); the second value is the number of tries (20 in this case); the third
number is the fraction success for each try (0.5 for a coin flip); and the last value
is false because you don’t want a cumulative probability. Older versions of Excel
offer the function BinomDist, which works the same way but is said to be less
accurate.
Table 15.1 shows all the possible probabilities when you flip a fair coin
20 times. This book doesn’t explain how to compute these answers, but the
table legend points you to a Web page and an Excel function that does these
calculations.
the risk in those who receive an inactive ointment, so any discrepancy in the inci-
dence of infections observed in this study is the result of chance. Then we can ask
the following question: If the risk of infection overall is identical in the two groups
and the experiment was performed properly, what is the chance that random sam-
pling would lead to a difference in incidence rates equal to, or greater than, the
difference actually observed in this study?
The authors did the calculation and published the answer, which is 0.010.
Since the study was well designed, the P value was low, and the hypothesis that
an antibiotic would prevent infection is quite sensible, the investigators concluded
that the antibiotic worked to prevent wound infection.
provisional assumption about the population (the null hypothesis) and asks
about possible data samples. It’s backwards! Thinking about P values seems
quite counterintuitive, except maybe to lawyers or Talmudic scholars used
to this sort of argument by contradiction.
• The P value answers a question that few scientists would ever think to ask!
A
Probability Probability=
0.59% 0.59%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Heads
B
Probability=
0.59%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Heads
C
Probability=
99.41%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Heads
sampling would lead to a 40% reduction (or more) in infection rate in those treated
with antibiotics? The answer is 0.005, which is half the two-tailed P value.
What if you had predicted that there would be more infections in the patients
treated with the antibiotic than in those treated with the inactive ointment? Sure,
that is a pretty unlikely prediction for this example, but go with it. Now the one-
tailed P value answers this question: If the null hypothesis were true, what is the
chance that random sampling would lead to a 40% reduction (or less) in infection
rate in those treated with the inactive ointment? Since the prediction runs opposite
to the data, this one-tailed P value, 0.995, is high.
Once the data are collected, the appeal of a one-tailed P value is clear.
ssuming you predicted the direction of the effect correctly, a one-tailed P value
A
equals one-half of the two-tailed P value (this is not always exactly true, but it is
true for most statistical tests).
The results were a surprise to the cardiologists who designed the study.
Patients treated with anti-arrhythmic drugs turned out to be four times more likely
to die than patients given a placebo. The results went in the direction opposite to
what was predicted.
40 P = 0.1338 40 P = 0.8964
30 30
Result
Result
20 20
10 10
0 0
Control Treated Control Treated
P = 0.0107 P = 0.0006
40 40
30 30
Result
Result
20 20
10 10
0 0
Control Treated Control Treated
P = 0.05
300
Number of experiments
200
100
0
0.00001 0.0001 0.001 0.01 0.1 1
P value
Rothman (2016) pointed out a huge distinction between P values and CIs. P
values are computed by combining two separate values: the size of the effect and
the precision of the effect (partly determined by sample size). In contrast, a CI
shows the two separately. The location of the center of the CI tells you about the
observed effect size. It answers this question: How big is the observed effect in
this study? The width of the CI tells you about precision. It answers this question:
How precisely has this study determined the size of the effect? Both questions are
important, and the CI answers both in an understandable way. In contrast, the P
value is computed by combining the two concepts and answers a single question
that is less interesting: If there were no effect in the overall population (if the null
hypothesis were actually true), what is the chance of observing an effect this large
or larger?
When reading papers, I suggest that you don’t let yourself get distracted by P
values. Instead, look for a summary of how large the effect was and its CI.
When writing a paper, first make sure that the results are shown clearly. Then
compute CIs to quantify the precision of the effect sizes. Consider stopping there.
Only add P values to your manuscript when you have a good reason to think that
they will make the paper easier to understand.
The P value is not the probability that the result was due to sampling
error
The P value is computed from figuring out what results you would see if the
null hypothesis is true. In other words, the P value is computed from the results
you would see if observed differences were only due to randomness in selecting
1 4 0 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
subjects—that is, to sampling error. Therefore, the P value cannot tell you the
probability that the result is due to sampling error.
The P value is not the probability that the null hypothesis is true
The P value is computed from the set of results you would see if the null hypoth-
esis is true, so it cannot calculate the probability that the null hypothesis is true.
The probability that the alternative hypothesis is true is not 1.0 minus
the P value
If the P value is 0.03, it is very tempting to think that if there is only a 3%
probability that my difference would have been caused by random chance, then
there must be a 97% probability that it was caused by a real difference. But this
is wrong!
What you can say is that if the null hypothesis were true, then 97% of ex-
periments would lead to a difference smaller than the one you observed and 3%
of experiments would lead to a difference as large or larger than the one you
observed.
Calculation of a P value is predicated on the assumption that the null hypoth-
esis is true. P values cannot tell you whether this assumption is correct. Instead,
the P value tells you how rarely you would observe a difference as large or larger
than the one you observed if the null hypothesis were true.
The probability that the results will hold up when the experiment is
repeated is not 1.0 minus the P value
If the P value is 0.03, it is tempting to think that this means there is a 97% chance
of getting similar results in a repeated experiment. Not so. The P value does not
itself quantify reproducibility.
A high P value does not prove that the null hypothesis is true
A high P value means that if the null hypothesis were true, it would not be surpris-
ing to observe the treatment effect seen in a particular experiment. But that does
not prove that the null hypothesis is true. It just says the data are consistent with
the null hypothesis.
Q&A
Shouldn’t P values always be presented with a conclusion about whether the results
are statistically significant?
No. A P value can be interpreted on its own. In some situations, it can make
sense to go one step further and report whether the results are statistically
significant (as will be explained in Chapter 16). But this step is optional.
Should it be P or p? Italicized or not? Hyphenated or not?
Different books and journals use different styles. This book uses “P value,” but
“p-value” is probably more common.
I chose to use a one-tailed P value, but the results came out in the direction opposite to
my prediction. Can I report the one-tailed P value calculated by a statistical program?
Probably not. Most statistical programs don’t ask you to specify the direction of
your hypothesis and report the one-tailed P value assuming you correctly pre-
dicted the direction of the effect. If your prediction was wrong, then the correct
one-tailed P value equals 1.0 minus the P value reported by the program.
Why do some investigators present the negative logarithm of P values?
When presenting many tiny P values, some investigators (especially those
doing genome-wide association studies; see Chapter 22) present the negative
logarithm of the P value to avoid the confusion of dealing with tiny fractions.
For example, if the P value is 0.01, its logarithm (base 10) is –2, and the nega-
tive logarithm is 2. If the P value is 0.00000001, its logarithm is –8, and the
negative logarithm is 8. When these are plotted on a bar graph, it is called
a Manhattan plot (Figure 15.3). Why that name? Because the arrangement
5 0.000001
4 0.00001
-log(P Value)
P Value
3 0.0001
2 0.001
1 0.1
0
A B C D E F G H I J K L MO P Q R S T U VWX Y Z
Treatment
of high and low bars make the overall look vaguely resembles the skyline of
Manhattan, New York (maybe you need to have a few drinks before seeing
that resemblance).
Is the null hypothesis ever true?
Rarely.
Is it possible to compute a CI of a P value?
No. CIs are computed for parameters like the mean or a slope. The idea is to
express the likely range of values that includes the true population value for the
mean or slope etc. The P value is not an estimate of a value from the popula-
tion. It is computed for that one sample. Since it doesn’t make sense to ask
what the overall P value is in the population, it doesn’t make sense to compute
a CI of a P value.
If the null hypothesis is really true and you run lots of experiments, will you expect
mostly large P values?
No. If the null hypothesis is true, the P value is equally likely to have any value.
You’ll find just as many less than 0.10 as greater than 0.90.
This chapter takes the point of view that P values are often misunderstood and over-
emphasized. Is this a mainstream opinion of statisticians?
Yes. The American Statistical Association released a report in 2016 expressing
many of the same concerns as this chapter (Wasserstein & Lazar, 2016). But
there is plenty of disagreement on the nuances, as you can see from the fact
that the report came with 20 accompanying commentaries!
CHAPTER SUMMARY
• P values are frequently reported in scientific papers, so it is essential that
every scientist understand exactly what a P value is and is not.
• All P values are based on a null hypothesis. If you cannot state the null
hypothesis, you can’t understand the P value.
• A P value answers the following general question: If the null hypothesis is
true, what is the chance that random sampling would lead to a difference
(or association, etc.) as large or larger than that observed in this study?
• P values are calculated values between 0.0 and 1.0. They can be reported
and interpreted without ever using the word “significant.”
• When interpreting published P values, note whether they are calculated
for one or two tails. If the author doesn’t say, the result is somewhat
ambiguous.
• If you repeat an experiment, expect the P value to be very different. P val-
ues are much less reproducible than you would guess.
• One-tailed P values have their advantages, but you should always use a
two-tailed P value unless you have a really good reason to use a one-tailed
P value.
• There is much more to statistics than P values. When reading scientific
papers, don’t get mesmerized by P values. Instead, focus on what the inves-
tigators found and how large the effect size is.
1 4 4 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
Statistical Significance and
Hypothesis Testing
For the past eighty years, it appears that some of the sciences have
made a mistake by basing decisions on statistical significance.
Z ILIAK AND M CCLOSKEY (2008)
A juror starts with the presumption that the defendant is innocent. A scientist
starts with the presumption that the null hypothesis of “no difference” is true.
A juror bases his or her decision only on factual evidence presented in the
trial and should not consider any other information, such as newspaper stories.
A scientist bases his or her decision about statistical significance only on data from
one experiment, without considering what other experiments have concluded.
A juror reaches the verdict of guilty when the evidence is inconsistent with
the assumption of innocence. Otherwise, the juror reaches a verdict of not guilty.
When performing a statistical test, a scientist reaches a conclusion that the results
are statistically significant when the P value is small enough to make the null
hypothesis unlikely. Otherwise, a scientist concludes that the results are not sta-
tistically significant.
A juror does not have to be convinced that the defendant is innocent to reach
a verdict of not guilty. A juror reaches a verdict of not guilty when the evidence
is consistent with the presumption of innocence, when guilt has not been proven.
A scientist reaches the conclusion that results are not statistically significant
whenever the data are consistent with the null hypothesis. The scientist does not
have to be convinced that the null hypothesis is true.
A juror can never reach a verdict that a defendant is innocent. The only
choices are guilty or not guilty. A statistical test never concludes the null hypoth-
esis is true, only that there is insufficient evidence to reject it (but see Chapter 21).
A juror must try to reach a conclusion of guilty or not guilty and can’t con-
clude “I am not sure.” Similarly, each statistical test leads to a crisp conclusion
of statistically significant or not statistically significant. A scientist who strictly
follows the logic of statistical hypothesis testing cannot conclude “Let’s wait for
more data before deciding.”
Scientist as journalist
Jurors aren’t the only people who evaluate evidence presented at a criminal trial.
Journalists also evaluate evidence presented at a trial, but they have very different
goals than jurors. A journalist’s job is not to reach a verdict of guilty or not guilty
but rather to summarize the proceedings.
Scientists in many fields are often more similar to journalists than jurors. If
you don’t need to make a clear decision based on one P value, you don’t need to use
the term statistically significant or use the rubric of statistical hypothesis testing.
Most scientists are less rigid and refer to very significant or extremely signifi-
cant results when the P value is tiny. When showing P values on graphs, investiga-
tors commonly use asterisks to create a scale such as that used in Michelin Guides
or movie reviews (see Table 16.1). When you read this kind of graph, make sure
that you look at the key that defines the symbols, because different investigators
can use different threshold values.
Type I error
When there really is no difference (or association or correlation) between the pop-
ulations, random sampling can lead to a difference (or association or correlation)
large enough to be a statistically significant. This is a Type I error. It occurs when
you decide to reject the null hypothesis when in fact the null hypothesis is true.
It is a false positive.
Type II error
When there really is a difference (or association or correlation) between the popu-
lations, random sampling (and small sample size) can lead to a difference (or as-
sociation or correlation) small enough to be not statistically significant. This is a
Type II error. It occurs when you decide not to reject the null hypothesis when in
fact the null hypothesis is false. It is a false negative.
• A P value is computed from the data. You reject the null hypothesis (and con-
clude that the results are statistically significant) when the P value from a
particular experiment is less than the significance level α set in advance.
The trade-off
A result is deemed statistically significant when the P value is less than a signifi-
cance level (α) set in advance. So you need to choose that level. By tradition, α
is usually set to equal 0.05, but you can choose whatever value you want. When
choosing a significance level, you confront a trade-off.
If you set α to a very low value, you will make few Type I errors. That means
that if the null hypothesis is true, there will be only a small chance that you will
mistakenly call a result statistically significant. However, there is also a larger
chance that you will not find a significant difference, even if the null hypothesis is
false. In other words, reducing the value of α will decrease your chance of making
a Type I error but increase the chance of a Type II error.
If you set α to a very large value, you will make more Type I errors. If the null hy-
pothesis is true, there is a large chance that you will mistakenly conclude that the effect
is statistically significant. But there is a small chance of missing a real difference. In
other words, increasing the value of α will increase your chance of making a Type I
error but decrease the chance of a Type II error. The only way to reduce the chances
of both a Type I error and a Type II error is to collect bigger samples (see Chapter 26).
Table 16.3. Type I and Type II errors in the context of spam filters.
1 5 0 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
Table 16.4. Type I and Type II errors in the context of trial by jury in a criminal case.
P < 0.005?
Benjamin (2017) and 71 colleagues urge scientists to use a stricter threshold,
“We propose to change the default P value threshold for statistical significance
for claims of new discoveries from 0.05 to 0.005.” Their goal is to reduce the
False Positive Reporting Rate (FPRP) as will be explained in the next chapter.
The tradeoff, of course, is that statistical power will be reduced unless sample size
is increased. They note that increasing sample size by about 70% can maintain
C HA P T E R 1 6 • Statistical Significance and Hypothesis Testing 151
statistical power while changing the significance level from 0.05 to 0.005. John-
son (2013) made a similar suggestion.
P < 0.0000003?
A large international group of physicists announced in 2012 the discovery of the
Higgs boson. The results were not announced in terms of a P value or statistical
significance. Instead, the group announced that two separate experiments met the
five-sigma threshold (Lamb, 2012).
Meeting the five-sigma threshold means that the results would occur by
chance alone as rarely as a value sampled from a Gaussian distribution would be 5
SD (sigma) away from the mean. How rare is that? The probability is 0.0000003,
or 0.00003%, or about 1 in 3.5 million, as computed with the Excel function
“=l-normsdist(5).” In other words, the one-tailed P value is less than 0.0000003. If
the Higgs boson doesn’t exist, that probability is the chance that random sampling
of data would give results as striking as the physicists observed. Two separate
groups obtained data with this level of evidence.
The standard of statistical significance in most fields is that the two-tailed
P is less than 0.05. About 5% of a Gaussian distribution is more than 2 SD away
from the mean (in either direction). So the conventional definition of statistical
significance can be called a two-sigma threshold.
The tradition in particle physics is that the threshold to report “evidence of a
particle” is P < 0.003 (three sigma), and the standard to report a “discovery of a
particle” is P < 0.0000003 (five sigma). These are one-tailed P values.
Masicampo and Lalande (2012) found similar results. They reviewed papers
in three highly regarded, peer-reviewed psychology journals. The distribution of
3,627 P values shows a clear discontinuity of 0.05. There are too many P values
just smaller than 0.05 and too many greater than 0.05. They concluded, “The
number of P values in the psychology literature that barely meet the criterion for
statistical significance (i.e., that fall just below .05) is unusually large, given the
number of P values occurring in other ranges.”
Some of the investigators probably did the analyses properly and simply didn’t
write up results when the P value was just a bit higher than 0.05. And some may have
had the borderline results rejected for publication. You can read more about publica-
tion bias in Chapter 43. It also is likely, I think, that many investigators cheat a bit by:
• Tweaking. The investigators may have played with the analyses. If one analy-
sis didn’t give a P value less than 0.05, then they tried a different one. Perhaps
they switched between parametric and nonparametric analysis or tried includ-
ing different variables in multiple regression models. Or perhaps they re-
ported only the analyses that gave P values less than 0.05 and ignored the rest.
• Dynamic sample size. The investigators may have analyzed their data several
times. Each time, they may have stopped if the P value was less than 0.05 but
collected more data when the P value was above 0.05. This approach would
yield misleading results, as it is biased toward stopping when P values are small.
• Slice and dice. The investigators may have analyzed various subsets of the
data and only reported the subsets that gave low P values.
• Selectively report results of multiple outcomes. If several outcomes were
measured, the investigators may have chosen to only report the ones for
which the P value was less than 0.05.
• Play with outliers. The investigators may have tried using various defini-
tions of outliers, reanalyzed the data several times, and only reported the
analyses with low P values.
Simmons, Nelson, and Simonsohn (2012) coined the term P-hacking to refer to
attempts by investigators to lower the P value by trying various analyses or by
analyzing subsets of data.
Q&A
But isn’t the whole point of statistics to decide when an effect is statistically
significant?
No. The goal of statistics is to quantify scientific evidence and uncertainty.
Why is statistical hypothesis testing so popular?
There is a natural aversion to ambiguity. The crisp conclusion “The results are
statistically significant” is more satisfying to many than the wordy conclusion
“Random sampling would create a difference this big or bigger in 3% of
experiments if the null hypothesis were true.”
Who invented the threshold of P < 0.05 as meaning statistically significant?
That threshold, like much of statistics, came from the work of Ronald Fisher.
Are the P value and α the same?
No. A P value is computed from the data. The significance level α is chosen by
the experimenter as part of the experimental design before collecting any data.
A difference is termed statistically significant if the P value computed from the
data is less than the value of α set in advance.
Is α the probability of rejecting the null hypothesis?
Only if the null hypothesis is true. In some experimental protocols, the null
hypothesis is often true (or close to it). In other experimental protocols, the null
hypothesis is almost certainly false. If the null hypothesis is actually true, α is the
probability that random sampling will result in data that will lead you to reject
the null hypothesis, thus making a Type I error.
If I perform many statistical tests, is it true that the conclusion “statistically significant”
will be incorrect 5% of the time?
No! That would only be true if the null hypothesis is, in fact, true in every single
experiment. It depends on the scientific context.
My two-tailed P value is not low enough to be statistically significant, but the one-
tailed P value is. What do I conclude?
Stop playing games with your analysis. It is only OK to compute a one-tailed
P value when you decided to do so as part of the experimental protocol
(see Chapter 15).
Isn’t it possible to look at statistical hypothesis testing as a way to choose between
alternative models?
Yes. See Chapter 35.
What is the difference between Type I and Type II errors?
Type I errors reject a true null hypothesis. Type II errors accept a false null
hypothesis.
My P value to four digits is 0.0501. Can I round to 0.05 and call the result statistically
significant?
No. The whole idea of statistical hypothesis test is to make a strict criterion
(usually at P=0.05) between rejecting and not rejecting the null hypothesis.
Your P value is greater than α, so you cannot reject the null hypothesis and call
the results statistically significant.
My P value to nine digits is 0.050000000. Can I call the result statistically significant?
Having a P value equal 0.05 (to many digits) is rare, so this won’t come up often.
It is just a matter of definition. But I think most would reject the null hypothesis
when a P value is exactly equal to α .
1 5 6 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
CHAPTER SUMMARY
• Statistical hypothesis testing reduces all findings to two conclusions,
“statistically significant” or “not statistically significant.”
• The distinction between the two conclusions is made purely based on
whether the computed value of a P value is less than or not less than an
arbitrary threshold, called the significance level.
• A conclusion of statistical significance does not mean the difference is large
enough to be interesting, does not mean the results are intriguing enough
to be worthy of further investigations, and does not mean that the finding is
scientifically or clinically significant.
• The concept of statistical significance is often overemphasized. The whole
idea of statistical hypothesis testing only is useful when a crisp decision
needs to be made based on one analysis.
• Avoid the use of the term significant when you can. Instead, talk about
whether the P value is low enough to reject the null hypothesis and
whether the effect size (or difference or association) is large enough to
be important.
• Some scientists use the phrase very significant or extremely significant
when a P value is tiny and borderline significant when a P value is just a
little bit greater than 0.05.
• While it is conventional to use 0.05 as the threshold P value (called α)
that separates statistically significant from not statistically significant, this
value is totally arbitrary and some scientists use a much smaller threshold.
• Ideally, one should choose α based on the consequences of Type I (false
positive) and Type II (false negative) errors.
• It is important to distinguish the P value from the significance level α. A P
value is computed from data, while the value of α is (or should be) decided
as part of the experimental design.
• There is more—a lot more—to statistics than deciding if an effect is statis-
tically significant or not statistically significant.
Comparing Groups with Confidence
Intervals and P Values
Reality must take precedence over public relations, for Nature
cannot be fooled.
R ICHARD F EYNMAN
T his book has presented the concepts of CIs and statistical hypothesis
testing in separate chapters. The two approaches are based on the
same assumptions and the same mathematical logic. This chapter explains
how they are related.
157
1 5 8 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
an esi
s
d me oth
rve yp
n =12 se ll h
Ob Nu
95% Cl
Figure 17.1. Comparing observed body temperatures (n = 12) with the hypothetical
mean value of 37°C.
The top bar shows the range of results that are not statistically significant (P > 0.05). The
bottom bar shows the 95% CI of the mean and is centered on the sample mean. The lengths
of the two bars are identical. Because the 95% CI contains the null hypothesis, the zone of
statistically not significant results must include the sample mean.
The two bars have identical widths but different centers. In this example, the
95% CI contains the null hypothesis (37°C). Therefore, the zone of statistically
not significant results must include the sample result (in this case, the mean).
an sis
d me o the
n=130 serve ll hyp
Ob Nu
95% Cl
Figure 17.2. Comparing observed body temperatures (n = 130) with the hypothetical
mean value of 37°C.
Because the sample size is larger, the CI is narrower than it is in Figure 17.1, as is the zone
of possible results that would not be statistically significant. Because the 95% CI does not
contain the null hypothesis, the zone of not statistically significant results cannot contain
the sample mean. Whether the two boxes overlap is not relevant.
C HA P T E R 1 7 • Comparing Groups with Confidence Intervals and P Values 159
multiplying the SEM by the critical value of the t distribution from Appendix
D (for 95% confidence). This bar is much shorter than the corresponding bar in
Figure 17.1. The main reason for this is that the SEM is much smaller because
of the much larger sample size. Another reason is that the critical value from
the t distribution is a bit smaller (1.98 vs. 2.20) because the sample size (and
number of dfs) is larger.
The top bar in Figure 17.2 shows the range of results that would not be sta-
tistically significant (P > 0.05). It is centered on the null hypothesis, in this case,
that the mean body temperature is 37°C. It extends in each direction exactly the
same distance as the other bar, equal to the product of the SEM times the critical
value of the t distribution.
In this example, the 95% CI does not contain the null hypothesis. There-
fore, the zone of statistically not significant results cannot include the sample
mean.
CI
The thromboembolism recurred in 8.8% of the patients receiving the placebo
(73/829) but in only 1.7% of the people receiving apixaban (14/840). So patients
who received apixaban were a lot less likely to have a recurrent thromboembo-
lism. The drug worked!
As explained in Chapter 4, we could compute the CI for each of those propor-
tions (8.8% and 1.7%), but the results are better summarized by computing the
ratio of the two proportions and its CI. This ratio is termed the risk ratio, or the
relative risk (to be discussed in Chapter 27). The ratio is 1.7/8.8, which is 0.19.
Patients receiving the drug had 19% the risk of a recurrent thromboembolism
compared to patients receiving the placebo.
That ratio is an exact calculation for the patients in the study. A 95% CI
generalizes the results to the larger population of patients with thromboembo-
lism similar to those in the study. This book won’t explain how the calculation is
done, but many programs will report the result. The 95% CI of this ratio extends
from 0.11 to 0.33. If we assume the patients in the study are representative of
the larger population of adults with thromboembolism, then we are 95% confi-
dent that treatment with apixaban will reduce the incidence of disease progres-
sion to somewhere between 11% and 33% of the risk in untreated patients. In
other words, patients in the study taking apixaban had about one-fifth the risk of
another clot than those taking placebo did, and if we studied a much larger group
of patients (and accepted some assumptions), we could be 95% confident that
the risk would be somewhere between about one-ninth and one-third the risk of
those taking placebo.
Is reducing the risk of thromboembolism to one-fifth its control value a large,
important change? That is a clinical question, not a statistical one, but I think
anyone would agree that is a substantial effect. The 95% CI goes as high as 0.33.
If that were the true effect, a reduction in risk down to only one-third of its control
value, it would still be considered substantial. When interpreting any results, you
also have to ask whether the experimental methods were sound (I don’t see any
problem with this study) and whether the results are plausible (yes, it makes sense
that extending treatment with a blood thinner might prevent thromboembolism).
Therefore, we can conclude with 95% confidence that apixaban substantially
reduces the recurrence of thromboembolism in this patient population.
C HA P T E R 1 7 • Comparing Groups with Confidence Intervals and P Values 161
P value
To interpret the P value, we first must tentatively assume that the risk of
thromboembolism is the same in patients who receive an anticoagulant as in those
who receive placebo and that the discrepancy in the incidence of thromboembolism
observed in this study was the result of chance. This is the null hypothesis. The
P value answers the following question:
If the risk of thromboembolism overall is identical in the two groups, what is the
chance that random sampling (in a study as large as ours) would lead to a ratio
of incidence rates as far or farther from 1.0 as the ratio computed in this study?
(Why 1.0? Because a ratio of 1.0 implies no treatment effect.)
The authors did the calculation using a test called the Fisher exact test and pub-
lished the answer as P < 0.0001. When P values are tiny, it is traditional to simply
state that they are less than some small value—here, 0.0001. Since this P value is
low and since the hypothesis that an anticoagulant would prevent thromboembolism
is quite sensible, the investigators concluded that the drug worked to prevent recur-
rent thromboembolism. This example is discussed in more detail in Chapter 27.
Statistical significance
The P value is less than 0.05 and the CI of the relative risk does not include 1.0
(no increased risk). So the results can be deemed to be statistically significant.
CI
The values for the two groups of rats overlap considerably, but the two means are
distinct. The mean maximum response in old rats is 23.5% lower than the mean
maximum response in young rats. That value is exactly correct for our sample
of data, but we know the true difference in the populations is unlikely to equal
23.5%. To make an inference about the population of all similar rats, look at the
95% CI of the difference between means. This is done as part of the calculations
of a statistical test called the unpaired t test (not explained in detail in this book).
This CI (for the mean value measured in the young rats minus the mean value
observed in the old rats) ranges from 9.3 to 37.8, and it is centered on the differ-
ence we observed in our sample. Its width depends on the sizes (number of rats)
and the variability (standard deviation) of the two samples and on the degree of
confidence you want (95% is standard).
1 6 2 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
Because the 95% CI does not include zero, we can be at least 95% confident
that the mean response in old rats is less than the mean response in young ones.
Beyond that, the interpretation needs to be made in a scientific context. Is a differ-
ence of 23.5% physiologically trivial or large? How about 9.3%, the lower limit of
the CI? These are not statistical questions. They are scientific ones. The investiga-
tors concluded this effect is large enough (and is defined with sufficient precision)
to perhaps explain some physiological changes with aging.
P value
The null hypothesis is that both sets of data are randomly sampled from popula-
tions with identical means. The P value answers the following question:
If the null hypothesis is true, what is the chance of randomly observing a differ-
ence as large as or larger than the one observed in this experiment of this size?
The P value depends on the difference between the means, on the SD of each
group, and on the sample sizes. A test called the unpaired t test (also called the
two-sample t test) reports that the P value equals 0.003. If old and young rats have
the same maximal relaxation of bladder muscle by norepinephrine overall, there
is only a 0.03% chance of observing such a large (or a larger) difference in an
experiment of this size. This example will be discussed in more detail in Chapter 31.
Statistical significance
The P value is less than 0.05 and the CI of the difference between means does not
include 0.0. So the results can be deemed to be statistically significant.
in the risk of autism. That is a lot of risk. The investigators therefore do not give
a firm negative conclusion but instead conclude (p. 2406), “On the basis of the
upper boundary of the CI, our study could not rule out a relative risk up to 1.6, and
therefore the association warrants further study.”
Q&A
If the 95% CI just barely reaches the value that defines the null hypothesis, what can
you conclude about the P value?
If the 95% CI includes the value that defines the null hypothesis, you can
conclude that the P value is greater than 0.05. If the 95% CI excludes the null
hypothesis value, you can conclude that the P value is less than 0.05. So if the
95% CI ends right at the value that defines the null hypothesis, then the P value
must equal 0.05.
If the 95% CI is centered on the value that defines the null hypothesis, what can you
conclude about the P value?
The observed outcome equals the value that defines the null hypothesis. In this
case, the two-tailed P value must equal 1.000.
The 99% CI includes the value that defines the null hypothesis, but the P value is
reported to be < 0.05. How is this possible?
If the 99% CI includes the value that defines the null hypothesis, then the P
value must be greater than 0.01. But the P value was reported to be less than
0.05. You can conclude that the P value must be between 0.01 and 0.05.
The 99% CI includes the value that defines the null hypothesis, but the P value is
reported to be < 0.01. How is this possible?
It is inconsistent. Perhaps the CI or P value was calculated incorrectly. Or perhaps
thedefinitionofthenullhypothesisincludedinthe99%CIisnotthesamedefinitionused
to compute the P value.
Two of the examples came from papers that reported CIs, but not P values or conclu-
sions about statistical significance. Isn’t this incomplete reporting?
No. In many cases, knowing the P value and a conclusion about statistical
significance really adds nothing to understanding the data. Just the opposite.
Conclusions about statistical significance often act to reduce careful thinking
about the size of the effect.
CHAPTER SUMMARY
• CIs and statistical hypothesis testing are closely related.
• If the 95% CI includes the value stated in the null hypothesis, then the
P value must be greater than 0.05.
• If the 95% CI does not include the value stated in the null hypothesis, then
the P value must be less than 0.05.
• The same rule works for 99% CIs and P < 0.01, or 90% CIs and P < 0.10.
C HA P T E R 1 8
Interpreting a Result That Is
Statistically Significant
Facts do not “speak for themselves,” they are read in the light of
theory.
S TEPHEN J AY G OULD
165
1 6 6 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
average is the true effect size. Some of those experiments will reach a conclusion
that the result is “statistically significant” and others won’t. The first set will, on
average, show effect sizes larger than the true effect size, and these are the ex-
periments that are most likely to be published. Other experiments that wound up
with smaller effects may never get written up. Or if they are written up, they may
not get accepted for publication. So published “statistically significant” results
tend to exaggerate the effect being studied (Gelman, & Carlin, 2014).
of the actual overall effect is opposite to the results you happened to observe in one
experiment. Type S errors occur rarely but are not impossible (Gelman, & Carlin,
2014). They are also referred to as Type III errors.
The FPRP and the significance level are not the same
The significance level and the FPRP are the answers to distinct questions, so
the two are defined differently and their values are usually different. To under-
stand this conclusion, look at Table 18.1 which shows the results of many hypo-
thetical statistical analyses that each reach a decision to reject or not reject the
null hypothesis. The top row tabulates results for experiments in which the null
hypothesis is true. The second row tabulates results for experiments in which the
null hypothesis is not true. When you analyze data, you don’t know whether the
null hypothesis is true, so you could never actually create this table from an actual
series of experiments.
C HA P T E R 1 8 • Interpreting a Result That Is Statistically Significant 169
DECISION: DO
DECISION: REJECT NOT REJECT NULL
NULL HYPOTHESIS HYPOTHESIS TOTAL
Null hypothesis is true A (Type I error) B A+B
Null hypothesis is false C D (Type II error) C+D
Total A+C B+D A+B+C+D
Table 18.1. The results of many hypothetical statistical analyses to reach a decision to
reject or not reject the null hypothesis.
In this table, A, B, C, and D are integers (not proportions) that count numbers of analyses
(number of P values). The total number of analyses equals A + B + C + D. The significance
level is defined to equal A/(A + B). The FPRP is defined to equal A/(A + C).
The FPRP only considers analyses that reject the null hypothesis so only deals
with the left column of the table. Of all these experiments (A + C), the number in
which the null hypothesis is true equals A. So the FPRP equals A/(A + C).
The significance level only considers analyses where the null hypothesis is
true so only deals with the top row of the table. Of all these experiments (A + B),
the number of times the null hypothesis is rejected equals A. So the significance
level is expected to equal A/(A + B).
Since the values denoted by B and C in Table 18.1 are unlikely to be equal, the
FPRP is usually different than the significance level. This makes sense because
the two values answer different questions.
Table 18.2. The false positive report probability (FPRP) depends on the prior probability
and P value.
Examples 2 to 4 assume an experiment with sufficient sample size to have a power of 80%.
All examples assume that the significance level was set to the traditional 5%. The FDR is com-
puted defining “discovery” in two ways: as all results where the P value is less than 0.05, and as
only those results where the P value is between 0.045 and 0.050 (based on Colquhoun, 2014.)
1 7 0 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
• Of the 1,000 drugs screened, we expect that 10 (1%) will really work.
• Of the 10 drugs that really work, we expect to obtain a statistically signifi-
cant result in 8 (because our experimental design has 80% power).
• Of the 990 drugs that are really ineffective, we expect to obtain a statisti-
cally significant result in 5% (because we set α equal to 0.05). In other
words, we expect 5% × 990, or 49, false positives.
• Of 1,000 tests of different drugs, we therefore expect to obtain a statisti-
cally significant difference in 8 + 49 = 57. The FPRP equals 49/57 = 86%.
So even if you obtain a P value less than 0.05, there is an overwhelming chance
that the results it a false positive. This kind of experiment is not worth doing
unless you use a much stricter value for α (say 0.1% instead of 5%).
C HA P T E R 1 8 • Interpreting a Result That Is Statistically Significant 171
1,000 drugs
1,000 drugs
• Of the 500 drugs that are really ineffective, we expect to obtain a statisti-
cally significant result in 5% (because we set α equal to 0.05). In other
words, we expect 5% × 500, or 25, false positives.
• Of 1,000 tests of different drugs, therefore, we expect to obtain a statistically
significant difference in 400 + 25 = 425. The FPRP equals 25/425 = 5.9%.
The FPRP when the P value is just a tiny bit less than 0.05
So far, we have been defining a result to be a “discovery” when the P value is less
than 0.05. That includes P values barely less than 0.05 as well as P values that are
really tiny. Conclusions based on really tiny P values are less likely to be false pos-
tives than conclusions based on P = 0.049. The right column of Table 18.2 com-
putes the FPRP defining a result to be a discovery when the P value is between
0.045 and 0.050 (based on Colquhoun, 2014). The FPRP values are much higher
than they are when you define a discovery as being all P values less than 0.05.
Example 4 is for a situation where, before seeing any data, it is equally likely
that there will be or will not be a real effect. The prior probability is 50%. If you
observe a P value just barely less than 0.05 in this situation, the FDR is 27%. If
you do a more exploratory experiment where the prior probability is only 10%,
then the FPRP is 78%! These FPRPs are way higher than 5% that many people
mistakenly expect them to be.
C HA P T E R 1 8 • Interpreting a Result That Is Statistically Significant 173
This is a really important point. A P value just barely less than 0.05 provides
little evidence against the null hypothesis.
• If P= 0.001 and you want the FPRP to be less than 5%, the prior probability
must be greater than 16%.
Note these calculations were done for the exact P value specified, not for all pos-
sible P values less than the specified one.
BAYESIAN ANALYSIS
These calculations use a simplified form of a Bayesian approach, named after
Thomas Bayes, who first published work on this problem in the mid-18th century.
(Bayes was previously mentioned in Chapter 2.) The big idea of Bayesian analysis
is to analyze data by accounting for prior probabilities.
There are some situations in which the prior probabilities are well defined,
for example, in analyses of genetic linkage. The prior probability that two genetic
loci are linked is known, so Bayesian statistics are routinely used in analysis of
genetic linkage. There is nothing controversial about using Bayesian inference
when the prior probabilities are known precisely.
In many situations, including the previous drug examples, the prior prob-
abilities are little more than a subjective feeling. These feelings can be expressed
as numbers (e.g., “99% sure” or “70% sure”), which are then treated as prior prob-
abilities. Of course, the calculated results (the FPRPs) are no more accurate than
are your estimates of the prior probabilities.
The examples in this chapter are a simplified form of Bayesian analysis. The
rows of Table 18.2 show only two possibilities. Either the null hypothesis is true,
or the null hypothesis is false and the true difference (or effect) is of a defined
size. A full Bayesian analysis would consider a range of possible effect sizes,
rather than only consider two possibilities.
than the usual threshold of 0.05 but not by very much. I have a choice
of believing that the results occurred by a coincidence that will happen
1 time in 25 under the null hypothesis or of believing that the experimental
hypothesis is true. Because the experimental hypothesis is so unlikely to
be correct, I think that the results are the result of coincidence. The null
hypothesis is probably true.
• This study tested a hypothesis that makes no biological sense and has not
been supported by any previous data. I’d be amazed if it turned out to be
true. The P value is incredibly low (0.000001). I’ve looked through the
details of the study and cannot identify any biases or flaws. These are repu-
table scientists, and I believe that they’ve reported their data honestly. I have
a choice of believing that the results occurred by a coincidence that will
happen 1 time in a million under the null hypothesis, or of believing that the
experimental hypothesis is true. Although the hypothesis seems crazy to me,
the data force me to believe it. The null hypothesis is probably false.
Focus on the last row of Table 18.3. The effect is tiny, 11.2% versus 10.0% of
the subjects had the outcome being studied. Even though the P value is small, it is
unlikely that such a small effect would be worth pursuing. It depends on the scien-
tific or clinical situation, but there are few fields where such a tiny association is
important. With huge sample sizes, it takes only a tiny effect to produce such a small
P value. In contrast, with tiny samples, it takes a huge effect to yield a small P value.
It is essential when reviewing scientific results to look at more than just the P value.
COMMON MISTAKES
Mistake: Believing that a “statistically significant” result proves
an effect is real
As this chapter points out, there are lots of reasons why a result can be statistically
significant.
Mistake: Not realizing that the FPRP depends on the scientific context
The FDR depends upon the scientific context, as quantified by the prior probability.
Mistake: Thinking that a P value just barely less than 0.05 is strong
evidence against the null hypothesis
Table 18.2 tabulated the FPRP associated with P values just a tiny bit less than
0.05. When the prior probability is 50%, the FDR is 27%. If the prior probability
is lower than 50% (as it often is in exploratory research), the FPRP is even higher.
Q&A
Is this chapter trying to tell me that it isn’t enough to determine whether a result is, or
is not, statistically significant? Instead, I actually have to think?
Yes!
How is it possible for an effect to be statistically significant but not scientifically signifi-
cant (important)?
You reach the conclusion that an effect is statistically significant when the P
value is less than 0.05. With large sample sizes, this can happen even when
the effect is tiny and irrelevant. The small P value tells us that the effect would
not often occur by chance but says nothing about whether the effect is large
enough to care about.
When is an effect large enough to care about, to be scientifically significant
(important)?
It depends on what you are measuring and why you are measuring it. This
question can only be answered by someone familiar with your particular field of
science. It is a scientific question, not a statistical one.
Does the context of the experiment (the prior probability) come into play when decid-
ing whether a result is statistically significant?
Only if you take into account prior probability when deciding on a value for
α. Once you have chosen α, the decision about when to call a result “statisti-
cally significant” depends only on the P value and not on the context of the
experiment.
Is the False Positive Report Probability (FPRP) the same as the False Positive Risk (FPR)
and the False Discovery Rate (FPRP)?
Yes. The FPRP and FPR mean the same thing. The FDR is very similar, but usually is
used in the context of multiple comparisons rather than interpreting a single P value.
Does the context of the experiment (the prior probability) come into play when calcu-
lating the FPRP?
Yes. See Table 18.2.
Does your choice of a value for α influence the calculated value of the FPRP?
Yes. Your decision for a value of α determines the chance that a result will end
up in the first or second column of Table 18.1.
When a P value is less than 0.05 and thus the comparison is deemed to be statistically
significant, can you be 95% sure that the effect is real?
No! It depends on the situation, on the prior probability.
The FPRP answers the question: If a result is statistically significant, what is he chance
that it is a false positive? What is the complement of the FPRP called, the value that
answers the question: If a result is statistically significant, what is the chance that it is a
true positive?
One name is the posterior probability. That name goes along with prior prob-
ability, used earlier in this chapter. Posterior is the opposite of prior, so refers to
a probability computed after collecting evidence. You estimate the prior prob-
ability based on existing data and theory. Then after collecting new evidence,
you compute the posterior probability by combining the prior probability and
the evidence. A synonym is the positive predictive value. The positive predictive
value or posterior probability is computed as 1-FPRP (if the FPRP is expressed as
a fraction), or 100%-FPRP% (if the FPRP is expressed as a percentage).
1 7 8 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
CHAPTER SUMMARY
• P values depend on the size of the difference or association as well as
sample size. You’ll see a small P value either with a large effect observed in
tiny samples or with a tiny effect observed in large samples.
• A conclusion of statistical significance does not mean the difference is large
enough to be interesting, does not mean the results are intriguing enough
to be worthy of further investigation, and does not mean that the finding is
scientifically or clinically significant.
• Don’t mix up the significance level α (the threshold P value defining sta-
tistical significance) with the P value. You must choose α as part of the
experimental design. The P value, in contrast, is computed from the data.
• Also, don’t mix up the significance level α with the FPRP, which is the
chance that a statistically significant finding is actually due to a coinci-
dence of random sampling.
• The FPRP depends on the context of the study. In other words, it depends
on the prior probability that your hypothesis is true (based on prior data
and theory).
• Even if you don’t do formal Bayesian calculations, you should consider
prior knowledge and theory when interpreting data.
• A P value just barely less than 0.05 does not provide strong evidence against
the null hypothesis.
• The conclusion that a result is “statistically significant” sounds very
definitive. But in fact, there are many reasons for this designation.
Interpreting a Result That Is Not
Statistically Significant
Extraordinary claims require extraordinary proof.
C ARL S AGAN
Explanation 1: The drug did not affect the enzyme you are studying
Scenario: The drug did not induce or activate the enzyme you are studying, so
the enzyme’s activity is the same (on average) in treated and control cells.
Discussion: This is, of course, the conclusion everyone jumps to when they see
the phrase “not statistically significant.” However, four other explanations are
possible.
179
1 8 0 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
A high P value does not prove the null hypothesis. Deciding not to reject
the null hypothesis is not the same as believing that the null hypothesis is
definitely true. The absence of evidence is not evidence of absence (Altman &
Bland, 1995).
CONTROLS HYPERTENSION
Number of subjects 17 18
Mean receptor number 263 257
(receptors/platelet)
SD 87 59
Interpreting the results requires knowing the 95% CI for the relative risk,
which a computer program can calculate. For this example, the 95% CI ranges
from 0.88 to 1.17.
Our data are certainly consistent with the null hypothesis, because the CI in-
cludes 1.0. This does not mean that the null hypothesis is true. Our CI tells us that
the data are also consistent (within 95% confidence) with relative risks ranging
from 0.88 to 1.17.
Here are three approaches to interpreting the results:
• The CI is centered on 1.0 (no difference) and is quite narrow. These data
convincingly show that routine use of ultrasound is neither helpful nor
harmful.
• The CI is narrow but not all that narrow. It certainly makes clinical sense
that the extra information provided by an ultrasound will help obstetricians
manage the pregnancy and might decrease the chance of a major problem.
The CI goes down to 0.88, a risk reduction of 12%. If I were pregnant, I’d
certainly want to use a risk-free technique that reduces the risk of a sick
or dead baby by as much as 12% (from 5.0% to 4.4%)! The data certainly
don’t prove that a routine ultrasound is beneficial, but the study leaves open
the possibility that routine ultrasound might reduce the rate of awful events
by as much as 12%.
• The CI goes as high as 1.17. That is a 17% relative increase in problems
(from 5.0% to 5.8%). Without data from a much bigger study, these data do
not convince me that ultrasounds are helpful and make me worry that they
might be harmful.
Statistics can’t help to resolve the differences among these three mindsets.
It all depends on how you interpret the relative risk of 0.88 and 1.17, how worried
you are about the possible risks of an ultrasound, and how you combine the data in
this study with data from other studies (I have no expertise in this field and have
not looked at other studies).
In interpreting the results of this example, you also need to think about ben-
efits and risks that don’t show up as a reduction of adverse outcomes. The ultra-
sound picture helps reassure parents that their baby is developing normally and
gives them a picture to bond with and show relatives. This can be valuable regard-
less of whether it reduces the chance of adverse outcomes. Although statistical
analyses focus on one outcome at a time, you must consider all the outcomes
when evaluating the results.
Q&A
If a P value is greater than 0.05, can you conclude that you have disproven the null
hypothesis?
No.
If you conclude that a result is not statistically significant, it is possible that you are
making a Type II error as a result of missing a real effect. What factors influence the
chance of this happening?
The probability of a Type II error depends on the significance level (α) you have
chosen, the sample size, and the size of the true effect.
C HA P T E R 1 9 • Interpreting a Result That Is Not Statistically Significant 185
By how much do you need to increase the sample size to make a CI half as wide?
A general rule of thumb is that increasing the sample size by a factor of 4 will cut
the expected width of the CI by a factor of 2. (Note that 2 is the square root of 4.)
What if I want to make the CI one-quarter as wide as it is?
Increasing the sample size by a factor of 16 will be expected to reduce the width
of the CI by a factor of 4. (Note that 4 is the square root of 16.)
Can a study result can be consistent both with an effect existing and with it not
existing?
Yes! Clouds are not only consistent with rain but also with no rain. Clouds, like
noisy results, are inconclusive (Simonsohn, 2016).
CHAPTER SUMMARY
• If a statistical test computes a large P value, you should conclude that the
findings would not have been unusual if the null hypothesis were true.
• You should not conclude that the null hypothesis of no difference (or asso-
ciation, etc.) has been proven.
• When interpreting a high P value, the first thing to do is look at the size of
the effect.
• Also look at the CI of the effect.
• If the CI includes effect sizes that you would consider to be scientifically
important, then the study is inconclusive.
C HA P T E R 2 0
Statistical Power
There are two kinds of statistics, the kind you look up, and the
kind you make up.
R EX S TOUT
The answer, called the power of the experiment, depends on four values:
• The sample size
• The amount of scatter (if comparing values of a continuous variable) or
starting proportion (if comparing proportions)
• The size of the effect you hypothesize exists
• The significance level you choose
Given these values, power is the fraction of experiments in which you would
expect to find a statistically significant result. By tradition, power is usually ex-
pressed as a percentage rather than a fraction.
186
C HA P T E R 2 0 • Statistical Power 187
DECISION: DO NOT
DECISION: REJECT REJECT NULL
NULL HYPOTHESIS HYPOTHESIS TOTAL
Null hypothesis is true A B A+B
Null hypothesis is false C D C+D
Table 20.1 (which repeats Table 18.2) shows the results of many hypothetical
statistical analyses, each analyzed to reach a conclusion of “statistically signifi-
cant” or “not statistically significant.” Assuming the null hypothesis is not true,
power is the fraction of experiments that reach a statistically significant conclu-
sion. So power equals C/(C + D).
value can’t really be calculated without knowing the prior probability and using
Bayesian thinking (see Chapter 18). Instead, let’s ask a different question: If the
tool really is in the basement, what is the chance your child would have found it?
The answer, of course, is: it depends. To estimate the probability, you’d want to
know three things:
• How long did he spend looking? If he looked for a long time, he is more
likely to have found the tool than if he looked for a short time. The time
spent looking for the tool is analogous to sample size. An experiment with
a large sample size has high power to find an effect, while an experiment
with a small sample size has less power.
• How big is the tool? It is easier to find a snow shovel than the tiny screw-
driver used to fix eyeglasses. The size of the tool is analogous to the size of
the effect you are looking for. An experiment has more power to find a big
effect than a small one.
• How messy is the basement? If the basement is a real mess, he was less
likely to find the tool than if it is carefully organized. The messiness is
analogous to experimental scatter. An experiment has more power when the
data are very tight (little variation), and less power when the data are very
scattered.
If the child spent a long time looking for a large tool in an organized base-
ment, there is a high chance that he would have found the tool if it were there. So
you could be quite confident of his conclusion that the tool isn’t there. Similarly,
an experiment has high power when you have a large sample size, are looking for
a large effect, and have data with little scatter (small SD). In this situation, there
is a high chance that you would have obtained a statistically significant effect if
the effect existed.
If the child spent a short time looking for a small tool in a messy basement,
his conclusion that the tool isn’t there doesn’t really mean very much. Even if the
tool were there, he probably would not have found it. Similarly, an experiment has
little power when you use a small sample size, are looking for a small effect, and
the data have lots of scatter. In this situation, there is a high chance of obtaining a
conclusion of “not statistically significant” even if the effect exists.
Table 20.2 summarizes this analogy and its connection to statistical analyses.
Table 20.2. The analogy between searching for a tool and statistical power.
C HA P T E R 2 0 • Statistical Power 189
100% 100%
75% 75%
Power
Power
50% 50%
0% 0%
0 25 50 75 100 0.50 0.75 1.00
Hypothetical difference between mean Relative risk
100%
80%
Observed power
60%
40%
20%
0%
0.001 0.01 0.1 1
P value
Figure 20.2. The relationship between P value and observed power when the signifi-
cance level (alpha) is set to 0.05.
Hoenig and Helsey (2001) derived the equation use to create this curve. The dotted lines
show that when the P value equals 0.05, the observed power equals 50%.
Hoenig and Helsey (2001) pointed out that the observed power can be com-
puted from the observed P value as well as the value of α you choose (usually
0.05). The observed power conveys no new information. Figure 20.2 shows the
relationship between P value and observed power, when α is set to 0.05.
Q&A
In Figure 20.2, why is the observed power 50% when the P value equals 0.05?
If the P value is 0.05 in one experiment, that is your best guess for what it will be
in repeat experiments. You expect half the P values to be higher and half to be
lower. Only that last half will lead to the conclusion that the result is statistically
significant, so the power is 50%.
If I want to use a program to compute power, what questions will it ask?
To calculate power, you must enter the values for α, the expected SD, the
planned sample size, and the size of the hypothetical difference (or effect) you
are hoping to detect.
Why do I have to specify the effect size for which I am looking? I want to detect any
size effect.
All studies will have a small power to detect tiny effects and a large power to
detect enormous effects. You can’t calculate power without specifying the effect
size for which you are looking.
When does it make sense to do calculations involving power?
In two situations. When planning a study, you need to decide how large a
sample size to use. Those calculations will require you to specify how much
power you want to detect some hypothetical effect size. After completing a
study that has results that are not statistically significant, it can make sense to
ask how much power that study had to detect some specified effect size.
1 9 2 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
When calculating sample size, what value for desired power is traditionally used?
Most sample size calculations are done for 80% power. Of course, there is
nothing special about that value except tradition.
Can one do all sample size/power calculations using standard equations?
Usually. But in some cases it is necessary to run computer simulations to
compute the power of a proposed experimental design.
Power analyses traditionally set α = 0.05 and β = 0.20 (because power is set to 80%,
and 100% – 80% = 20% = 0.20). Using these traditional values implies that you accept
four times the chance of a Type I error than a Type II error (because 0.20/0.05 = 4). Is
there any justification for this ratio?
No. Since the relative “costs” of Type I and Type II errors depend on the scientific
context, so should the choices of α and β.
CHAPTER SUMMARY
• The power of an experimental design is the probability that you will
obtain a statistically significant result assuming a certain effect size in the
population.
• Beta (β) is defined to equal 1.0 minus power.
• The power of an experiment depends on sample size, variability, the choice
of α to define statistical significance, and the hypothetical effect size.
• For any combination of sample size, variability, and α, there will be a high
power to detect a huge effect and a small power to detect a tiny effect.
• Power should be computed based on the minimum effect size that would
be scientifically worth detecting, not on an effect observed in a prior
experiment.
• Once the study is complete, it is not very helpful to compute the power of
the study to detect the effect size that was actually observed.
Testing for Equivalence
or Noninferiority
The problem is not what you don’t know, but what you know
that ain’t so.
W ILL R O GERS
193
1 9 4 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
Equivalent
Not Not
equivalent equivalent
Figure 21.2. Three drug formulations for which the ratio of peak concentrations are
within the equivalent zone.
For Drug A, the CI extends outside of the equivalent zone, so the results are inconclusive.
The results are consistent with the two drugs being equivalent or not equivalent. In Drugs B
and C, the 90% CIs lie completely within the equivalent zone, so the data demonstrate that
the two drugs are equivalent to the standard drug.
This is the case for Drugs B and C. Those 90% CIs lie entirely within the
equivalence zone, so the data demonstrate that Drugs B and C are equivalent to the
standard drug with which they are being compared.
In contrast, the 90% CI for Drug A is partly in the equivalence zone and
partly outside of it. The data are inconclusive.
Figure 21.3. Results of three drugs for which the mean ratio of peak concentrations is in
the not-equivalent zone.
With Drugs D and E, the CI includes both equivalent and not-equivalent zones, so the data
are not conclusive. The 90% CI for Drug F is entirely outside the equivalence zone, proving
that it is not equivalent to the standard drug.
1 9 6 PA RT D • P VA LU E S A N D STAT I ST IC A L SIG N I F IC A N C E
The 90% CIs for Drugs D and E are partly within the equivalent zone and partly
outside it. The data are not conclusive.
The 90% CI for Drug F lies totally outside the equivalence zone. These data
prove that Drug F is not equivalent to the standard drug.
AND
Figure 21.4. Applying the idea of statistical hypothesis testing to equivalence testing.
A conclusion of equivalence requires a statistically significant finding (P < 0.10) from two
different tests of two different null hypotheses, shown by the vertical lines. Each null
hypothesis is tested with a one-sided alternative hypothesis, shown by the arrows. Two
drugs are considered equivalent when the ratio of peak concentrations is significantly
greater than 80% and significantly less than 125%.
C HA P T E R 2 1 • Testing for Equivalence or Noninferiority 197
Juggling two null hypotheses and two P values (each one sided or one-tailed;
the two terms are synonyms) is not for the statistical novice. The results are the
same as those obtained using the CI approach described in the previous discus-
sion. The CI approach is much easier to understand.
NONINFERIORITY TESTS
Equivalence trials attempt to prove that a new treatment or drug works about the
same as the standard treatment. Noninferiority trials attempt to prove that a new
treatment is not worse than the standard treatment.
To prove equivalence, all parts of the CI must be within the equivalence zone.
To prove noninferiority, all parts of the CI must be to the right of the left (lower)
border of the equivalence zone. The entire CI, therefore, is in a range that either
shows the new drug is superior or shows that the new drug is slightly inferior but
still in the zone defined to be practically equivalent. In Figure 21.2, Drugs B and C
are noninferior. In Figure 21.3, all three drugs (D, E, and F) are noninferior. For all
five of these drugs, the lower confidence limit exceeds 80%.
Table 21.1 (adapted from Walker and Nowacki, 2010) summarizes the differ-
ences between testing for differences, equivalence and noninferiority.
ALTERNATIVE HYPOTHESIS
(CONCLUSION IF P VALUE IS
TEST FOR . . . NULL HYPOTHESIS SMALL)
Difference No difference between treatments Nonzero difference
Equivalence A difference large enough to matter Either no difference or a
difference too small to matter
Noninferiority Experimental treatment is worse Experimental treatment is
than the standard treatment either equivalent to the
standard or better. It is not
worse
Q&A
Why isn’t a conclusion that a difference is not statistically significant enough to prove
equivalence?
The P value from a standard statistical test, and thus the conclusion about
whether an effect is statistically significant, is based entirely on analyzing the
data. A conclusion about equivalence has to take the context into account.
What is equivalent for one variable in one situation is not equivalent for another
variable in another context.
Is it possible for a difference to be statistically significant but for the data to prove
equivalence?
Surprisingly, yes. The conclusion that the difference is statistically significant just
means the data convince you that the true difference is not zero. It doesn’t tell you
that the difference is large enough to care about. It is possible for the entire CI to
include values that you consider to be equivalent. Look at Drug C in Figure 21.2.
Why 90% CIs instead of 95%?
Tests for equivalence use 90% CIs (I had this wrong in the prior edition). But the
conclusions are for 95% confidence. The reason is complicated. Essentially you
are looking at two one-sided CIs. So using a 90% CI yields a 95% confidence. Yes,
this is confusing and not at all obvious.
How is testing for noninferiority different than testing for superiority?
Although it might initially appear so, the double negatives are not all that
confusing. When testing for noninferiority (in the example presented in this
chapter), you are asking if the data prove that a drug is not worse than a
standard drug. You will conclude that Drug A is “not worse” than Drug B when
the two drugs are equivalent or when Drug A is better.
CHAPTER SUMMARY
• In many scientific and clinical investigations, the goal is not to find out
whether one treatment causes a substantially different effect than another
treatment. Instead, the goal is to find out whether the effects of a new treat-
ment are equivalent to (or not inferior to) that of a standard treatment.
C HA P T E R 2 1 • Testing for Equivalence or Noninferiority 199
• The usual approach of statistical hypothesis testing does not test for equiva-
lence. Determining whether a difference between two treatments is statisti-
cally significant tells you nothing about whether the two treatments are
equivalent.
• There is no point testing whether a new treatment is equivalent to a standard
one unless you are really sure that the standard treatment works.
• Equivalence trials attempt to prove that a new treatment or drug works
about the same as the standard treatment. Noninferiority trials attempt to
prove that a new treatment is not worse than the standard treatment.
Challenges in Statistics
C HA P T E R 2 2
Multiple Comparisons Concepts
If you torture your data long enough, they will tell you what-
ever you want to hear.
M ILLS (1993)
100% 0%
75% 25%
One or more 'significant'
25% 75%
0% 100%
0 10 20 30 40 50 60
Number of comparisons
With more than 13 comparisons, it is more likely than not that one or more
conclusions will be statistically significant just by chance. With 100 independent
null hypotheses that are all true, the chance of obtaining at least one statistically
significant P value is 99%.
Corrections for multiple comparisons are not essential when you have
clearly defined one outcome as primary and the others as secondary
Many clinical trials clearly define, as part of the study protocol, that one out-
come is primary. This is the key prespecified outcome on which the conclusion
of the study is based. The study may do other secondary comparisons, but those
are clearly labeled as secondary. Correction for multiple comparisons is often not
used with a set of secondary comparisons.
18,000 people. Half received a statin drug to lower LDL cholesterol and half
received a placebo.
The investigators’ primary goal (planned as part of the protocol) was to com-
pare the number of end points that occurred in the two groups, including deaths
from a heart attack or stroke, nonfatal heart attacks or strokes, and hospitalization
for chest pain. These events happened about half as often to people treated with
the drug compared with people taking the placebo. The drug worked.
The investigators also analyzed each of the end points separately. Those
people taking the drug (compared with those taking the placebo) had fewer deaths,
fewer heart attacks, fewer strokes, and fewer hospitalizations for chest pain.
The data from various demographic groups were then analyzed separately.
Separate analyses were done for men and women, old and young, smokers and
nonsmokers, people with hypertension and those without, people with a family
history of heart disease and those without, and so on. In each of 25 subgroups,
patients receiving the drug experienced fewer primary end points than those
taking the placebo, and all of these effects were statistically significant.
The investigators made no correction for multiple comparisons for all these
separate analyses of outcomes and subgroups, because these were planned as
secondary analyses. The reader does not need to try to informally correct for
multiple comparisons, because the results are so consistent. The multiple com-
parisons each ask the same basic question, and all the comparisons lead to the
same conclusion—people taking the drug had fewer cardiovascular events than
those taking the placebo. In contrast, correction for multiple comparisons would
be essential if the results showed that the drug worked in a few subsets of patients
but not in other subsets.
NUMBER OF SIGNIFICANT
COMPARISONS NO CORRECTION BONFERRONI
Zero 35.8% 95.1%
One 37.7% 4.8%
Two or more 26.4% 0.1%
Table 22.1. How many significant results will you find in 20 comparisons?
This table assumes you are making 20 comparisons, that all 20 null hypotheses are true, and
that α is set to its conventional value of 0.05. If there is no correction for multiple compari-
sons, there is only a 36% chance of observing no statistically significant findings. With the
Bonferroni correction, this probability goes up to 95%.
2 0 8 PA RT E • C HA L L E N G E S I N STAT I S T IC S
to over 100,000 nurses in 1980. From the questionnaires, they determined the
participants’ intake of vitamins A, C, and E and divided the women into quintiles
for each vitamin (i.e., the first quintile contains the 20% of the women who con-
sumed the smallest amount). They then followed these women for eight years to
determine the incidence rate of breast cancer. Using a test called the chi-square
test for trend, the investigators calculated a P value to test the null hypothesis that
there is no linear trend between vitamin-intake quintile and the incidence of breast
cancer. There would be a linear trend if increasing vitamin intake was associated
with increasing (or decreasing) incidence of breast cancer. There would not be a
linear trend if (for example) the lowest and highest quintiles had a low incidence
of breast cancer compared with the three middle quintiles. The authors deter-
mined a different P value for each vitamin. For Vitamin C, P = 0.60; for Vitamin
E, P = 0.07; and for Vitamin A, P = 0.001.
Interpreting each P value is easy: if the null hypothesis is true, the P value
is the chance that random selection of subjects will result in as large (or larger) a
linear trend as was observed in this study. If the null hypothesis is true, there is a 5%
chance of randomly selecting subjects such that the trend is statistically significant.
If no correction is made for multiple comparisons, there is a 14% chance
of observing one or more significant P values, even if all three null hypotheses
are true. The Bonferroni method sets a stricter significance threshold by dividing
the significance level (0.05) by the number of comparisons (three), so a differ-
ence is declared statistically significant only when its P value is less than 0.050/3,
or 0.017. According to this criterion, the relationship between Vitamin A intake
and the incidence of breast cancer is statistically significant, but the intakes of
Vitamins C and E are not significantly related to the incidence of breast cancer.
The terminology can be confusing. The significance level is still 5%, so α still
equals 0.05. But now the significance level applies to the family of comparisons.
The lower threshold (0.017) is used to decide whether each particular comparison
is statistically significant, but α (now the familywise error rate) remains 0.05.
controls are not well matched and have different ancestry (say the patients are
largely of Italian ancestry and the controls are largely Jewish), you’d expect ge-
netic differences between the two groups that have nothing to do with the disease
being studied (see “The Challenge of Case-Control Studies” in Chapter 28).
Lingo: FDR
This approach does not use the term statistically significant but instead uses the
term discovery. A finding is deemed to be a discovery when its P value is lower
than a certain threshold. A discovery is false when the null hypothesis is actually
true for that comparison. The FDR is the answer to these two equivalent questions
(this definition actually defines the positive FDR, or pFDR, but the distinction
between the pFDR and the FDR is subtle and won’t be explained in this book):
• If a comparison is classified as a discovery, what is the chance that the null
hypothesis is true?
• Of all discoveries, what fraction is expected to be false?
Q (the desired FDR) to 5%. If all the null hypotheses were true, you’d expect that
the smallest P value would be about 1/100, or 1%. Multiply that value by Q. So
you declare the smallest P value to be a discovery if its P value is less than 0.0005.
You’d expect the second-smallest P value to be about 2/100, or 0.02. So you’d call
that comparison a discovery if its P value is less than 0.0010. The threshold for
the third-smallest P value is 0.0015, and so on. The comparison with the largest
P value is called a discovery only if its value is less than 0.05. This description is
a bit simplified but does provide the general idea behind the method. This is only
one of several methods used to control the FDR.
If you set Q for the FDR method to equal alpha in the conventional method,
note these similarities. For the smallest P value, the threshold used for the FDR
method is α/k, which is the same threshold used by the Bonferroni method. For
the largest P value, the threshold for the FDR method is α, which is equivalent to
not correcting for multiple comparisons. P values between the smallest and largest
are compared to a threshold that is a blend of the two methods.
Table 22.2. This table (identical to Table 18.2) shows the results of many statistical
analyses, each analyzed to reach a decision to reject or not reject the null hypothesis.
The top row tabulates results for experiments for which the null hypothesis is really true.
The second row tabulates experiments for which the null hypothesis is not true. When you
analyze data, you don’t know whether the null hypothesis is true, so you could never create
this table from an actual series of experiments. A, B, C, and D are integers (not proportions)
that count the number of analyses.
Q&A
If you make 10 independent comparisons and all null hypotheses are true, what is the
chance that none will be statistically significant?
If you use the usual 5% significance level, the probability that each test will be
not statistically significant is 0.95. The chance that all 10 will be not significant is
0.9510, or 59.9%.
Is the definition of a family of comparisons always clear?
No, it is a somewhat arbitrary definition, and different people seeing the same
data could apply that definition differently.
If you make only a few planned comparisons, why don’t you have to correct for those
comparisons?
It makes sense to correct for all comparisons that were made, whether planned
or not. But some texts say that if you only plan a few comparisons, you
should get rewarded by not having to correct for multiple comparisons. This
recommendation doesn’t really make sense to me.
When using the Bonferroni correction, does the variable α refer to the overall family-
wide significance level (usually 5%) or the threshold used to decide when a particular
comparison is statistically significant (usually 0.05/K, where K is the number of
comparisons)?
The significance level α refers to the overall familywide significance level (usually 5%).
Will you know when you make a false discovery?
No. You never know for sure whether the null hypothesis is really true, so you
won’t know when a discovery is false. Similarly, you won’t know when you make
Type I or Type II errors with statistical hypothesis testing.
Are there other ways to deal with multiple comparisons?
Yes. One way is to fit a multilevel hierarchical model, but this requires special
expertise and is not commonly used, at least in biology (Gelman, 2012).
Physicists refer to the “look elsewhere effect.” How does that relate to multiple
comparisons?
It is another term for the same concept. When physicists search a tracing for a
signal, they won’t get too excited when they find a small one, knowing that they
also looked elsewhere for a signal. They need to account for all the places they
looked, which is the same as accounting for multiple comparisons.
C HA P T E R 2 2 • Multiple Comparisons Concepts 213
CHAPTER SUMMARY
• The multiple comparisons problem is clear. If you make lots of compari-
sons (and make no special correction for the multiple comparisons), you
are likely to find some statistically significant results just by chance.
• Coping with multiple comparisons is one of the biggest challenges in data
analysis.
• One way to deal with multiple comparisons is to analyze the data as usual
but to fully report the number of comparisons that were made and then let
the reader account for the number of comparisons.
• If you make 13 independent comparisons with all null hypotheses true, there
is about a 50:50 chance that one or more P values will be less than 0.05.
• The most common approach to multiple comparisons is to define the sig-
nificance level to apply to an entire family of comparisons, rather than to
each individual comparison.
• A newer approach to multiple comparisons is to control or define the FDR.
The Ubiquity of Multiple
Comparisons
If the fishing expedition catches a boot, the fishermen should
throw it back, and not claim that they were fishing for boots.
J AMES L . M ILLS
OVERVIEW
Berry (2007, p. 155) concisely summarized the importance and ubiquity of mul-
tiple comparisons:
Most scientists are oblivious to the problems of multiplicities. Yet they are every-
where. In one or more of its forms, multiplicities are present in every statistical
application. They may be out in the open or hidden. And even if they are out in
the open, recognizing them is but the first step in a difficult process of inference.
Problems of multiplicities are the most difficult that we statisticians face. They
threaten the validity of every statistical conclusion.
This chapter points out some of the ways in which the problem of multiple com-
parisons affects the interpretation of many statistical results.
214
C HA P T E R 2 3 • The Ubiquity of Multiple Comparisons 215
the data as if the two random groups were actually given two distinct treatments.
As expected, the survival of the two groups was indistinguishable.
They then divided the patients in each group into six subgroups depending on
whether they had disease in one, two, or three coronary arteries and whether the
heart ventricle contracted normally. Because these are variables that are expected
to affect survival of the patients, it made sense to evaluate the response to “treat-
ment” separately in each of the six subgroups. Whereas they found no substantial
difference in five of the subgroups, they found a striking result among patients
with three-vessel disease who also had impaired ventricular contraction. Of these
patients, those assigned to Treatment B had much better survival than those as-
signed to Treatment A. The difference between the two survival curves was statis-
tically significant, with a P value less than 0.025.
If this were an actual study comparing two alternative treatments, it would
be tempting to conclude that Treatment B is superior for the sickest patients and
to recommend Treatment B to those patients in the future. But in this study, the
random assignment to Treatment A or Treatment B did not alter how the patients
were actually treated. Because the two sets of patients were treated identically, we
can be absolutely, positively sure that the survival difference was a coincidence.
It is not surprising that the authors found one low P value out of six com-
parisons. Figure 22.1 shows that there is a 26% chance that at least one of six
independent comparisons will have a P value of less than 0.05, even if all null
hypotheses are true.
If the difference between treatments had not been statistically significant in
any of the six subgroups used in this example, the investigators probably would
have kept looking. Perhaps they would have gone on to separately compare Treat-
ments A and B only in patients who had previously had a myocardial infarction
(heart attack), only in those with abnormal electrocardiogram findings, or only
those who had previous coronary artery surgery. Since you don’t know how
many comparisons were possible, you cannot rigorously account for the multiple
comparisons.
For analyses of subgroups to be useful, it is essential that the study design
specify all subgroups analyses, including methods to be used to account for mul-
tiple comparisons.
incidence of heart failure among all others (people born under all other 11 signs
combined into one group). Taken at face value, this comparison showed that the
people born under Pisces have a statistically significant higher incidence of heart
failure than do people born under the other 11 signs (the P value was 0.026).
The problem is that the investigators didn’t really test a single hypothesis;
they implicitly tested 12. They only focused on Pisces after looking at the inci-
dence of heart failure for people born under all 12 astrological signs. So it isn’t
fair to compare that one group against the others without considering the other
11 implicit comparisons.
Let’s do the math. If you do 12 independent comparisons, what is the chance
that every P value will be greater than 0.026? Since the probabilities of indepen-
dent comparisons are independent of each other, we can multiply them. The prob-
ability of getting a P value greater than 0.026 in one comparison is 1.0 − 0.026, or
0.974. The probability of two P values both being greater than 0.026 in two inde-
pendent comparisons is 0.974 × 0.974 = 0.949. The chance that all 12 independent
comparisons would have P values greater than 0.026 is 0.97412 = 0.729. Subtract
that number from 1.00 to find the probability that at least one P value of 12 will be
less than 0.026. The answer is 0.271. In other words, there is a 27.1% chance that
one or more of the 12 comparisons will have a P value less than or equal to 0.026
just by chance, even if all 12 null hypotheses are true. Once you realize this, the
association between astrological sign and heart failure does not seem impressive.
Freedman then selected the 15 input variables that happened to have the
lowest P values (less than 0.25) and refit the multiple regression model using only
those 15 input variables, rather than all 50 as before. Now the overall P value was
tiny (0.0005). Of the 15 variables now in the model, 6 had an associated P value
smaller than 0.05.
Anyone who saw only this analysis and didn’t know the 15 input variables
were selected from a larger group of 50 would conclude that the outcome variable
is related to, and so can be predicted from, the 15 input variables. Because these
are simulated data, we know that this conclusion would be wrong. There is, in fact,
no consistent relationship between the input variables and the outcome variable.
By including so many variables and selecting a subset that happened to be
predictive, the investigators were performing multiple comparisons. Freedman
knew this, and that is the entire point of his paper. But the use of this kind of
variable selection happens often in analyzing large studies. When multiple re-
gression analyses include lots of opportunity for variable selection, it is easy
to be misled by the results, especially if the investigator doesn’t explain exactly
what she has done.
If you analyze your data many ways—perhaps first with a t test, then with a
nonparametric Mann–Whitney test, and then with a two-way ANOVA (adjusting
for another variable)—you are performing multiple comparisons.
looking only at subgroups, removing outliers, analyzing the data after transform-
ing to logarithms, switching to nonparametric tests, and so on. These methods
would increase the false positive rate even further.
If investigators use even a few researcher degrees of freedom (i.e., allow
themselves to p-hack), the chance of finding a bogus (false positive) “statisti-
cally significant” result increases. Statistical results can only be interpreted at
face value if the analyses were planned in advance and the investigators don’t
try alternative analyses once the data are collected. When the number of possible
analyses is not defined in advance (and is almost unlimited), results simply cannot
be trusted.
Q&A
Why does this chapter only have only one Q&A and the rest have many?
Comparing chapters is a form of multiple comparisons you should avoid.
CHAPTER SUMMARY
• The problem of multiple comparisons shows up in many situations:
◆◆ Multiple end points
◆◆ Multiple time points
◆◆ Multiple subgroups
◆◆ Multiple geographical areas
◆◆ Multiple predictions
◆◆ Multiple geographical groups
◆◆ Multiple ways to select variables in multiple regression
◆◆ Multiple methods of preprocessing data
C HA P T E R 2 3 • The Ubiquity of Multiple Comparisons 223
Normality Tests
You need the subjunctive to explain statistics.
D AVID C OLQUHOUN
224
4 4 4 4
3 3 3 3
n = 12
2 2 2 2
1 1 1 1
Number of people
0 0 0 0
35 36 37 38 35 36 37 38 35 36 37 38 35 36 37 38
35
30 30 30 30
25
20 20 20 20
n = 130
15
10 10 10 10
5
0 0 0 0
35 36 37 38 35 36 37 38 35 36 37 38 35 36 37 38
Body temperature ºC
QQ PLOTS
When investigators want to show that the data are compatible with the assumption
of sampling from a Gaussian distribution, they sometimes show a QQ plot, like
the ones in Figure 24.3. The Y axis plots the actual values in a sample of data.
The X axis plots ideal predicted values assuming the data were sampled from a
Gaussian distribution.
If the data are truly sampled from a Gaussian distribution, all the points will
be close to the line of identity (shown in a dotted line in Figure 24.3). The data
C HA P T E R 2 4 • Normality Tests 227
400
12
10 300
8
Actual data
Actual data
200
6
100
4
0
2
0 –100
0 2 4 6 8 10 12 –100 0 100 200 300 400
Predicted data Predicted data
in the left panel of the figure are sampled from a Gaussian distribution, and the
points lie pretty close to the line of identity.
Systematic deviation of points from the line of identity is evidence that the
data are not sampled from a Gaussian distribution. But it is hard to say how much
deviation is more than you’d expect to see by chance. It takes some experience to
interpret a QQ plot. The data in the right panel of Figure 24.3 were not sampled
from a Gaussian distribution (but rather from a lognormal distribution), and you
can see that the points systematically deviate from the line of identity.
Here is a brief explanation of how a QQ plot is created. First, the program
computes the percentile of each value in the data set. Next, for each percentile, it
calculates how many SDs from the mean you need to go to reach that percentile
on a Gaussian distribution. Finally, using the actual mean and SD computed from
the data, it calculates the corresponding predicted value.
Why the strange name QQ? Q is an abbreviation for quantile, which is the
same as a percentile but expressed as a fraction. The 95th percentile is equivalent
to a quantile of 0.95. The 50th percentile (median) is the same as a quantile of
0.5. The name QQ plot stuck even though quantiles don’t actually appear on most
versions of QQ plots.
always report a small P value, even if the distribution only mildly deviates from
a Gaussian distribution.
ALTERNATIVES TO ASSUMING
A GAUSSIAN DISTRIBUTION
If you don’t wish to assume that your data were sampled from a Gaussian
distribution, you have several choices:
• Identify another distribution from which the data were sampled. In many cases,
you can then transform your values to create a Gaussian distribution. Most
common, if the data come from a lognormal distribution (see Chapter 11), you
can transform all values to their logarithms.
• Ignore small departures from the Gaussian ideal. Statistical tests tend to be
quite robust to mild violations of the Gaussian assumption.
• Identify and remove outliers (see Chapter 25).
• Switch to a nonparametric test that doesn’t assume a Gaussian distribution
(see Chapter 41).
Don’t make the mistake of jumping directly to the fourth option, using a
nonparametric test. It is very hard to decide when to use a statistical test based on
a Gaussian distribution and when to use a nonparametric test. This decision really
is difficult, requiring thinking, perspective, and consistency. For that reason, the
decision should not be automated.
Q&A
CHAPTER SUMMARY
• The Gaussian distribution is an ideal that is rarely achieved. Few, if any,
scientific variables completely follow a Gaussian distribution.
• Data sampled from Gaussian distributions don’t look as Gaussian as many
expect. Random sampling is more random than many appreciate.
• Normality tests are used to test for deviations from the Gaussian ideal.
• A small P value from a normality test only tells you that the deviations
from the Gaussian ideal are more than you’d expect to see by chance. It tells
you nothing about whether the deviations from the Gaussian ideal are large
enough to affect conclusions from statistical tests that assume sampling
from a Gaussian distribution.
Outliers
There are liars, outliers, and out-and-out liars.
R OBERT D AWSON
232
C HA P T E R 2 5 • Outlierss 233
Figure 25.1. No outliers here. These data were sampled from Gaussian distributions.
All of these data sets were computer generated and sampled from a Gaussian distribution.
But when you look at them, some points just seem too far from the rest to be part of the same
distribution. They seem like real outliers—but they are not. The human brain is very good at
seeing patterns and exceptions from patterns, but it is poor at recognizing random scatter.
It would seem that the presence of outliers would be obvious. If this were
the case, we could deal with outliers informally. But identifying outliers is much
harder than it seems.
Figure 25.1 shows the problem with attempting to identify outliers informally.
It shows 18 data sets all sampled from a Gaussian distribution. Half of the samples
have five values and half have 24 values. When you look at the graph, some points
just seem to be too far from the rest. It seems obvious they are outliers. But, in
fact, all of these values were sampled from the same Gaussian distribution.
One problem with ad hoc removal of outliers is our tendency to see too many
outliers. Another problem is that the experimenter is almost always biased. Even if
you try to be fair and objective, your decision about which outliers to remove will
probably be influenced by the results you want to see.
missing value (or configured the program incorrectly), that code might
appear to be an outlier.
• Did you notice a problem during the experiment? If so, don’t bother with
outlier tests. Eliminate values if you noticed a problem with a value during
the experiment.
• Could the extreme values be a result of biological variability? If each value
comes from a different person or animal, the outlier may be a correct value.
It is an outlier not because of an experimental mistake, but rather because
that individual is different from the others. This may be the most exciting
finding in your data!
• Is it possible the distribution is not Gaussian? If so, it may be possible to
transform the values to make them Gaussian. Most outlier tests assume that
the data (except the potential outliers) come from a Gaussian distribution.
OUTLIER TESTS
The question an outlier test answers
If you’ve answered no to all five of the previous questions, two possibilities remain:
• The extreme value came from the same distribution as the other values and
just happened to be larger (or smaller) than the rest. In this case, the value
should not be treated specially.
• The extreme value was the result of a mistake. This could be something
like bad pipetting, a voltage spike, or holes in filters. Or it could be a mis-
take in recording a value. These kinds of mistakes can happen and are not
always noticed during data collection. Because including an erroneous
value in your analyses will give invalid results, you should remove it. In
other words, the value comes from a different population than the other
values and is misleading.
The problem, of course, is that you can never be sure which of these pos-
sibilities is correct. Mistake or chance? No mathematical calculation can tell you
for sure whether the outlier came from the same or a different population than the
others. An outlier test, however, can answer this question: If the values really were
all sampled from a Gaussian distribution, what is the chance one value would be
as far from the others as the extreme value you observed?
same distribution as the others. All you can say is that there is no strong e vidence
that the value came from a different distribution.
of data analysis that are not much affected by the presence of outliers are called
robust. You don’t need to decide when to eliminate an outlier, because the method
is designed to accommodate them. Outliers just automatically fade away.
The simplest robust statistic is the median. If one value is very high or very
low, the value of the median won’t change, whereas the value of the mean will
change a lot. The median is robust; the mean is not.
To learn more about robust statistics, start with the book by Huber (2003).
LINGO: OUTLIER
The term outlier is confusing because it gets used in three contexts:
When you see the term outlier used, make sure you understand whether the
term is used informally as a value that seems a bit far from the rest, or is used
formally as a point that has been identified by an outlier test.
40
1.5
30
1.0
20
0.5
10
0 0.0
Figure 25.2. No outliers here. These data were sampled from lognormal distributions.
(Left) The four data sets were randomly sampled by a computer from a lognormal distribu-
tion. A Grubbs outlier test found a significant (P < 0.05) outlier in three of the four data sets.
(Right) Graph of the same values after being transformed to their logarithms. No outliers
were found. Most outlier tests are simply inappropriate when values are not sampled from a
Gaussian distribution.
outlier test would find an outlier in that set of data. Don’t mistakenly interpret the
5% significance level to mean that each value has a 5% chance of being identified
as an outlier.
Q&A
CHAPTER SUMMARY
• Outliers are values that lie very far from the other values in the data set.
• The presence of a true outlier can lead to misleading results.
• If you try to remove outliers manually, you are likely to be fooled. Random
sampling from a Gaussian distribution creates values that often appear to
be outliers but are not.
• Before using an outlier test, make sure the extreme value wasn’t simply a
mistake in data entry.
• Don’t try to remove outliers if it is likely that the outlier reflects true bio-
logical variability.
• Beware of multiple outliers. Most outlier tests are designed to detect one
outlier and do a bad job of detecting multiple outliers. In fact, the presence
of a second outlier can cause the outlier test to miss the first outlier.
• Beware of lognormal distributions. Very high values are expected in log-
normal distributions, but these can easily be mistaken for outliers.
• Rather than removing outliers, consider using robust statistical methods
that are not much influenced by the presence of outliers.
Choosing a Sample Size
There are lots of ways to mismanage experiments, but using
the wrong sample size should not be one of them.
P AUL M ATHEWS
M any experiments and clinical trials are run with a sample size that is
too small, so even substantial treatment effects may go undetected.
When planning a study, therefore, you must choose an appropriate sample
size. This chapter explains how to understand a sample size calculation
and how to compute sample size for simple experimental designs. Make
sure you understand the concept of statistical power in Chapter 20 before
reading this chapter. Sample size calculations for multiple and logistic
regression are briefly covered in Chapters 37 and 38, respectively. Sample
size for nonparametric tests is discussed in Chapter 41.
239
2 4 0 PA RT E • C HA L L E N G E S I N STAT I S T IC S
100% 15
σ = 10
40%
w=5 5
20%
0% 0
0 20 40 60 80 100 0 20 40 60 80 100
Sample size per group Sample size per group
the same calculations, programs, or tables as those used for computing sample size
needed to achieve statistical significance. Here are the similarities and differences:
• Instead of entering the smallest difference you wish to detect, enter the
largest acceptable margin of error, which equals half the width of the widest
CI you’d find acceptable.
• Calculate α as 100% minus the confidence level you want. For 95% confi-
dence intervals, enter α as 5% or 0.05.
• Estimate SD the same as you would for the usual methods.
• Enter power as 50%. But see the next paragraph to understand what this
means.
When calculating sample size to achieve a specified margin of error, it is
customary to do the calculations so the expected average margin of error will
equal the value you specified. About half the time, the margin of error will be
narrower than that specified value, and the other half of time the margin of error
will be wider. Although the term power is often not used, these calculations are
the same as those used to calculate sample size for 50% power. If you chose,
say, 80% power instead, then you’d need a larger sample size and the margin of
error would be expected to be narrower than the specified value 80% of the time
(Metcalfe, 2011).
I prefer this approach to determining sample size, because it puts the focus
on effect size rather than statistical significance. I think it is more natural to think
about how precise you want to determine the effect size (the widest acceptable
confidence interval), rather than the smallest effect size you can detect. But the
two methods are equivalent.
What power?
Statistical power is explained in Chapter 20. The power of an experimental design
is the answer to this question: If there truly is an effect of a specified size, what is
the chance that your result will be statistically significant?
The example study chose a power of 80%. That means that if the effect the
investigators wanted to see really exists, there is an 80% chance that the study
would have resulted in a statistically significant result, leaving a 20% chance of
missing the real difference.
If you want lots of statistical power, then you’ll need a huge sample size. If
moderate power is adequate, then the sample size can be smaller.
LINGO: POWER
The word “power” tends to get emphasized when discussing sample size calcula-
tions. The calculations used to compute necessary sample size are commonly re-
ferred to as a power analysis. When discussing the effect size the experiment was
designed to detect, some investigators say the experiment was powered to detect
the specified effect size. That is just slang for choosing a large enough sample size
for the specified effect size, significance level, and power. When the sample size
is too small to detect a specified effect size, a study is sometimes referred to as
being underpowered.
C HA P T E R 2 6 • Choosing a Sample Size 243
Table 26.1. Expected FPRP as a function of choices for α and power, and an estimate of
the prior probability.
Each value was computed in Excel by this formula where Prior and Power are percentages
and Alpha is a fraction:
=100((1-Prior/100)*Alpha)/ ( ((1-Prior/100)*Alpha)+((Prior/100)*(Power/100) ))
2 4 4 PA RT E • C HA L L E N G E S I N STAT I S T IC S
prior data, but rather on a wild hunch. The corresponding FPRP is 86.1%.
If such an experiment resulted in a P value less than 0.05, there would be
an 86% chance that this is a false positive. The problem is that the almost
5% of the experiments will be false positives (5% of the 99% of the experi-
ments where the null hypothesis is true) while fewer than 1% of the experi-
ments will be true positives (80% of the 1% of the experiments where the
experimental hypothesis is true. There is no point reading the rest of the
paper in this case. The study design makes no sense. If the prior probability
is low, α must be set to a lower value (and power to a higher value).
• The right column sets the prior probability is 50%. If the study was based
on solid theory and/or prior data, this is a reasonable estimate of the prior
probability and the FPRP is 5.9%. If such an experiment resulted in a
P value less than 0.05, there would be only about a 6% chance that it is a
false positive. That is not unreasonable. This is a well-designed study.
These two extremes demonstrate why experimental context (here quantified as prior
probability and FPRP) must be considered when choosing a sample size for a study.
Of course, one can never know the prior probability exactly, but it is essential to esti-
mate it when choosing sample size or when interpreting a sample size statement. The
usual choices of power (80%) and α (0.05) are reasonable for testing well-grounded
hypotheses, but are really not helpful when testing unlikely hypotheses.
One problem is that you might care about effects that are smaller than the
published (or pilot) effect size. You should compute the sample size required to
detect the smallest effect worth detecting, not the sample size required to detect
an effect size someone else has published (or you determined in a pilot study).
The second problem is that the published difference or effect is likely to in-
flate the true effect (Ioannidis, 2008; Zollner & Pritchard, 2007). If many studies
are performed, the average of the effects detected in these studies should be close
to the true effect. Some studies will happen to find larger effects, and some studies
will happen to find smaller effects. The problem is that the studies that find small
effects are much less likely to get published than the studies that find large effects.
Therefore, published results tend to overestimate the actual effect size.
What happens if you use the published effect size to compute the necessary
sample size for a repeat experiment? You’ll have a large enough sample size to
have 80% power (or whatever power you specify) to detect that published effect
size. But if the real effect size is smaller, the power of the study to detect that real
effect will be less than the power you chose.
Don’t try to compute sample size based on the effect size you expect to see
Sample size calculations should be based on the smallest effect size worth detect-
ing, based on scientific or clinical considerations. Sample size calculations should
not be based on the sample size you expect to see. If you knew the effect size to
expect, it wouldn’t really be an experiment! That is what you are trying to find out.
C HA P T E R 2 6 • Choosing a Sample Size 247
n 95% CI OF SD
2 0.45 × SD to 31.9 × SD
3 0.52 × SD to 6.29 × SD
5 0.60 × SD to 2.87 × SD
10 0.69 × SD to 1.83 × SD
25 0.78 × SD to 1.39 × SD
50 0.84 × SD to 1.25 × SD
100 0.88 × SD to 1.16 × SD
500 0.94 × SD to 1.07 × SD
1000 0.96 × SD to 1.05 × SD
samples, note how wide the confidence interval is. For example, with n = 3 the popu-
lation SD can be (within 95% confidence) 6.29 times larger than the SD computed
from the sample. If you compute a sample size with a SD that is lower than the true
population SD, the computed sample size will also be too low. The required sample
size is proportional to the square of the SD. So if the SD you obtain in a pilot experi-
ment is half the true population SD, your computed sample size will be one quarter of
what it should be. Conversely, your power will be a whole lot less than you think it is.
EXAMPLES
Example 1: Comparing weights of men and women
You want to compare the average weight of adult men and women. How big a
sample size do you need to test the null hypothesis that the men and women weigh
the same on average? Of course, this is a silly question, as you know the answer
already but going through this example may help clarify the concepts of sample
size determination. You need to answer four questions:
• How big an effect are you looking for? For this example, the answer is
already known. In an actual example, you have to specify the smallest dif-
ference you’d care about. Here we can use known values. In the US, the
mean difference between the weight of men and women (for 30 year olds,
but it doesn’t vary that much with age) is 16 kg (Environmental Protection
Agency, 2011, p. 36 NHANES-III data for 30-31-year-olds). An alternative
way to look at this is to ask what is the widest that you are willing to accept
for the confidence interval for the difference between the two mean weights.
Here we are looking for a margin of error of 16 kg, which means the entire
confidence interval extends that far in each direction so has a width of 32 kg.
• What power do you want? Let’s choose the standard 80% power.
• What significance level? Let’s stick with the usual definition of 5%.
• How much variability do you expect. The table with the average weights
also states the SD of weights of 30-year-old men is 17 kg, and the SD for
other ages is about the same.
Figure 26.2 shows how to use the program G*Power to compute sample size
for this example. G*Power is free for Mac and Windows at www.gpower.hhu.de,
and is described by Faul (2007). I used version 3.1.9.2. If you use a different ver-
sion, the screen may not look quite the same. To use G*Power:
1. Choose a t test for the difference between two independent means.
2. Choose two-tail P values.
3. Enter the effect size, d, as 1.0. This is the ratio of the smallest difference
between means you wish to detect divided by the expected SD.
4. Set alpha to 0.05 and power to 0.80. Enter both as fractions not percentages.
5. Enter 1.0 for the allocation ratio because you plan to use the same sample
size for each group.
2 5 0 PA RT E • C HA L L E N G E S I N STAT I S T IC S
These examples are silly in that you would never design an experiment to
make this comparison. But it is a helpful example to appreciate the need for
large samples. Look around. It is obvious that men, on average, weigh more than
women. But to prove that (with only 1% chance of error in either direction) re-
quires measuring the weight of about 100 people.
When you actually do studies, you usually can’t look up the SD as we did
here. And even if you can look it up, it may not be correct. The EPA reference actu-
ally lists two different tables of weights. The other one (Environmental Protection
Agency, 2011, p. 38, NHANES-IV data for 30-31-year-olds) is quite different: the
SD is 22 kg and the mean difference between men and women is 6 kg. If you based
your sample size calculation on these values with an effect size of 6/22 = 0.27, the
required sample size per group is 217 (α = 0.05; power = 80%) or 661 (α = 0.01;
power = 99%). If you planned the study expecting the SD to be about 16 kg, but in
fact it was about 22 kg, you would not have the statistical power you were expecting.
are having substantially better outcomes than those in the other. This is a form
of multiple comparisons, so the investigators need to take a statistical “hit” for
peeking at the data while a trial is ongoing. For example, if you use a significance
level of 0.01 for stopping the trial early, then the P value of the final analysis has
to be less than about 0.049 to be deemed statistically significant at the 5% level.
Newer adaptive techniques are used to decide when to stop adding new pa-
tients to the study, when to stop the study, and what proportion of new subjects
should be assigned to each alternative treatment. These decisions must be based
on unambiguous protocols established not before collecting data but before ana-
lyzing data in a blinded study. Also, the interim analysis results must only be
presented as a summary of each group and each group only designated by labels
in a way that does not identify which treatment or experimental group is which,
analyzing data to ensure that the statistical results are meaningful.
In some cases, the adaptive design will lead investigators to end a study early
when it becomes clear that one treatment is much better than the other or when it
becomes clear that the difference between the treatments is at best trivial. In other
cases, the adaptive design will prolong a study (increase the sample size) until it
reaches a crisp conclusion.
The methods used to adapt a study to interim findings are not straightfor-
ward, have some drawbacks (Fleming, 2006) and are not yet entirely standard. The
investigators must choose the strategy of data analysis when planning the study
and must lock in that strategy before looking at any data. Despite the name, you
can’t adapt your strategy after doing preliminary analyses. All adaptations must
follow a preplanned strategy.
Simulations
Simulations provide a versatile approach to optimizing sample size for any kind
of experimental design. Run multiple sets of computer simulations, each with
various assumptions about variability, effect size, and sample sizes. Of course,
you’ll include a random factor so each simulation has different data. For each set
of simulations, tabulate whatever outcome makes the most sense—often the frac-
tion of the simulated results that have a P value less than 0.05 (the power). Then
you can review all the simulations and see what advantages increasing sample size
gives you. Knowing the cost and effort of the experiment, you can then choose a
sample size that makes sense.
1
P Value
0.1
Not statistically significant
Statistically significant
0.01
0 25 50 75
Cumulative sample size
increase sample size if the P value isn’t as low as desired, calculate the P value
again, and so on. If you could extend the sample size out to infinity, you’d always
eventually see a P value less than 0.05, even if the null hypothesis were true. In
practice, the investigator would eventually run out of money, time, or patience.
Even so, the chance of hitting a P value less than 0.05 under the null hypothesis
is way higher than 5%.
Someone viewing these results would have been very misled if the investiga-
tor had stopped collecting data the first time the P value dropped below 0.05 (at
n = 21) and had simply reported that the P value was 0.0286 and the sample size
was 21 in each group. It is impossible to interpret results if an ad hoc method is
used to choose sample size.
Mistake: Expecting there to be one sample size that is exactly right for
your proposed study
Sample size calculations are often based on tradition (choices for α and power),
a guess (SD), and an almost arbitrary decision (effect size). Essentially, the cal-
culations use solid theory to compute sample size from nearly arbitrary inputs.
Since the inputs are often arbitrary, the computed sample sizes are also somewhat
arbitrary. Different knowledgeable people facing the same situation may come
up with different estimates and decisions and so would compute different sample
sizes. Sample size statements in journals tend to make sample size determination
sound far more exact than it is.
Mistake: Confusing the sample size needed for each group with the
total needed sample size
Most programs compute the same size needed for each group, in which case the
total sample size required is twice the value reported (assuming you are compar-
ing two groups). If you mistakably think the program reported the total sample
size when it actually reported the size for each group, you will use half the re-
quired sample size and your study will have much less power than you intended.
Q&A
Does 80% power mean that 80% of the participants would be improved by the treat-
ment and 20% would not?
No. Power refers to the probability that a proposed study would reach a conclu-
sion that a given difference (or effect size) is statistically significant. Power has
nothing to do with the fraction of participants who benefit from a treatment.
How much power do I need?
Sample size calculations require you to choose how much power the study will
have. If you want more power, you’ll need a larger sample size. Often power is
C HA P T E R 2 6 • Choosing a Sample Size 257
set to a standard value of 80%. Ideally, the value should be chosen to match the
experimental setting, the goals of the experiment, and the consequences of
making a Type II error (see Chapter 16).
What are Type I and Type II errors?
See Table 16.2. A Type I error occurs when the null hypothesis is true but your
experiment gives a statistically significant result. A Type II error occurs when
the alternative hypothesis is true but your experiment yields a result that is not
statistically significant.
My program asked me to enter β. What’s that?
β is defined as 100% minus power. A power of 80% means that if the effect size
is what you predicted, your experiment has an 80% chance of producing a sta-
tistically significant result and thus a 20% chance of obtaining a not-significant
result. In other words, there is a 20% chance that you’ll make a Type II error (miss-
ing a true effect of a specified size). β, therefore, equals 20%, or 0.20.
Why don’t the equations that compute sample size to achieve a specified margin of
error require you to enter a value for power?
The equations you’ll find in other books for calculating sample size to obtain a
specified margin of error (half-width of the CI) often don’t require you to choose
power. That is because they assume 50% power. If the assumptions are all true
and you use the computed sample size, there is a 50% chance that the com-
puted margin of error will be less than the desired value and a 50% chance that
it will be larger.
Can a study ever have 100% power?
No.
Why does power decrease if you set a smaller value for α (without changing sample
size)?
When you decide to make α smaller, you set a stricter criterion for finding a sig-
nificant difference. The advantage of this decision is that it decreases the chance
of making a Type I error. The disadvantage is that it will now be harder to declare
a difference significant, even if the difference is real. By making α smaller, you
increase the chance that a real difference will be declared not significant and
thus decrease statistical power.
Why does power increase if you choose a larger n?
If you collect more evidence, conclusions will be more certain. Collecting data
from a larger sample decreases the standard error and thus increases statistical
power.
How can I do a sample size calculation if I have no idea what I expect the SD to be?
You can’t. You’ll need to perform a pilot study to find out.
How large a sample size do I need for that pilot study?
Julious (2005) suggests that 12 is a good number. But using the equation shown
in the legend for Table 26.1, the 95% CI for the SD extends down to 0.7 times the
actual population SD. Since sample size is proportional to the square of SD, this
means the computed sample size (based on a SD from a pilot study with n = 12)
could be half the size you actually need.
Is it useful to compute power of a completed study?
When a study reaches a conclusion that an effect is not statistically significant, it
can be useful to ask what power that experimental design had to detect various
effects. However, it is rarely helpful to compute the power to detect the effect
actually observed. See Chapter 20.
2 5 8 PA RT E • C HA L L E N G E S I N STAT I S T IC S
CHAPTER SUMMARY
• Sample size calculations require that you specify the desired significance
level and power, the minimum effect size for which you are looking, and the
scatter (SD) of the data or the expected proportion.
• You need larger samples when you are looking for a small effect, when the
SD is large, and when you desire lots of statistical power.
C HA P T E R 2 6 • Choosing a Sample Size 259
• It is often more useful to ask what you can learn from a proposed sample
size than to compute a sample size from the four variables listed in the
previous bullet.
• It is important to decide the magnitude of the effect for which you are look-
ing. Using standard effect sizes or published effect sizes doesn’t really help.
• Instead of determining sample size to detect a specified effect size with
a specified power, you can compute the sample size needed to make the
margin of error of a CI have a desired size.
• Ad hoc sequential sample size determination is commonly done but not
recommended. You can’t really interpret P values when the sample size was
determined that way.
• Adaptive trials do not require you to stick with the sample size chosen at
the beginning of the study. They allow you to modify sample size based on
data collected during the study using algorithms chosen as part of the study
protocol.
• If you can estimate the prior probability that your hypothesis is correct,
you can compute the estimated FPRP of the study you are designing. If
the FPRP is too high, it makes sense to design a larger study with a lower
significance level and more power.
Statistical Tests
C HA P T E R 2 7
Comparing Proportions
No one believes an hypothesis except its originator, but everyone
believes an experiment except the experimenter.
W. I. B. B EVERID GE
T his chapter and the next one explain how to interpret results
that compare two proportions. This chapter explains analyses of
cross-sectional, prospective, and experimental studies in which the
results are summarized as the difference or ratio of two incidence or
prevalence rates.
263
2 6 4 PA RT F • STAT I ST IC A L T E ST S
RECURRENT
TREATMENT THROMBOEMBOLISM NO RECURRENCE TOTAL
Placebo 73 756 829
Apixaban (2.5 mg twice a day) 14 826 840
Total 87 1,582 1,669
Table 27.1. Data from the Agnelli et al. (2012) apixaban study. Each value is an actual
number of patients.
Table 27.2. Statistical results of the Agnelli et al. (2012) apixaban study.
C HA P T E R 2 7 • Comparing Proportions 265
P value
All P values test a null hypothesis, which in this case is that apixaban does not alter
the risk of a recurrent thromboembolism. The P value answers the question, If the
null hypothesis were true, what is the chance that random sampling of subjects would
result in incidence rates as different (or more different) from what we observed?
The P value depends on sample size and on how far the relative risk is from
1.0. Two statistical methods can be used to compute the P value. Most would agree
that the best test to use is Fisher’s exact test. With large sample sizes (larger than
used here), Fisher’s test is mathematically unwieldy, so a chi-square test is used
instead. With large samples, the two compute nearly identical P values.
The P value (calculated by a computer using either test) is tiny, less than
0.0001. Interpreting the P value is straightforward: if the null hypothesis is true,
there is less than a 0.01% chance of randomly picking subjects with such a large
(or larger) difference in incidence rates.
Other results
The results discussed above convincingly show that the lower dose (of two tested)
of apixaban prevents recurrence of thromboembolism. Although statistical tests
focus on one result at a time, you must integrate various results before reaching an
overall conclusion. Here the investigators also showed the following:
C HA P T E R 2 7 • Comparing Proportions 267
• The results with the higher dose (5 mg twice a day) were very similar to the
results summarized above for 2.5 mg twice a day.
• The main potential risk of an anticoagulant is bleeding, so it is reassur-
ing that patients receiving apixaban (an anticoagulant) did not have more
bleeding events than those treated with the placebo.
• The reduction in the number of recurrent thromboembolisms was similar in
old and young, and was also similar in men and women.
ASSUMPTIONS
The ability to interpret the results of a prospective or experimental study depends
on the following assumptions.
OBSERVED NUMBER
OF SEEDS IN THIS EXPECTED EXPECTED NUMBER
PHENOTYPE EXPERIMENT PROPORTION IN THIS EXPERIMENT
Round and yellow 315 9/16 312.75
Round and green 108 3/16 104.25
Angular and yellow 101 3/16 104.25
Angular and green 32 1/16 34.75
Total 556 16/16 556.00
Table 27.3 shows one of his experiments (from Cramer, 1999). In this experi-
ment, he recorded whether peas were round or angular and also whether they were
yellow or green. The yellow and round traits were dominant. Table 27.3 also
shows the expected proportions of the four phenotypes and the expected number
in an experiment with 556 peas. These expected numbers are not integers, but that
is OK, because they are the average expectations if you did many experiments.
Is the discrepancy between the observed and expected distributions greater
than we would expect to see by chance? The test that answers this is called the
chi-square test. It computes a P value that answers the question, If the theory that
generated the expected distribution is correct, what is the chance that random
sampling would lead to a deviation from expected as large or larger as that
observed in this experiment?
The P value is 0.93 (computed here using GraphPad Prism, but lots of pro-
grams can do the calculations). With such a high P value, there is no reason to
doubt that the data follow the expected distribution. Note that the large P value
does not prove the theory is correct, only that the deviations from that theory are
small and consistent with random variation. (Many of Mendel’s P values were
high, which makes some wonder whether the data were fudged, as was briefly
discussed in Chapter 19.)
(Observed − Expected)2
X2 = ∑
Expected
For the Mendel example, χ2 = 0.470. The relationship between χ2 and P value
depends on how many categories there are. The number of degrees of freedom
(df) is defined to equal the number of categories minus one. The example data
have four categories and so have 3 df. This makes sense. Once you know the total
number of peas and the number in three of the phenotype categories, you can
figure out the number of peas in the remaining category. Knowing that χ2 = 0.470
and df = 3, the P value can be determined by computer or using a table.
The test is based on some approximations, which are only reasonably ac-
curate when all the expected values are fairly large. If any expected value is less
than 5, the results are dubious. This matters less when there are lots of categories
(rows) and matters the most when there are only two categories (in which case the
expected values should be 10 or higher).
Binomial test
The chi-square test described in the previous section is an approximation. When
there are only two categories, the binomial test computes the exact P value, with-
out any approximation or worry about sample size.
The coin-flipping example of Chapter 15 used the binomial test. To compute
the binomial test, you must enter the total number of observations, the fraction
that had one of the two outcomes, and the fraction expected (under the null hy-
pothesis) of having that outcome. In the coin-flipping example, the expected pro-
portion was 50%, but this is not always the case.
The second part of this chapter explained how the chi-square test can be used
to compare an observed distribution with a distribution expected by theory. To do
this analysis using a computer program, you must enter both an observed and an
expected count for each category. The expected counts must come from theory or
external data and cannot be derived from the data being analyzed.
Q&A
What if there are more than three groups or more than three outcomes?
It won’t be possible to compute an attributable risk or relative risk, but it is
possible to compute a P value. Fisher’s test is limited to use on tables with two
rows and two columns, but the chi-square test can analyze any size contingency
table. Some programs offer exact tests for this situation.
If there are more than three rows or columns, does it matter in what order they
are placed?
The usual chi-square test pays no attention to the order of rows or columns.
If your table has two columns and three or more rows in which the order
matters (e.g., doses or ages), the chi-square test for trend questions whether
there is a significant trend between row number and the distribution of
the outcomes.
What is Yates’s correction?
When you use a program to analyze contingency tables, you might be
asked about Yates’s correction. The chi-square test used to analyze a con-
tingency table can be computed in two ways. Yates’s correction increases
the resulting P value to adjust for a bias in the usual chi-square test, but it
overcorrects.
Are special analyses available for paired data in which each subject is measured before
and after an intervention?
Yes. McNemar’s test is explained in Chapter 31.
C HA P T E R 2 7 • Comparing Proportions 271
Is it better to express results as the difference between two proportions or the ratio of
the two proportions?
It is best, of course, to express the results in multiple ways. If you have to sum-
marize results as a single value, I think the NNT is often the most useful way to
do so. Imagine a vaccine that halves the risk of a particular infection so that the
relative risk is 0.5. Consider two situations. In Situation A, the risk in unexposed
people is 2 in 10 million, so halving the risk brings it down to 1 in 10 million.
The difference is 0.0000001, so NNT = 10,000,000. In Situation B, the risk in
unexposed people is 20%, so halving the risk would bring it down to 10%. The
difference in risks is 0.10, so the NNT = 10. The relative risk is the same in the
two situations, and the risk difference involves tiny fractions that are hard to say
and remember. But the NNT is clear. In the first situation, you’d treat 10 million
people to prevent one case of disease. In the second situation, you’d treat
10 people to prevent one case of disease. Expressed as NNT, the results are
easier to understand and explain.
The NNT in the apixipan example is 14. Does that mean the drug doesn’t work in
13 out of 14 patients who receive it?
No. The drug works as an anticoagulant, and presumably all the patients were
anticoagulated. The NNT of 14 means that during the year’s course of the study,
you need to treat 14 patients to prevent one thromboembolism (the outcome
measured in the study).
Can a relative risk be zero?
If none of the control group had the outcome, the relative risk will be zero.
Can a relative risk be negative?
No.
Is there a limit to how high a relative risk can be?
The maximum possible value depends on the risk in the control group. For
example, if the risk in the control group is 25%, the relative risk cannot be
greater than 4.0, because that would make the risk in the treated group be 100%
(the maximum possible risk).
CHAPTER SUMMARY
• A contingency table displays the results of a study with a categorical out-
come. Rows represent different treatments or exposures. Columns repre-
sent different outcomes. Each value is an actual number of subjects.
• When there are two treatments and two outcomes, the results can be sum-
marized as the ratio (the relative risk), the difference between two incidence
rates (the attributable risk), or the reciprocal of that difference (NNT).
• Evidence against the null hypothesis that the outcome is not related to the
treatment is expressed as a P value, computed using either Fisher’s exact
test or the chi-square test.
• An observed distribution can be compared against a theoretical distribution
using the chi-square test. The result is expressed as a P value testing the
null hypothesis that the data were in fact sampled from a population that
followed the theoretical distribution. When there are only two categories,
the binomial test is preferred.
2 7 2 PA RT F • STAT I ST IC A L T E ST S
Case-Control Studies
It is now proven beyond doubt that smoking is one of the lead-
ing causes of statistics.
F LETCHER K NEBEL
273
2 7 4 PA RT F • STAT I ST IC A L T E ST S
Received vaccine 10 94
No vaccine 33 78
Total 43 172
Interpreting a P value
When you analyze these data with a computer program, you will be asked which
test to use to compute the P value. Most would agree that Fisher’s exact test is
C HA P T E R 2 8 • Case-Control Studies 275
Odds ratio
They analyzed the data for each disease (ulcerative colitis and Crohn’s disease)
separately and also analyzed the two diseases combined. The combined data
are shown in Table 28.2. Of the 1,960 cases of people with IBD, 25 (1.28%)
had taken isotretinoin prior to their diagnosis. Among cases, the odds of having
received isotretinoin is 25/1935, or 0.0129. Of the 19,419 controls, 213 (1.10%)
had taken isotretinoin. Among controls, the odds of having received isotreti-
noin are 213/19216, or 0.0111. The odds ratio equals the odds of the cases
having been exposed to the possible risk factor divided by the odds of the con-
trols having been exposed. In this example, the odds ratio is 0.0129/0.0111, or
1.17, with a 95% CI ranging from 0.77 to 1.75. Since that interval contains 1.0,
the investigators concluded there was no evidence of an association between
isotretinoin use and IBD. Even though the samples are huge, the CI is quite
wide. This is because only a small fraction (about 1%) of either group took the
drug.
2 7 6 PA RT F • STAT I ST IC A L T E ST S
P value
The researchers presented the results as an odds ratio with CI but did not also
present a P value or statements of statistical significance. I think this was a wise
decision. The odds ratio with its CI summarizes the results very clearly.
Even though the P value wasn’t presented, we can infer something about it.
Since the 95% CI contains 1.0 (the value that denotes no association), the P value
must be greater than 0.05 (since 100% − 95% = 5% = 0.05). Since 1.0 is not very
near either end of the interval, the P value must be quite a bit greater than 0.05.
In fact, it is 0.43.
time, and matching gender and age is a strength of the study. But, there are some
potential problems with this approach:
• The only way that controls became part of the study was to be home when the
investigators visited. This method thus selected for people who stay home a
lot. This selection criterion was not applied to the cases. This would cause
bias if the incidence of cholera was lower in people who stay home a lot.
• The controls were chosen among people who live very near the cases. This
could be a problem if the vaccination effort was more effective in some neigh-
borhoods than others. If that were true, then a case who was vaccinated is
more likely to get matched to a control who was a vaccinated, and a case
that was not vaccinated would be more likely to be matched to a control that
was not vaccinated. This would bias the results of the case-control study,
because it would be biased to not find a link between vaccination and infection.
These kinds of problems are inherent in all case-control studies where the
controls were chosen to match the cases. When you match for variables, you might
inadvertently also match (indirectly) for exposure.
The investigators in this study were very aware of these potential problems.
To test whether these biases affected their results, they ran a second case-control
study. Here, the cases were patients with bloody diarrhea that turned out to not
be caused by cholera. The controls were selected in the same way as in the first
study. The second study, which had almost the same set of biases as the first, did
not detect an association between diarrhea and cholera vaccination. The odds ratio
was 0.64, with a 95% CI extending from 0.34 to 1.18. Because the CI spans 1.0,
the study shows no association between cholera vaccination and bloody diarrhea
not caused by cholera. The negative results of the second study suggest that the
association reported in the main study really was caused by vaccination, rather
than by any of the previously listed biases.
Population controls
Rather than matching controls to cases, some studies are designed so controls are
chosen from the same population from which the cases are identified, with an
effort to choose controls at about the same time that cases are identified. Many
epidemiologists think this kind of design is much better, as it avoids the problems
of matching. Sometimes population controls are selected with some matching to
get controls of about the same age as the cases.
EPIDEMIOLOGY LINGO
Bias
Bias refers to anything that causes the results to be incorrect in a systematic way.
Contingency table
Contingency tables (first introduced in Chapter 27) show how the outcome is
contingent on the treatment or exposure. The rows represent exposure (or lack of
exposure) to alternative treatments or possible risk factors. Each subject belongs
to one row based on exposure or treatment. Columns denote alternative outcomes.
Each subject also belongs to one column based on outcome. Therefore, each cell
in the table is the number of people (or animals or lightbulbs…) that were in one
particular exposure (or treatment) group and had one particular outcome.
Not all tables are contingency tables. Contingency tables always show the
actual number of people (or some other experimental unit) in various categories.
C HA P T E R 2 8 • Case-Control Studies 279
Thus, each number must be a positive integer or zero. Tables of fractions, propor-
tions, percentages, averages, changes, or durations are not contingency tables, nor
are tables of rates, such as number of cases per 1,000 people. Table 27.3 compares
observed and expected counts and therefore is not a contingency table.
Mistake: Thinking that a case-control study can prove cause and effect
An observational study can show an association between two variables. Depend-
ing on the situation, you may be able to make some tentative conclusions about
causation. But, in general, association does not imply causation. Check out the
cartoons in Chapter 45 if you are not convinced.
Q&A
Are case-control studies “quick and dirty” ways to advance medical knowledge?
It is easy to design and perform a poor-quality case-control study. It is far from
easy or quick to design a high-quality case-control study. But once designed,
a case-control study only requires data collection and analysis. In contrast, an
experimental or prospective study also requires waiting for the results of the
treatment or exposure to become apparent over time.
Is it possible for the same person to be both a case and a control?
Surprisingly, it is possible. Say you identify the controls at the beginning of the
study and then collect cases as people in that population develop the disease or
condition. It is possible that one of your controls then also becomes a case.
Since case-control studies are retrospective, is there any point in following a prespeci-
fied analysis plan?
Yes. If investigators didn’t create and follow an analysis plan, people reading the
study never know how many variations on the study they ran until they got a
P value small enough to publish.
How is a case-control study analyzed when the investigator wants to adjust for other
variables?
Using logistic regression as explained in Chapter 38.
What are the advantages and disadvantages of hospital-based case-control studies?
Sometimes case-control studies define cases as people coming to the hospital
with one disease and controls as people coming to the same hospital with a
different disease or condition. One advantage is that the logistics of this kind of
study are easier to arrange than most other kinds of studies. Another advantage
is that hospitalized controls tend to be more likely to agree to participate in a
research study than are people who are not hospitalized. The disadvantage is
that the cases and controls may come from very different populations. This is
not much of an issue when the hospital is in a small city or is part of a health
maintenance organization where all members go to the same hospital for all
conditions. But in larger cities where people have a choice of hospitals, the hos-
pitalized patients with one disease may have very different characteristics than
patients who come to that hospital with another disease.
Do case-control studies always use an outcome variable with only two possible values?
Case-control studies are usually done with binary outcome variables, so are
usually analyzed by calculation of an odds ratio using Fisher’s test or logistic
regression. But a case-control design could be used for a study that collects con-
tinuous data to be analyzed with a t test, or collects survival data to be analyzed
by the log-rank test or Cox regression (see Chapter 29).
CHAPTER SUMMARY
• Case-control studies start by identifying people with and without a disease
or condition and then compare exposure to a possible risk factor or to a
treatment.
• The advantage of a case-control study compared to a cohort study is that
a case-control study can be performed more quickly, can sometimes use
existing data, and requires smaller sample sizes.
C HA P T E R 2 8 • Case-Control Studies 283
Comparing Survival Curves
A statistician is a person who draws a mathematically
precise line from an unwarranted assumption to a foregone
conclusion.
U NKNOWN A UTHOR
PREDNISOLONE CONTROL
2 2
6 3
12 4
54 7
56 (left study) 10
68 22
89 28
96 29
96 32
125* 37
128* 40
131* 41
140* 54
141* 61
143 63
145* 71
146 127*
148* 140*
162* 146*
168 158*
173* 167*
181* 182*
Table 29.1. Sample survival data from Kirk et al. (1980), with raw
data from Altman and Bland (1998).
Patients with chronic active hepatitis were treated with prednisolone
or placebo, and their survival was compared. The values are survival
time in months. The asterisks denote patients still alive at the time the
data were analyzed. These values are said to be censored.
Control Prednisolone
100% 100%
Prednisolone
75% 75%
Percent survival
Percent survival
50% 50%
25% 25%
Control
0% 0%
0 50 100 150 200 0 50 100 150 200
Months of follow-up Months of follow-up
Figure 29.1. Kaplan-Meier survival curves created from the data in Table 29.1.
Each drop in the survival curve shows the time of one (or more) patient’s death. The blips
show the time of censoring. One subject was censored when he left the study. The data
for the others were censored because they were alive when the data collection ended. The
graph on the right shows how median survival times are determined.
2 8 6 PA RT F • STAT I ST IC A L T E ST S
the average survival of all patients in the “with metastases” group will probably
also improve.
Changing the diagnostic method paradoxically increases the average survival
of both groups! Feinstein, Sosin, and Wells (1985) termed this paradox the Will
Rogers phenomenon from a quote attributed to the humorist Will Rogers: “When
the Okies left Oklahoma and moved to California, they raised the average intel-
ligence in both states.”
This phenomenon also causes a problem when making historical compari-
sons. The survival experience of people diagnosed with early stage breast cancer
is much better now than it was prior to widespread use of screening mammogra-
phy. But does that mean that screening mammography saves lives? That is a hard
question to answer, since the use of screening mammography has nearly doubled
the number of diagnoses of early breast cancer while barely changing the number
of diagnoses of advanced breast cancer. Nearly a third of the people who are
diagnosed with breast cancer by mammography have abnormalities that probably
would never have caused illness (Bleyer & Welch, 2012). Comparing the survival
curves of people diagnosed with early breast cancer then and now is pointless, as
the diagnostic criteria have changed so much.
100%
75%
Percent survival
50%
25%
0%
0 1 2 3 4 5
Years
start, of course, with 100% and end at 0%. At any time along the curve, the slope
of one curve is twice the other. The hazards are proportional.
The assumption of proportional hazards means that the hazard ratio is consis-
tent over time, and any differences are caused by random sampling. This assump-
tion would be true if, at all times, patients from one treatment group are dying
at about half the rate of patients from the other group. It would not be true if the
death rate in one group is much higher at early times but lower at late times. This
situation is common when comparing surgery (high initial risk, lower later risk)
with medical therapy (less initial risk, higher later risk). In this case, a hazard ratio
would essentially average together information that the treatment is worse at early
time points and better at later time points, so would be meaningless.
If two survival curves cross, the assumption of proportional hazards is un-
likely to be true. The exception would be when the curves only cross at late time
points when few patients are still being followed.
The assumption of proportional hazards needs to be reasonable to interpret
the hazard ratio (and its CI) and the CI for the ratio of median survival times.
Hazard ratio
If the assumption of proportional hazards is reasonable, the two survival curves
can be summarized with the hazard ratio, which is essentially the same as the rela-
tive risk (see Chapter 27). For the example data, the hazard ratio is 0.42, with a
95% CI ranging from 0.19 to 0.92. In other words, the treated patients are dying at
42% of the rate of control patients, and we have 95% confidence that the true ratio
is between 19% and 92%. More specifically, this means that on any particular day
or week, a treated patient has 42% of the chance of dying as a control patient.
If the two survival curves are identical, the hazard ratio would equal 1.0.
Because the 95% CI of the sample data does not include 1.0, we can be at least
95% certain that the two populations have different survival experiences.
This assumption does not seem unreasonable for the sample data. For these data,
the ratio of median survivals is 3.61, with a 95% CI ranging from 3.14 to 4.07.
In other words, we are 95% sure that treatment with prednisolone triples to qua-
druples the median survival time.
Confidence bands
Figure 29.3 shows the example survival curves with their 95% confidence bands.
The curves overlap a bit, so plotting both curves together creates a very clut-
tered graph. Instead, the two are placed side by side. Both curves start at 100%,
of course, so they overlap somewhat, but note that they don’t overlap at all for
many months. This graph should be enough to convince you that the prednisolone
worked.
Note the difference between the confidence band, which is an area plotted on
a graph, and a CI, which is a range of values.
P value
The comparison of two survival curves can be supplemented with a P value. All
P values test a null hypothesis. When comparing two survival curves, the null
hypothesis is that the survival curves of the two populations are identical and any
observed discrepancy is the result of random sampling error. In other words, the
null hypothesis is that the treatment did not change survival overall and that any
difference observed was simply the result of chance.
The P value answers the question, If the null hypothesis is true, what is the
probability of randomly selecting subjects whose survival curves are as different
(or more different) than what was actually observed?
The calculation of the P value is best left to computer programs. The log-
rank method, also known as the Mantel–Cox method (and nearly identical to the
Mantel–Haenszel method), is used most frequently.
For the sample data, the two-tailed P value is 0.031. If the treatment was
really ineffective, it is possible that the patients who were randomly selected to
receive one treatment just happened to live longer than the patients who received
Percent survival
75% 75%
50% 50%
25% 25%
0% 0%
0 50 100 150 200 0 50 100 150 200
Months of follow-up Months of follow-up
the other treatment. The P value tells us that the chance of this happening is only
3.1%. Because that percentage is less than the traditional significance cut-off of
5%, we can say that the increase in survival with treatment by prednisolone is
statistically significant.
An alternative method to compute the P value is known as the Gehan–
Breslow–Wilcoxon method. Unlike the log-rank test, which gives equal weight
to all time points, the Gehan–Breslow–Wilcoxon method gives more weight to
deaths at early time points. This often makes a lot of sense, but the results can be
misleading when a large fraction of patients are censored at early time points. The
Gehan–Breslow–Wilcoxon test does not require a consistent hazard ratio, but it
does require that one group consistently have a higher risk than the other.
For the sample data, the P value computed by the Gehan–Breslow–Wilcoxon
method is 0.011.
100%
Percent survival
50%
P = 0.0441
0%
0 5 10 15 20
Years
Five-year survival
The success of cancer therapies is often expressed as a five-year survival. This tells
you what fraction of the patients are still alive after five years. Five years is an arbi-
trary but standard length of time to tabulate survival. Figure 29.4 shows why focus-
ing only on that one number can be misleading. The two curves in that figure have
the same five-year survival (72%) but diverge a lot after five years. Even though the
five-year survival is the same for both treatments, the 10-year survival is very different.
INTENTION TO TREAT
When comparing the survival of people randomly assigned to receive alternative
treatments, you will often hit a snag. Some patients don’t actually get the treat-
ment they were randomized to receive. Others stop following the protocol.
It seems sensible to just exclude from analysis all data from people who
didn’t receive the full assigned treatment. This is called the per protocol approach,
because it requires that you only compare survival of subjects who got the full
treatment according to the study protocol. But this approach will lead to biased
results if the noncompliance has any association with disease progression and
treatment. Hollis and Campbell (1999) review an example of this. The study in
question randomized patients with severe angina to receive a surgical or medical
treatment. Some of the patients randomly assigned to the surgical group died soon
after being randomized so never actually received the surgery. It would seem sen-
sible to exclude them from the analysis, since they never received the treatment
to which they were assigned. But that would exclude the earliest deaths from one
assigned treatment (surgery) but not the other (medical). The results would then
not be interpretable. The surgery group would have better survival, simply be-
cause people who died early were excluded from the analysis.
2 9 2 PA RT F • STAT I ST IC A L T E ST S
Q&A
CHAPTER SUMMARY
• Survival analysis is used to analyze data for which the outcome is elapsed
time until a one-time event (often death) occurs.
• Comparisons of survival curves can only be interpreted if you accept a list
of assumptions.
• The hazard ratio is one way to summarize the comparison of two survival
curves. If the hazard ratio is 2.0, then subjects given one treatment have
died at twice the rate of subjects given the other treatment. Hazard ratios
should be reported with a 95% CI.
• Another way to summarize the comparison of two survival curves is to com-
pute the ratio of the median survival times, along with a CI for that ratio.
• A P value tests the null hypothesis that survival is identical in the two popula-
tions and the difference you observed is due to random assignment of patients.
• It is rarely helpful to compute the mean time until death for two reasons.
One reason is that you can’t compute the mean until the last person has
died. The other reason is that survival times are rarely Gaussian.
• Computing the median survival is more useful than computing the mean
survival, but that approach doesn’t replace the need to view the survival
curves.
• Most investigators analyze survival data from randomized trials using the
ITT principle. In this approach, you analyze each person’s data according to
the treatment group to which they were assigned, even if they didn’t actu-
ally get that treatment.
Comparing Two Means:
Unpaired t Test
Researchers often want Bioinformaticians to be Biomagicians,
people who can make significant results out of non-significant
data, or Biomorticians, people who can bury data that disagree
with the researcher’s prior hypothesis.
D AN M ASYS
T his chapter explains the unpaired t test which compares the means
of two groups, assuming the data were sampled from a Gaussian
population. Chapter 31 explains the paired t test, Chapter 39 explains the
nonparametric Mann–Whitney test (and computer-intensive bootstrapping
methods), and Chapter 35 shows how an unpaired t test can be viewed as a
comparison of the fit of two models.
294
C HA P T E R 3 0 • Comparing Two Means: Unpaired t Test 295
(given some assumptions) that the mean response in old animals is less than the
mean response in young ones.
The width of the CI depends on three values:
• Variability. If the data are widely scattered (large SD), then the CI will be
wider. If the data are very consistent (low SD), the CI will be narrower.
• Sample size. Everything else being equal, larger samples generate narrower
CIs and smaller samples generate wider CIs.
• Degree of confidence. If you wish to have more confidence (i.e., 99%
rather than 95%), the interval will be wider. If you are willing to accept less
confidence (i.e., 90% confidence), the interval will be narrower.
OLD YOUNG
20.8 45.5
2.8 55.0
50.0 60.7
33.3 61.5
29.4 61.1
38.9 65.5
29.4 42.9
52.6 37.5
14.3
75
50
% Emax
25
0
Old Young
Figure 30.1. Maximal relaxation of muscle strips of old and young rat
bladders stimulated with high concentrations of norepinephrine.
More relaxation is shown as larger numbers. Each symbol represents a
measurement from one rat. The horizontal lines represent the means.
2 9 6 PA RT F • STAT I ST IC A L T E ST S
UNPAIRED T TEST
P value 0.0030
P value summary **
Are means significantly different? (P < 0.05) Yes
One-or two-tailed P value? Two-tailed
t, df t = 3.531, df = 15
HOW BIG IS THE DIFFERENCE?
Mean ± SEM of Column A (old) 30.17 ± 5.365, n = 9
Mean ± SEM of Column B (young) 53.71 ± 3.664, n = 8
Difference between means 23.55 ± 6.667
95% CI 9.335 to 37.76
R2 0.4540
F TEST TO COMPARE VARIANCES
F, DFn, Dfd 2.412, 8, 7
P value 0.2631
P value summary ns
Are variances significantly different? No
80
37.75 40
20
40
9.338
0
20
–20
0
Old Young Mean B – Mean A 95% CI
P value
The null hypothesis is that both sets of data are randomly sampled from popula-
tions with identical means. The P value answers the question, If the null hypoth-
esis were true, what is the chance of randomly observing a difference as large or
larger than that observed in this experiment?
C HA P T E R 3 0 • Comparing Two Means: Unpaired t Test 297
R2
Not all programs report R2 with t test results, so you can ignore this value if it
seems confusing. Chapter 35 will explain the concept in more detail. For these
data, R2 = 0.45. This means that a bit less than half (45%) of the variation among
all the values is due to differences between the group means, and a bit more than
half of the variation is due to variation within the groups.
t ratio
The P value is calculated from the t ratio, which is computed from the difference
between the two sample means and the SD and sample size of each group. The
t ratio has no units. In this example, t = 3.53. This number is part of the calcula-
tions of the t test, but its value is not very informative.
16.09 and 10.36. The old animals have an SD 1.55 times larger than that of the
young animals. The square of that ratio (2.41) is called an F ratio.
The t test assumes that both groups are sampled from populations with equal
variances. Are the data consistent with that assumption? To find out, let’s compute
another P value.
If that assumption were true, the distribution of the F ratio would be known, so a
P value could be computed. This calculation depends on knowing the df for the numer-
ator and denominator of the F ratio, abbreviated in Table 30.2 as DFn and DFd. Each
df equals one of the sample sizes minus 1. The P value equals 0.26. If the assumption
of equal variances were true, there would be a 26% chance that random sampling
would result in this large a discrepancy between the two SD values as observed here
(or larger still). The P value is large, so the data are consistent with the assumption.
Don’t mix up this P value, which tests the null hypothesis that the two popula-
tions have the same SD, with the P value that tests the null hypothesis that the two
populations have the same mean.
SD error bars
Figure 30.3 graphs the mean and SD of the sample rat data. The graph on the
left is typical of graphs you’ll see in journals, with the error bar only sticking
up. But, of course, the scatter goes in both directions. The graph on the right
shows this.
The two error bars overlap. The upper error bar for the data from the old
rats is higher than the lower error bar for data from the young rats. What can you
conclude from that overlap? Not much, because the t test also takes into account
sample size (in this example, the number of rats studied). If the samples were
larger (more rats), with the same means and same SDs, the P value would be much
smaller. If the samples were smaller (fewer rats), with the same means and same
SDs, the P value would be larger.
When the difference between two means is statistically significant (P 0.05),
the two SD error bars may or may not overlap. Knowing whether SD error bars
overlap, therefore, does not let you conclude whether the difference between the
means is statistically significant.
Table 30.3. Rules of thumb for interpreting error bars that overlap or do not overlap.
These rules only apply when comparing two means with an unpaired t test and when the
two sample sizes are equal or nearly equal.
3 0 0 PA RT F • STAT I ST IC A L T E ST S
75 75
Mean ± SD Mean ± SD
50 50
% Emax
% Emax
25 25
0 0
Old Young Old Young
75 75
Mean ± SEM Mean ± SEM
50 50
% Emax
% Emax
25 25
0 0
Old Young Old Young
The SEM error bars quantify how precisely you know the mean, taking into
account both the SD and the sample size. Looking at whether the error bars over-
lap, therefore, lets you compare the difference between the mean with the preci-
sion of those means. This sounds promising. But, in fact, you don’t learn much
by looking at whether SEM error bars overlap. By taking into account sample
size and considering how far apart two error bars are, Cumming, Fidler, and Vaux
(2007) came up with some rules for deciding when a difference is significant.
However, these rules are hard to remember and apply.
C HA P T E R 3 0 • Comparing Two Means: Unpaired t Test 301
Here is a rule of thumb that can be used when the two sample sizes are equal
or nearly equal. If two SEM error bars overlap, the P value is (much) greater than
0.05, so the difference is not statistically significant. The opposite rule does not
apply. If two SEM error bars do not overlap, the P value could be less than or
greater than 0.05.
CI error bars
The error bars in Figure 30.5 show the 95% CI of each mean. Because the two 95%
CI error bars do not overlap and the sample sizes are nearly equal, the P value must
be a lot less than 0.05—probably less than 0.005 (Payton, Greenstone, & Schen-
ker, 2003; Knol, Pestman, & Grobbee, 2011). But it would be a mistake to look at
overlapping 95% CI error bars and conclude that the P value is greater than 0.05.
That relationship, although commonly believed, is just not true. When two 95% CIs
overlap, the P value might be greater than 0.05, but it also might be less than 0.05.
What you can conclude is that the P value is probably greater than 0.005.
75 Mean ± 95% CI
50
% Emax
25
0
Old Young
Figure 30.5. The same data plotted as means with 95% CIs.
Compare this figure with Figure 30.4. The 95% error bars are always longer
than SEM error bars. The two error bars do not overlap, which means the P
value must be much less than 0.05 and is probably less than 0.005.
3 0 2 PA RT F • STAT I ST IC A L T E ST S
CI
The CI for the difference between the two population means is centered on the dif-
ference between the means of the two samples. The margin of error, the distance
the CI extends in each direction, equals the standard error of the difference (see the
previous section) times a critical value from the t distribution (see Appendix D).
For the sample data, the margin of error equals 6.67 (see previous section) times
2.1314 (the critical value from the t distribution for 95% CI and 15 df), or 14.22.
The observed difference between means is 23.55, so the 95% CI extends from
23.55 − 14.22 to 23.55 + 14.22, or from 9.33 to 37.76.
t ratio
To determine the P value, first compute the t ratio, which equals the difference be-
tween the two sample means (23.5) divided by the standard error of that difference
(6.67). In the rat example, t = 3.53. The numerator and denominator have the same
units, so the t ratio has no units. Programs that compute the t test always report the
t ratio, even though it is not particularly informative by itself.
Don’t mix up the t ratio computed from the data with the critical value from
the t distribution (see previous section) used to compute the CI.
P value
The P value is computed from the t ratio and the number of df, which equals the
total number of values (in both groups) minus 2.
Mistake: If the P value is larger than 0.05, try other tests to see whether
they give a lower P value
If the P value from an unpaired t test is “too high,” it is tempting to try a nonpara-
metric test or to run the t test after removing potential outliers or transforming the
values to logarithms. If you try multiple tests and choose to only report the results
you like the most, those results will be misleading. You can’t interpret the P value
at face value when it was chosen from a menu of P values computed using differ-
ent methods. You should pick one test before collecting data and use it.
20
15
Response
10
0
d
l
ro
te
nt
ea
Co
Tr
Q&A
Can a t test be computed from the mean, SD, and sample size of each group?
Yes. You don’t need the raw data to compute an unpaired t test. It is enough to
know the mean, SD or SEM, and sample size of each group.
Why is it called an unpaired test?
As its name suggests, this test is used when the values in the two groups are
not matched. It should not be used when you compare two measurements in
each subject (perhaps before and after an intervention) or if each measurement
in one group is matched to a particular value in the other group. These kinds of
data are analyzed using the paired t test, which is explained in Chapter 31.
Is the unpaired t test sometimes called Student’s t test because it is the first statistical
test inflicted on students of statistics?
No. W. S. Gossett developed the t test when he was employed by a brewery
that required him to publish anonymously to keep as a trade secret the fact that
they were using statistics. Gossett’s pen name was “Student.” He is no longer
anonymous, but the name Student is still attached to the test.
How is the z test different than the t test?
The t test is computed using the t ratio, which is computed from the difference
between the two sample means and the sample size and SD of the two
groups. Some texts also discuss using the z test, based on the z ratio, which
requires knowing the SD of the population from which the data are sampled.
This population’s SD on cannot be computed from the data at hand but rather
must be known precisely from other data. Since this situation (knowing the
population SD precisely) occurs rarely (if ever), the z test is rarely useful.
Does it matter whether or not the two groups have the same numbers of
observations?
No. The t test does not require equal n. However, the t test is more robust to
violations of the Gaussian assumption when the sample sizes are equal or nearly
equal.
Why is the t ratio sometimes positive and sometimes negative?
The sign (positive or negative) of t depends on which group has the larger mean
and in which order they are entered into the statistical program. Because the order
in which you enter the groups into a program is arbitrary, the sign of the t ratio is
irrelevant. In the example data, the t ratio might be reported as 3.53 or –3.53.
It doesn’t matter, because the CI and P value will be the same either way.
What would a one-tailed P value be in this example?
As explained in Chapter 15, the P value can be one- or two-sided. Here, the
P value is two-sided (two-tailed). If the null hypothesis were true, the probability
of observing a difference as large as that observed here with the young rats
having the larger mean is 0.0015. That’s a one-tailed P value. The chance of
observing a difference as large as observed here with the old rats having the
larger mean is also 0.0015. That’s the other one-tailed P value. The sum of those
two one-tailed P values equals the two-tailed P value (0.0030).
How is the t ratio reported with the t test different than the critical value of the t distri-
bution used to calculate the CI?
The t ratio is computed from the data and used to determine the P value. The
critical value of the t distribution, abbreviated t* in this book, is computed from
C HA P T E R 3 0 • Comparing Two Means: Unpaired t Test 305
the sample size and the confidence level you want (usually 95%) and is used
to compute the CI. The value of t* does not depend on the data (except for
sample size).
If you perform an unpaired t test with data that has very unequal variances, will the
P value be too big or too small?
It can go either way depending on the data set (Delacre, Lakens, and Leys, 2017).
If you are not sure if the populations have equal variances or not, should you choose
the regular t test because the variances might be the same, or should one choose
Welch’s t test because the variances might be different?
Ruxton (2006) and Delacre, Lakens, and Leys (2017) make a strong case to use
the Welch’s test routinely.
CHAPTER SUMMARY
• The unpaired t test compares the means of two unmatched groups.
• The results are expressed both as a CI for the difference between means
and as a P value testing the null hypothesis that the two population means
are identical.
• The P value can be tiny even if the two distributions overlap substantially.
This is because the t test compares only the means.
• In addition to the usual assumptions, the unpaired t test assumes that the
two populations follow Gaussian distributions with the same SD in each
population.
• Looking at whether error bars overlap or not is not very helpful. The one
rule of thumb worth remembering is that if two SEM error bars overlap, the
P value is greater than 0.05.
Comparing Two Paired Groups
To call in the statistician after the experiment is done may
be no more than asking him to perform a postmortem
examination: he may be able to say what the experiment
died of.
R . A . F ISHER
T his chapter explains the paired t test, which compares two matched or
paired groups when the outcome is continuous, and the McNemar’s
test, which compares two matched or paired groups when the outcome
is binomial. Chapter 30 explained the unpaired t test and Chapter 41
explains the nonparametric Wilcoxon test (and computer-intensive boot-
strapping methods).
306
C HA P T E R 3 1 • Comparing Two Paired Groups 307
25
20
10
0
Cross-fertilized Self-fertilized
25
20
Plant height (inches)
15
10
0
Cross-fertilized Self-fertilized
PAIRED t TEST
P value 0.0497
P value summary *
Are means significantly different? (P < 0.05) Yes
One- or two-tailed P value? Two-tailed
t, df t = 2.148, df = 14
Number of pairs 15
HOW BIG IS THE DIFFERENCE?
Mean of differences 2.617
95% CI 0.003639 to 5.230
HOW EFFECTIVE WAS THE PAIRING?
Correlation coefficient (r) –0.3348
P value (one-tailed) 0.1113
P value summary ns
Was the pairing significantly effective? No
CI
A paired t test looks at the difference in measurements between two matched
subjects (as in our example; see Figure 31.3) or a measurement made before and
after an experimental intervention. For the sample data, the average difference in
plant length between cross- and self-fertilized seeds is 2.62 inches. To put this in
perspective, the self-fertilized plants have an average height of 17.6 inches, so this
difference is approximately 15%.
The 95% CI for the difference ranges from 0.003639 inches to 5.230 inches.
Because the CI does not include zero, we can be 95% confident that the cross-
fertilized seeds grow more than the self-fertilized seeds. But the CI extends from
a tiny difference (especially considering that plant height was only measured to
the nearest one-eighth of an inch) to a fairly large difference.
The width of the CI depends on three values:
• Variability. If the observed differences are widely scattered, with some
pairs having a large difference and some pairs a small difference (or a dif-
ference in the opposite direction), then the CI will be wider. If the data are
very consistent, the CI will be narrower.
• Sample size. Everything else being equal, a sample with more pairs will gen-
erate narrower CIs, and a sample with fewer pairs will generate wider CIs.
• Degree of confidence. If you wish to have more confidence (i.e., 99%
rather than 95%), the interval will be wider. If you are willing to accept less
confidence (i.e., 90% confidence), the interval will be narrower.
3 1 0 PA RT F • STAT I ST IC A L T E ST S
10 5
SEM
4
5
Difference in height
Difference in height
crossed - self
3
0
2
–5
1
–10 0
P value
The null hypothesis is there is no difference in the height of the two kinds of
plants. In other words, the differences that we measure are sampled from a popula-
tion for which the average difference is zero. By now you should be able to easily
state the question the P value answers.. If the null hypothesis were true, what is the
chance of randomly observing a difference as large as or larger than the difference
observed in this experiment?
The P value is 0.0497. If the null hypothesis were true, 5% of random samples
of 15 pairs would have a difference this large or larger. Using the traditional defini-
tion, the difference is statistically significant, because the P value is less than 0.05.
As explained in Chapter 15, the P value can be one- or two-sided. Here, the
P value is two-sided (two-tailed). Assuming the null hypothesis is true, the prob-
ability of observing a difference as large as that observed here with the self-
fertilized plants growing more than the cross-fertilized plants is 0.0248. The
chance of observing a difference as large as that observed here with the cross-
fertilized plants growing more is also 0.0248. The sum of those two probabilities
equals the two-sided P value (0.0497).
The size of the P value depends on the following:
• Mean difference. Everything else being equal, the P value will be smaller
when the average of the differences is far from zero.
• Variability. If the observed differences are widely scattered, with some
pairs having a large difference and some pairs having a small difference,
then the P value will be higher. If the data are very consistent, the P value
will be smaller.
• Sample size. Everything else being equal, the P value will be smaller when
the sample has more data pairs.
C HA P T E R 3 1 • Comparing Two Paired Groups 311
Assumptions
When assessing any statistical results, it is essential to review the list of assump-
tions. The paired t test is based on a familiar list of assumptions (see Chapter 12).
• The paired values are randomly sampled from (or at least are representative of)
a population of paired samples.
• In that population, the differences between the matched values follow a
Gaussian distribution.
• Each pair is selected independently of the others.
25
Self-fertilized
20
15
r = –0.33
10
10 15 20 25
Cross-fertilized
The CI for the mean difference is calculated exactly as explained in Chapter 12,
except that the values used for the calculations are the set of differences.
The t ratio is computed by dividing the average difference (2.62 inches) by
the SEM of those differences (1.22 inches). The numerator and denominator have
the same units, so the t ratio is unitless. In the example, t = 2.15.
The P value is computed from the t ratio and the number of df, which equals
the number of pairs minus 1.
CONTROL TREATED
24 52
6 11
16 28
5 8
2 4
60 30
Difference treated - control
Enzyme activity
40 20
20 10
0 0
Control Treated
Figure 31.5. Enzyme activity in matched sets of control and treated cells.
The mean of the differences (treated minus control) is 10.0. The 95% CI for the average
differences ranges from –3.4 to 23.4. Because that CI includes zero, you can’t be sure (with
95% confidence) whether the treatment had any effect on the activity. The P value (two-tailed,
from a paired t test) is 0.107, which is not low enough to be sure that the difference between
control and treated is real; the difference could easily just be the result of random sampling.
With only five pairs and inconsistent data, the data only lead to a very fuzzy conclusion.
C HA P T E R 3 1 • Comparing Two Paired Groups 313
100 2
log(Enzyme activity)
Enzyme activity
10 1
1 0
Control Treated
Figure 31.6. The same data as those shown in Figure 31.5 and Table
31.3, plotted on a logarithmic axis.
The right axis shows the logarithms. The left axis is labeled with the
original values, logarithmically spaced. Both sets of data are plotted on
both axes, which plot the same values but are expressed differently.
3 1 4 PA RT F • STAT I ST IC A L T E ST S
CASES
The 13 pairs for which both cases and controls were exposed to the risk
factor provide no information about the association between risk factor and dis-
ease. Similarly, the 92 pairs for which both cases and controls were not exposed
to the risk factor provide no useful information. The association between risk
factor and disease depends on the other two values in the table, and their ratio is
the odds ratio.
For the example, the odds ratio is the number of pairs for which the case was
exposed to the risk factor but the control was not (25), divided by the number of
pairs for which the control was exposed to the risk factor but the case was not (4),
which equals 6.25.
McNemar’s test computes the 95% CI of the odds ratio and a P value from
the two discordant values (25 and 4 for this example). Not all statistics programs
calculate McNemar’s test, but it can be computed using a free Web calculator in
the QuickCalcs section of graphpad.com.
The 95% CI for the odds ratio ranges from 2.16 to 24.71. Assuming the disease
is fairly rare, the odds ratio can be interpreted as a relative risk (see Chapter 28).
These data show that exposure to the risk factor increases one’s risk 6.25-fold and
that there is 95% confidence that the range from 2.16 to 24.7 contains the true
population odds ratio.
McNemar’s test also computes a P value of 0.0002. To interpret a P value, the
first step is to define the null hypothesis, which is that there really is no associa-
tion between risk factor and disease, so the population odds ratio is 1.0. If that
null hypothesis were true, the chance of randomly selecting 134 pairs of cases and
controls with an odds ratio so far from 1.0 is only 0.02%.
Mistake: Using the paired t test when the ratios, not the differences,
are a more consistent measure of treatment effects
The ratio t test, as explained in the previous discussion, is not commonly used, but
it is often the best test for many situations.
Q&A
Can I decide how the subjects should be matched after collecting the data?
No. Pairing must be part of the experimental protocol that is decided before
the data are collected. The decision about pairing is a question of experimental
design and should be made long before the data are analyzed.
When computing the difference for each pair, in which order is the subtraction done?
It doesn’t matter, so long as you are consistent. For each pair in the example, the
length of the self-fertilized plant was subtracted from the length of the cross-
fertilized plant. If the calculation were done the other way (if the length of the
cross-fertilized plant was subtracted from the length of the self-fertilized plant),
all the differences would have the opposite sign, as would the t ratio. The P value
would be the same. It is very important, of course, that the subtraction be done the
same way for every pair. It is also essential that the program doing the c alculations
doesn’t lose track of which differences are positive and which are negative.
What if one of the values for a pair is missing?
The paired t test analyzes the differences between each set of pairs. If one value
for a pair is missing, that pair can contribute no information at all, so it must be
excluded from the analysis. A paired t test program will ignore pairs that lack
one of the two values.
Can a paired t test be computed if you know only the mean and SD of the two
treatments (or two time points) and the number of pairs?
No. These summary data tell you nothing about the pairing.
Can a paired t test be computed if you know only the mean and SD of the set of
differences (as well as the number of pairs)? What about the mean and SEM?
Yes. All you need to compute a paired t test is the mean of the differences, the
number of pairs, and the SD or SEM of the differences. You don’t need the raw data.
C HA P T E R 3 1 • Comparing Two Paired Groups 317
Does it matter whether the two groups are sampled from populations that are not
Gaussian?
No. The paired t test only analyzes the set of paired differences, which it assumes
is sampled from a Gaussian distribution. This doesn’t mean that the two
individual sets of values need to be Gaussian. If you are going to run a normality
test on paired t test data, it only makes sense to test the list of differences (one
value per pair). It does not make sense to test the two sets of data separately.
Will a paired t test always compute a lower P value than an unpaired test?
No. With the Darwin example, an unpaired test computes a lower P value than a
paired test. When the pairing is effective (i.e., when the set of differences is more
consistent than either set of values), the paired test will usually compute a lower
P value.
CHAPTER SUMMARY
• The paired t test compares two matched or paired groups when the outcome
is continuous.
• The paired t test only analyzes the set of differences. It assumes that this
set of differences was sampled from a Gaussian population of differences.
• The paired t test is used when the difference between the paired measure-
ments is a consistent measure of change. Use the ratio t test when the ratio
of the paired measurements is a more consistent measure of the change or
effect.
• McNemar’s test compares two matched or paired groups when the outcome
is binomial (categorical with two possible outcomes).
Correlation
The invalid assumption that correlation implies cause is prob-
ably among the two or three most serious and common errors
of human reasoning.
S TEPHEN J AY G OULD
318
C HA P T E R 3 2 • Correlation 319
600
Insulin sensitivity (mg/m2/min)
400
200
16 18 20 22 24 26
%C20–22 fatty acids
The graph shows a clear relationship between the two variables. Individu-
als whose muscles have more C20–22 polyunsaturated fatty acids tend to have
greater sensitivity to insulin. The two variables vary together—statisticians say
that there is a lot of covariation or correlation.
Correlation results
The direction and magnitude of the linear correlation can be quantified with a
correlation coefficient, abbreviated r. Its value can range from –1 to 1. If the cor-
relation coefficient is zero, then the two variables do not vary together at all.
3 2 0 PA RT F • STAT I ST IC A L T E ST S
CORRELATION
r 0.77
95% CI 0.3803 to 0.9275
r2 0.5929
P (two-tailed) 0.0021
Number of XY pairs 13
If the correlation coefficient is positive, the two variables tend to increase or de-
crease together. If the correlation coefficient is negative, the two variables are
inversely related—that is, as one variable tends to decrease, the other one tends to
increase. If the correlation coefficient is 1 or –1, the two variables vary together
completely—that is, a graph of the data points forms a straight line.
The results of a computer program (GraphPad Prism 6) are shown in Table 32.2.
r2
The square of the correlation coefficient is an easier value to interpret than r. For
the example, r = 0.77, so r2 (referred to as “r squared”) = 0.59. Because r is always
between –1 and 1, r2 is always between zero and 1. Note that when you square a
positive fraction, the result is a smaller fraction.
If you can accept the assumptions listed in the next section, r2 is the fraction
of the variance shared between the two variables. In the example, 59% of the vari-
ability in insulin tolerance is associated with variability in lipid content. Knowing
the lipid content of the membranes lets you explain 59% of the variance in the
insulin sensitivity. That leaves 41% of the variance to be explained by other fac-
tors or by measurement error. X and Y are symmetrical in correlation analysis,
so you can also say that 59% of the variability in lipid content is associated with
variability in insulin tolerance.
C HA P T E R 3 2 • Correlation 321
P value
The P value is 0.0021. To interpret any P value, first define the null hypothesis.
Here, the null hypothesis is that there is no correlation between insulin sensitiv-
ity and the lipid composition of membranes in the overall population. The two-
tailed P value answers the question, If the null hypothesis were true, what is the
chance that 13 randomly picked subjects would have an r greater than 0.77 or
less than –0.77?
Because the P value is so low, the data provide evidence to reject the null
hypothesis.
ASSUMPTIONS: CORRELATION
You can calculate the correlation coefficient for any set of data, and it may be a
useful descriptor of the data. However, interpreting the CI and P value depends on
the following assumptions.
You will get the same value for r2 and the P value. However, the CI of r cannot be
interpreted if the experimenter controlled the value of X.
Assumption: No outliers
Calculation of the correlation coefficient can be heavily influenced by one out-
lying point. Change or exclude that single point and the results may be quite
different. Outliers can influence all statistical calculations, but especially when
determining correlation. You should look at graphs of the data before reaching
any conclusion from correlation coefficients. Don’t instantly dismiss outliers as
bad points that mess up the analysis. It is possible that the outliers are the most
interesting observations in the study!
LINGO: CORRELATION
Correlation
When you encounter the word correlation, distinguish its strict statistical meaning
from its more general usage. As used by statistics texts and programs, correlation
quantifies the association between two continuous (interval or ratio) variables.
However, the word correlation is often used much more generally to describe the
association of any two variables, even if either (or both) of the variables are not
continuous variables. It is not possible to compute a correlation coefficient to
help you figure out whether survival times are correlated with choice of drug or
whether antibody levels are correlated with gender.
Coefficient of determination
Coefficient of determination is just a fancy term for r2. Most scientists and statisti-
cians just call it r square or r squared.
1. Calculate the average of all X values. Also calculate the average of all
Y values. These two averages define a point at “the center of gravity” of
the data.
2. Compare the position of each point with respect to that center. Subtract
the average X value from each X value. The result will be positive for
points to the right of the center and negative for points to the left. Simi-
larly, subtract the average Y value from each Y value. The result will be
positive for points above the center and negative for points below.
3. Standardize those X distances by dividing by the SD of all the X values.
Similarly, divide each Y distance by the SD of all Y values. Dividing a
distance by the SD cancels out the units, so these ratios are fractions
without units.
4. Multiply the two standardized distances for each data point. The product
will be positive for points that are northeast (product of two positive
numbers) or southwest (product of two negative numbers) of the center.
The product will be negative for points that are to the northwest or south-
east (product of a negative and a positive number).
5. Add up all the products computed in Step 4.
6. Account for sample size by dividing the sum by (n – 1), where n is the
number of XY pairs.
If X and Y are not correlated, then the positive products in Step 4 will ap-
proximately balance out the negative ones, and the correlation coefficient will be
close to zero. If X and Y are correlated, the positive and negative products will
not balance each other out, and the correlation coefficient will be far from zero.
The nonparametric Spearman correlation (see Chapter 41) adds one step to
the beginning of this process. First, you would separately rank the X and Y values,
with the smallest value getting a rank of 1. Then you would calculate the correla-
tion coefficient for the X ranks and the Y ranks, as explained in the previous steps.
You can never rule out the last possibility, but the P value tells you how
unlikely the coincidence would be. In this example, you would observe a correla-
tion that strong (or stronger) in 0.21% (the P value) of experiments if there was no
correlation in the overall population.
You cannot decide which of the first four possibilities is correct by analyzing
only these data. The only way to figure out which is true is to perform additional
experiments in which you manipulate the variables. Remember that this study
simply measured both values in a set of subjects. Nothing was experimentally
manipulated.
The authors, of course, want to believe the first possibility based on their
knowledge of physiology. That does not mean they believe that the lipid composi-
tion is the only factor that determines insulin sensitivity, only that it is one factor.
Most people immediately think of the first two possibilities but ignore the
rest. Correlation does not necessarily prove simple causality, however. Two vari-
ables can be correlated because both are influenced by the same third variable.
Infant mortality in various countries is probably negatively correlated with the
number of telephones per capita, but buying telephones will not make kids live
longer. Instead, increased wealth (and thus increased purchases of telephones) is
related to better plumbing, better nutrition, less crowded living conditions, more
vaccinations, and so on.
r r2 P VALUE
Table 32.3. Correlation between a measure of intelligence and three measures of sperm
quality.
Data from Arden et al. (2008), with n = 425.
tiny and that all correlations go in the same direction helps make the data
convincing.
• The P values are all quite small. If there really was no relationship between
sperm and intelligence, there would only be a tiny chance of obtaining a
correlation this strong.
• The r values are all small. It is easier to interpret the r2 values, which show
that only 2% to 3% of the variation in intelligence in this study is associated
with variation in sperm count and motility.
The last two points illustrate a common situation. With large samples, data
can show small effects (in this case, r2 values of 2%–3%) and still have tiny
P values. The fact that the P values are small tells you that the effect is unlikely to
be a coincidence of random sampling. But the size of the P value does not tell you
how large the effect is. The effect size is quantified by r and r2. Here, the effect
size is tiny.
Is an r2 value of 2% to 3% large enough to be considered interesting and worth
pursuing? That is a scientific question, not a statistical one. The investigators of
this study seemed to think so. Other investigators might conclude that such a tiny
effect is not worthy of much attention. I doubt that many newspaper reporters who
wrote about the supposed link between sperm count and intelligence understood
how weak the reported relationship really is.
15
10
0
0 5 10 15 20 0 5 10 15 20
15
10
0
0 5 10 15 20 0 5 10 15 20
If two variables A and B are completely independent with zero correlation, you’d
expect the correlation coefficient of A versus A-B to equal about 0.7. See Figure 33.5
in the next chapter on linear regression to see why this is a problem.
Q&A
CHAPTER SUMMARY
• The correlation coefficient, r, quantifies the strength of the linear relation-
ship between two continuous variables.
• The value of r is always between –1.0 and 1.0; it has no units.
• A negative value of r means that the trend is negative—that is, Y tends to
go down as X goes up.
• A positive value of r means that the trend is positive—that is, X and Y go
up together.
• The square of r equals the fraction of the variance that is shared by X and Y.
This value, r2, is sometimes called the coefficient of determination.
• Correlation is used when you measure both X and Y variables, but it is not
appropriate if X is a variable you manipulate.
3 2 8 PA RT F • STAT I ST IC A L T E ST S
• Note that correlation and linear regression are not the same. In particular,
note that the correlation analysis does not fit or plot a line.
• You should always view a graph of the data when interpreting a correlation
coefficient.
• Correlation does not imply causation. The fact that two variables are cor-
related does not mean that changes in one cause changes in the other.
Fitting Models to Data
C HA P T E R 3 3
Simple Linear Regression
In the space of 176 years, the Lower Mississippi has shortened
itself 242 miles. This is an average of a trifle over one mile and
a third per year. Therefore . . . any person can see that seven
hundred and forty-two years from now, the lower Mississippi
will be only a mile and three-quarter long.
M ARK T WAIN, L IFE ON THE M ISSISSIPPI
O ne way to think about linear regression is that it fits the “best line”
through a graph of data points. Another way to look at it is that linear
regression fits a simple model to the data to determine the most likely
values of the parameters that define that model (slope and intercept).
Chapter 34 introduces the more general concepts of fitting a model to data.
331
3 3 2 PA RT G • F I T T I N G M O D E L S T O DATA
BEST-FIT VALUES
Slope 37.21 ± 9.296
Y intercept when X = 0.0 –486.5 ± 193.7
X intercept when Y = 0.0 13.08
1/slope 0.02688
95% CI
Slope 16.75 to 57.67
Y intercept when X = 0.0 –912.9 to –60.17
X intercept when Y = 0.0 3.562 to 15.97
GOODNESS OF FIT
R² 0.5929
Sy.x 75.90
IS SLOPE SIGNIFICANTLY NONZERO?
F 16.02
DFn, DFd 1.000, 11.00
P value 0.0021
Deviation from zero? Significant
DATA
Number of X values 13
Maximum number of Y replicates 1
Total number of values 13
Number of missing values 0
We can’t possibly know the one true population value for the intercept or the
one true population value for the slope. Given the sample of data, our goal is to
find those values for the intercept and slope that are most likely to be correct and
to quantify the imprecision with CIs. Table 33.1 shows the results of linear regres-
sion. The next section will review all of these results.
It helps to think of the model graphically (see Figure 33.1). A simple way
to view linear regression is that it finds the straight line that comes closest to the
points on a graph. That’s a bit too simple, however. More precisely, linear regression
finds the line that best predicts Y from X. To do so, it only considers the vertical
distances of the points from the line and, rather than minimizing those distances,
it minimizes the sum of the square of those distances. Chapter 34 explains why.
600
200
18 20 22 24
%C20–22 fatty acids
Figure 33.1. The best-fit linear regression line along with its 95%
confidence band (shaded).
The CI is an essential (but often omitted) part of the analysis. The 95% CI of
the slope ranges from 16.7 to 57.7 mg/m2/min. Although the CI is fairly wide, it
does not include zero; in fact, it doesn’t even come close to zero. This is strong
evidence that the observed relationship between lipid content of the muscles and
insulin sensitivity is very unlikely to be a coincidence of random sampling.
The range of the CI is reasonably wide. The CI would be narrower if the
sample size were larger.
Some programs report the standard errors of the slope instead of (or in addi-
tion to) the CIs. The standard error of the slope is 9.30 mg/m2/min. CIs are easier
to interpret than standard errors, but the two are related. If you read a paper that re-
ports the standard error of the slope but not its CI, compute the CI using these steps:
1. Look in Appendix D to find the critical value of the t distribution. The
number of df equals the number of data points minus 2. For this example,
there were 13 data points, so there are 11 df, and the critical t ratio for a
95% CI is 2.201.
2. Multiply the value from Step 1 by the standard error of the slope reported
by the linear regression program. For the example, multiply 2.201 times
9.30. The margin of error of the CI equals 20.47.
3. Add and subtract the value computed in Step 2 from the best-fit value
of the slope to obtain the CI. For the example, the interval begins at
37.2 − 20.5 = 16.7. It ends at 37.2 + 20.5 = 57.7.
The intercept
A line is defined by both its slope and its Y-intercept, which is the value of the
insulin sensitivity when the %C20–22 equals zero.
3 3 4 PA RT G • F I T T I N G M O D E L S T O DATA
600
200
18 20 22 24
%C20–22 fatty acids
For this example, the Y-intercept is not a scientifically relevant value. The
range of %C20–22 in this example extends from about 18%z to 24%. Extrapolat-
ing back to zero is not helpful. The best-fit value of the intercept is –486.5, with
a 95% CI ranging from –912.9 to –60.17. Negative values are not biologically
possible, because insulin sensitivity is assessed as the amount of glucose needed
to maintain a constant blood level and so must be positive. These results tell us
that the linear model cannot be correct when extrapolated way beyond the range
of the data.
Graphical results
Figure 33.1 shows the best-fit regression line defined by the values for the slope
and the intercept, as determined by linear regression.
The shaded area in Figure 33.1 shows the 95% confidence bands of the
regression line, which combine the CIs of the slope and the intercept. The best-
fit line determined from this particular sample of subjects (solid black line) is
unlikely to really be the best-fit line for the entire (infinite) population. If the
assumptions of linear regression are true (which is discussed later in this chapter),
you can be 95% sure that the overall best-fit regression line lies somewhere within
the shaded confidence bands.
The 95% confidence bands are curved but do not allow for the possibility
of a curved (nonlinear) relationship between X and Y. The confidence bands are
computed as part of linear regression, so they are based on the same assumptions
as linear regression. The curvature simply is a way to enclose possible straight
lines, of which Figure 33.2 shows two.
C HA P T E R 3 3 • Simple Linear Regression 335
The 95% confidence bands enclose a region that you can be 95% confident
includes the true best-fit line (which you can’t determine from a finite sample
of data). But note that only six of the 13 data points in Figure 33.2 are included
within the confidence bands. If the sample was much larger, the best-fit line would
be determined more precisely, so the confidence bands would be narrower and a
smaller fraction of data points would be included within the confidence bands.
Note the similarity to the 95% CI for the mean, which does not include 95% of
the values (see Chapter 12).
R2
The R2 value (0.5929) means that 59% of all the variance in insulin sensitiv-
ity can be accounted for by the linear regression model and the remaining 41%
of the variance may be caused by other factors, measurement errors, biological
variation, or a nonlinear relationship between insulin sensitivity and %C20–22.
Chapter 35 will define R2 more rigorously. The value of R2 for linear regression
ranges between 0.0 (no linear relationship between X and Y) and 1.0 (the graph of
X vs. Y forms a perfect line).
P value
Linear regression programs report a P value. To interpret any P value, the null
hypothesis must be stated. With linear regression, the null hypothesis is that
there really is no linear relationship between insulin sensitivity and %C20–22. If
the null hypothesis were true, the best-fit line in the overall population would be
horizontal with a slope of zero. In this example, the 95% CI for slope does not
include zero (and does not come close), so the P value must be less than 0.05. In
fact, it is 0.0021. The P value answers the question, If that null hypothesis was
true, what is the chance that linear regression of data from a random sample of
subjects would have a slope as far (or farther) from zero as that which is actu-
ally observed?
In this example, the P value is tiny, so we conclude that the null hypothesis is
very unlikely to be true and that the observed relationship is unlikely to be caused
by a coincidence of random sampling.
With this example, it makes sense to analyze the data both with correlation
(see Chapter 32) and with linear regression. The two are related. The null hy-
pothesis for correlation is that there is no correlation between X and Y. The null
hypothesis for linear regression is that a horizontal line is correct. Those two null
hypotheses are essentially equivalent, so the P values reported by correlation and
linear regression are identical.
The linear regression equation defines a line that extends infinitely in both
directions. For any value of X, no matter how high or how low, the equation can
predict a Y value. Of course, it rarely makes sense to believe that a model can
extend infinitely. But the model can be salvaged by using the predictions of the
model that only fall within a defined range of X values. Thus, we only need to
assume that the relationship between X and Y is linear within that range. In the
example, we know that the model cannot be accurate over a broad range of X
values. At some values of X, the model even predicts that Y will be negative, a
biological impossibility. But the linear regression model is useful within the range
of X values actually observed in the experiment.
Parameters
The goal of linear regression is to find the values of the slope and intercept that
make the line come as close as possible to the data. The slope and the intercept
are called parameters.
Regression
Why the strange term regression? In the 19th century, Sir Francis Galton studied the
relationship between parents and children. Children of tall parents tended to be shorter
than their parents. Children of short parents tended to be taller than their parents. In
each case, the height of the children reverted, or “regressed,” toward the mean height of
all children. Somehow, the term regression has taken on a much more general meaning.
Residuals
The vertical distances of the points from the regression line are called residuals.
A residual is the discrepancy between the actual Y value and the Y value predicted
by the regression model.
Least squares
Linear regression finds the value of the slope and intercept that minimizes the
sum of the squares of the vertical distances of the points from the line. This goal
gives linear regression the alternative name linear least squares.
Linear
The word linear has a special meaning to mathematical statisticians. It can be
used to describe the mathematical relationship between model parameters and the
outcome. Thus, it is possible for the relationship between X and Y to be curved but
for the mathematical model to be considered linear.
Raw data. R2 = 0.12 P = 0.13 Rolling averages. R2= 0.76 P < 0.0001
20 20
# of hurricanes
15 15
10 10
5 5
0 0
1985 1990 1995 2000 2005 1985 1990 1995 2000 2005
Year Year
160
Systolic blood pressure 150 150 R2 < 0.0001; P = 0.9789
140
(mmHg)
130 130
After
120
110 110
100
90 90
20
(mmHg)
–20
–40
random, and the two data sets (top-left graph) look about the same. Therefore, a best-
fit regression line (top-right graph) is horizontal. Although blood pressure levels
varied between measurements, there was no systematic effect of the treatment.
The graph on the bottom shows the same data. But now the vertical axis shows the
change in blood pressure (after–before) plotted against the starting (before) blood
pressure. Note the striking linear relationship. Individuals who initially had low
pressures tended to see an increase; individuals with high pressures tended to see a
decrease. This is entirely an artifact of data analysis and tells you nothing about the
effect of the treatment, only about the stability of the blood pressure levels between
treatments. Chapter 1 gave other examples of regression to the mean.
Graphing a change in a variable versus the initial value of the variable is quite
misleading. Attributing a significant correlation on such a graph to an experimen-
tal intervention is termed the regression fallacy. Such a plot should not be analyzed
C HA P T E R 3 3 • Simple Linear Regression 341
by linear regression, because these data (as presented) violate one of the assump-
tions of linear regression, that the X and Y values are not intertwined.
15 15
10 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
15 15
10 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
200
Signal
100
0
0 5 10 15
Time
400 0
1000
200 –200
500
0 0 –400
0 50 100 150 0 50 100 150 0 50 100 150
Time Time Time
Figure 33.7. Beware of extrapolating models.
(Top) Data that fit a linear regression model very well. What happens at later time points? The three graphs on the bottom show three a lternative
predictions. (Left) The prediction of linear regression. (Middle and right) Predictions of two other models that actually fit the data slightly better
than a straight line does. Predictions of linear regression that extend far beyond the data may be very wrong.
3 4 4 PA RT G • F I T T I N G M O D E L S T O DATA
130
n = 500
P = 0.0302
120 R2 = 0.0094
110
Y
100
90
80
0 10 20 30 40 50
X
Mistake: Using linear regression when the data are complete with no
sampling error
Figure 33.11 shows the number of high school students in the United States who
took the advanced placement (AP) exam in statistics each year from 1998 to 2015
(with data from two years missing). The graph shows the total number of stu-
dents who took the exam. There is no sampling from a population and no random
error. The exam is given once a year, so there are no values between the points to
interpolate.
The graph shows the linear regression line, but it is not very helpful. The stan-
dard error and CI of the slope would be meaningless. Interpolating between years
would be meaningless as well, because the exam is only offered once each year.
Computing a P value would be meaningless because there is no random sampling.
The numbers really do speak for themselves, and the results of linear regression
would not add any insights and might even be misleading.
What if you want to extrapolate to predict the number of students who will
take the exam the next year (2016)? Using a linear regression line would not
be unreasonable. But it might be just as reasonable to simply predict that the
increase from 2015 to 2016 (which you are trying to predict) will equal the in-
crease from 2014 to 2015 (which you know). Any prediction requires assuming a
model, and there really is no reason to assume that the increase will be linear in-
definitely. In fact, it looks as though the increase from year to year is increasing
in recent years. You could also use a second- or third-order polynomial model
(or some other model) to predict future years. Or maybe only fit to data from the
last few years.
3 4 6 PA RT G • F I T T I N G M O D E L S T O DATA
250,000
# of AP statistics students
200,000
150,000
100,000
50,000
0
1995 2000 2005 2010 2015
Year
Figure 33.11. No need for linear regression.
The number of high school students who took the advanced placement test in statistics
each year from 1998 to 2015. These data are not sampled from a larger population. The
graph shows the total number of U.S. students who took the advanced placement statistics
exam each year. The dotted line shows the best-fit linear regression line, but it would not be
useful to describe the data.
Source: Data from Wikipedia (http://en.wikipedia.org/wiki/AP_Statistics).
You could also look deeper. The number of students taking the exam depends
on how many schools offer an AP statistics course, how many students (on aver-
age) take that course, and what fraction of the students who take the course end up
taking the AP exam. If you wanted to predict future trends, it would make sense to
learn how these three different factors have changed over the years.
40 R2 = 0.002 40
R2 = 0.69
30 30
20 R2 = 0.007 20
Y
Y
10 10
0 0
0 5 10 15 20 0 5 10 15 20
X X
Figure 33.12. Combining two groups into one regression can mislead by creating a
strong linear relationship.
30 30
R2 = 0.83 R2 = 0.08
20 20
R2 = 0.92
Y
10 10
0 0
0 5 10 15 20 0 5 10 15 20
X X
Figure 33.13. Combining two groups into one regression can mislead by hiding a trend.
Both groups (closed and open symbols) show a strong linear trend (left), but when
combined there is no trend (right).
0.8 (left panel of Figure 33.13). But when you fit one line to all the data, as shown
in the graph on the right with all data shown as squares, the regression line is
horizontal showing essentially no trend with R2 equal to 0.08.
Q&A
Do the X and Y values have to have the same units to perform linear regression?
No, but they can. In the example, X and Y are in different units.
Can linear regression work when all X values are the same? When all Y values are the
same?
No. The whole point of linear regression is to predict Y based on X. If all X values
are the same, they won’t help predict Y. If all Y values are the same, there is
nothing to predict.
3 4 8 PA RT G • F I T T I N G M O D E L S T O DATA
Can linear regression be used when the X values are actually categories?
If you are comparing two groups, you can assign the groups to X = 1 and
X = 0 and use linear regression. This is identical to performing an unpaired t test,
as Chapter 35 explains. If there are more than two groups, using simple linear
regression only makes sense when the groups are ordered and equally spaced
and thus can be assigned sensible numbers. If you need to use a categorical
variable with more than two possible values, read about indicator variables (also
called dummy variables) in a text about multiple linear regression.
Is the standard error of the slope the same as the SEM?
No. The standard error is a way to express the precision of a computed value
(parameter). The first standard error encountered in this book happened to be
the standard error of the mean (see Chapter 14). The standard error of a slope
is quite different. Standard errors can also be computed for almost any other
parameter.
What does the variable Ŷ (used in other statistics books) mean?
Mathematical books usually distinguish between the Y values of the data you
collected and the Y values predicted by the model. The actual Y values are called
Yi, where i indicates the value to which you are referring. For example, Y3 is the
actual Y value of the third subject or data point. The Y values predicted by the
model are called Ŷ, pronounced “Y hat.” So Ŷ3 is the value the model predicts for
the third participant or data point. That value is predicted from the X value for
the third data point but doesn’t take into account the actual value of Y.
Will the regression line be the same if you exchange X and Y?
Linear regression fits a model that best predicts Y from X. If you swap the
definitions of X and Y, the regression line will be different unless the data points
line up perfectly so that every point is on the line. However, swapping X and Y
will not change the value of R2.
Can R2 ever be zero? Negative?
R2 will equal zero if there is no trend whatsoever between X and Y, so the best-fit
line is exactly horizontal. R2 cannot be negative with standard linear regression,
but Chapter 36 explains that R2 can be negative with nonlinear regression.
Do you need more than one Y value for each X value to calculate linear regression?
Does it help?
Linear regression does not require more than one Y value for each X value. But
there are three advantages to using replicate Y values for each value of X:
• With more data points, you’ll determine the slope and intercept with more
precision.
• Additional calculations (not detailed here) can test for nonlinearity. The idea
is to compare the variation among replicates to the distances of the points
from the regression line. If your points are “too far” from the line (given the
consistency of the replicates), then you can conclude that a straight line does
not really describe the relationship between X and Y.
• You can test the assumption that the variability in Y is the same at all values
of X.
If you analyze the same data with linear regression and correlation (see Chapter 12),
how do the results compare?
If you square the correlation coefficient (r), the value will equal R2 from
linear r egression. The P value testing the null hypothesis that the population
C HA P T E R 3 3 • Simple Linear Regression 349
correlation coefficient is zero will match the P value testing the null hypothesis
that the population slope is zero.
R2 or r2?
With linear regression, both forms are used and there is no distinction. With
nonlinear and multiple regression, the convention is to always use R2.
Can R2 be expressed as a percentage? How about r?
Yes. R2 is a fraction, so it makes sense to express it as a percentage (but this is
rarely done). Note, however that the correlation coefficient r is not a fraction so
should not be expressed as a percentage.
Does linear regression depend on the assumption that the X values are sampled from
a Gaussian distribution? Y values?
No. The results of linear regression are based on the assumption that the residuals
(the vertical distances of the data points from the regression line) are Gaussian
but does not assume that either X values or Y values are Gaussian.
CHAPTER SUMMARY
• Linear regression fits a model to data to determine the value of the slope
and intercept that makes a line best fit the data.
• Linear regression also computes the 95% CI for the slope and intercept and
can plot a confidence band for the regression line.
• Goodness of fit is quantified by R2.
• Linear regression reports a P value, which tests the null hypothesis that the
population slope is horizontal.
Introducing Models
A mathematical model is neither a hypothesis nor a theory.
Unlike scientific hypotheses, a model is not verifiable directly
by an experiment. For all models are both true and false. . . .
The validation of a model is not that it is “true” but that it
generates good testable hypotheses relevant to important
problems.
R . L EVINS (1966)
Random component
A mathematical model must specify both the ideal predictions and how the data
will be randomly scattered around those predictions. The following two equivalent,
350
C HA P T E R 3 4 • Introducing Models 351
crude equations point out that a full model must specify a random component
(noise) as well as the ideal component (signal).
Data = Ideal + Random
Response = Signal + Noise
Estimate
When regression is used to fit a model to data, it reports values for each parameter
in the model. These can be called best-fit values or estimated values. Note that this
use of the term estimate is very different from the word’s conventional use to mean
an informed guess. The estimates provided by regression are the result of calcula-
tions. The best-fit value is called a point estimate. The CI for each parameter is
called an interval estimate.
When you sample values from a population that follows a Gaussian distribu-
tion, each value can be defined by this simple model:
Y =µ+ε
where Y is the dependent variable, which is different for each value (data point);
μ (the population mean) is a parameter with a single value, which you are trying
to find out (it is traditional to use Greek letters to denote population parameters
and regular letters to denote sample statistics); and ε (random error) is different
for each data point, randomly drawn from a Gaussian distribution centered at zero.
Thus, ε is equally likely to be positive or negative, so each Y is equally likely to be
higher or lower than μ. The random variable ε is often referred to as error. As used
in this statistical context, the term error refers to any random variability, whether
caused by experimental imprecision or by biological variation.
This kind of model is often written like this:
Yi = µ + ε i
The subscript i tells you that each Y and ε has a different value, but the popu-
lation parameter μ has only a single value.
Note that the right side of this model equation has two parts. The first part
computes the expected value of Y. In this case, it is always equal to the parameter
μ, the population mean. More complicated models would have more complicated
calculations involving more than one parameter. The second part of the model
takes into account the random error. In this model, the random scatter follows a
Gaussian distribution and is centered at zero. This is a pretty standard assumption
but is not the only model of scatter.
Now you have a set of values, and you are willing to assume that this model
is correct. You don’t know the population value of μ, but you want to estimate
its value from the sample of data. What value of the parameter μ is most likely
to be correct? Mathematicians have proven that if the random scatter follows a
Gaussian distribution, then the value of μ that is most likely to be correct is the
sample mean or average. To use some mathematical lingo, the sample mean is the
maximum likelihood estimate of μ.
What I’ve done so far is to take the simple idea of computing the average and
turn it into a complicated process using statistical jargon and Greek letters. This
doesn’t help you understand the idea of a sample mean, but it warms you up for
understanding more complicated models, for which the jargon really is helpful.
Mistake: Saying that you will fit the data to the model
Regression does not fit data to a model, a phrase that implies fudging the data to
make them comply with the predictions of the model. Rather, the model is fit to
the data or the data are fit by the model.
C HA P T E R 3 4 • Introducing Models 355
Think of models as shoes in a shoe store and data as the pair of feet that
you bring into the store. The shoe salesman fits a set of shoes of varying styles,
lengths, and widths (parameter estimates) to your feet to find a good, comfortable
fit. The salesman does not bring over one pair of universal shoes along with surgi-
cal tools to fit your feet to the shoes.
evaluate such a model is to see how well it predicts future data. When the
data sets are huge and the model has a huge number of parameters, this is
called machine learning.
CHAPTER SUMMARY
• A mathematical model is an equation that describes a physical, chemical,
or biological process.
• Models are simplified views of the world. As a scientific field progresses,
it is normal for the models to get more complicated.
• Statistical techniques fit a model to the data. It is wrong to say that you fit
data to the model.
• When you fit a model to data, you obtain the best-fit values of each param-
eter in the model along with a CI showing you how precisely you have
determined its value.
• If the scatter (error) is Gaussian, fitting the model entails finding the
parameter values that minimize the sum of the square of the differences
between the data and the predictions of the model. These are called least-
squares methods.
Comparing Models
Beware of geeks bearing formulas.
W ARREN B UFFET T
A t first glance, comparing the fit of two models seems simple: just
choose the model with predictions that come closest to the data.
In fact, choosing a model is more complicated than that, and it requires
accounting for both the number of parameters and the goodness of fit.
The concept of comparing models provides an alternative perspective for
understanding much of statistics.
357
3 5 8 PA RT G • F I T T I N G M O D E L S T O DATA
The meaning of R2
The linear regression model fits the data better. The variability of points around the
regression line (right side of Figure 35.1) is less than the variability of points around
the null hypothesis horizontal line (left side of Figure 35.1). How much less?
Table 35.1 compares the sum of squares (SS). The first row shows the sum of
the squared distances of points from the horizontal line that is the null hypothesis.
The second row shows the sum of the squared distances of points from the linear
regression line.
400 400
200 200
0 0
18 20 22 24 18 20 22 24
%C20–22 fatty acids %C20–22 fatty acids
Figure 35.1. Fitting the sample data with two alternative models.
C HA P T E R 3 5 • Comparing Models 359
300
200
–100
–200
–300
Figure 35.2. Residuals (distance from line) of each point from both models. The Y-axis
has the same units as the Y-axis of Figure 35.1.
Table 35.1. Comparing the fit of a horizontal line versus the best-fit linear regression line.
The points are closer to the regression line, so the sum of squares is lower.
The regression line fits the data better than a horizontal line, so the sum of
squares is lower. The bottom row in Table 35.1 shows the difference between the
fit of the two models. It shows how much better the linear regression model fits
than the alternative null hypothesis model.
The fourth column shows the two sums of squares as a percentage of the total.
Scatter around the regression line accounts for 40.7% of the variation. Therefore,
the linear regression model itself accounts for 100% − 40.7% = 59.3% of the
variation. This is the definition of R2, which equals 0.593.
P value
Table 35.2 is rearranged and relabeled to match other statistics books and
programs. The focus is no longer on comparing the fit of two models but on
dividing the total sum of squares into its components. The first row quanti-
fies how well the linear regression model explains the data. The second row
quantifies the scatter of data around the predictions of the model. The third row
quantifies the scatter of the data from the predictions (horizontal line) of the
null hypothesis.
The important points are that overall variation among data points is quanti-
fied with the sum of the squared distances between the point and a prediction
3 6 0 PA RT G • F I T T I N G M O D E L S T O DATA
SOURCE OF SUM OF
VARIATION SQUARES DF MS F RATIO
Regression 92,281 1 92,281.0 16.0
+ Random 63,361 11 5,760.1
= Total 155,642 12
Table 35.2. Comparing the fit of a horizontal line versus the best-fit linear regression line.
The points are closer to the regression line, so the F ratio is high. Table 35.2 uses the format
of ANOVA, which will be explained in Chapter 39.
of a model, and that the sum of squares can be divided into various sources
of variation.
The third column of Table 35.2 shows the number of df. The bottom row
shows the sum of squares of the distances from the fit of the null hypothesis
model. There are 13 data points and only one parameter is fit (the mean), which
leaves 12 df. The next row up shows the sum of squares from the linear regression
line. Two parameters are fit (slope and intercept), so there are 11 df (13 data points
minus two parameters). The top row shows the difference. The linear regression
model has one more parameter than the null hypothesis model, so there is only one
df in this row. The df, like the sums of squares, can be partitioned so the bottom
row is the sum of the values in the rows above.
The fourth column of Table 35.2 divides the sums of squares by the number of
df to compute the mean square (MS) values, which can also be called variances.
Note that it is not possible to add the MS in the top two rows to obtain the MS in
the bottom row.
Even if the null hypothesis were correct, you’d expect the sum of squares
around the regression line to be a bit smaller than the sum of squares around the
horizontal null hypothesis line. Dividing by the number of df accounts for this dif-
ference. If the null hypothesis were true, the two MS values would be expected to
have similar values, so their ratio would be close to 1.0. In fact, for this example,
the ratio equals 16.0.
This ratio of the two MS values is called the F ratio, named after the pioneer-
ing statistician Ronald Fisher. The distribution of F ratios is known when the null
hypothesis is true. So for any value of F, and for particular values of the two df
values, a P value can be computed. When using a program to find the P value from
F, be sure to distinguish the df of the numerator (1 in this example) and the df for
the denominator (11 in this example). If you mix up those two df values, you’ll
get the wrong P value.
The P value, which you already saw in Chapter 33, is 0.0021. From the point
of view of probability distributions, this P value answers the following question:
If the null hypothesis were true and given an experimental design with one and
11 df, what is the chance that random sampling would result in data with such a
strong linear trend that the F ratio would be 16.0 or higher?
C HA P T E R 3 5 • Comparing Models 361
100
80
60
% Emax
40
20
0
Old (X = 0.0) Young (X = 1.0)
SUM OF PERCENTAGE OF
HYPOTHESIS SCATTER FROM SQUARES VARIATION
Null Grand mean 5,172 100.0
– Alternative Group means 2,824 54.6
= Difference Improvement 2,348 45.4 R2 = 0.454
40
20
Distance from mean
–20
–40 Old
Young
–60
Null Alternative
Hypothesis
Figure 35.4. How well the t test example data are fit by the two models.
(Left) The fit of the model defined by the null hypothesis that both groups share the same
population mean, showing the difference between each value and the grand mean. (Right)
How well the data fit the alternative model that the groups have different population means.
This part of the graph shows the difference between each value and the mean of its group.
C HA P T E R 3 5 • Comparing Models 363
the groups, and 45.5% of the total variation is caused by a difference between the
two group means. Therefore, R2 = 0.454.
P value
Determining the P value requires more than partitioning the variance into its com-
ponents. It also requires accounting for the number of values and the number of
parameters fit by each model.
The third column of Table 35.4 shows the number of df. The bottom row
shows the fit of the null hypothesis model. There are 17 data points and only one
parameter is fit (the grand mean), which leaves 16 df. The next row up quantifies
the fit of the alternative model. Two parameters are fit (the mean of each group),
so there are 15 df (17 data points minus 2 parameters). The top row shows the dif-
ference. The alternative model (two distinct means) has one more parameter than
the null hypothesis model (one mean for both groups), so there is only 1 df in this
row. The df, like the sum of squares, can be partitioned so the bottom row is the
sum of the values in the two rows above. The fourth column divides the sum of
squares by the number of df to compute the MS, which could also be called vari-
ance. Note that it is not possible to add the MS in the top two rows to obtain the
MS in the bottom row.
If the null hypothesis were correct, you’d expect the sum of squares around
the individual means to be a bit smaller than the sum of squares around the grand
mean. But after dividing by df, the MS values would be expected to be about the
same if the null hypothesis were in fact true. Therefore, if the null hypothesis were
true, the ratio of the two MS values would be expected to be close to 1.0. If fact,
for this example, the ratio (called the F ratio) equals 12.47.
The F distribution under the null hypothesis is known, and so the P value can
be computed. The P value is 0.0030. It is the answer to the following question: If
the simpler (null hypothesis) model were correct, what is the chance that randomly
chosen values would have group means far enough apart to yield an F ratio of 12.47
or higher?
Summary
The t test data can be viewed as comparing how well the data are fit by two models.
One model is the null hypothesis—that the two groups of data are sampled from two
populations with the same mean. Viewed in terms of linear regression, this model is a
SOURCE OF SUM OF
VARIATION SQUARES DF MS F RATIO P VALUE
Between groups 2,348 1 2,348.0 12.47 0.0030
+ Within groups 2,824 15 188.3
= Total 5,172 16
horizontal line with a slope equal to zero. The alternative model is that the population
means differ. Viewed as linear regression, the slope is not zero. The goodness of fit of
the two models is compared to see whether there is substantial evidence to reject the
simpler (null hypothesis) model and accept the more complicated alternative model.
Mistake: Comparing the fits of models that don’t make scientific sense
Only use a statistical approach to compare models that are scientifically sensible.
It rarely makes sense to blindly test a huge number of models. If a model isn’t
scientifically sensible, it probably won’t be useful—no matter how high its R2 may
be for a particular data set.
At larger X values (X > 15), the predictions of the models diverge consider-
ably. If you had reason to compare those models, it would be essential to collect
data at X values (times) for which the predictions of the model differ.
Q&A
Is it OK to fit lots of models and choose the one that fits best?
There are two problems with this approach. First, in many cases you want to
only compare models that make scientific sense. If the model that fits best
makes no scientific sense, it won’t be helpful in many contexts. You certainly
won’t be able to interpret the parameters. The second problem is if the best
model has too many parameters, you may be overfitting. This means that the
model is unlikely to work as well with future data. It is usually better to only
compare the fits of two (or a few) models that make scientific sense.
What if I want to compare the fits of two models with the same number of parameters?
The examples presented in this chapter compare the fits of models with
different numbers of parameters, so you need to look at the trade-off of better
fit versus more complexity in the model. If both your models have the same
number of parameters, there is no such trade-off. Pick the one that fits best
unless it makes no scientific sense.
CHAPTER SUMMARY
• Much of statistics can be thought of as comparing the fits of two models.
• This chapter looked at examples from Chapter 33 (linear regression) and
Chapter 30 (unpaired t test) and recast the examples as comparisons of models.
• Viewing basic statistical tests (t test, regression, ANOVA) as a comparison
of two models is a bit unusual, but it makes statistics more sensible for
some people.
• Every P value can be viewed as the result of comparing the fit of the null
hypothesis with the fit of a more general alternative hypothesis. The P value
answers this question: If the null hypothesis were true, what is the chance
that the data would fit the alternative model so much better? Interpreting
P values is easy once you identify the two models being compared.
• It is only fair to compare the fits of two models to data when the data are
exactly the same in each fit. If one of the fits has fewer data points or trans-
formed data, the comparison will be meaningless.
Nonlinear Regression
Models should be as simple as possible, but not more so.
A . E INSTEIN
366
C HA P T E R 3 6 • Nonlinear Regression 367
LOG[NOREPINEPHRINE, M] % RELAXATION
–8.0 2.6
–7.5 10.5
–7.0 15.8
–6.5 21.1
–6.0 36.8
–5.5 57.9
–5.0 73.7
–4.5 89.5
–4.0 94.7
–3.5 100.0
–3.0 100.0
Table 36.1. Bladder muscle relaxation data for one young rat.
100%
75%
Relaxation
50%
25%
0%
Figure 36.1. Bladder muscle relaxation data for one young rat.
The circles show the data from Table 36.1. The curve was fit by nonlinear regression.
relaxation and the concentration of norepinephrine that relaxes the muscle half
that much (EC50).
Table 36.1 and Figure 36.1 show the data from one young rat. Note that the
X-axis of Figure 36.1 is logarithmic. Going from left to right, each major tick
on the axis represents a concentration of norepinephrine 10-fold higher than the
previous major tick.
The model
The first step in fitting a model is choosing a model. In many cases, like this one, a
standard model will work fine. Pharmacologists commonly model dose-response
(or concentration-effect) relationships using the following equation:
Top − Bottom
Y = Bottom +
1 + 10(LogEC50 − X ) HillSlope
3 6 8 PA RT G • F I T T I N G M O D E L S T O DATA
To enter this equation into a computer program, use the following syntax.
Note that the asterisk (*) denotes multiplication and the carrot (^) denotes
exponentiation.
Y = Bottom + (Top − Bottom) / (1+ 10 ^ ((LogEC50 − X ) * HillSlope))
CI of the parameters
The program also reports the standard error and the 95% CI of each parameter. The
CIs are much easier to interpret and are an essential part of the nonlinear regression
results. Given all the assumptions of the analysis (listed later in the chapter), you
can be 95% confident that the interval contains the true parameter value.
C HA P T E R 3 6 • Nonlinear Regression 369
BEST-FIT VALUES
Bottom 0.0
Top 104
LogEC50 –5.64
HillSlope 0.622
EC50 2.30e–006
STANDARD ERRORS
Top 2.06
LogEC50 0.0515
HillSlope 0.0358
95% CIs
GOODNESS OF FIT
df 8
R2 0.997
Absolute sum of squares 43.0
The 95% CI for the logEC50 ranges from –5.76 to –5.52. Pharmacologists are
accustomed to thinking in log units of concentration. Most people prefer to see the
values in concentration units. The 95% CI of the EC50 ranges from 1.75 to 3.02 μM.
The CI is fairly narrow and shows us that the EC50 has been determined to within
a factor of two. That is more than satisfactory for this kind of experiment.
If the CI had been very wide, then you would not have determined the param-
eter very precisely and would not be able to interpret its value. How wide is too
wide? It depends on the context and goals of the experiment.
Some programs don’t report the CIs but instead report the standard error of
each parameter. Sometimes this is simply labeled the error, and sometimes it is
labeled the SD of the parameter. The CI of each parameter is computed from its
standard error. When you have plenty of data, the 95% CI extends about two stan-
dard errors in each direction.
R2
R2 is the fraction of the total variance of Y that is explained by the model. In this
example, the curve comes very close to all the data points, so the R2 is very high,
at 0.997.
When R2 = 0.0, the best-fit curve fits the data no better than a horizontal line
at the mean of all Y values. When R2 = 1.0, the best-fit curve fits the data per-
fectly, going through every point. If you happen to fit a really bad model (maybe
3 7 0 PA RT G • F I T T I N G M O D E L S T O DATA
100%
75%
Relaxation
50%
25%
0%
10 10
5 5
0 0
–5 –5
–10 –10
–15 –15
10–8 10–7 10–6 10–5 10–4 10–3 10–8 10–7 10–6 10–5 10–4 10–3
[Norepinephrine, M] [Norepinephrine, M]
Figure 36.4 makes the comparison of the size of the residuals easier to see.
Tables 36.3 and 36.4 compare the two fits using a format similar to that used
in examples in Chapter 35. The null hypothesis is that the simpler model (fixed
slope, one fewer parameter to fit) is correct. In fact, the alternative model fits
much better (lower sum of squares) but has one fewer df. The calculations balance
the difference in df with the difference in sum of squares.
C HA P T E R 3 6 • Nonlinear Regression 373
15
10
Residual 0
–5
–10
Variable slope Fixed slope
Table 36.3. The variable slope model fits the data much better, so the sum of squares is
much smaller.
Table 36.4. Computing an F ratio and P value from the fits of the two models.
The F ratio is high, so the P value is tiny.
Tip: Make sure you know the meaning and units of X and Y
Here Y is enzyme activity, which can be expressed in various units, depend-
ing on the enzyme. X is the substrate concentration expressed in units of
concentration.
Q&A
CHAPTER SUMMARY
• Like linear regression, nonlinear regression fits a model to your data to
determine the best-fit values of parameters.
• As the name suggests, nonlinear regression fits nonlinear models. For this
reason, the details of how the method works are far more complicated than
those of linear regression.
• Using a nonlinear regression program is not much harder than using a lin-
ear regression program.
• The most important result from nonlinear regression is a graph of the data
with the best-fit curve superimposed.
• The most important numerical results are the best-fit values of the param-
eters and the CIs of each parameter.
Multiple Regression
An approximate answer to the right problem is worth a good
deal more than an exact answer to an approximate problem.
J OHN T UKEY
378
C HA P T E R 3 7 • Multiple Regression 379
to generate an equation that can predict the risk for individual patients (as
explained in the previous point). But another goal might be to understand
the contributions of each risk factor to aid public health efforts and help
prioritize future research projects.
Different regression methods are available for different kinds of data. This
chapter explains multiple linear regression, in which the outcome variable is con-
tinuous. The next chapter explains logistic regression (binary outcome) and pro-
portional hazards regression (survival times).
LINGO
Variables
A regression model predicts one variable Y from one or more other variables X. The
Y variable is called the dependent variable, the response variable, or the outcome
variable. The X variables are called independent variables, explanatory variables,
or predictor variables. In some cases, the X variables may encode variables that the
experimenter manipulated or treatments that the experimenter selected or assigned.
Each independent variable can be:
• Continuous (e.g., age, blood pressure, weight).
• Binary. An independent variable might, for example, code for gender by
defining zero as male and one as female. These codes, of course, are arbi-
trary. When there are only two possible values for a variable, it is called a
dummy variable.
• Categorical, with three or more categories (e.g., four medical school classes
or three different countries). Consult more advanced books if you need to
do this, because it is not straightforward, and it is easy to get confused.
Several dummy variables are needed.
Parameters
The multiple regression model defines the dependent variable as a function of the
independent variables and a set of parameters, or regression coefficients. Regression
methods find the values of each parameter that make the model predictions come as
close as possible to the data. This approach is analogous to linear regression, which
determines the values of the slope and intercept (the two parameters or regression
coefficients of the model) to make the model predict Y from X as closely as possible.
multivariate methods, and they include factor analysis, cluster analysis, principal
components analysis, and multiple ANOVA (MANOVA). These methods contrast
with univariate methods, which deal with only a single Y variable.
Note that the terms multivariate and univariate are sometimes used inconsis-
tently. Sometimes multivariate is used to refer to multivariable methods for which
there is one outcome and several independent variables (i.e., multiple and logistic
regression), and sometimes univariate is used to refer to simple regression with
only one independent variable.
Mathematical model
This book is mostly nonmathematical. I avoid explaining the math of how meth-
ods work and only provide such explanation when it is necessary to understand the
C HA P T E R 3 7 • Multiple Regression 381
β0
+ β1 × log(serum lead)
+ β2 × Age
+ β3 × Body mass index
+ β4 × log(GGT)
+ β5 × Diuretics?
+ ε (Gaussian random)
= Creatinine clearance
questions the statistical methods answer. But you really can’t understand multiple
regression at all without understanding the model that is being fit to the data.
The multiple regression model is shown in Table 37.1. The dependent (Y)
variable is creatinine clearance. The model predicts its value from a baseline value
plus the effects of five independent (X) variables, each multiplied by a regression
coefficient, also called a regression parameter.
The X variables were the logarithm of serum lead, age, body mass, logarithm
of γ-glutamyl transpeptidase (a measure of liver function), and previous exposure to
diuretics (coded as zero or 1). This last variable is designated as the dummy variable
(or indicator variable) because those two particular values were chosen arbitrarily
to designate two groups (people who have not taken diuretics and those who have).
The final component of the model, ε, represents random variability (error).
Like ordinary linear regression, multiple regression assumes that the random
scatter (individual variation unrelated to the independent variables) follows a
Gaussian distribution.
The model of Table 37.1 can be written as an equation:
Table 37.2. Units of the variables used in the multiple regression examples.
• The regression coefficients (β1 to β5). These will be fit by the regression
program. One of the goals of regression is to find the best-fit value of
each regression coefficient, along with its CI. Each regression coefficient
represents the average change in Y when you change the corresponding X
value by 1.0. For example, β5 is the average difference in the logarithm of
creatinine between those who have diabetes (Xi,5 = 1) and those who don’t
(Xi,5 = 0).
• The intercept, β0. This is the predicted average value of Y when all the
X values are zero. In this example, the intercept has only a mathematical
meaning, but no practical meaning because setting all the X values to zero
means that you are looking at someone who is zero years old with zero
weight! β0 is fit by the regression program.
• ε. This is a random variable that is assumed to follow a Gaussian distri-
bution. Predicting Y from all the X variables is not a perfect prediction
because there also is a random component, designated by ε.
Goals of regression
Multiple regression fits the model to the data to find the values for the parameters
that will make the predictions of the model come as close as possible to the actual
data. Like simple linear regression, it does so by finding the values of the param-
eters (regression coefficients) in the model that minimize the sum of the squares
of the discrepancies between the actual and predicted Y values. Like simple linear
regression, multiple regression is a least-squares method.
–40
–60
differences in the other variables, an increase in log(lead) of one unit (so that the
lead concentration increases tenfold, since they used common base 10 logarithms)
is associated with a decrease in creatinine clearance of 9.5 ml/min. The 95%
CI ranges from –18.1 to –0.9 ml/min.
Understanding these values requires some context. The study participants’
average creatinine clearance was 99 ml/min. So a tenfold increase in lead con-
centrations is associated with a reduction of creatinine clearance (reduced renal
function) of about 10%, with a 95% CI ranging from about 1% to 20%. Figure 37.1
illustrates this model.
The authors also report the values for all the other parameters in the model. For
example, the β5 coefficient for the X variable “previous diuretic therapy” was –8.8
ml/min. That variable is coded as zero if the patient had never taken diuretics and
1 if the patient had taken diuretics. It is easy to interpret the best-fit value. On av-
erage, after taking into account differences in the other variables, participants who
had taken diuretics previously had a mean creatinine clearance that was 8.8 ml/min
lower than that of participants who had not taken diuretics.
Statistical significance
Multiple regression programs can compute a P value for each parameter in the
model testing the null hypothesis that the true value of that parameter is zero. Why
zero? When a regression coefficient (parameter) equals zero, then the correspond-
ing independent variable has no effect in the model (because the product of the
independent variable times the coefficient always equals zero).
3 8 4 PA RT G • F I T T I N G M O D E L S T O DATA
The authors of this paper did not include P values and reported only the
b est-fit values with their 95% CIs. We can figure out which ones are less than
0.05 by looking at the CIs.
The CI of β1 runs from a negative number to another negative number and
does not include zero. Therefore, you can be 95% confident that increasing lead
concentration is associated with a drop in creatinine clearance (poorer kidney
function). Since the 95% CI does not include the value defining the null hypoth-
esis (zero), the P value must be less than 0.05. The authors don’t quantify it more
accurately than that, but most regression programs would report the exact P value.
R2 versus adjusted R2
R2 is commonly used as a measure of goodness of fit in multiple linear regression,
but it can be misleading. Even if the independent variables are completely unable to
predict the dependent variable, R2 will be greater than zero. The expected value of R2
increases as more independent variables are added to the model. This limits the use-
fulness of R2 as a way to quantify goodness of fit, especially with small sample sizes.
In addition to reporting R2, which quantifies how well the model fits the
data being analyzed, most programs also report an adjusted R2, which estimates
how well the model is expected to fit new data. This measure accounts for the
number of independent variables and is always smaller than R2. How much
smaller depends on the relative numbers of participants and variables. This study
C HA P T E R 3 7 • Multiple Regression 385
R2 = 0.27
P < 0.0001
The authors did not post the raw data, so this graph does not accurately represent the data
in the example. Instead, I simulated some data that look very much like what the actual
data would have looked like. Each of the 965 points represents one man in the study. The
horizontal axis shows the actual creatinine clearance for each participant. The vertical
axis shows the creatinine clearance computed by the multiple regression model from
that participant’s lead level, age, body mass, log γ-glutamyl transpeptidase, and previous
exposure to diuretics. The prediction is somewhat useful, because generally people who
actually have higher creatinine clearance levels are predicted to have higher levels. However,
there is a huge amount of scatter. If the model were perfect, each predicted value would be
the same as the actual value, all the points would line up on a 45-degree line, and R2 would
equal 1.00. Here the predictions are less accurate, and R2 is only 0.27.
has far more participants (965) than independent variables (5), so the adjusted
R2 is only a tiny bit smaller than the unadjusted R2, and the two are equal to two
decimal places (0.27).
ASSUMPTIONS
Assumption: Sampling from a population
This is a familiar assumption of all statistical analyses. The goal in all forms of
multiple regression is to analyze a sample of data to make more general conclu-
sions (or predictions) about the population from which the data were sampled.
increases tenfold, since the common or base 10 logarithm of 10 is 1). The assump-
tion of linear effects means that increasing the logarithm of lead concentration by
2.0 will have twice the impact on creatinine clearance as increasing it by 1.0 and
that the decrease in creatinine clearance by a certain concentration of lead does
not depend on the values of the other variables.
then find the one that is the best. With many variables and large data sets, the
computer time required for this approach is prohibitive. To conserve computer
time when working with huge data sets, other algorithms use a stepwise approach.
One approach (called forward-stepwise selection or a step-up procedure) is to
start with a very simple model and add new X variables one at a time, always
adding the X variable that most improves the model’s ability to predict Y. Another
approach (backward-stepwise selection or a step-down procedure) is to start with
the full model (including all X variables) and then sequentially eliminate those X
variables that contribute the least to the model.
NAME EXAMPLE
Longitudinal or repeated measures Multiple observations of the same participant at different
times.
Crossover Each participant first gets one treatment, then another.
Bilateral Measuring from both knees in study of arthritis or both
ears in a study of tinnitus, and entering the two
measurements into the regression separately.
Cluster Study pools results from three cities. Patients from the
same city are more similar to each other than they are
to patients from another city.
Hierarchical A clinical study of a surgical procedure uses patients from
three different medical centers. Within each center, s everal
different surgeons may do the procedure. For each
patient, results might be collected at several time points.
To include interaction between age and the logarithm of serum lead concen-
tration, add a new term to the model equation with a new parameter multiplied by
the product of age (X2) times the logarithm of lead (X1):
Y = β0 + β1 ⋅ X1 + β 2 ⋅ X 2 + β 3 ⋅ X 3 + β 4 ⋅ X 4 + β5 ⋅ X 5 + β1,2 ⋅ X1 ⋅ X 2 + ε
If the CI for the new parameter (β1,2) does not include zero, then you will con-
clude that there is a significant interaction between age and log(lead). This means
that the effects of lead change with age. Equivalently, the effect of age depends on
the lead concentrations.
Correlated observations
One of the assumptions of multiple regression is that each observation is inde-
pendent. In other words, the deviation from the prediction of the model is entirely
random. Table 37.3 (adapted from Katz, 2006) is a partial list of study designs that
violate this assumption and so require specialized analysis methods.
When you fit the data many different ways and then report only the model
that fits the data best, you are likely to come up with conclusions that are not valid.
This is essentially the same problem as choosing which variables to include in the
model, previously discussed.
blood pressure to a binary variable that encodes whether the patient has hyperten-
sion (high blood pressure) or not. One problem with the latter approach is that
it requires deciding on a somewhat arbitrary definition of whether someone has
hypertension. Another problem is that it treats patients with mild and severe hy-
pertension as the same, and so information is lost.
Q&A
Do you always have to decide which variable is the outcome (dependent variable)
and which variables are the predictors (independent variables) at the time of data
collection?
No. In some cases, the independent and dependent variables may not be distinct
at the time of data collection. The decision is sometimes made only at the time of
data analysis. But beware of these analyses. The more ways you analyze the data,
the more likely you are to be fooled by overfitting and multiple comparisons.
Does it make sense to compare the value of one best-fit parameter with another?
No. The units of each parameter are different, so they can’t be directly compared.
If you want to compare, read about standardized parameters in a more
advanced book. Standardizing rescales the parameters so they become unitless
and can then be compared. A variable with a larger standardized parameter has
a more important impact on the dependent variable.
C HA P T E R 3 7 • Multiple Regression 393
CHAPTER SUMMARY
• Multiple variable regression is used when the outcome you measure is
affected by several other variables.
• This approach is used when you want to assess the impact of one vari-
able after correcting for the influences of others to predict outcomes from
several variables or to try to tease apart complicated relationships among
variables.
• Multiple linear regression is used when the outcome (Y) variable is con-
tinuous. Chapter 38 explains methods used when the outcome is binary.
• Beware of the term multivariate, which is used inconsistently.
• Automatic variable selection is appealing but the results can be misleading.
It is a form of multiple comparisons.
Logistic and Proportional Hazards
Regression
The plural of anecdote is not data.
R O GER B RINNER
T his chapter briefly explains regression methods used when the de-
pendent variable is not continuous. Logistic regression is used when
the dependent variable is binary (i.e., has two possible values). Propor-
tional hazards regression is used when the dependent variable is survival
time. Read Chapter 37 (multiple regression) before reading this one.
LOGISTIC REGRESSION
When is logistic regression used?
As was discussed in Chapter 37, methods that fit regression models with two or
more independent variables are called multiple regression methods. Multiple regres-
sion is really a family of methods, with the specific type of regression used depend-
ing on what kind of outcome is being measured (see Table 38.1). Logistic regression
is used when the outcome (dependent variable) has two possible values. Because it is
almost always used with two or more independent variables, it should be called mul-
tiple logistic regression. However, the word multiple is often omitted but assumed.
395
3 9 6 PA RT G • F I T T I N G M O D E L S T O DATA
whether a person belongs to lower middle class (1 = yes; 0 = no). The second
variable encodes membership in the upper middle class, and the third variable
encodes membership in the higher economic class. Lower-class status is encoded
by setting all three of those variables to zero. This is one of several ways to encode
categorical variables with more than two possible outcomes.
probability that it will not occur. Every probability can be expressed as odds.
Every odds can be expressed as a probability.
The first independent variable has two possible values (rural = 0; urban = 1)
and so is a binary, or dummy, variable. The corresponding odds ratio is 1.0 for
rural women and 2.13 for urban women. This means that a woman who lives in a
city has a bit more than twice the odds of being obese as someone who lives in a
rural area but shares other attributes (age, education, married or not, social class).
The 95% CI ranges from 1.9 to 2.4.
The next independent variable is age. The odds ratio for this variable is 1.02.
Every year, the odds of obesity goes up about 2%. Don’t forget that odds ratios
multiply. The odds ratio for a 40-year-old compared to a 25-year-old is 1.0215, or
1.35. The exponent is 15 because 40 – 25 = 15. That means a 40-year-old person
has about 35% greater odds of being obese than a 25-year-old.
The next independent variable is education. The odds ratio is 0.98. Since that
is less than 1.0, having more education (in this population) makes one less slightly
likely to be obese. For each additional year of formal education, the odds of being
obese decline by 2% (since we multiply by the OR of 0.98).
An example
Rosman and colleagues (1993) investigated whether diazepam (also known as
Valium) would prevent febrile seizures in children. They recruited about 400 chil-
dren who had had at least one febrile seizure. Their parents were instructed to give
4 0 0 PA RT G • F I T T I N G M O D E L S T O DATA
medication to the children whenever they had a fever. Half were given diazepam
and half were given a placebo (because there was no standard therapy).
The researchers asked whether treatment with diazepam would delay the time
of first seizure. Their first analysis was simply to compare the two survival curves
with a log-rank test (see Chapter 29). The P value was 0.064, so the difference
between the two survival curves was not considered to be statistically significant.
They did not report the hazard ratio for this analysis.
Next, the researchers did a more sophisticated analysis that adjusted for
differences in age, number of previous febrile seizures, the interval between
the last seizure and entry into the study, and developmental problems. The pro-
portional hazards regression, which accounts for all such differences among
kids in the study, found that the relative risk was 0.61, with a 95% CI extend-
ing from 0.39 to 0.94. This means that at any time during the two-year study
period, kids treated with diazepam had only 61% of the risk of having a febrile
seizure compared with those treated with a placebo. This reduction was statisti-
cally significant, with a P value of 0.027. If diazepam were truly ineffective,
there would have been only a 2.7% chance of seeing such a low relative risk
in a study of this size. This example shows that the results of proportional
hazards regression are easy to interpret, although the details of the analysis are
complicated.
labeled the same way. Don’t take 10 to the power of the log(odds ratios). That
would work if the logarithms were common (base 10) logs, but logistic regression
always uses natural logs.
Q&A
CHAPTER SUMMARY
• Logistic regression is used when the outcome is binary (has two possible
values).
• Proportional hazards regression is used when the outcome is survival time.
• These methods are used for the same purposes for which multiple regres-
sion is used: to assess the impact of one variable after correcting for influ-
ences of others, to predict outcomes from several variables, or to try to
tease apart complicated relationships among variables.
• The assumptions and traps of logistic and proportional hazards regression
are about the same as those for multiple regression.
The Rest of Statistics
C HA P T E R 3 9
Analysis of Variance
Let my dataset change your mindset.
H ANS R OSLIN
407
4 0 8 PA RT H • T H E R E ST O F STAT I ST IC S
Recreational Recreational
runners runners
Nonrunners Nonrunners
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Log(LH) Log(LH)
Figure 39.1. Data from Table 39.1 shown as a graph of the mean with
SD (left) or SEM (right).
SUM OF PERCENTAGE OF
HYPOTHESIS SCATTER FROM SQUARES VARIATION
Null Grand mean 17.38 100.0
– Alternative Group means 16.45 94.7
= Difference 0.93 5.3 R2 = 0.053
SOURCE OF SUM OF
VARIATION SQUARES DF MS F RATIO P VALUE
Between groups 0.93 2 0.46 5.69 0.0039
+ Within groups 16.45 202 0.081
(error, residual)
= Total 17.38 204
within the groups, leaving 5.3% of the total variation as the result of differences
between the group means.
To determine the P value requires more than dividing the variance into its
components. As you’ll see, it also requires accounting for the number of values
and the number of groups.
Determining P from F
The third column of Table 39.3 shows the number of df. The bottom row, la-
beled “Total,” is for the null hypothesis model. There are 205 values and only
one parameter was estimated (the grand mean), which leaves 204 df. The next
row up shows the sum of squares from the group means. Three parameters
were fit (the mean of each group), so there are 202 df (205 data points minus
three parameters). The top row shows the difference. The alternative model
(three distinct means) has two more parameters than the null hypothesis model
(one grand mean), so there are two df in this row. The df, like the sums of
squares, can be partitioned so that the bottom row is the sum of values in the
rows above.
The fourth column divides the sum of squares by the number of df to compute
the mean square (MS), which could also be called the variance. Note that it is not
possible to add the MSs of the top two rows to obtain a meaningful MS for the
bottom row. Because the MS for the null hypothesis is not used in further calcula-
tions, it is left blank.
To compute a P value, you must take into account the number of values and
the number of groups. This is done in the last column of Table 39.3.
If the null hypothesis were correct, each MS value would estimate the vari-
ance among values, so the two MS values would be similar. The ratio of those two
MS values is called the F ratio, after Ronald Fisher, the pioneering statistician
who invented ANOVA and much of statistics.
If the null hypothesis were true, F would be likely to have a value close to 1.
If the null hypothesis were not true, F would probably have a value greater than 1.
The probability distribution of F under the null hypothesis is known for various df
and can be used to calculate a P value. The P value answers this question: If the
null hypothesis were true, what is the chance that randomly selected data (given
the total sample size and number of groups) would lead to an F ratio this large
or larger? For this example, F = 5.690, with 2 df in the numerator, 202 df in the
denominator, and P = 0.0039.
R2
Look at the first column of Table 39.2, which partitions the total sum of squares
into its two component parts. Divide the sum of squares resulting from differences
between groups (0.93) by the total sum of squares (17.38) to determine the frac-
tion of the sum of squares resulting from differences between groups, 0.053. This
is called eta squared, η2, which is interpreted in the same way as R2. Only 5.3%
of the total variability in this example is the result of differences between group
means. The remaining 94.7% of the variability is the result of differences within
the groups.
The low P value means that the differences among group means would be
very unlikely if in fact all the population means were equal. The low R2 means that
the differences among group means are only a tiny fraction of the overall variabil-
ity. You could also reach this conclusion by looking at the left side of Figure 39.1,
where the SD error bars overlap so much.
4 1 2 PA RT H • T H E R E ST O F STAT I ST IC S
100
Response
50
du n
ng tion
n
ng tio
tio
tio
L o ura
ra
ra
ur
du
td
td
or
or
Lo
Sh
Sh
Figure 39.2. Sample two-way ANOVA data.
If male and female animals were both included in the study, you’d need three-
way ANOVA. Now each value would be categorized in three ways: treatment,
duration, and gender.
If you analyze data with repeated measures two-way ANOVA, make very
sure the program knows which factor is repeated, or if both are. If you set up the
program incorrectly, the results will be meaningless.
Q&A
CHAPTER SUMMARY
• One-way ANOVA compares the means of three or more groups.
• The P value tests the null hypothesis that all the population means are
identical.
C HA P T E R 3 9 • Analysis of Variance 417
• ANOVA assumes that all data are randomly sampled from populations that
follow a Gaussian distribution and have equal SDs.
• The difference between ordinary and repeated-measures ANOVA is the
same as the difference between the unpaired and paired t tests. Repeated-
measures ANOVA is used when measurements are made repeatedly for
each subject or when subjects are recruited as matched sets.
• Two-way ANOVA, also called two-factor ANOVA, determines how a
response is affected by two factors. For example, you might measure a
response to three different drugs in both men and women.
Multiple Comparison Tests
after ANOVA
When did “skeptic” become a dirty word in science?
M ICHAEL C RICHTON (2003)
418
C HA P T E R 4 0 • Multiple Comparison Tests after ANOVA 419
Figure 40.1 plots the 95% CI for the difference between each mean and every
other mean. The original data were expressed as the logarithm of LH concentra-
tion, so these units are also used for the confidence intervals. These are tabulated
in Table 40.2.
These are multiple comparisons CIs, so the 95% confidence level applies to the
entire family of comparisons, rather than to each individual interval. Given the as-
sumptions of the analysis (listed in Chapter 39), there is a 95% chance that all three
of these CIs include the true population value, leaving only a 5% chance that any
one or more of the intervals does not include the population value. Because the 95%
confidence level applies to the entire set of intervals, it is impossible to correctly
interpret any individual interval without seeing the entire set.
95% CI
CIs as ratios
For this example, the data were entered into the ANOVA program as the logarithm
of the concentration of LH (shown in Chapter 39). Therefore, Figure 40.1 and
Table 40.2 show differences between two logarithms. Many find it easier to think
about these results without logarithms, and it is easy to convert the data to a more
intuitive format.
The trick is to note that the difference between the logarithms of two values
is mathematically identical to the logarithm of the ratio of those two values (loga-
rithms and antilogarithms are reviewed in Appendix E). Transform each of the dif-
ferences (and each confidence limit) to its antilogarithm, and the resulting values
can be interpreted as the ratio of two LH levels. Written as equations:
A
log ( A) − log ( B) = log
B
A
10(log ( A) − log ( B )) =
B
Each row in Table 40.3 shows the ratio of the mean LH level in one group
divided by the mean in another group, along with the 95% CI of that ratio. These
are plotted in Figure 40.2.
Table 40.3. Multiple comparisons CIs, expressed as ratios rather than differences.
95% CI
Figure 40.2. 95% CIs for Tukey’s multiple comparisons test, expressed as ratios of LH
levels rather than as differences between mean log(LH) concentrations.
The values correspond to Table 40.3.
C HA P T E R 4 0 • Multiple Comparison Tests after ANOVA 421
STATISTICALLY SIGNIFICANT?
Statistical significance
If a 95% CI for the difference between two means includes zero (the value spec-
ified in the null hypothesis), then the difference is not statistically significant
(P > 0.05). Two of the three 95% CIs shown in Table 40.2 and Figure 40.1 include
zero, and so these comparisons are not statistically significant at the 5% signifi-
cance level. The CI for the remaining comparison (nonrunners vs. recreational
runners) does not include zero, so that difference is statistically significant.
The values in Table 40.2 are the differences between two logarithms, which
are mathematically identical to the logarithms of the ratios. Transforming to anti-
logarithms creates the table of ratios and CIs of ratios, shown in Table 40.3.
The null hypothesis of identical populations corresponds to a ratio of 1.0.
Only one comparison (nonrunners vs. recreational runners) does not include 1.0,
so that comparison is statistically significant.
The 5% significance level is a familywise significance level, meaning that it ap-
plies to the entire set, or family, of comparisons (previously defined in Chapter 22).
If the overall null hypothesis (values from all groups were sampled from populations
with identical means) is true, there is a 5% chance that one or more of the compari-
sons will be statistically significant and a 95% chance that none of the comparisons
will be statistically significant.
Table 40.4 shows the conclusions about statistical significance at three differ-
ent significance levels.
The significance levels (α) apply to the entire family of three comparisons,
but -the yes/no conclusions apply to each comparison individually.
200
100
50
0
0 5 10 15 20
Number of groups
Figure 40.3. Number of possible pairwise comparisons between group means as a func-
tion of the number of groups.
Multiple comparisons tests use data from all groups, even when only
comparing two groups
The result of most multiple comparison tests depends on all the data in all the
groups, not just the two groups being compared. To see why, read about the use of
the pooled SD in the following “How It Works” section.
Dunnett’s test: Compare each mean with the mean of a control group
Dunnett’s test compares the mean of each group to the mean of a control group with-
out comparing the other groups among themselves. For example, it would be used
in an experiment that tests the effects of six different drugs with a goal of defining
which drugs have any effect but not of comparing the drugs with each other.
Because Dunnett’s test makes fewer comparisons than Tukey’s method, it
generates narrower CIs and has more power to detect differences. It is very useful.
The decision to use Dunnett’s test (along with a definition of which group is
the control group) should be part of the experimental design. It isn’t fair to first
do Tukey’s test to make all comparisons and then switch to Dunnett’s test to get
more power.
test has more power. Similarly, it should not be used to compare each group
against a control, because Dunnett’s test has more power for that purpose.
Bonferroni’s multiple comparisons test should be used when the experiment
design requires comparing only selected pairs of means. By making a limited
set of comparisons, you get narrower CIs and more statistical power to detect
differences.
It is essential that you select those pairs as part of the experimental design,
before collecting the data. It is not fair to first look at the data and then decide
which pairs you want to compare. By looking at the data first, you have implicitly
compared all groups.
you made so that the people reading the research can informally correct for mul-
tiple comparisons. This recommendation is sensible but not mainstream.
Q&A
What is the difference between a multiple comparisons test and a post hoc test?
The term multiple comparisons test applies whenever several comparisons are
performed at once with a correction for multiple comparisons. The term post hoc
test refers to situations in which you can decide which comparisons you want to
make after looking at the data. Often, however, the term is used informally as a
synonym for multiple comparisons test. Posttest is an informal, but ambiguous,
term. It can refer to either all multiple comparisons tests or only to post hoc tests.
If one-way ANOVA reports a P value less than 0.05, are multiple comparisons tests sure
to find a significant difference between group means?
Not necessarily. The low P value from the ANOVA tells you that the null hypothesis
that all data were sampled from one population with one mean is unlikely to be
true. However, the difference might be a subtle one. It might be that the mean
of Groups A and B is significantly different than the mean of Groups C, D, and E.
C HA P T E R 4 0 • Multiple Comparison Tests after ANOVA 429
Scheffe’s posttest can find such differences (called contrasts), and if the overall
ANOVA is statistically significant, Scheffe’s test is sure to find a significant contrast.
The other multiple comparisons tests compare group means. Finding that the
overall ANOVA reports a statistically significant result does not guarantee that
any of these multiple comparisons will find a statistically significant difference.
If one-way ANOVA reports a P value greater than 0.05, is it possible for a multiple com-
parisons test to find a statistically significant difference between some group means?
Yes. Surprisingly, this is possible.
Are the results of multiple comparisons tests valid if the overall P value for the ANOVA
is greater than 0.05?
It depends on which multiple comparisons test you use. Tukey’s, Dunnett’s, and
Bonferroni’s tests mentioned in this chapter are valid even if the overall ANOVA
yields a conclusion that there are no statistically significant differences among
the group means.
Does it make sense to only focus on multiple comparisons results and ignore the
overall ANOVA results?
It depends on the scientific goals. ANOVA tests the overall null hypothesis that
all the data come from groups that have identical means. If that is your experi-
mental question—do the data provide convincing evidence that the means
are not all identical—then ANOVA is exactly what you want. If the experimental
questions are more focused and can be answered by multiple comparisons
tests, you can safely ignore the overall ANOVA results and jump right to the
results of multiple comparisons.
Note that the multiple comparisons calculations all use the mean square
result from the ANOVA table. Consequently, even if you don’t care about the
value of F or the P value, the multiple comparisons tests still require that the
ANOVA table be computed.
Can I assess statistical significance by observing whether two error bars overlap?
If two standard error bars overlap, you can be sure that a multiple comparisons
test comparing those two groups will find no statistical significance. However, if
two standard error bars do not overlap, you can’t tell whether a multiple com-
parisons test will or will not find a statistically significant difference. If you plot
SD, rather than SEM, error bars, the fact that they do (or don’t) overlap will not let
you reach any conclusion about statistical significance.
Do multiple comparisons tests take into account the order of the groups?
No. With the exception of the test for trend mentioned in this chapter, multiple
comparisons tests do not consider the order in which the groups were entered
into the program.
Do all CIs between means have the same length?
If all groups have the same number of values, then all the CIs for the difference
between means will have identical lengths. If the sample sizes are unequal, then
the standard error of the difference between means depends on sample size.
The CI for the difference between two means will be wider when sample size is
small and narrower when sample size is larger.
Three groups of data (a, b, c) are analyzed with one-way ANOVA followed by Tukey
multiple comparisons to compare all pairs of means. Now another group (d) is added,
and the ANOVA is run again. Will the comparisons for a–b, a–c, and b–c change?
Probably. When comparing the difference between two means, that difference is
compared to a pooled SD computed from all the data, which will change when
4 3 0 PA RT H • T H E R E ST O F STAT I ST IC S
you add another treatment group. Also, the increase in number of comparisons
will lower the threshold for a P value to be deemed statistically significance and
widen confidence intervals.
Why wasn’t the Newman–Keuls test used for the example?
Like Tukey’s test, the Newman–Keuls test (also called the Student–Newman–
Keuls test) compares each group mean with every other group mean. Some
prefer it because it has more power. I prefer Tukey’s test, because the
Newman–Keuls test does not really control the error rate as it should (Seaman,
Levin, & Serlin, 1991) and cannot compute CIs.
Chapter 22 explains the concept of controlling the FDR. Is this concept used in mul-
tiple comparisons after ANOVA?
This is not a standard approach to handling multiple comparisons after ANOVA,
but some think it should be.
CHAPTER SUMMARY
• Multiple comparisons tests follow ANOVA to find out which group means
differ from which other means.
• To prevent getting fooled by bogus statistically significant conclusions, the
significance level is usually defined to apply to an entire family of compari-
sons rather than to each individual comparison.
• Most multiple comparisons tests can report both CIs and statements about
statistical significance. Some can also report multiplicity-adjusted P values.
• Because multiple comparisons tests correct for multiple comparisons, the
results obtained by comparing two groups depends on the data in the other
groups and the number of other groups in the analysis.
• To choose a multiple comparison test, you must articulate the goals of
the study. Do you want to compare each group mean to every other group
mean? Each group mean to a control group mean? Only compare a small
set of pairs of group means? Different multiple comparison tests are used
for different sets of comparisons.
Nonparametric Methods
Statistics are like a bikini. What they reveal is suggestive, but
what they conceal is vital.
A ARON L EVENSTEIN
M any of the methods discussed in this book are based on the as-
sumption that the values are sampled from a Gaussian distribu-
tion. Another family of methods makes no such assumption about the
population distribution. These are called nonparametric methods. The
nonparametric methods used most commonly work by ignoring the
actual data values and instead analyzing only their ranks. Computer-
intensive resampling and bootstrapping methods also do not assume a
specified distribution, so they are also nonparametric.
OLD YOUNG
3.0 10.0
1.0 13.0
11.0 14.0
6.0 6.0
4.5 15.0
8.0 17.0
4.5 9.0
12.0 7.0
2.0
20
15
Rank
10
0
Old Young
2. Sum the ranks in each group. In the example data, the sum of the ranks
of the old rats is 52 and the sum of the ranks of the young rats is 101.
The values for the younger rats tend to be larger and thus tend to have
higher ranks.
3. Calculate the mean rank of each group.
4. Compute a P value for the null hypothesis that the distribution of ranks
is totally random. If there are ties, different programs might calculate a
different P value.
Under the null hypothesis, it would be equally likely for either of the two
groups to have the larger mean ranks and more likely to find the two mean ranks
close together than far apart. Based on this null hypothesis, the P value is computed
by answering the question, If the distribution of ranks between two groups were
C HA P T E R 4 1 • Nonparametric Methods 433
distributed randomly, what is the probability that the difference between the mean
ranks would be this large or even larger? The answer is 0.0035. It certainly is not
impossible that random sampling of values from two identical populations would lead
to sums of ranks this far apart, but it would be very unlikely. Accordingly, we conclude
that the difference between the young and old rats is statistically significant.
Although the Mann–Whitney test makes no assumptions about the distribu-
tion of values, it is still based on some assumptions. Like the unpaired t test, the
Mann–Whitney test assumes that the samples are randomly sampled from (or rep-
resentative of) a larger population and that each value is obtained independently.
But unlike the t test, the Mann–Whitney test does not assume anything about the
distribution of values in the populations from which the data are sampled.
The Mann–Whitney test is equivalent to a test developed by Wilcoxon, so the
same test is sometimes called the Wilcoxon rank-sum test. Don’t mix this test up
with the nonparametric test for paired data discussed in the next section.
Nonparametric correlation
One nonparametric method for quantifying correlation is called Spearman’s rank
correlation. Spearman’s rank correlation is based on the same assumptions as
ordinary (Pearson) correlations, outlined in Chapter 32, with two exceptions.
Rank correlation does not assume Gaussian distributions and does not assume a
linear relationship between the variables. However, Spearman’s (nonparametric)
rank correlation does assume that any underlying relationship between X and Y is
monotonic (i.e., either always increasing or always decreasing).
Spearman’s correlation separately ranks the X and Y values and then com-
putes the correlation between the two sets of ranks. For the insulin sensitivity ex-
ample of Chapter 32, the nonparametric correlation coefficient, called rS, is 0.74.
The P value, which tests the null hypothesis that there is no rank correlation in the
overall population, is 0.0036.
An easy way to think about how the two kinds of correlation are distinct is
to recognize that Pearson correlation quantifies the linear relationship between X
and Y, while Spearman quantifies the monotonic relationship between X and Y.
Nonparametric ANOVA
The nonparametric test analogous to one-way ANOVA is called the Kruskal–Wallis
test. The nonparametric test analogous to repeated-measures one-way ANOVA is
called Friedman’s test. These tests first rank the data from low to high and then
analyze the distribution of the ranks among groups.
Nonparametric tests are less powerful when the data are Gaussian
Because nonparametric tests only consider the ranks and not the actual data, they
essentially throw away some information, so they are less powerful. If there truly
C HA P T E R 4 1 • Nonparametric Methods 435
Using a parametric test with a small data set sampled from a non-
Gaussian population
The central limit theorem (discussed in Chapter 10) doesn’t apply to small sam-
ples, so the P value may be inaccurate.
Using a parametric test with a large data set sampled from a non-
Gaussian population
The central limit theorem ensures that parametric tests work well with large sam-
ples even if the data are sampled from non-Gaussian populations. In other words,
parametric tests are robust to mild deviations from Gaussian distributions, so long
as the samples are large. But there are two snags:
• It is impossible to say how large is large enough, because it depends on the
nature of the particular non-Gaussian distribution. Unless the population
distribution is really weird, you are probably safe choosing a parametric test
when there are at least two-dozen data points in each group.
• If the population is far from Gaussian, you may not care about the mean or
differences between means. Even if the P value provides an accurate answer
to a question about the difference between means, that question may be
scientifically irrelevant.
C HA P T E R 4 1 • Nonparametric Methods 437
Summary
Large data sets present no problems. It is usually easy to tell whether the data are
likely to have been sampled from a Gaussian population (and normality tests can
help), but it doesn’t matter much, because in this case, nonparametric tests are so
powerful and parametric tests are so robust.
Small data sets present a dilemma. It is often difficult to tell whether the data
come from a Gaussian population, but it matters a lot. In this case, nonparametric
tests are not powerful and parametric tests are not robust.
• When analyzing data from a series of experiments, all data should be ana-
lyzed the same way (unless there is some reason to think they aren’t compa-
rable). In this case, results from a single normality test should not be used
to decide whether to use a nonparametric test.
• Data sometimes fail a normality test because the values were sampled from
a lognormal distribution (see Chapter 11). In this case, transforming the
data to logarithms will create a Gaussian distribution. In other cases, trans-
forming data to reciprocals or using other transformations can often con-
vert a non-Gaussian distribution to a Gaussian distribution.
• Data can fail a normality test because of the presence of an outlier (see Chapter
25). In some cases, it might make sense to analyze the data without the outlier
using a conventional parametric test rather than a nonparametric test.
• The decision of whether to use a parametric or nonparametric test is most
important with small data sets (because the power of nonparametric tests
is so low). But with small data sets, normality tests have little power, so an
automatic approach would give you false confidence.
The decision of when to use a parametric test and when to use a nonparamet-
ric test really is a difficult one, requiring thinking and perspective. As a result, this
decision should not be automated.
Q&A
CHAPTER SUMMARY
• ANOVA, t tests, and many statistical tests assume that you have sampled
data from populations that follow a Gaussian bell-shaped distribution.
C HA P T E R 4 1 • Nonparametric Methods 441
Sensitivity, Specificity, and Receiver
Operating Characteristic Curves
We’re very good at recognizing patterns in randomness but we
never recognize randomness in patterns.
D ANIËL L AKENS
T his chapter explains how to quantify false positive and false nega-
tive results from laboratory tests. Although this topic is not found
in all basic statistics texts, deciding whether a clinical laboratory result
is normal or abnormal relies on logic very similar to that used in de-
ciding whether a finding is statistically significant or not. Learning the
concepts of sensitivity and specificity presented here is a great way to
review the ideas of statistical hypothesis testing and Bayesian logic ex-
plained in Chapter 18.
442
C HA P T E R 4 2 • Sensitivity, Specificity, and Receiver Operating Curves 443
Table 42.1. The results of many hypothetical lab tests, each analyzed to reach a decision
to call the results normal or abnormal.
The top row tabulates results for patients without the disease, and the second row tabulates
results for patients with the disease. You can’t actually create this kind of table from a group
of patients unless you run a “gold standard” test that is 100% accurate.
measures how well the test identifies those who don’t have the disease, that
is, how specific it is. If a test has very high specificity, it won’t mistakenly
give a positive result to many people without the disease.
True negatives D
Negative Predictive Value = =
All negative results C + D
The sensitivity and specificity are properties of the test. In contrast, the posi-
tive predictive value and negative predictive value are determined by the char-
acteristics of the test and the prevalence of the disease in the population being
studied. The lower the prevalence of the disease is, the lower the ratio of true
positives to false positives is. This is best understood by example.
What is the probability that a patient with fewer than 98 units of enzyme
activity has porphyria? In other words, what is the positive predictive value of the
test? The answer depends on who the patient is. We’ll work through two examples.
Table 42.2. Expected results of screening 1 million people with the porphyria test from a
population with a prevalence of 0.01%.
Most of the abnormal test results are false positives.
C HA P T E R 4 2 • Sensitivity, Specificity, and Receiver Operating Curves 445
has the disease. That is the positive predictive value of the test. Because only about
1 in 500 of the people with positive test results have the disease, the other 499 of
500 positive tests are false positives.
Of the 962,922 negative tests results, only 18 are false negatives. The predic-
tive value of a negative test is 99.998%.
Table 42.3. Expected results of testing 1,000 siblings of someone with porphyria.
In this group, the prevalence will be 50% and few of the abnormal test results will be false
positives.
4 4 6 PA RT H • T H E R E ST O F STAT I ST IC S
These examples demonstrate that the fraction of the positive tests that are false posi-
tives depends on the prevalence of the disease in the population you are testing.
Table 42.4. Expected results of testing a million people with an HIV test in a population
where the prevalence of HIV is 0.1%.
Even though the test seems to be quite accurate (sensitivity = 99.9%; specificity = 99.6%),
80% of the abnormal test results will be false positives (3,996/4,995 = 80%).
Table 42.5. False positive and false negative lab results are similar to Type I and Type II
errors in statistical hypothesis testing.
Table 42.6. Relationship between sensitivity and power and between specificity
and alpha.
C HA P T E R 4 2 • Sensitivity, Specificity, and Receiver Operating Curves 447
100%
75%
Sensitivity %
50%
25%
0%
0% 25% 50% 75% 100%
100% – Specificity%
The bottom left of the ROC curve is one silly extreme when the test never,
ever returns a diagnosis that the person has the disease. At this extreme, every
patient is incorrectly diagnosed as healthy (sensitivity = 0%) and every control is
correctly diagnosed as healthy (specificity = 100%). At the other extreme, which
is in the upper-right corner, the test always returns a diagnosis that the person
tested has the disease. Every patient is correctly diagnosed (sensitivity = 100%)
and every control is incorrectly diagnosed (specificity = 0%).
BAYES REVISITED
Interpreting clinical laboratory tests requires combining what you know about the
clinical context and what you learn from the lab test. This is simply Bayesian logic
at work. Bayesian logic has already been discussed in Chapter 18.
probability
Odds =
1 − probability
odds
Probability =
1 + odds
C HA P T E R 4 2 • Sensitivity, Specificity, and Receiver Operating Curves 449
If the probability is 0.50, or 50%, then the odds are 50:50, or 1:1. If you repeat
the experiment often, you will expect to observe the event (on average) in one of
every two trials (probability = 1/2). That means you’ll observe the event once for
every time it fails to happen (odds = 1:1).
If the probability is 1/3, the odds equal 1/3/(1 − 1/3) = 1:2 = 0.5. On average,
you’ll observe the event once in every three trials (probability = 1/3). That means
you’ll observe the event once for every two times it fails to happen (odds = 1:2).
Bayes as an equation
Bayes’s equation for clinical diagnosis can be written in two forms:
sensitivity
Posttest Odds = Pretest odds ⋅
1 − specificity
The posttest odds are the odds that a patient has the disease, taking into ac-
count both the test results and your prior knowledge about the patient. The pretest
odds are the odds that the patient has the disease as determined from information
you know before running the test.
Table 42.7 reworks the two examples with intermittent porphyria, which has a
likelihood ratio of 22.2, using the second equation shown previously in this section.
PRETEST POSTTEST
Who was tested? Probability Odds Odds Probability
Random screen 0.0001 0.0001 0.0022 0.0022
Sibling 0.50 1.0000 22.2 0.957
COMMON MISTAKES
Mistake: Automating the decision about which point on an ROC curve
to use as a cut-off.
The ROC curve plots the trade-offs between sensitivity and specificity. Which
combination is the best to define a critical value of a lab test? It depends on the
consequences of making a false positive or a false negative diagnosis. That deci-
sion needs to be made in a clinical (or in some situations, scientific) context and
should not be automated.
Q&A
CHAPTER SUMMARY
• When you report a test result as positive or negative, you can be wrong in
two ways: A positive result can be a false positive, and a negative result can
be a false negative.
• The sensitivity of a test is the fraction of all those with a disease who cor-
rectly get a positive result.
• The specificity is the fraction of those without the disease who correctly get
a negative test result.
• The positive predictive value of a test answers the question, If the test is
positive (abnormal test result, suggesting the presence of disease), what is
the chance that the patient really has the disease? The answer depends in
part on the prevalence of the disease in the population you are testing.
• The negative predictive value of a test answers the question, If the test is
negative (normal test result, suggesting the absence of disease), what is the
chance that the patient really does not have the disease? The answer depends
in part on the prevalence of the disease in the population you are testing.
C HA P T E R 4 2 • Sensitivity, Specificity, and Receiver Operating Curves 451
Meta-analysis
For all meta-analytic tests (including those for bias): If it looks
bad, it’s bad. If it looks good, it’s not necessarily good.
D ANIËL L AKENS
INTRODUCING META-ANALYSIS
What is meta-analysis?
Meta-analysis is used to combine evidence from multiple studies, usually clinical
trials testing the effectiveness of therapies or tests. The pooled evidence from multiple
studies can give much more precise and reliable answers than any single study can.
A large part of most meta-analyses is a descriptive review that points out
the strengths and weaknesses of each study. In addition, a meta-analysis pools
the data to report an overall effect size (perhaps a relative risk or odds ratio), an
overall CI, and an overall P value. Meta-analysis is often used to summarize the
results of a set of clinical trials.
CHALLENGE SOLUTION
Investigators may use inconsistent criteria to define Define which patient groups are included.
a disease and so may include different groups of
patients in different studies.
Studies asking the same clinical question may use Define which clinical outcomes are
different outcome variables. For example, stud- included.
ies of cardiovascular disease often tabulate the
number of patients who have experienced new
cardiac events. Some studies may only count
people who have documented myocardial infarc-
tions (heart attacks) or who have died. Others may
include patients with new angina or chest pain.
Some studies, especially those that don’t show the Seek unpublished studies; review registries
desired effect, are not published. Omitting these of study protocols.
studies from the meta- analysis causes publication
bias (discussed later in this chapter).
Some studies are of higher quality than others. Define what makes a study of high enough
quality to be included.
Some relevant studies may be published in a lan- Arrange for translation.
guage other than English.
Published studies may not include enough informa- Estimate data from figures; obtain unpub-
tion to perform a proper meta-analysis. lished details from the investigators or
Web archives.
Published data may be internally inconsistent. For Resolve inconsistencies without bias; re-
example, Garcia-Berthou and Alcaraz (2004) found quest details from original investigators.
that 11% of reported P values were not consistent
with the reported statistical ratios (t, F, etc.) and dfs.
Data from some patients may be included in multiple When multiple studies are published by the
publications. Including redundant analyses in a same investigators, ask them to clarify
meta-analysis would be misleading. which patients were included in more
than one publication.
PUBLICATION BIAS
The problem
Scientists like to publish persuasive findings. When the data show a large and
statistically significant effect, it is easy to keep up the enthusiasm necessary to
write up a paper, create the figures, and submit the resulting manuscript to a
journal. If the effect is large, the study is likely to be accepted for publication, as
editors prefer to publish papers that report results that are statistically significant.
If the study shows a small effect, especially if it is not statistically significant, it
is much harder to stay enthusiastic through this process and stick it out until the
work is published.
Many people have documented that this really happens. Studies that find
large, statistically significant effects are much more likely to get published, while
studies showing small effects tend to be abandoned by the scientists or rejected by
journals. This tendency is called publication bias.
4 5 4 PA RT H • T H E R E ST O F STAT I ST IC S
The rest of a meta-analysis is quantitative. The results of each study are sum-
marized by one value, called the effect size, along with its CI. The effect size is
usually a relative risk or odds ratio, but it could also be some other measure of
treatment effect. In some cases, different studies use different designs, and so
the meta-analyst has to do some conversions so that the effect sizes are compa-
rable.Combining all the studies, the meta-analysis computes the pooled treatment
effect, its CI, and a pooled P value.
The results for the individual studies and the pooled results are plotted on a
graph known as a forest plot or blobbogram. Figure 43.1 is an example. It shows
part of a meta-analysis performed by Eyding et al. (2010) on the effectiveness of
the drug reboxetine, a selective inhibitor of norepinephrine re-uptake, marketed
in Europe as an antidepressant. The researchers tabulated two different effects for
this drug: the odds ratio for remission and the odds ratio for any response. They
computed these odds ratios comparing reboxetine against either placebo or other
antidepressants.
15
46
47
Trial ID
50
45
49
Total
Figure 43.1 shows the odds ratio for remission comparing reboxetine to
p lacebo. An odds ratio of 1.0 means no effect, an odds ratio greater than 1.0
means that reboxetine was more effective than placebo, and an odds ratio less than
1.0 means that reboxetine was less effective than placebo. The graph shows 95%
CIs for seven studies identified with numbers (which can be used to look up de-
tails presented in the published meta-analysis). The bottom symbol shows the
overall odds ratio and its 95% CI.
In six of the seven studies, the 95% CI includes 1.0. With 95% confidence,
you cannot conclude from these six studies that reboxetine works better than pla-
cebo. Accordingly, a P value would be greater than 0.05. In only one of the stud-
ies (the one shown on top of the graph and identified as Study 14) does the 95%
confidence not include 1.0. The data from this study, but not the others, would
lead you to conclude with 95% confidence that reboxetine worked better than
placebo and that the effect is statistically significant (with P < 0.05). As explained
in Chapter 22, even if the drug were entirely ineffective, it is not too surprising
that one of six studies would find a statistically significant effect just by chance.
The bottom part of the graph shows the total, or pooled, effect, as computed by
the meta-analysis. The 95% CI ranges from 0.98 to 1.56. The CI contains 1.0, so the
P value must be greater than 0.05. From this CI alone, it might be hard to know what to
conclude. The overall odds ratio is 1.24. That is a small effect, but not a tiny one. The
95% CI includes 1.0 (no effect), but just barely. The paper also compares reboxetine
to other antidepressants and concludes that reboxetine is less effective than the others.
ASSUMPTIONS OF META-ANALYSIS
The calculations used in a meta-analysis are complicated and beyond the scope of
this book. But you should know that there are two general methods that are used,
each of which is based on different assumptions:
• Fixed effects. This model assumes that all the subjects in all the studies
were really sampled from one large population. Thus, all the studies are
estimating the same effect, and the only difference between studies is due
to random selection of subjects.
• Random effects. This model assumes that each study population is unique.
The difference among study results is due both to differences between the
populations and to random selection of subjects. This model is more realis-
tic and is used more frequently.
Q&A
How does the meta-analysis combine the individual P values from each study?
It doesn’t! A meta-analysis pools weighted effect sizes, not P values. It calculates
a pooled P value from the pooled effect size.
Why not just average the P values?
A meta-analysis accounts for the relative sizes of each study. This is done by
computing a weighted average of the effect size. Averaging P values would be
misleading.
4 5 8 PA RT H • T H E R E ST O F STAT I ST IC S
Study A
Study B
Study C
Study D
Study E
Study F
Meta-analysis
95% CI: 0.75-0.97
Statins better Placebo better
0.25 0.5 1 2 4
Relative risk
Figure 43.2. Meta-analysis for the risk of strokes after taking statins.
Each line shows the 95% confidence interval for the relative risk of stroke comparing those
who take statins with controls. The vertical line is at a relative risk of 1.0, which represents
equal risk in both groups. The six lines on top, one for each study, cross that line so show
no significant association of statins and stroke. The meta-analysis computed an overall 95%
confidence interval, shown at the bottom. It does not cross 1.0, so the meta-analysis shows
a statistically significant association of taking statins and fewer strokes (P = 0.02), even
though none of the individual studies do. From Thavendiranathan and Bagai (2006).
C HA P T E R 4 3 • Meta-analysis 459
CHAPTER SUMMARY
• A meta-analysis pools multiple studies and reports an overall effect size
(perhaps a relative risk or odds ratio), an overall CI, and an overall P value.
• Studies that show positive results are far more likely to be published than
studies that reach negative or ambiguous conclusions. Selective publication
(publication bias) makes it impossible to properly interpret the published
literature and can make the results of meta-analyses suspect. A good meta-
analyst has to work hard to find relevant unpublished data.
• The main results of a meta-analysis are shown in a forest plot, also known
as a blobbogram. This kind of graph shows the effect size of each study
with its CI, as well as the pooled (overall) effect with its CI.
Putting It All Together
C HA P T E R 4 4
The Key Concepts of Statistics
If you know twelve concepts about a given topic you will look
like an expert to people who only know two or three.
S COT T A DAMS
463
4 6 4 PA RT I • P U T T I N G I T A L L T O G E T H E R
“Not significantly different” does not mean the effect is absent, small,
or scientifically irrelevant
If a difference is not statistically significant, you can conclude that the observed
results are not inconsistent with the null hypothesis. Note the double negative.
You cannot conclude that the null hypothesis is true. It is quite possible that the
null hypothesis is false and that there really is a small difference between the
populations. All you can say is that the data are not strong or consistent enough to
persuade you to reject the null hypothesis.
Chapter 19 lists many reasons for obtaining a result that is not statistically
significant.
50%
“Disproven
conclusively”
0%
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
P Value
of finding false but statistically significant results, but these methods also make it
harder to find true effects.
Multiple comparisons can be insidious. To correctly interpret statistical analy-
ses, all analyses must be planned (before collecting data), and all planned analyses
must be conducted and reported. However, these simple rules are widely broken.
What about the published findings that are not false positives? As pointed
out in Chapter 26, these studies tend to inflate the size of the difference or effect.
The explanation is simple. If many studies were performed, you’d expect the av-
erage of the effects detected in these studies to be close to the true effect. By
chance, some studies will happen to find larger effects and some studies will
happen to find smaller effects. However, studies with small effects tend not to get
published. On average, therefore, the studies that do get published tend to have
effect sizes that overestimate the true effect (Ioannidis, 2008). This is called the
winner’s curse (Zollner and Pritchard, 2007). This term was coined by economists
to describe why the winner of an auction tends to overpay.
Statistical Traps to Avoid
When the data don’t make sense, it’s usually because you have
an erroneous preconception about how the system works.
E RNEST B EUTLER
468
C HA P T E R 4 5 • Statistical Traps to Avoid 469
and then publishes the results so it appears that the hypothesis was stated before
the data collection began As Chapter 23 noted, Kerr (1998) coined the acronym
HARK: hypothesizing after the results are known. Kriegeskorte and colleagues
(2009) call this double dipping.
It is impossible to evaluate the results from such studies unless you know ex-
actly how many hypotheses were actually tested. You will be misled if the results
are published as if only one hypothesis was tested. A XKCD cartoon points out the
folly of this approach (Figure 45.1).
Also beware of studies that don’t truly test a hypothesis. Some investiga-
tors believe their hypothesis so strongly (and may have stated the hypothesis
so vaguely) that no conceivable data would lead them to reject that hypothesis.
No matter what the data showed, the investigator would find a way to conclude
that the hypothesis is correct. Such hypotheses “are more ‘vampirical’ than
‘empirical’—unable to be killed by mere evidence” (Gelman & Weakliem, 2009
citing Freese, 2008).
Figure 45.1. Nonsense conclusions via HARKing (hypothesizing after the results are
known).
Source: https://xkcd.com.
C HA P T E R 4 5 • Statistical Traps to Avoid 471
P = 0.05
300
200
Frequency
100
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
P value
Figure 45.2. Too many published P values are just a tiny bit less than 0.05.
Masicampo and Lalande (2012) collected P values from three respected and peer-reviewed
journals in psychology and then tabulated their distribution. The figure shows a “peculiar
prevalence” of P values just below 0.05. This figure was made from the raw list of 3,627 P
values kindly sent to me by Masicampo, matching a graph published by Wasserman (2012).
is amazingly strong, with r = 0.79. A P value testing the null hypothesis of no real
correlation is tiny, less than 0.0001.
Of course, these data don’t prove that eating chocolate helps people win a
Nobel Prize. Nor does it prove that increasing chocolate imports into a country
will increase the number of Nobel Prizes that residents of that country will win.
When two variables are correlated, or associated, it is possible that changes
in one of the variables causes the other to change. But it is also likely that
both variables are related to a third variable that influences both. There are
many variables that differ among the countries shown in the graph, and some
of those probably correlate with both chocolate consumption and number of
Nobel Prizes.
This point is often summarized as “Correlation does not imply causation,” but
it is more accurate to say that correlation does not prove causation.
When each data point represents a different year, it is even easier to find
silly correlations. For example, Figure 45.4 shows a very strong negative corre-
lation between the total number of pirates in the world and one measure of global
average temperature. But correlation (pirates would say carrrrelation) does not
prove causation. It is quite unlikely that the lack of pirates caused global warm-
ing or that global warming caused the number of pirates to decrease. More
likely, this graph simply shows that both temperature and the number of pirates
have changed over time. The variable time is said to be confounded by the other
two variables.
4 7 2 PA RT I • P U T T I N G I T A L L T O G E T H E R
35
Switzerland
r = 0.79 Sweden
30
Nobel laureates per 10 million population
25 Denmark
Austria
Norway
20
UK
15
Ireland Germany
Netherlands
USA
10
Belgium France
Finland
Canada
5 Australia
Italy
Poland Portugal
Greec Spain
Japan
0 China Brazil
0 2 4 6 8 10 12 14
Per capita chocolate consumption (kg/year)
Figure 45.3. Correlation between average chocolate consumption by country and the
number of Nobel Prize winners from that country.
Data are from Messerili (2012). The Y-axis plots the total number of Nobel Prizes won by
citizens of each country. The X-axis plots chocolate consumption in a recent year (different
years for different countries, based on the availability of data). Both X and Y values are
normalized to the country’s current population. The correlation is amazingly strong, but this
doesn’t prove that eating chocolate will help you win a Nobel Prize.
17
Global average temperature (°C)
16
2000
1980
15 1940
1880
1920 1860
14 1820
rs = –0.995
P = 0.0008
13
Figure 45.4. Correlation between the number of pirates worldwide and average
world temperature.
Adapted from Henderson (2005). Correlation does not prove causation!
C HA P T E R 4 5 • Statistical Traps to Avoid 473
The cartoons in Figure 45.5 drive home the point that correlation does not
prove causation. But note a problem with the XKCD cartoon. Since the outcome
(understanding or not) is binary, this cartoon really points out that association
does not imply causation.
SURROGATE OR PROXY
TREATMENT VARIABLE IMPORTANT VARIABLE(S)
Two anti-arrhythmic drugs • Fewer premature heart beats • More deaths
• Conclusion: Good treatment • Conclusion: Deadly treatment
Table 45.1. Results using proxy or surrogate variables can produce incorrect conclusions.
a nti-arrhythmic drugs was known to reduce the number of extra beats. So it made
sense that taking those drugs would extend life. The evidence was compelling
enough that the FDA approved use of these drugs for this purpose. But a ran-
domized study to directly test the hypothesis that anti-arrhythmic drugs would
reduce sudden death showed just the opposite. Patients taking two specific anti-
arrhythmic drugs had (fewer extra beats, the proxy variable) but were more likely
to die (CAST Investigators, 1989). Fisher and VanBelle (1993) summarize the
background and results of this trial.
Another example is the attempt to prevent heart attacks by using drugs to
raise HDL levels. Low levels of HDL (“good cholesterol”) are associated with an
increased risk of atherosclerosis and heart disease. Pfizer Corporation developed
torcetrapib, a drug that elevates HDL, with great hope that it would prevent heart
disease. Barter and colleagues (2007) gave the drug to thousands of patients with
a high risk of cardiovascular disease. LDL (“bad cholesterol”) decreased 25% and
HDL (“good cholesterol”) increased 72%. The CIs were narrow, and the P values
were tiny (<0.001). If the goal was to improve cholesterol levels, the drug was a
huge success. Unfortunately, however, treatment with torcetrapib also increased
the number of heart attacks by 21% and increased the number of deaths by 58%.
Similarly, niacin increases HDL but doesn’t decrease heart attacks (The HPS2-
THRIVE Collaborative Group, 2014).
The take-home message is clear: treatments that improve results of lab tests
may not improve health or survival (see Table 45.1). Svennson (2013) lists 14 ad-
ditional examples.
RESULTS FROM . . .
OBSERVATIONAL
INTERVENTION INCIDENCE OF STUDIES EXPERIMENT
Hormone replacement Cardiovascular events Decrease Increase
therapy after menopause
Megadose vitamin E Cardiovascular events Decrease No change
Low-fat diet Cardiovascular events Decrease No change
and cancer
Calcium supplementation Fractures and cancer Decrease No change
Vitamins to reduce Cardiovascular events Decrease No change
homocysteine
Vitamin A Lung Cancer Decrease Increase
Table 45.2. Six hypotheses suggested by observational studies proven not to be true
by experiment.
people without diabetes who were similar in other ways and then compared the
vitamin D levels in blood samples drawn before disease onset. They found
that people with average 25(OH)D levels greater than 100 nmol/L had a much
lower risk of developing diabetes than those whose average 25(OH)D levels
were less than 75 nmol/L. The risk ratio was 0.56, with a 95% CI ranging from
0.35 to 0.90 (P = 0.03).
Intriguing data! Do these findings mean that taking vitamin D supple-
ments will prevent diabetes? No. The association of low vitamin D levels and
onset of diabetes could be explained many ways. Sun exposure increases vita-
min D levels. Perhaps sunlight exposure also creates other hormones (which
we may not have yet identified) that decrease the risk of diabetes. Perhaps
the people who are exposed more to the sun (and thus have higher vitamin D
levels) also exercise more, and it is the exercise that helps prevent diabetes.
Perhaps the people with higher vitamin D levels drink more fortified milk, and
it is the calcium in the milk that helps prevent diabetes. The only way to find
out for sure whether ingestion of vitamin D prevents diabetes is to conduct
an experiment comparing people who are given vitamin D supplements with
those who are not.
Table 45.2 lists six examples where the results of experiments were opposite
to the results from observational studies. Although observational studies are much
easier to conduct than experiments, data from experiments are more definitive.
With observational studies, it is difficult to deal with confounding variables, and
so it is nearly impossible to convincingly untangle cause and effect. The cartoon
of Figure 45.6 adds two more examples.
The first five rows are adapted from Spector and Vesell (2006a). “Cardiovas-
cular events” include myocardial infarction, sudden death, and stroke. The bottom
row is from Omenn and colleagues (1996).
4 7 6 PA RT I • P U T T I N G I T A L L T O G E T H E R
r = –0.4419
80% P = 0.0013
% voted for Romney
60%
40%
20%
55%
rs = 0.9222
P = 0.0025
45%
40%
35%
30%
4
19
9
80
<2
–3
–4
–5
–8
17
–1
>1
0–
24
36
48
60
90
12
Household income (thousands of US dollars)
state data. People with higher income were more likely to vote for Romney, even
though states with higher average income tended to have fewer Romney voters.
What accounts for the discrepancy? There are lots of differences between
states. A correlation of statewide data does not tell you about the individuals in
those states (Gelman & Feller, 2012). Mistakenly using relationships among groups
to make inferences about individuals is called the ecological fallacy. Another ex-
ample is the data on Nobel Prizes and chocolate presented earlier in this chapter.
equipment have all become more consistent. Since the mean hasn’t changed and
the SD is much smaller than it was in earlier times, a batting average greater than
0.400 is now incredibly rare. Gould wasn’t able to understand the changes in base-
ball until he investigated changes in variability (SD).
Variability in biological or clinical studies often reflects real biological diver-
sity (rather than experimental error). Appreciate this diversity! Don’t get mesmer-
ized by comparing averages. Pay attention to variation and to the extreme values.
Nobel Prizes have been won from studies of people whose values were far from
the mean.
Wild-type Mutant
ns **
100
Transmitter uptake
50
0
Baseline Stimulated Baseline Stimulated
Figure 45.9. The difference between statistically significant and not statistically signifi-
cant is not itself statistically significant.
C HA P T E R 4 5 • Statistical Traps to Avoid 479
Table 45.3. Admissions to the graduate programs in Berkeley, 1973. Pooled data.
At first glance, the data seem to provide evidence of sexism, but that is only because the
data pool admission rates from multiple graduate programs, each of which makes its own
admissions decisions.
4 8 0 PA RT I • P U T T I N G I T A L L T O G E T H E R
150
100
Area (cm2) 50
0
0 20 40 60 80
Perimeter (cm)
200 150
R2 = 0.90 R2 = 0.96
150
Area (cm2)
Area (cm2)
100
100
50
50
0 0
0 20 40 60 80 0 20 40 60 80
Perimeter (cm) Perimeter (cm)
150 150
Area (cm2)
Area (cm2)
100 100
50 50
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Perimeter (cm) Perimeter (cm)
Figure 45.12 adds data from more rectangles. Now it seems that those two
outliers were really not so unusual. Instead, it seems that there are two distinct
categories of rectangles. The right side of Figure 45.11 tentatively identifies
the two types of rectangles with open and closed circles and fits each to a
different model.
4 8 2 PA RT I • P U T T I N G I T A L L T O G E T H E R
While this process may seem like real science, it is not. Two rectangles with
the same perimeter can have vastly different areas depending on their shapes.
It simply is not possible to predict the area of a rectangle from its perimeter.
The area must be computed from both height and width (or, equivalently, from
perimeter and either height or width, or from perimeter and the ratio of height/
width). An important variable (either height, width, or their ratio) encoding the
shape of the rectangle was missing from our analysis. Understanding these data
required simple thinking to identify the missing variable, rather than fancy statisti-
cal analyses. A missing variable that influences both the dependent and indepen-
dent variables is called a lurking variable.
DECISION: DO
DECISION: REJECT NOT REJECT NULL
NULL HYPOTHESIS HYPOTHESIS TOTAL
Null hypothesis is true A (Type I error) B A +B
Null hypothesis is false C D (Type II error) C+D
Total A+C B+D A+B+C+D
Table 45.5. The results of many hypothetical statistical analyses to reach a decision to
reject or not reject the null hypothesis.
A, B, C, and D are integers (not proportions) that count numbers of analyses (number of P
values). The total number of analyses equals A + B + C + D. This is identical to Table 18.1.
C HA P T E R 4 5 • Statistical Traps to Avoid 483
NOT STATISTICALLY
STATISTICALLY SIGNIFICANT: DO NOT
SIGNIFICANT: REJECT REJECT NULL
NULL HYPOTHESIS HYPOTHESIS TOTAL
No real effect (null 495 9,405 9,900
hypothesis true)
Effect is real 50 50 100
Total 545 9.455 10,000
Table 45.6. Results of 10,000 comparisons with 50% power, a 5% significance level,
and a prior probability of 1%.
Their effect size is tiny and their measurement error is huge. At some point, a
set of measurements is so noisy that biases in selection and interpretation over-
whelm any signal and, indeed, nothing useful can be learned from them. My best
analogy is that they are trying to use a bathroom scale to weigh a feather—and
the feather is resting loosely in the pouch of a kangaroo that is vigorously jump-
ing up and down.
CHAPTER SUMMARY
• Don’t get distracted by P values and conclusions about statistical signifi-
cance without also thinking about the size of difference or correlation and
its precision (as assessed by its CI).
• Beware of HARKing (hypothesizing after results are known). Testing a
hypothesis proposed as part of the experimental design is quite distinct
from “testing” a hypothesis suggested by the data.
• Beware of p-hacking. You cannot interpret P values at face value if the
investigator analyzed the data many different ways.
• Correlation does not prove causation.
• Beware of surrogate or proxy variables. A treatment can improve laboratory
results while harming patients.
• If you want to know if an intervention makes a difference, you need to
intervene. Observational studies can be useful but rarely are definitive.
• Don’t fall for the ecological fallacy. If the data were collected for groups,
the conclusion can only be applied to the set of groups, not to the individu-
als within the groups.
• Ask about more than averages. Be on the lookout for differences in variability.
• The difference between “significant” and “not significant” is not itself sta-
tistically significant.
• If the study ignores a crucial variable, it doesn’t really matter how fancy
the analyses are.
• False positive conclusions are much more common than most people realize.
• Avoid dichotomizing outcome variables.
Capstone Example
I must confess to always having viewed studying statistics as
similar to a screening colonoscopy; I knew that it was impor-
tant and good for me, but there was little that was pleasant or
fun about it.
A NTHONY N . D E M ARIA
T his chapter was inspired by Bill Greco, who is its narrator and
coauthor. One definition of capstone is “a final touch,” and this
chapter reviews many of the statistical principles presented throughout
this book. It also demonstrates the usefulness and versatility of simula-
tions. Although based on a true situation, many of the elements of the
storyline, people’s names, and drug names have been changed to protect
the innocent.
487
4 8 8 PA RT I • P U T T I N G I T A L L T O G E T H E R
Each of the eight experiments had taken one week to conduct. All eight experi-
ments were conducted with the same passage of the KB cell line over the six-
month period preceding the submission of the manuscript. The IC50s for TMQ and
MTX of 19 and 1.9 nM, respectively, were determined from a joint experiment
with shared controls. The other six IC50 determinations were made at different
times and preceded this last paired experiment. The overall theme of the manu-
script was a comparison of the two DHFR inhibitors against the single KB human
cell line regarding antiproliferative potency, cellular uptake, metabolism, and in-
hibition of the target enzyme DHFR.
As part of the usual process of peer review, the journal’s editor sent the manu-
script to several anonymous reviewers. The paper was rejected largely because of
the following comment by one reviewer: “This reviewer performed an unpaired
Student’s t test on the data [see Table 46.1], and found that there is no significant
difference between the mean IC50 of TMQ and MTX (P = 0.092) against KB cells.
The data do not support your conclusion.”
Not believing the reviewer, Dr. Bentham asked me to check the calculations.
Using the formulas and tables in my biostatistics text and a scientific calculator,
I performed a standard unpaired two-tailed Student’s t test (with the usual as-
sumption of equal error variance for the two populations). I found that t = 2.01
and P = 0.092, in total agreement with the reviewer. Because P > 0.05, I informed
the investigator that the reviewer was correct: the difference in potency was not
statistically significant.
Because the P value was 0.092, the investigator asked, doesn’t that mean
there is a (1.000 − 0.092) = 0.908, or 90.8%, chance that the findings are real? I
explained to him that a P value cannot be interpreted that way. The proper inter-
pretation of the P value is that even if the two drugs really have the same potency
(the null hypothesis), there is a 9.2% chance of observing this large a discrepancy
(or larger still) by chance. If I had had a crystal ball, perhaps I could have shown
him Chapter 19 of this text!
I took the analysis one step further and computed the CI of the difference. The
difference in observed IC50s was −18.5 nM (the order of subtraction was arbitrary,
C HA P T E R 4 6 • Capstone Example 489
so the difference could have been +18.5 nM; it is important only to be consistent).
The 95% CI for the difference was from −41.0 to +4.06 nM. I pointed out that the
lower confidence limit was negative (MTX was more potent) and the upper confi-
dence limit was positive (TMQ was more potent), so the 95% CI included the pos-
sibility of no difference in potency (equal IC50 values; the null hypothesis). This was
consistent with the P value (see Chapter 17). When a P value is greater than 0.05, the
95% CI must include the value specified by the null hypothesis (zero, in this case).
The investigator became enraged. He thought the data in the table would con-
vince any competent pharmacologist that TMQ is, beyond any reasonable doubt,
less potent than MTX. Statistics weren’t needed, he thought, spouting off half-
remembered famous quotes: “When the data speaks for itself, don’t interrupt.”
“There are lies, damned lies and statistics.” “If you need statistics, you’ve done
the wrong experiment.”
The investigator sent the manuscript unaltered to another journal, which
accepted it.
Now
It is Friday, January 23, 2009. It is a chilly winter day in Buffalo, New York.
Barack Obama has just become the president of the United States. There are
small, powerful PCs on the desks of virtually all scientists. Powerful statistical
PC software is available, at reasonable or no cost, to all scientists. The Internet
is a major part of professional and personal life. I am the instructor of an applied
biostatistics course. I use Intuitive Biostatistics as the main textbook in the course.
I just retrieved the published paper on TMQ and MTX. As I stare at the IC50 table
reproduced here as Table 46.1, it stares back at me.
The mismatch of statistical reasoning and common sense gnawed at me. I
knew the statistical calculations were correct. But I also felt the senior and experi-
enced investigator’s scientific intuition and conclusion were also probably correct.
Something was wrong, and this chapter is the culmination of decades of angst. I
must resolve this 34-year-old quandary. How should I have analyzed these data
back in 1975? How could I analyze these data now in 2009?
Before reading on, think about how you would analyze these data.
• What are the details of the cell proliferation assay? Which details provide
the opportunity for experimental artifacts? Were cells actually counted?
Or did they measure some proxy variable (perhaps protein content) that
4 9 0 PA RT I • P U T T I N G I T A L L T O G E T H E R
is usually proportional to cell count? Over what period of time was cell
growth measured? Has the relationship between cell growth and time been
well characterized in previous experiments? Does cell growth slow down
because of cell-to-cell contact inhibition (or depletion of nutrients) as the
flasks fill with cells? Are the different drug treatments randomly assigned
to the flasks? Is it possible that cell growth depends on where each flask is
placed in the incubator?
• How was 100% cell growth defined—as growth with no drug? What if
growth was actually a bit faster in the presence of a tiny concentration of
drug? Should the 100% point be defined by averaging the controls or by
fitting a curve and letting the curve-fitting program extrapolate to define
the top plateau?
• How was 0% cell growth defined? Is 0% defined to be no cell growth at all
or cell growth with the largest concentration of drug used? What if an even
larger concentration had been used? Or was 0% defined by a curve-fitting
program that extrapolated the curve to infinite drug concentration? The
answer to this seemingly simple technical question can have a large influ-
ence on the estimation of drug potency. The determination of the IC50can be
no more accurate than the definitions of 0% and 100%.
• How was the IC50 determined? By hand with a ruler? By hand with a French
curve? By linearizing the raw data and then using linear regression? By fit-
ting the four-parameter concentration-effect model (see Chapter 36) to the
raw data using nonlinear regression? If so, with what weighting scheme?
Here, we will skip all those questions and assume that the values in Table 46.1
are reliable. But do note that many problems in data analysis actually occur while
planning or performing the experiment or while wrangling the data into a form
that can be entered into a statistics program.
50
40
IC50 in nM
30
20
10
0
TMQ MTX
smaller. But the other end of the CI is positive, which means the IC50 of TMQ is
smaller. Like before, this method concludes that the data are consistent with no
difference in IC50. Because the 95% CI includes zero, the P value must be higher
than 0.05. In fact, it is 0.104.
In summary, this alternative logical approach lengthens the CI and increases
the P value! I don’t seem to be making any real progress.
ratio for this last confirmatory experiment, with a statement that this experiment
was representative of six previous individual exploratory experiments with the
two agents.
This seems to be a scientifically ethical approach to the dilemma. However,
because I like to think of myself as a stubborn, honest, and thorough scientist, I
suspect that there is a better solution—one that involves calculating, rather than
avoiding, statistics. I want to honestly showcase and analyze all of the experimen-
tal work.
Studying TMQ and MTX in paired experiments with common controls to
assess their potency ratio is a good idea. Pairing (see Chapter 31) is usually rec-
ommended because it reduces variability and so increases power. This example
combines data from paired experiments with data from unpaired experiments,
which makes the analysis awkward.
Analyzing lognormal data is simple. I first calculated the (base 10) loga-
rithms of the eight IC50s. The results are shown in Table 46.3 and Figure 46.2.
The mean log(IC50) for TMQ is 1.19 and that for MTX is 0.251. The variabil-
ity of the two sets of log(IC50) values is very similar, with nearly identical sample
SDs (0.389 and 0.373, respectively). Thus, there is no problem accepting the as-
sumption of the standard two-sample unpaired t test that the two samples come
from populations with the same SDs.
Reversing the logarithmic transform (i.e., taking the antilog; see Appendix E)
converts the values back to their original units. Taking the antilog is the same as
taking 10 to that power, and taking the antilog of a mean of logarithms computes
the geometric mean (see Chapter 11). The corresponding geometric mean for the
IC50 for TMQ is 15.6 nM and that for MTX is 1.78 nM.
TMQ MTX
0.740 −0.081
1.079 0.041
1.279 0.279
1.672 0.763
Table 46.3. The logarithms of the IC50 values of TMQ and MTX.
These are the logarithms of the values (in nM) shown in Table 46.1.
2 100
Log(IC50, nM)
1 10
IC50 in nM
0 1
–1 0.1
TMQ MTX
of two values equals the logarithm of the ratio of those two values). The antilog
is 10 to that power, or 100.942, which equals 8.75. In other words, MTX is almost
nine times more potent than TMQ. It requires one-ninth as much MTX to have the
same effect as TMQ.
95% CI
The 95% CI for the difference in mean log(IC50) ranges from 0.282 to 1.60. Trans-
form each limit to its antilog (10 to that power) to determine the 95% confidence
limits of the population potency ratio, 1.91 to 39.8. This interval does not encom-
pass 1.0 (the value of the null hypothesis of equal drug potency), so it is consistent
with a P value of less than 0.05. With 95% confidence, we can say that MTX is
somewhere between twice as potent as TMQ and 40 times as potent. Analyzed this
way, the intuition of the investigator is proven to be correct.
groups was 0.38, so we enter into the sample size formula a conservative 0.40 as
s. How large a difference in log(IC50) are we looking for? Let’s say we are look-
ing for an eightfold difference in potency, so the log(IC50) values differ by log(8),
which equals about 0.9. Set w = 0.9. The effect size is 0.9/0.4 = 2.25. Change the
effect size in G*Power to 2.25 (while keeping the settings for two-tailed signifi-
cance threshold of 0.05 and a power of 80%). The required sample size is four in
each group.
This might be stated in a paper as follows: a sample size of four in each
group was computed to have 80% power to detect as statistically significant (P <
0.05) an eightfold difference in potency, assuming that log(IC50) values follow a
Gaussian distribution with a SD of 0.40.
analyzing the IC50 values rather than the log(IC50) values? Our one example gives
a hint of the answer, but only a hint. To find out a fuller answer, I performed Monte
Carlo simulations.
Table 46.4 shows the simulated results. The first row in Table 46.4 is the result
of 1 million simulated experiments. For each, four values were randomly generated
to simulate an experimental measurement of potency for one drug, and four other
values were randomly generated to simulate the potency of the other drug. All in
all, 8 million values were randomly generated to create the first row of Table 46.4.
The data for one drug were simulated with an IC50 of 2 nM, so the logarithm
of 2 nM, 0.301, was entered as the mean into the simulation program. The data for
the other drug were simulated using an IC50 of 20 nM, so the logarithm of 20 nM,
1.301, was entered into the simulation program. Each log(IC50)—all 8 million of
them—was chosen using a random number method that simulates sampling from
a Gaussian distribution, assuming a mean of 0.301 or 1.301 and a SD of 0.40.
These simulations mimic the data presented at the beginning of this chapter.
The potency of the two drugs differs by a factor of 10, the log(IC50) values vary
randomly according to a Gaussian distribution, the SD of the logarithms is 0.4,
and n = 4 for each group. The log(IC50) values from each simulated experiment
were compared with an unpaired t test, and the P value was tabulated.
The last column of Table 46.4 shows the percentage of the simulated experi-
ments in which the P value is less than 0.05, the usual definition of statistical sig-
nificance. In other words, that last column shows the power of the experimental
design (see Chapter 20). For this design, the power is 83.6%. In the remaining
simulated experiments (16.4% of them), the P values are greater than 0.05 and so
result in the conclusion that the results were not statistically significant. This hap-
pens when the means happen to be close together or the scatter among the values
in that simulated experiment happens to be large.
The second row of Table 46.4 shows analyses of IC50 values. Each IC50 was
computed by taking the antilog (10 to that power) of a log(IC50) generated as
explained for the first row. The first row shows the results of a simulation of sam-
pling log(IC50) values from a Gaussian distribution, and the second row shows the
results of a simulation of sampling IC50 values from a lognormal distribution. The
IC50 values from each simulated experiment were compared with an unpaired t
test, and the P value was tabulated.
In only 45.6% of these experiments was the P value less than 0.05. Because
the violation of all three of its assumptions (lognormal distribution, unequal vari-
ance, differences examined rather than ratios) discourages the use of the standard
t test on untransformed IC50 values, this approach has much less statistical power
(45.6% vs. 83.6%).
Table 46.5. Simulations in which the null hypothesis is true (drugs have equal
potencies).
The simulations were done as explained in Table 46.4, except that these simulations used
equal potencies for the two drugs (here, 20 nM, but nearly identical results were obtained
when using 2 nM).
5 0 0 PA RT I • P U T T I N G I T A L L T O G E T H E R
CHAPTER SUMMARY
• Lognormal distributions are common in biology. Transforming data to
logarithms may seem arcane, but it is actually an important first step for
analyzing many kinds of data.
• This chapter was designed to review as many topics as possible, not to
be as realistic as possible. It rambles a bit. However, it realistically shows
that the search for the best method of analyzing data can be complicated.
Ideally, this search should occur before collecting the data or after collect-
ing preliminary data but before doing the experiments that will be reported.
If you analyze final data in many different ways in a quest for statistical
significance, you are likely to be misled.
• Statistical methods are all based on assumptions. The holy grail of biostatis-
tics is to find methods that match the scientific problem and for which the
C HA P T E R 4 6 • Capstone Example 501
assumptions are likely to be true (or at least not badly violated). The solu-
tion to this chapter’s problem (using logarithms) was not optimal because
it yielded the smallest P value but rather because it melded the statistical
analysis with the scientific background and goal. It was a nice translation
of the scientific question to a statistical model.
• Simulating data is a powerful tool that can reveal deep insights into data
analysis approaches. Learning to simulate data is not hard, and it is a skill
well worth acquiring if you plan to design experiments and analyze scien-
tific data.
• Statistical textbooks and software aren’t always sufficient to help you ana-
lyze data properly. Data analysis is an art form that requires experience as
well as courses, textbooks, and software. Until you have that experience,
collaborate or consult with others who do. Learning how to analyze data is
a journey, not a destination.
C HA P T E R 4 7
Statistics and Reproducibility
Truthiness is a quality characterizing a “truth” that a person
making an argument or assertion claims to know intuitively
“from the gut” or because it “feels right” without regard to
evidence, logic, intellectual examination or facts.
S TEPHEN C OLBERT
502
C HA P T E R 4 7 • Statistics and Reproducibility 503
Why?
Why can so few findings be reproduced? Undoubtedly, there are many reasons.
Part of the problem is that many experiments are complicated, and it is difficult to
repeat all the steps exactly. Different labs that think they are working on the same
cell line may actually be working with somewhat different cell lines. Reagents or
antibodies thought to be identical may differ in subtle, but important, ways. Dif-
ferences in experimental method thought to be inconsequential (such as choosing
plastic or glass test tubes) may actually explain discrepancies between labs. In some
cases, exploring reasons for inconsistent results can lead to new scientific insights.
Reality check
The real goal, of course, is not to have multiple studies that give similar reproducible
results. The real goal is to obtain results that are correct. When the results are incor-
rect, it doesn’t really matter much whether another study can reproduce the findings.
To be sure the results are correct, it is not enough to repeat the experiment exactly. It
is also necessary to run similar experiments to make sure the general conclusions can
be confirmed, not just that the results of that exact experiment can be reproduced.
and editors tend to reject such papers. So the papers that end up being published
are in part selected because they have large effect sizes and small P values. In other
words, published papers tend to exaggerate differences between groups or effects
of treatments. Publication bias was discussed in more detail in Chapter 43.
Note that the number that matters is the number of independent variables you
begin the analysis with. You can’t reduce the problem of overfitting by reducing
the number of variables through stepwise regression or by preselecting only vari-
ables that appear to be correlated with the outcome.
Table 49.2. The FPRP depends on the prior probability and P value.
The calculations assume sample size was sufficient to have a power of 80% and the FPRP
values would be lower if the power were lower. The results in the last column were deter-
mined by simulation using methods similar to Colquhoun (2014).
C HA P T E R 4 7 • Statistics and Reproducibility 507
size was chosen to give a power of 80%). Chapter 18 explained how to combine
these values to compute the FPRP. The table shows that there is an 86% chance
that a finding with any P value less than 0.05 is a false positive, which leaves only
a 14% chance that it is real. And if the P value is just barely smaller than 0.05,
there is a 95% chance that the finding is a false positive, leaving only a 3% chance
that it is a real finding. With the FPRP so high, there really is no point in running a
speculative experiment with the conventional significance level of 5%. And if you
do that kind of experiment and find a “significant” result, you shouldn’t expect a
repeat experiment to reproduce that conclusion.
Now let’s focus on experiments where the prior probability is 50%. If you
look at all experiments with P < 0.05, the FPRP is only 5.9%. But if you look only
at experiments with P values just barely less than 0.05 (I used the range 0.045 to
0.050 in the simulations), the FPRP is 27%. In other words, in this very common
situation, more than a quarter of results with a P value just a bit smaller than 0.05
will be a false positive.
In situations where false positive results are likely, it doesn’t make a lot of
sense to worry about whether a particular finding can be replicated.
spanning more than three orders of magnitude! Boos and Stefanski (2011) have
shown this much variability in many situations.
Lazzeroni1, Lu, and Belitskaya-Lévy (2014) have derived an equation for
computing the 95% prediction interval for a P value that answers this question: If
you obtain a certain P value in one experiment, what is the range of P values you
would expect to see in 95% of repeated experiments using the same sample size?
It turns out that this interval does not depend on sample size (but does depend on
whether the repeat experiment uses the same sample size). If the P value in your
first experiment equals 0.05, you can expect 95% of P values in repeat experi-
ments (with the same sample size) to be between 0.000005 and 0.87. If the first
P value is 0.001, you expect 95% of P values in repeat experiments to be between
0.000000002 and 0.38.
P values are nowhere near as reproducible as most people expect!
Table 49.3. Probability of statistical significance in a second experiment given that you
know the P value in the first experiment as computed in two papers by different methods.
The calculations assume that both experiments are performed perfectly and both comply
with the assumptions of the analysis.
C HA P T E R 4 7 • Statistics and Reproducibility 509
that the repeat experiment will have a P value less than 0.05. It turns out that that
probability is only 78% or 91% (the two values differ because two methods were
used to compute).
SUMMARY
There are many reasons why the results of many experiments or studies cannot be
reproduced. This chapter reviewed statistical reasons:
• Many studies are designed in a way to exaggerate effect sizes and so report
P values that are small. When the study results are exaggerated, you can’t
expect a repeat experiment to get the same results.
• Even with perfectly performed studies, the FPRP can be high, especially
when the prior probability is low or when the sample size is low so the
power is low. If the reported result is a false positive, you can’t expect a
repeat experiment to reproduce the results.
• Many experiments are designed with low power to find the effect the exper-
imenters are looking for. In this case, one may not expect a repeat experi-
ment to find the same results.
• It is important to focus on how large the effects are and not just on P values
and conclusions about statistical significance.
• P values are far less reproducible than most people appreciate.
• Meta-analysis is a tool to combine results from several studies. When stud-
ies that appear to be done properly reach conflicting results, it can make
sense to combine the results with meta-analysis.
• Science is hard to do properly.
C HA P T E R 4 8
CHECKLISTS FOR REPORTING
STATISTICAL METHODS AND
RESULTS
“Surprising, newsworthy, statistically significant, and wrong: it
happens all the time.”
A NDREW G ELMAN
M any people reading this book will never publish statistical results.
But for those who do, the following checklists are an opinionated
guide about how to present data and analyses. You may also want to con-
sult similar guides by Curtis and colleagues (2015) and by Altman and
colleagues (1983).
three repeat counts in a gamma counter from a preparation made from one
run of an experiment . . . ?
• State whether you choose sample size in advance or adjusted sample size as
you saw the results accumulate. If the latter, explain whether you followed
preplanned rules.
• If the sample sizes of the groups are not equal, explain why.
• If you started with one sample size and ended with another sample size,
explain exactly what happened. State whether these decisions were based
on a preset protocol or were decided during the course of the experiment.
GRAPHING DATA
General principles
• Every figure and table should present the data clearly and not be exagger-
ated in a way to emphasize your conclusion.
• Every figure should be reported with enough detail about the methods used
so that no reader needs to guess at what was actually done.
C HA P T E R # 4 8 Checklists for Reporting Statistical Methods and Results 513
• Consider posting the raw data files in a standard format, so others can
repeat your analyses or analyze the data differently. This is required in some
scientific fields.
• State if any data points are off scale so don’t show on the graph.
Graphing variability
• When possible, graph the individual data, not a summary of the data. If
there are too many values to show in scatter plots, consider box-and-whisker
plots, violin plots, or frequency distributions.
• If you choose to plot means with error bars, graph SD error bars, which
show variability, rather than SEM error bars, which do not. Be sure to state
in the figure legend what the error bars represent.
• Plot confidence intervals of differences or ratios to graphically present the
effect size.
• Don’t try to hide the variability among your data. That is an important part
of your data, and showing expected variability in measurements of physi-
ological or molecular variables is a way to reassure readers the experiment
was properly done.
Reporting P values
• Don’t report a P value unless you really think it will help the reader inter-
pret the results.
• When possible, report the P value as a number rather than as an inequality.
For example, say “the P value was 0.0234” rather than “P < 0.05.”
• If there is any possible ambiguity, clearly state the null hypothesis the
P value tests. If you don’ know the null hypothesis, then you shouldn’t
report a P value (since every P value tests a null hypothesis)!
• When comparing two groups, state if the P value is one- or two-sided (which
is the same as one- or two-tailed). If one-sided, state that you predicted the
direction of the effect before collecting data and recorded that decision and
prediction. If you didn’t make this decision and prediction before collecting
data, you should not report a one-sided P value.
5 1 4 PA RT I • P U T T I N G I T A L L T O G E T H E R
Appendices
A P P E N DI X A
Statistics with GraphPad
GRAPHPAD PRISM
All the figures in this book were created with GraphPad Prism, and most of the
analyses mentioned in this book can be performed with Prism (although not lo-
gistic regression).
GraphPad Prism, available for both Windows and Macintosh computers,
combines scientific graphing, curve fitting, and basic biostatistics. It differs from
other statistics programs in many ways:
• Statistical guidance. Prism helps you make the right choices and make
sense of the results. If you like the style of this book, then you’ll appreciate
the help built into Prism.
• Analysis checklists. After completing any analysis in Prism, click the clip-
board icon to see an analysis checklist. Prism shows you a list of questions
to ask yourself to make sure you’ve picked an appropriate analysis.
• Nonlinear regression. Some statistics programs can’t fit curves with
nonlinear regression and others provide only the basics. Nonlinear
regression is one of Prism’s strengths, and it provides many options (e.g.,
remove outliers, compare models, compare curves, interpolate standard
curves, etc.).
• Automatic updating. Prism remembers the links among data, analysis
choices, results, and graphs. When you edit or replace the data, Prism auto-
matically updates the results and graphs.
• Analysis choices can be reviewed and changed at any time.
• Error bars. You don’t have to decide in advance. Enter raw data and then
choose whether to graph each value or the mean with SD, SEM, or CI. Try
different ways to present the data.
Prism is designed so you can just plunge in and use it without reading in-
structions. But avoid the temptation to sail past the Welcome dialog so you can go
straight to the analyses. Instead, spend a few moments understanding the choices
519
5 2 0 A P P E N D I X A • STAT I ST IC S W I T H G R A P H PA D
on the Welcome dialog (see Figure A1). This is where you start a new data table
and graph, open a Prism file, or clone a graph from one you’ve made before.
The key to using Prism effectively is to choose the right kind of table for
your data, because Prism’s data tables are arranged differently than those used by
most statistics programs. For example, if you want to compare three means with
one-way ANOVA, you enter the three sets of data into three different columns in
a data table formatted for entry of column data. With some other programs, you’d
enter all the data into one column and also enter a grouping variable into another
column to define the group to which each value belongs. Prism does not use
grouping variables.
To understand how the various tables are organized, start by choosing some
of the sample data sets built into Prism. These sample data sets come with ex-
planations about how the data are organized and how to perform the analysis
you want. After experimenting with several sample data sets, you’ll be ready to
analyze your own data.
You can download a free 30-day trial from www.graphpad.com.
Statistics with Excel
522
A P P E N D I X B • Statistics with Excel 5 2 3
• The Help within Excel does not always provide useful information about
statistical functions. Google for other sources of information.
• Excel provides an Analysis ToolPak, which can perform some statistical
tests. You need to use the Add-in Manager to install it before you can use
it. Unlike Excel equations, the results of the ToolPak are not linked to the
data. If you edit the data, the results will remain fixed until you run the
analysis again.
• Beware of Excel’s RANK() function. Nonparametric tests assign all tied
values the average of the ranks for which they tie. Excel’s RANK() function
assigns all tied values the lowest of the ranks for which they tie, and the
other ranks are not used.
• Beware of the NORMDIST(z) function (Gaussian distribution). It looks
like it ought to be similar to the TDIST(t) function (Student t distribution),
but the two work very differently. Experiment with sample data before
using this function.
• The excellent book by Pace (2008) gives many details about using Excel to
do statistical calculations. It can be purchased as either a printed book or
as a pdf download.
• Starting with Excel 2010 (Windows) and 2011 (Mac), Excel includes a new
set of statistical functions. When you have a choice, use the newer ones,
which all are named using several words separated by periods. For example,
the old function NORMINV (which still exists) has been replaced with a
function called NORM.INV (which is more accurate).
A P P E N DI X C
Statistics with R
Values of the t Distribution Needed
to Compute Cis
The margin of error of many CIs equals the standard error times a critical value of
the t distribution tabulated in Table D1. This value depends on the desired confi-
dence level and the number of df, which equals n minus the number of parameters
estimated. For example, when computing a CI of a mean, df = (n – 1), because you
Table D1
525
5 2 6 A P P E N D I X D • VA LU E S O F T H E T D I S T R I BU T IO N
are only estimating one parameter (the mean); when computing a CI for a slope
obtained by linear regression, df = (n – 2), because you estimated two parameters
(the slope and the intercept).
These values were computed using the following formula from Excel 2011 (C
is the desired confidence level in percentage):
= T.INV.2T (1 − 0.01 * C, df)
If you have an older version of Excel (prior to 2010) use the TINV function
instead of T.INV. Don’t use Excel’s CONFIDENCE() function, which is based on
the z (normal) distribution rather than the t distribution and so has limited utility.
A P P E N DI X E
A Review of Logarithms
OTHER BASES
The logarithms shown in the previous section are called base 10 logarithms, be-
cause the computations take 10 to some power. These are also called common
logarithms.
You can compute logarithms for any power. Mathematicians prefer natural
logarithms, using base e (2.7183 . . .). By convention, natural logarithms are used
in logistic and proportional hazards regression (see Chapter 37).
Biologists sometimes use base 2 logarithms, often without realizing it. The
base 2 logarithm is the number of doublings it takes to reach a value. So the log
base 2 of 16 is 4, because if you start with 1 and double it four times (2, 4, 8, and
16), the result is 16. Immunologists often serially dilute antibodies by factors of 2,
527
5 2 8 A P P E N D I X E • A R EV I EW O F L O G A R I T H M S
so they often graph data on a log2 scale. Cell biologists use base 2 logarithms to
convert cell counts to number of doublings.
Logarithms using different bases are proportional to each other. Conse-
quently, converting from natural logs to common logs is sort of like changing
units. Divide a natural logarithm by 2.303 to compute the common log of the same
value. Multiply a common log by 2.303 to obtain the corresponding natural log.
NOTATION
Unfortunately, the notation is used inconsistently.
The notation “log(x)” usually means the common (base 10) logarithm, but
some computer languages use it to mean the natural logarithm.
The notation “ln(x)” always means natural logarithm.
The notation “log10(x)” clearly shows that the logarithm uses base 10.
ANTILOGARITHMS
The antilogarithm (also called an antilog) is the inverse of the logarithm trans-
form. Because the logarithm (base 10) of 1,000 equals 3, the antilogarithm of 3 is
1,000. To compute the antilogarithm of a base 10 logarithm, take 10 to that power.
To compute the antilogarithm of a natural logarithm, take e to that power. The
natural logarithm of 1,000 is 6.908. So the antilogarithm of 6.908 is e6.908, which is
1,000. Spreadsheets and computer languages use the notation exp(6.908).
A P P E N DI X F
Choosing a Statistical Test
If you are not sure which statistical test to use, the tables in this appendix may help
you decide. Reviewing these tables is also a good way to review your understand-
ing of statistics. Of course, this appendix can’t include every possible statistical test
(and some kinds of data require developing new statistical tests).
OUTCOME: BINOMIAL
Examples • Cure (yes/no) of acute myeloid leukemia
• Within a specific time period
• Success (yes/no) of preventing motion
sickness
• Recurrence of infection (yes/no)
Describe one sample • Proportion
Make inferences about one population • CI of proportion
• Binomial test to compare observed distribution
with a theoretical (expected) distribution
Compare two unmatched (unpaired) groups • Fisher’s exact test
Compare two matched (paired) groups • McNemar’s test
Compare three or more unmatched • Chi-square test
(unpaired) groups • Chi-square test for trend
Compare three or more matched • Cochran’s Q
(paired) groups
Explain/predict one variable from one or • Logistic regression
several others
A P P E N DI X G
Problems and Answers
The first three editions contained a chapter of problems and another chapter with
extensive discussion of the answers. These have not been updated for the fourth
edition, but the problems and answers for the third edition are available online at
www.oup.com/us/motulsky-4e
532
R E F E R E NC E S
Agnelli, G., Buller, H. R., Cohen, A., Curto, M., Gallus, A. S., Johnson, M., Porcari, et al.
(2012). Apixaban for extended treatment of venous thromboembolism. New England
Journal of Medicine,368, 699–708.
Agresti, A., & Coull, B. A. (1998). Approximate is better than exact for interval estimation
of binomial proportions. American Journal of Statistics,52, 119–126.
Alere. (2014). DoubleCheckGoldTM HIV 1&2. Retrieved May 14, 2014, http://www.alere.
com/ww/en/product-details/doublecheckgold-hiv-1-2.html.
Altman, D. G. (1990). Practical statistics for medical research. London: Chapman & Hall/
CRC.
Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence.
BMJ, 311, 485.
———. (1998). Time to event (survival) data. BMJ, 317, 468–469.
Altman D. G, Gore, S. M., Gardner, M. J., & Pocock S. J. (1983). Statistical guidelines for
contributors to medical journals. BMJ, 286, 1489–1493.
Anscombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27, 17–21.
Arad, Y., Spadaro, L. A., Goodman, K., Newstein, D., & Guerci, A. D. (2000). Prediction of
coronary events with electron beam computed tomography. Journal of the American
College of Cardiology, 36, 1253–1260.
Arden, R., Gottfredson, L. S., Miller, G., & Pierce, A. (2008). Intelligence and semen qual-
ity are positively correlated. Intelligence, 37, 277–282.
Austin, P. C., & Goldwasser, M. A. (2008). Pisces did not have increased heart failure:
Data-driven comparisons of binary proportions between levels of a categorical
variable can result in incorrect statistical significance levels. Journal of Clinical
Epidemiology, 61, 295–300.
Austin, P. C., Mamdani, M. M., Juurlink, D. N., & Hux, J. E. (2006). Testing multiple sta-
tistical hypotheses resulted in spurious associations: A study of astrological signs and
health. Journal of Clinical Epidemiology, 59, 964–969.
Babyak, M. A. (2004). What you see may not be what you get: A brief, nontechnical in-
troduction to overfitting in regression-type models. Psychosomatic Medicine, 66,
411–421.
Bailar, J. C. (1997). The promise and problems of meta-analysis. New England Journal of
Medicine, 337, 559–561.
533
534 REFERENCES
Bakhshi, E., Eshraghian, M. R., Mohammad, K., & Seifi, B. (2008). A comparison of two
methods for estimating odds ratios: Results from the National Health Survey. BMC
Medical Research Methodology, 8, 78.
Barter, P. J., Caulfield, M., Eriksson, M., Grundy, S. M., Kastelein, J. J., Komajda, M.,
Lopez-Sendon, J., et al. (2007). Effects of torcetrapib in patients at high risk for coro-
nary events. New England Journal of Medicine, 357, 2109–2122.
Bausell, R. B. (2007). Snake oil science: The truth about complementary and alternative
medicine. Oxford: Oxford University Press.
Begley, C. G. C., & Ellis, L. M. L. (2012). Drug development: Raise standards for preclini-
cal cancer research. Nature, 483, 531–533.
Benjamin, D. J., Berger, J., Johannesson, M.,et al, (2017). Redefine statistical significance.
Retrieved from psyarxiv.com/mky9j
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical
and powerful approach to multiple testing. Journal of Royal Statistical Society, B, 57,
290–300.
Bennett, C. M., Baird, A. A., Miller, M. B., & Wolford, G. L. (2011). Neural correlates of
interspecies perspective taking in the post-mortem Atlantic salmon: An argument for
proper multiple comparisons correction. Journal of Serendipitous and Unexpected
Results, 1, 1–5.
Bernstein, C. N., Nugent, Z., Longobardi, T., & Blanchard, J. F. (2009). Isotretinoin is not
associated with inflammatory bowel disease: A population-based case–control study.
American Journal of Gastroenterology, 104, 2774–2778.
Berry, D. A. (2007). The difficult and ubiquitous problems of multiplicities. Pharmaceuti-
cal Statistics, 6, 155–160.
Bhatt, D.L., and Mehta, C. (2016). Adaptive Designs for Clinical Trials. New England
Journal of Medicine 375, 65–74.
Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex bias in graduate admissions:
Data from Berkeley. Science, 187, 398–404.
Bishop, D. (2013) Interpreting unexpected significant results. BishopBlog (blog), June 7.
Accessed July 2, 2013, http://www.deevybee.blogspot.co.uk/2013/06/interpreting-
unexpected-significant.html.
Bland, J. M. J., & Altman, D. G. D. (2011). Comparisons against baseline within random-
ized groups are often used and can be highly misleading. Trials, 12, 264–264.
Blumberg, M. S. (2004). Body heat: Temperature and life on earth. Cambridge, MA:
Harvard University Press.
Bleyer, A. & Welch, H. G. (2012). Effect of three decades of screening mammography on
breast-cancer incidence. New England Journal of Medicine, 367, 1998–2005.
Boos, D. D., & Stefanski, L. A. (2011). P-value precision and reproducibility. American
Statistician, 65, 213–221.
Borenstein, M., Hedges, L, Higgins, J., & Rothstein, H. (2009). Introduction to m
eta-analysis.
Chichester, UK: Wiley.
Borkman, M., Storlien, L. H., Pan, D. A., Jenkins, A. B., Chisholm, D. J., & Campbell,
L. V. (1993). The relation between insulin sensitivity and the fatty-acid composition
of skeletal-muscle phospholipids. New England Journal of Medicine, 328, 238–244.
Briggs, W. M. (2008a). On the difference between mathematical ability between boys and
girls. William M. Briggs (blog), July 25. Accessed June 21, 2009, http://www.wmbriggs.
com/blog/?p=163/.
———. (2008b). Do not calculate correlations after smoothing data. William M. Briggs
(blog), February 14. Accessed June 21, 2009, wmbriggs.com/blog/?p=86/.
References 5 3 5
———. (2008c). Stats 101, Chapter 5: R. William M. Briggs (blog), May 20. Accessed
June 21, 2009, http://www.wmbriggs.com/public/briggs_chap05.pdf.
Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion.
Statistical Science, 16, 101–133.
Burnham, K., & Anderson, D. (2003). Model selection and multi-model inference. 2d ed.
New York: Springer.
Burns, P. (2005). A guide for the unwilling S user. Accessed January 26, 2009, http://www.
burns-stat.com/pages/Tutor/unwilling_S.pdf.
Bryant, F. B., & Brockway, J. H. (1997). Hindsight bias in reaction to the verdict in the
O. J. Simpson criminal trial. Basic and Applied Social Psychology, 19, 225–241.
Campbell, M. J. (2006). Statistics at square two. 2d ed. London: Blackwell.
Cantor, W. J., Fitchett, D. L., Borgundvagg, B., Ducas, J., Heffernam, M., Cohen, E. A.,
Morrison, L. J., et al. (2009). Routine early angioplasty after fibrinolysis for acute
myocardial infarction. New England Journal of Medicine, 360, 2705–2718.
Cardiac Arrhythmia Suppression Trial (CAST) Investigators. (1989). Preliminary report:
Effect of encainide and lecainide on mortality in a randomized trial of arrhythmia sup-
pression after myocardial infarction. New England Journal of Medicine, 3212, 406–412.
Carroll, L. (1871). Through the looking glass. Accessed December 19, 2012, http://www.
gutenberg.org/ebooks/12.
Casadevall, A., Fang, F. C. (2014). Diseased science. Microbe, 9, 390–392.
Central Intelligence Agency. (2012). The world factbook. Accessed August 26, 2012, http://
www.cia.gov/library/publications/the-world-factbook/fields/2018.html.
Chan, A. W., Hrobjartsson, A., Haahr, M. T., Gotzsche, P. C., & Altman, D. G. (2004).
Empirical evidence for selective reporting of outcomes in randomized trials: Com-
parison of protocols to published articles. Journal of the American Medical Associa-
tion, 291, 2457–2465.
Chang, M., & Balser, J. (2016). Adaptive design-recent advancement in clinical trials.
Journal of Bioanalysis and Biostatistics, 1, 1–14.
Chao, T. A., Liu, C. J., Chen, S. J., Wang, K. L., Lin, Y. J., Chang, S. L., Lo, L. W., et
al. (2012). The association between the use of non-steroidal anti-inflammatory drugs
and atrial fibrillation: a nationwide case-control study. International Journal of
Cardiology, 168, 312–316.
Clopper, C. J., & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated
in the case of the binomial. Biometrika, 26, 404–413.
Cochrane Methods (2017) About IPD meta-analysis. Accessed June 18, 2017, http://methods.
cochrane.org/ipdma/about-ipd-meta-analyses.
Cochrane Handbook (2017). Detecting reporting biases. Accessed June 20, 2017, http://
handbook.cochrane.org/index.htm#chapter_10/10_4_1_funnel_plots.htm
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. 2d ed. Hillsdale,
NJ: Erlbaum.
Cohen, T. J., Goldner, B. G., Maccaro, P. C., Ardito, A. P., Trazzera, S., Cohen, M. B., &
Dibs, S. R. (1993). A comparison of active compression–decompression cardiopul-
monary resuscitation with standard cardiopulmonary resuscitation for cardiac arrests
occurring in the hospital. New England Journal of Medicine, 329, 1918–1921.
Colquhoun, D. D. (2003). Challenging the tyranny of impact factors. Nature, 423, 479–480.
————. (2014). An investigation of the false discovery rate and the misinterpretation of
p-values. Royal Society Open Science, 1, 140216–140216.
————. (2017). The reproducibility of research and the misinterpretation of P values.
bioRxiv preprint 144337; doi: https://doi.org/10.1101/144337. Accessed Aug. 24, 2017.
536 REFERENCES
Eyding, D. D., Lelgemann, M., Grouven, U., Härter, M., Kromp, M., Kaiser, T., Kerekes,
M. F., Gerken, M., & Wieseler, B. (2010). Reboxetine for acute treatment of major de-
pression: Systematic review and meta-analysis of published and unpublished placebo
and selective serotonin reuptake inhibitor controlled trials. BMJ, 341. doi: 10.1136/
bmj.c4737.
Faul, F., Erdfelder, E., Lang, A.-G., and Buchner, A. (2007). G*Power 3: A flexible sta-
tistical power analysis program for the social, behavioral, and biomedical sciences.
Behavior Research Methods 39: 175–191.
Federal Election Commission. (2012). Official 2012 presidential general election results.
Accessed June 2, 2013, http://www.fec.gov/pubrec/fe2012/2012presgeresults.pdf.
Feinstein, A. R., Sosin, D. M., & Wells, C. K. (1985). The Will Rogers phenomenon. Stage
migration and new diagnostic techniques as a source of misleading statistics for sur-
vival in cancer. New England Journal of Medicine, 312, 1604–1608.
Fisher, R. A. (1935). The design of experiments. New York: Hafner.
———. (1936). Has Mendel’s work been rediscovered? Annals of Science, 1, 115–137.
Fisher, L. D, & Van Belle, G. (1993). Biostatistics. A methodology for the health sciences.
New York: Wiley Interscience.
Fleming, T. R. (2006). Standard versus adaptive monitoring procedures: A commentary.
Statistics in Medicine, 25, 3305–3512; discussion 3313–3314, 3326–3347.
Flom, P. L., & Cassell, D. L. (2007). Stopping stepwise: Why stepwise and similar se-
lection methods are bad, and what you should use. Paper presented at the NESUG
2007, Baltimore. Accessed January 2, 2017, http://www.lexjansen.com/pnwsug/2008/
DavidCassell-StoppingStepwise.pdf.
Frazier, E. P., Schneider, T., & Michel, M. C. (2006). Effects of gender, age and hypertension
on beta-adrenergic receptor function in rat urinary bladder. Naunyn-Schmiedeberg’s
Archives of Pharmacology, 373, 300–309.
Freedman, D. (1983). A note on screening regression equations. American Statistician, 37,
152–155.
———. (2007). Statistics. 4th ed. New York: Norton.
Freese, J. (2008). The problem of predictive promiscuity in deductive applications of
evolutionary reasoning to intergenerational transfers: Three cautionary tales. In
Intergenerational caregiving, ed. A. Booth, A. C. Crouter, S. M. Bianchi, & J. A. Selt-
zer. Rowman & Littlefield Publishers, Inc
Fung, K. (2011). The statistics of anti-doping. Big Data, Plainly Spoken (blog), Accessed
December 9, 2012, at http://www.junkcharts.typepad.com/numbersruleyourworld/
2011/03/the-statistics-of-anti-doping.html.
Gabriel, S. E., O’Fallon, W. M., Kurland, L. T., Beard, C. M., Woods, J. E., & Melton, L. J.
(1994). Risk of connective-tissue diseases and other disorders after breast implanta-
tion. New England Journal of Medicine, 330, 1697–1702.
García-Berthou, E., & Alcaraz, C. (2004). Incongruence between test statistics and P values
in medical papers. BMC Medical Research Methodology, 4, 13–18.
Garnæs, K. K., Mørkved, S., Salvesen, Ø., & Moholdt, T. (2016). Exercise training and
weight gain in obese pregnant women: A randomized controlled trial (ETIP Trial).
PLoS Medicine, 13(7), e1002079–18. doi: 10.1371/journal.pmed.1002079
Gelman, A. (1998). Some class-participation demonstrations for decision theory and
Bayesian statistics. American Statistician, 52, 167–174.
———. (2009). How does statistical analysis differ when analyzing the entire popula-
tion rather than a sample? Statistical Modeling, Causal Inference, and Social Science
(blog), July 3. Accessed December 11, 2012, http://www.andrewgelman.com/2009/07/
how_does_statis/.
538 REFERENCES
———. (2010). Instead of “confidence interval,” let’s say “uncertainty interval.” Accessed
November 5, 2016, andrewgelman.com/2010/12/21/lets_say_uncert/.
———. (2012). Overfitting. Statistical Modeling, Causal Inference, and Social Science
(blog), July 27. Accessed November 15, 2012, http://www.andrewgelman.com/2012/
07/27/15864/.
———. (2013). Don’t let your standard errors drive your research agenda. Statistical
Modeling, Causal Inference, and Social Science (blog), February 1. Accessed Febru-
ary 8, 2013, http://www.andrewgelman.com/2013/02/dont-let-your-standard-errors-
drive-your-research-agenda/.
———. (2015)The feather, the bathroom scale, and the kangaroo. Statistical Modeling, Causal
Inference, and Social Science (blog).AccessedAugust 2, 2016, http://andrewgelman.com/
2015/04/21/feather-bathroom-scale-kangaroo/.
Gelman, A., & Feller, A. (2012) Red versus blue in a new light. New York Times,
September 12. Accessed January 2013, http://www.campaignstops.blogs.nytimes.
com/2012/11/12/ red-versus-blue-in-a-new-light/.
Gelman, A., & Carlin, J. (2014). Beyond power calculations assessing Type S (sign) and
Type M (magnitude) errors. Perspectives on Psychological Science, 9, 641–51.
Gelman, A., & Loken, E. (2014). The statistical crisis in science: Data-dependent
analysis—a “garden of forking paths”— explains why many statistically significant
comparisons don’t hold up. American Scientist, 102, 460. doi: 10.1511/2014.111.460
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant”
is not itself statistically significant. American Statistician, 60, 328–331.
Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single
and multiple comparison procedures. Computational Statistics, 15, 373–390.
Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Statistical challenges in
estimating small effects. American Scientist, 97, 310–311.
Gigerenzer, G. (2002). Calculated risks. New York: Simon and Schuster.
Glantz, S. A., Slinker, B. K., & Neilands, T. B. (2016). Primer of applied regression and
analysis of variance. 2d ed. New York: McGraw-Hill.
Glickman, M. E., Rao, S. R., & Schultz, M. R. (2014). False discovery rate control is a
recommended alternative to Bonferroni-type adjustments in health studies. Journal of
Clinical Epidemiology, 67, 850–857.
Goddard, S. (2008). Is the earth getting warmer, or cooler? The Register, May 2. Accessed June
13, 2008, http://www.theregister.co.uk/2008/05/02/a_tale_of_two_thermometers/.
Goldacre, B. (2013) Bad pharma. London: Faber & Faber.
Goldacre, B., Drysdale, H., Dale, A., Hartley, P., Milosevic, I., Slade, E., Mahtani, K.,
Heneghan, C., & Powell-Smith, A. (2016). Tracking switched outcomes in clinical
trials. Centre for Evidence Based Medicine Outcome Monitoring Project (COMPare).
http://www.COMPare-trials.org.
Goldstein, D. (2006). The difference between significant and not significant is not statisti-
cally significant. Decision Science News. Accessed February 8, 2013, http://www.
decisionsciencenews.com/2006/12/06/the-difference-between-significant-and-not-
significant-is-not-statistically-significant/.
Good, P. I., & Hardin, J. W. (2006). Common errors in statistics (and how to avoid them).
Hoboken, NJ: Wiley.
Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in
Hematology, 45, 135–140.
Goodman, S. N. (1992). A comment on replication, p-values and evidence. Statistics in
Medicine, 11, 875–879.
References 5 3 9
Goodman, S. N. (2016). Aligning statistical and scientific reasoning. Science, 352, 1180–1181.
Goodwin, P. (2010). Why hindsight can damage foresight. Foresight, 17, 5–7.
Gotzsche, P. C. (2006). Believability of relative risks and odds ratios in abstracts: Cross
sectional study. BMJ, 333. doi: 10.1136/bmj.38895.410451.79.
Gould, S. J. (1997). Full house: The spread of excellence from Plato to Darwin. New York:
Three Rivers.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., &
Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A
guide to misinterpretations. European Journal of Epidemiology, 31, 337–350.
Greenland, S., Thomas, D. C., & Morgenstern, H. (1986). The rare-disease assumption
revisited. A critique of “estimators of relative risk for case-control studies.” American
Journal of Epidemiology, 124, 869–883.
Grobbee, D. E. & Hoes, A. W. (2015). Clinical epidemiology. 2d ed. Burlington, MA: Jones
and Bartlett.
Hankins, M. (2013). Still not significant. Psychologically Flawed (blog). April 21.
Accessed June 28, 2013, http://www.mchankins.wordpress.com/2013/04/21/still-not-
significant-2/.
Hanley, J. A., & Lippman-Hand, A. (1983). If nothing goes wrong, is everything alright?
Journal of the American Medical Association, 259, 1743–1745.
Harmonic mean. (2017). Wikipedia. Accessed Jan 20, 2017 at https://en.wikipedia.org/
wiki/Harmonic_mean.
Harrell, F. (2015). Regression modeling strategies: With applications to linear models,
logistic and ordinal regression, and survival analysis. 2d ed. Cham, Switzerland:
Springer International.
Harter, H. L. (1984). Another look at plotting positions. Communications in Statistics:
Theory and Methods, 13, 1613–1633.
Hartung, J. (2005). Statistics: When to suspect a false negative inference. In American
Society of Anesthesiology 56th annual meeting refresher course lectures, Lecture 377,
1–7. Philadelphia: Lippincott.
Heal, C. F., Buettner, P. G., Cruickshank, R., & Graham, D. (2009). Does single application
of topical chloramphenicol to high risk sutured wounds reduce incidence of wound in-
fection after minor surgery? Prospective randomised placebo controlled double blind
trial. BMJ, 338, 211–214.
Henderson, B. (2005). Open letter to Kansas School Board. Church of the Flying S paghetti
Monster (blog). Accessed December 8, 2012, http://www.venganza.org/about/
open-letter/.
Hetland, M. L., Haarbo, J., Christiansen, C., & Larsen, T. (1993). Running induces men-
strual disturbances but bone mass is unaffected, except in amenorrheic women.
American Journal of Medicine, 95, 53–60.
Hintze, J. L., & Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism.
American Statistician, 52, 181–184.
Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power.
Calculations for data analysis. American Statistician, 55, 1–6.
Hollis, S., & Campbell, F. (1999). What is meant by intention to treat analysis? Survey of
published randomized controlled trials. BMJ, 319, 670–674.
HPS2-THRIVE Collaborative Group. (2014). Effects of extended-release niacin with laro-
piprant in high-risk patients. New England Journal of Medicine, 371, 203–12.
Hsu, J. (1996). Multiple comparisons: Theory and methods. Boca Raton, FL: Chapman &
Hall/CRC.
540 REFERENCES
Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F., & Baker, C. I. (2009). Circular
analysis in systems neuroscience: The dangers of double dipping. Nature Neuroscience,
12, 535–540.
Kruschke, J. K. (2015), Doing Bayesian data analysis: A tutorial with R and BUGS. 2d ed.
New York: Academic/Elsevier.
Kuehn, B. (2006). Industry, FDA warm to “adaptive” trials. Journal of the American
Medical Association, 296, 1955–1957.
Lakens, D.L. (2017). Equivalence Tests. Social Psychological and Personality Science 5: in
press. Preprint at: http://dx.doi.org/10.1177%2F1948550617697177
Lamb, E. L. (2012). 5 sigma: What’s that? Scientific American (blog), July 17. Accessed
November 16, 2012, http://www.blogs.scientificamerican.com/observations/2012/07/17/
five-sigmawhats-that/.
Lanzante, J. R. (2005). A cautionary note on the use of error bars. Journal of Climate, 18,
3699–3703.
Larsson, S. C., & Wolk, A. (2006). Meat consumption and risk of colorectal cancer: A
meta-analysis of prospective studies. International Journal of Cancer, 119, 2657–2664.
Lazzeroni, L. C., Lu, Y., & Belitskaya-Lévy, I. (2014). P-values in genomics: Apparent
precision masks high uncertainty. Molecular Psychiatry, 19, 1336–40.
Lazic, S. E. (2010). The problem of pseudoreplication in neuroscientific studies: Is it af-
fecting your analysis? BMC Neuroscience, 11(5). doi: 10.1186/1471-2202-11-5
Laupacis, A., Sackett, D. L., & Roberts, R. S. (1988). An assessment of clinically useful
measures of the consequences of treatment. New England Journal of Medicine, 318,
1728–1733.
Lee, K. L., McNeer, J. F., Starmer, C. F., Harris, P. J., & Rosati, R. A. (1980). Clinical
judgment and statistics. Lessons from a simulated randomized trial in coronary artery
disease. Circulation, 61, 508–515.
Lehman, E. (2007). Nonparametrics: Statistical methods based on ranks. New York:
Springer.
Lenth, R. V. (2001). Some practical guidelines for effective sample size determination.
American Statistician, 55, 187–193.
Levine, M., & Ensom, M. H. (2001). Post hoc power analysis: An idea whose time has
passed? Pharmacotherapy, 21, 405–409.
Levins, R. (1966). The strategy of model building in population biology. American
Scientist, 54, 421–431.
Lewin, T. (2008). Math scores show no gap for girls, study finds. New York Times, July 25.
Accessed July 26, 2008, http://www.nytimes.com/2008/07/25/education/25math.html.
Limpert, E., Stahel, W. A., & Abbt, M. (2001). Log-normal distributions across the sci-
ences: Keys and clues. Biosciences, 51, 341–352.
Limpert, E., and Stahel, W.A. (2011). Problems with Using the Normal Distribution–and
Ways to Improve Quality and Efficiency of Data Analysis. PLoS ONE 6: e21403–8.
Lucas, M. E. S., Deen, J. L., von Seidlein, L., Wang, X., Ampuero, J., Puri, M., Ali, M.,
et al. (2005). Effectiveness of mass oral cholera vaccination in Beira, Mozambique.
New England Journal of Medicine, 352, 757–767.
Ludbrook, L., & Lew, M. J. (2009). Estimating the risk of rare complications: Is the “rule
of three” good enough? Australian and New Zealand Journal of Surgery, 79, 565–570.
Mackowiak, P. A., Wasserman, S. S., & Levine, M. M. (1992). A critical appraisal of 98.6
degrees F, the upper limit of the normal body temperature, and other legacies of Carl
Reinhold August Wunderlich. Journal of the American Medical Association, 268,
1578–1580.
542 REFERENCES
Macrae, D., Grieve, R., Allen, E., Sadique, Z., Morris, K., Pappachan, J., Parslow, R., et al.
(2014). A randomized trial of hyperglycemic control in pediatric intensive care. New
England Journal of Medicine, 370, 107–118.
Manly, B. F. J. (2006). Randomization, bootstrap and Monte Carlo methods in biology, 3d
ed. London: Chapman & Hall/CRC.
Marsupial, D. (2013). Comment on: Interpretation of p-value in hypothesis testing.
Cross Validated, January 3. Accessed January 5, 2013, at stats.stackexchange.com/
questions/46856/interpretation-of-p-value-in-hypothesis-testing.
Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below
.05. Quarterly Journal of Experimental Psychology, 65, 2271–2279.
Mathews, P. (2010) Sample size calculations. Practical methods for engineer and scien-
tists. Mathews, Malnar and Bailey. ISBN 978-0-615-32461-6
McCullough, B. D., & Hellser, D. A. (2008). On the accuracy of statistical procedures in
Microsoft Excel 2007. Computational Statistics and Data Analysis, 52, 4570–4578.
McCullough, B. D., & Wilson, B. (2005). On the accuracy of statistical procedures in
Microsoft Excel 2003. Computational Statistics and Data Analysis, 49, 1244–1252.
Mendel, G. J. (1865). Versuche über Pflanzen-Hybriden. Verhandlungen des naturforschen-
den Vereines in Brünn, 4, 3–47.
Messerili, F. H. (2012). Chocolate consumption, cognitive function, and Nobel laureates.
New England Journal of Medicine, 367, 1562–1564.
Metcalfe, C. (2011). Holding onto power: Why confidence intervals are not (usually) the
best basis for sample size calculations. Trials, 12(Suppl. 1), A101. doi: 10.1186/1745-
6215-12-S1-A101.
Meyers, M. A. (2007). Happy accidents: Serendipity in modern medical breakthroughs.
New York: Arcade.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures.
Psychological Bulletin, 105, 156–166.
Microsoft Corporation. (2006). Description of improvements in the statistical functions in
Excel 2003 and in Excel 2004 for Mac. Accessed December 16, 2008, http://www.
support.microsoft.com/kb/828888.
Mills, J. L. (1993). Data torturing. New England Journal of Medicine, 329, 1196–1199.
Montori, V. M., & Guyatt, G. H. (2001). Intention-to-treat principle. Canadian Medical
Association Journal, 165, 1339–1341.
Montori, V. M., Kleinbart, J., Newman, T. B., Keitz, S., Wyer, P. C., Moyer, V., & Guyatt, G.
(2004). Tips for learners of evidence-based medicine: 2. Measures of precision (con-
fidence intervals). Canadian Medical Association Journal, 171, 611–615.
Moser, B. K., & Stevens, G. R. (1992). Homogeneity of variance in the two-sample means
test. American Statistician, 46, 19–21.
Motulsky, H. J. (2015). Essential biostatistics. New York: Oxford University Press.
Motulsky, H. J. (2016). GraphPad curve fitting guide. Accessed July 1, 2016, http://www.
graphpad.com/guides/prism/7/curve-fitting/.
Motulsky, H. J., & Brown, R. E. (2006). Detecting outliers when fitting data with nonlinear
regression: A new method based on robust nonlinear regression and the false discov-
ery rate. BMC Bioinformatics, 7(123). doi: 10.1186/1471-2105-7-123
Motulsky, H., & Christopoulos, A. (2004). Fitting models to biological data using linear
and nonlinear regression: A practical guide to curve fitting. New York: Oxford
University Press.
Motulsky, H. J., O’Connor, D. T., & Insel, P. A. (1983). Platelet alpha 2-adrenergic recep-
tors in treated and untreated essential hypertension. Clinical Science, 64, 265–272.
References 5 4 3
Moyé, L. A., & Tita, A. T. N. (2002). Defending the rationale for the two-tailed test in clini-
cal research. Circulation, 105, 3062–3065.
Munger, K. L, Levin, L. I., Massa, J., Horst, R., Orban, T., and Ascherio, A. (2013).
Preclinical serum 25-hydroxyvitamin D levels and risk of Type 1 diabetes in a cohort
of US military personnel. American Journal of Epidemiology, 177, 411–419.
NASA blows millions on flawed airline safety survey. (2007). New Scientist, November
10. Accessed May 25, 2008, http://www.newscientist.com/channel/opinion/
mg19626293.900-nasa-blows-millions-on-flawed-airline-safety-survey.html.
Newport, F. (2012). Romney has support among lowest income voters. Gallup,
September 12. http://www.gallup.com/poll/157508/romney-support-among-lowest-
income-voters. aspx.
Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E.-J. (2011). Erroneous analyses of
interactions in neuroscience: A problem of significance. Nature Neuroscience, 14,
1105–1107.
Omenn, G. S., Goodman, G. E., Thornquist, M. D., Balmes, J., Cullen, M. R., Glass, A.,
Keogh, J. P., et al. (1996). Risk factors for lung cancer and for intervention effects
in CARET, the Beta-Carotene and Retinol Efficacy Trial. Journal of the National
Cancer Institute, 88, 1550–1559.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological sci-
ence. Science, 349. doi: 10.1126/science.aac4716.
Oremus, W. (2013). The wedding industry’s pricey little secret. Slate, June 12. Accessed
June 13, 2013, http://www.slate.me/1a4KLuH.
Pace, L. A. (2008). The Excel 2007 data and statistics cookbook. 2d ed. Anderson, SC:
TwoPaces.
Parker, R. A., & Berman, N. G. (2003). Sample size: More than calculations. American
Statistician, 57, 166–170.
Paulos, J. A. (2008). Irreligion: A mathematician explains why the arguments for God just
don’t add up. New York: Hill and Wang.
Payton, M. E., Greenstone, M. H., & Schenker, N. (2003). Overlapping confidence in-
tervals or standard error intervals: What do they mean in terms of statistical signifi-
cance? Journal of Insect Science, 3, 34–40.
Pielke, R. (2008). Forecast verification for climate science, part 3. Prometheus, January 9.
Accessed April 20, 2008, at sciencepolicy.colorado.edu/prometheus/archives/climate_
change/001315forecast_verificatio.html.
Pocock, S. J., & Stone, G. W. (2016a). The primary outcome fails: What next? New England
Journal of Medicine, 375, 861–870.
Prinz, F. F., Schlange, T. T., & Asadullah, K. K. (2011). Believe it or not: How much can
we rely on published data on potential drug targets? Nature Reviews Drug Discovery,
10(712). doi: 10.1038/nrd3439-c1
Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to popula-
tion stratification in genome-wide association studies. Nature Reviews Genetics, 11,
459–463.
Pukelsheim, F. (1990). Robustness of statistical gossip and the Antarctic ozone hole. IMS
Bulletin, 4, 540–545.
Pullan, R. D., Rhodes, J., Gatesh, S., Mani, V., Morris, J. S., Williams, G. T., Newcombe,
R. G., et al. (1994). Transdermal nicotine for active ulcerative colitis. New England
Journal of Medicine, 330, 811–815.
Ridker, P. M., Danielson, E., Fonseca, F. A. H., Genest, J., Gotto, A. M., Jr., Kastelein,
J. J. P., Koening, W., et al. (2008). Rosuvastatin to prevent vascular events in men
544 REFERENCES
and women with elevated C-reactive protein. New England Journal of Medicine, 359,
2195–2207.
Riley, R.D., Lambert, P.C., and Abo-Zaid, G. (2010). Meta-analysis of individual partici-
pant data: rationale, conduct, and reporting. Bmj 340: c221–c228.
Roberts, S. (2004). Self-experimentation as a source of new ideas: Ten examples about
sleep, mood, health, and weight. Behavioral and Brain Sciences, 27, 227–262; discus-
sion 262–287.
Robinson, A. (2008) icebreakeR. Accessed January 26, 2009, from https://cran.r-project.
org/doc/contrib/Robinson-icebreaker.pdf.
Rosman, N. P., Colton, T., Labazzo, J., Gilbert, P. L., Gardella, N. B., Kaye, E. M., Van
Bennekom, C., & Winter, M. R. (1993). A controlled trial of diazepam administered
during febrile illnesses to prevent recurrence of febrile seizures. New England Journal
of Medicine, 329, 79–84.
Rothman, K. J., Greenland S, & Lash, T. L. (2008). Modern Epidemiology. 3d ed.
Philadelphia, Wolters Klewer.
Rothman, K. J. (1990). No adjustments are needed for multiple comparisons. Epidemiol-
ogy, 1, 43–46.
Rothman, K. J. (2016). Disengaging from statistical significance. European Journal of
Epidemiology, 31, 443–444.
Russo, J. E., & Schoemaker, P. J. H. (1989). Decision traps. The ten barriers to brilliant
decision-asking and how to overcome them. New York: Simon & Schuster.
Ruxton, G.D. (2006). The unequal variance t-test is an underused alternative to Student’s
t-test and the Mann-Whitney U test. Behavioral Ecology 17: 688–690.
Samaniego, F. J. A (2008). Conversation with Myles Hollander. Statistical Science, 23,
420–438.
Sawilowsky, S. S. (2005). Misconceptions leading to choosing the t test over the Wilcoxon
Mann–Whitney test for shift in location parameter. Journal of Modern Applied Statis-
tical Methods, 4, 598–600.
Schmidt, M., Christiansen, C. F., Mehnert, F., Rothman, K. J., & Sorensen, H. T. (2011).
Non-steroidal anti-inflammatory drug use and risk of atrial fibrillation or flutter: pop-
ulation based case-control study. BMJ, 343. doi: 10.1136/bmj.d3450
Schmidt, M., & Rothman, K. J. (2014). Mistaken inference caused by reliance on and
misinterpretation of a significance test. International Journal of Cardiology, 177,
1089–1090.
Schoemaker, A. L. (1996). What’s normal? Temperature, gender, and heart rate. Journal of
Statistics Education, 4(2). Accessed May 5, 2007, http://www.amstat.org/publications/
jse/v4n2/datasets.shoemaker.html.
Seaman, M. A., Levin, J. R., & Serlin, R. C. (1991). New developments in pairwise mul-
tiple comparisons: Some powerful and practicable procedures. Psychological Bulletin,
110, 577–586.
Sedgwick, P. (2015). How to read a funnelplot in a meta-analysis. BMJ 351:h4718.
Senn, S. (2003). Disappointing dichotomies. Pharmaceutical Statistics, 2, 239–240.
Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures,
5th ed. New York: Chapman & Hall/CRC.
Shettles, L. B. (1996). How to choose the sex of your baby. Monroe, WA: Main Street
Books.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology:
Undisclosed flexibility in data collection and analysis allows presenting anything as
significant. Psychological Science, 22, 1359–1366.
References 5 4 5
———. (2013). P-curve: A key to the file drawer. Journal of Experimental Psychology:
General, 143, 534–547
Simon, S. (2005). Stats: Standard deviation versus standard error. PMean (blog), May 16.
Accessed December 20, 2012, http://www.pmean.com/05/StandardError.html.
Simonsohn, U. (2016) Evaluating replications: 40% full ≠ 60% empty. Data Colada (blog),
March 3. Accessed May 8, 2016, http://datacolada.org/47.
Snapinn, S. M. (2000). Noninferiority trials. Current Control Trials in Cardiovascular
Medicine, 1, 19–21.
Sparling, B. (2001). Ozone history. NASA Advanced Supercomputing Division, May 30.
Accessed June 13, 2008, http://www.nas.nasa.gov/About/Education/Ozone/history.html
Spector, R., & Vesell, E. S. (2006a). The heart of drug discovery and development: Rational
target selection. Pharmacology, 77, 85–92.
———. (2006b). Pharmacology and statistics: Recommendations to strengthen a produc-
tive partnership. Pharmacology, 78, 113–122.
Squire, P. (1988). Why the 1936 Literary Digest poll failed. Public Opinion Quarterly, 52,
125–133.
Staessen, J. A., Lauwerys, R. R., Buchet, J. P., Bulpitt, C. J., Rondia, D., Vanrenterghem, Y.,
Amery, A., & the Cadmibel Study Group. (1992). Impairment of renal function with
increasing blood lead concentrations in the general population. New England Journal
of Medicine, 327, 151–156.
Statwing. (2012). The ecological fallacy. Statwing (blog), December 20. Accessed
February 8, 2013, at blog.statwing.com/the-ecological-fallacy/.
Svensson, S., Menkes, D. B., & Lexchin, J. (2013). Surrogate outcomes in clinical trials: a
cautionary tale. JAMA Internal Medicine,173, 611–612.
Thavendiranathan, P., & Bagai, A. (2006). Primary prevention of cardiovascular diseases
with statin therapy: A meta-analysis of randomized controlled trials. Archives of Inter-
nal Medicine, 166, 2307–2313.
Taubes, G. (1995). Epidemiology faces its limits. Science, 269, 164–169.
Thomas, D., Radji, S., and Benedetti, A. (2014). Systematic review of methods for in-
dividual patient data meta- analysis with binary outcomes. BMC Medical Research
Methodology 14: 79.
Thun, M. J., & Sinks, T. (2004). Understanding cancer clusters. CA: A Cancer Journal for
Clinicians, 54, 273–280.
Tierney, J. (2008). A spot check of global warming. New York Times, January 10. Accessed
April 20, 2008, at https://nyti.ms/2pLmpTT.
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective
publication of antidepressant trials and its influence on apparent efficacy. New England
Journal of Medicine, 358, 252–260.
Twain, M. (1883). Life on the Mississippi. London: Chatto & Windus.
U.S. Census (2011). Accessed January 30, 2012, http://www.census.gov/hhes/http://www/
income/data/historical/household/2011/H08_2011.xls.
U.S. Food and Drug Administration. (2012). Orange book preface. Approved drug products
with therapeutic equivalence evaluations. Accessed December 23, 2012, http://www.
fda.gov/Drugs/DevelopmentApprovalProcess/ucm079068.htm.
Van Belle, G. (2008). Statistical rules of thumb. 2d ed. New York: Wiley Interscience.
Vandenbroucke, J. P., & Pearce, N. (2012). Case-control studies: basic concepts.
International Journal of Epidemiology, 41, 1480–1489.
Vaux, D. L., Fidler, F., & Cumming, G. (2012). Replicates and repeats: What is the differ-
ence and is it significant? EMBO Reports, 13, 291–296.
546 REFERENCES
Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are
misleading. American Statistician, 47, 65–72.
Vera-Badillo, F. E., Shapiro, R., Ocana, A., Amir, E., & Tannock, I. F. (2013). Bias in re-
porting of end points of efficacy and toxicity in randomized, clinical trials for women
with breast cancer. Annals of Oncology, 24, 1238–1244.
Vickers, A. J. (2006). Shoot first and ask questions later: How to approach statistics like a
real clinician. Medscape Business of Medicine, 7(2). Accessed June 19, 2009, http://
www.medscape.com/viewarticle/540898.
———. (2010). What is a P-value anyway? Boston: Addison-Wesley.
Vittinghoff, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2007). Regression
methods in biostatistics: Linear, logistic, survival, and repeated measures models.
New York: Springer.
von Hippel, P.T. (2005). Mean, median, and skew: Correcting a textbook rule. Journal of
Statistics Education, 13(2). ww2.amstat.org/publications/jse/v13n2/vonhippel.html.
Vos Savant, M. (1997). The power of logical thinking: Easy lessons in the art of reasoning . . .
and hard facts about its absence in our lives. New York: St. Martin’s Griffin.
Walker, E., & Nowacki, A. S. (2010). Understanding equivalence and noninferiority test-
ing. Journal of General Internal Medicine, 26, 192–196.
Wallach J. D., Sullivan P. G. , Trepanowski J. F. , Sainani K. L. , Steyerberg E. W., Ioannidis
J. P. A. (2017). Evaluation of Evidence of Statistical Support and Corroboration of
Subgroup Claims in Randomized Clinical Trials. JAMA Intern Med.177(4):554–560.
Walsh, M., Srinathan, S. K., McAuley, D. F., Mrkobrada, M., Levine, O., Ribic, C., Molnar,
A. O., et al. (2014). The statistical significance of randomized controlled trial results
is frequently fragile: A case for a fragility index. Journal of Clinical Epidemiology,
67, 622–628.
Wasserman, L. (2012). P values gone wild and multiscale madness. Normal Deviate
(blog), August 16. Accessed December 10, 2012, http://normaldeviate.wordpress.
com/2012/08/16/p-values-gone-wild-and-multiscale-madness/.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: context, pro-
cess, and purpose. American Statistician, 70, 129–133.
Wellek, S. (2002). Testing statistical hypotheses of equivalence. Boca Raton, FL: Chapman
& Hall/CRC.
Westfall, P. H. (2014). Kurtosis as Peakedness, 1905–2014. R.I.P. American Statistician,
68, 191–195.
Westfall, P., Tobias, R., Rom, D., Wolfinger, R., & Hochberg, Y. (1999) Multiple compari-
sons and multiple tests using the SAS system. Cary, NC: SAS.
Wilcox, R. R. (2001). Fundamentals of modern statistical methods: Substantially improv-
ing power and accuracy. New York: Springer-Verlag.
———. (2010). Fundamentals of modern statistical methods: Substantially improving
power and accuracy. 2d ed. New York: Springer.
Wilcox, A. J., Weinberg, C. R., & Baird, D. D. (1995). Timing of sexual intercourse in rela-
tion to ovulation. Effects on the probability of conception, survival of the pregnancy,
and sex of the baby. New England Journal of Medicine, 333, 1517–1521.
Winston, W. (2004). Introduction to optimization with the Excel Solver tool. Accessed March
6, 2013, at https://support.office.com/en-us/article/An-introduction-to-optimization-
with-the-Excel-Solver-tool-1F178A70-8E8D-41C8-8A16-44A97CE99F60%20
#office.
Wolff, A. (2002). That old black magic. Sports Illustrated, January 21. Accessed May 25,
2008, http://sportsillustrated.cnn.com/2003/magazine/08/27/jinx/.
References 5 4 7
548
Index 5 4 9
paired t test results from, 309, 309t interpreting tests, 151, 152t
unpaired t test results from, 294, 296f questions answered by, 151, 152t
GraphPad QuickCalcs, 521, 521f traps to avoid, 468–469
groups
comparing, 157–164 ignorance: quantifying, 16–17
comparing two paired groups, I-knew-it-all-along effect, 218
433–434 impossible values, 71
comparing two unpaired groups, incidence, 278
431–433, 432f, 432t income studies (example), 476–477,
defining, 286–287 476f–477f
multiple definitions of, 218–219 independent measurements, 84
pairwise comparisons between independent observations, 35, 103, 267,
means, 422, 422t, 423f 321, 386, 401
Grubbs outlier test, 235 independent subjects, 51
independent variables, 351, 379
Hall, Monty, 6–7 assumptions, 371
HARK (hypothesizing after results are common mistakes with, 391
known), 218, 468–469, 470f interactions among, 389–390
harmonic mean, 65 multiple, 515
hazard ratio, 287–288 P values for, 399
hazards, 287–288, 400 regression with one independent
proportional, 287–288, 287f, 393, variable, 514
396t, 399–402 individual participant or patient data
hierarchical studies, 390, 390t (IPD), 456–457
Hill slope, 371 inflammatory bowel disease (IBD), 275
hindsight bias, 218 insulin sensitivity: equation for, 331
histograms, 69, 69f intention to treat (ITT), 47, 291–292
HIV testing (example), 445–446, 446t interactions, 403
Hollander, Myles, 12, 463 among independent variables,
Holm’s test, 425 389–390
Holm–Sidak test, 425 assumptions about, 386, 401
homoscedasticity, 297–298, 336 intercept, 333
hospital-based case-control studies, 282 interpreting data, 467
Humpty Dumpty, xxvi interquartile range, 66, 85–86
hyperglycemia, 163 intertwined comparisons, 423
hyperglycemic control, tight, 163 interval estimates, 38, 351
hypoglycemia, 163 interval variables, 75–77, 76f, 77t
hypotheses intuition, 3, 6
HARK (hypothesizing after results isotretinoin, 275
are known), 218, 468–469, 470f ITT (intention to treat), 291–292
multiple, 218
vampirical, 469 journalists, 146
hypothesis testing, xxviii, 145–156 jurors, 145–146
CIs and, 157–159 jury trials, 150
common mistakes with, 152–154
equivalence testing, 196–197, Kaplan–Meier method, 48, 48t–49t
196f, 197 Kaplan–Meier survival curves, 284, 285f
556 I N D E X