Modeling
Modeling
Modeling
Contents
Preface
xi
Chapter 1
Varieties of Count Data
1
1
3
3
7
9
16
18
22
22
23
31
33
35
35
36
39
v
vi
CONTENTS
41
48
55
55
59
62
66
68
69
70
71
73
74
74
74
81
81
84
87
88
92
92
96
99
105
106
108
108
108
112
Contents
vii
112
114
116
116
119
122
123
124
126
126
126
128
133
136
136
148
152
152
153
156
160
162
162
162
165
165
165
170
viii
CONTENTS
Chapter 7
Problems with Zeros
Some Points of Discussion
7.1 Counts without Zeros Zero-Truncated Models
7.1.1 Zero-Truncated Poisson (ZTP)
7.1.2 Zero-Truncated Negative Binomial (ZTNB)
7.1.3 Zero-Truncated Poisson Inverse Gaussian (ZTPIG)
7.1.4 Zero-Truncated NB-P (ZTNBP)
7.1.5 Zero-Truncated Poisson Log-Normal (ZTPLN)
7.1.6 Zero-Truncated Model Summary
7.2 Two-Part Hurdle Models
7.2.1 Poisson and Negative Binomial Logit Hurdle Models
7.2.2 PIG-Logit and Poisson Log-Normal Hurdle Models
7.2.3 PIG-Poisson Hurdle Model
7.3 Zero-Inflated Mixture Models
7.3.1 Overview and Guidelines
7.3.2 Fit Tests for Zero-Inflated Models
7.3.3 Fitting Zero-Inflated Models
7.3.4 Good and Bad Zeros
7.3.5 Zero-Inflated Poisson (ZIP)
7.3.6 Zero-Inflated Negative Binomial (ZINB)
7.3.7 Zero-Inflated Poisson Inverse Gaussian (ZIPIG)
7.4 Summary Finding the Optimal Model
Chapter 8
Modeling Underdispersed Count Data Generalized Poisson
172
172
173
174
177
180
182
183
184
184
185
192
194
196
196
197
197
198
199
202
206
207
210
210
Chapter 9
Complex Data: More Advanced Models
217
217
218
224
225
229
231
Contents
9.3
9.4
9.5
9.6
ix
232
235
238
239
239
241
245
248
252
255
Bibliography
269
Index
277
Preface
Modeling Count Data is written for the practicing researcher who has a
reason to analyze and draw sound conclusions from modeling count data.
More specifically, it is written for an analyst who needs to construct a count
response model but is not sure how to proceed.
A count response model is a statistical model for which the dependent, or
response, variable is a count. A count is understood as a nonnegative discrete
integer ranging from zero to some specified greater number. This book aims
to be a clear and understandable guide to the following points:
r How to recognize the characteristics of count data
r Understanding the assumptions on which a count model is based
r Determining whether data violate these assumptions (e.g., overdispersion),
r
r
r
r
r
There is indeed a lot to consider when selecting the best-fitted model for
your data. I will do my best in these pages to clarify the foremost concepts
and problems unique to modeling counts. If you follow along carefully, you
should have a good overview of the subject and a basic working knowledge
needed for constructing an appropriate model for your study data. I focus
on understanding the nature of the most commonly used count models and
xi
xii
PREFACE
Preface
xiii
xiv
PREFACE
address: http://works.bepress.com/joseph hilbe/. The books page with Cambridge University Press is at www.cambridge.org/9781107611252.
The data files used for examples in the book are real data. The rwm1984and
medpar data sets are used extensively throughout the book. Other data
sets used include titanic, heart, azcabgptca, smoking, fishing,
fasttrakg, rwm5yr, nuts, and azprocedure. The data are defined
where first used. The medpar, rwm5yr, and titanic data are used more
than other data in the book. The medpar data are from the 1991 Arizona
Medicare files for two diagnostic groups related to cardiovascular procedures.
I prepared medpar in 1993 for use in workshops I gave at the time. The
rwm5yr data consist of 19,609 observations from the German Health Reform
data covering the five-year period of 19841988. Not all patients were in the
study for all five years. The count response is the number of visits made by
a patient to the doctor during that calendar year. The rwm1984 data were
created from rwm5yr, with only data from 1984 included one patient, one
observation. The well-known titanic data set is from the 1912 Titanic ship
disaster survival data. It is in grouped format with survived as the response.
The predictors are age (adult vs. child), gender (male vs. female), and class
(1st-, 2nd-, and 3rd-class passengers). Crew members have been excluded.
I advise the reader that there are parts of Chapter 3 that use or adapt
text from the first edition of Negative Binomial Regression (Hilbe 2007a),
which is now out of print, as it was superseded by Hilbe (2011). Chapter 2
incorporates two tables that were also used in the first edition. I received very
good feedback regarding these sections and found no reason to change them
for this book. Now that the original book is out of print, these sections would
be otherwise lost.
I wish to acknowledge five eminent colleagues and friends in the truest
sense who in various ways have substantially contributed to this book, either
indirectly while working together on other projects or directly: James Hardin,
director of the Biostatistics Collaborative Unit and professor, Department of
Statistics and Epidemiology, University of South Carolina School of Medicine;
Andrew Robinson, director, Australian Centre of Excellence for Risk Analysis (ACERA), Department of Mathematics and Statistics, University of Melbourne, Australia; Alain Zuur, senior statistician and director of Highland
Statistics Ltd., UK; Peter Bruce, CEO, Institute for Statistics Education (Statistics.com); and John Nelder, late Emeritus Professor of Statistics, Imperial
College, UK. John passed away in 2010, just shy of his eighty-sixth birthday;
our many discussions over a 20-year period are sorely missed. He definitely
Preface
xv
CHAPTER 1
r What are the parameters of a statistical model? Where do they come from,
(1.1)
(1.2)
Statisticians usually convert equation (1.2) to one that has the left-hand side
being the predicted or expected mean value of the response, based on the sum
of the predictors and coefficients. Each associated coefficient and predictor is
called a regression term:
y = 0 + 1 X 1 + 2 X 2 + + n X n
(1.3)
= 0 + 1 X 1 + 2 X 2 + + n X n
(1.4)
or
Notice that the error became part of the expected or predicted mean response.
, or hat over y and (mu), indicates that this is an estimated value. From
this point on, I use the symbol to refer to the predicted value, without a hat.
Understand, though, that when we are estimating a parameter or a statistic,
a hat should go over it. The true unknown parameter, on the other hand, has
no hat. You will also at times see the term E(y) used to mean estimated y. I
will not use it here.
In matrix form, where the individual terms of the regression are expressed
in a single term, we have
= X
(1.5)
Lets look at example data (smoking). Suppose that we have a sixobservation model consisting of the following variables:
sbp:
male:
smoker:
age:
4.048601
6.927835
.2507664
.1946711
16.14
35.59
0.004
0.001
2.96964
6.090233
5.127562
7.765437
age |
.4698085
.02886
16.28
0.004
.3456341
.593983
cons |
104.0059
.7751557
134.17
0.000
100.6707
107.3411
------------------------------------------------------------------------
Continuing with Stata, we may obtain the predicted value, , which is the
estimated mean systolic blood pressure, and display the predictor values
together with (mu) as
. predict mu
. l
// l is an abbreviation for list
+------------------------------------+
| sbp
male smoker sge
mu
|
|------------------------------------|
1. | 131
1
1
34
130.9558 |
2. | 132
3. | 122
4. | 119
1
1
0
1
0
0
36
30
32
131.8954 |
122.1488 |
119.0398 |
5. | 123
6. | 115
0
0
1
0
26
23
123.1488 |
114.8115 |
+------------------------------------+
To see exactly what this means, we sum the terms of the regression. The
intercept term is also summed, but its values are set at 1. The _b[] term
captures the coefficient from the results saved by the software. For the intercept, _b[_cons] adds the intercept term, slope[1], to the other values. The
term xb is also commonly referred to as the linear predictor.
. gen xb = _b[male]*male + _b[smoker]*smoker + _b[age]*age + _b[_cons]
. l
+-----------------------------------------------+
| sbp
male
smoker age
mu
xb
|
|-----------------------------------------------|
1. | 131
1
1
34
130.9558
130.9558 |
2. | 132
3. | 122
4. | 119
1
1
0
1
0
0
36
30
32
131.8954
122.1488
119.0398
131.8954 |
122.1488 |
119.0398 |
5. | 123
6. | 115
0
0
1
0
26
23
123.1488
114.8115
123.1488 |
114.8115 |
+-----------------------------------------------+
/* intercept slope */
Using R, we may obtain the same results with the following code:
R CODE
sbp
- c(131,132,122,119,123,115)
male
- c(1,1,1,0,0,0)
smoker - c(1,1,0,0,1,0)
age
- c(34,36,30,32,26,23)
summary(reg1 - lm(sbp~ male+smoker+age))
results not displayed
As was done with the Stata code, we may calculate the linear predictor, which
is the same as , by first abstracting the coefficient
cof - reg1$coef
cof
(Intercept)
male
104.0058910
4.0486009
smoker
age
6.9278351
0.4698085
and then the linear predictor, xb. Each coefficient can be identified with [ ].
The values are identical to mu.
xb - cof[1] + cof[2]*male + cof[3]*smoker + cof[4]*age
xb
[1] 130.9558 131.8954 122.1487 119.0398 123.1487 114.8115
Notice the closeness of the observed response and predicted values. The
differences are
diff - sbp - mu
diff
1
0.04418262
2
3
4
5
0.10456554 -0.14874816 -0.03976436 -0.14874816
6
0.18851252
When the values of the linear predictor are close to the predicted or expected
values, we call the model well fitted.
.01
Density
.02
.03
100
150
Systolic Blood Pressure
200
and 2 2.
This criterion of the Poisson distribution is referred to as the equidispersion
criterion. The problem is that when modeling real data, the equidispersion
criterion is rarely satisfied. Analysts usually must adjust their Poisson model
in some way to account for any under- or overdispersion that is in the data.
Overdispersion is by far the foremost problem facing analysts who use Poisson
regression when modeling count data.
I should be clear about the meaning of overdispersion since it is central
to the modeling of count data and therefore plays an important role in this
book. Overdispersion almost always refers to excess variability or correlation
in a Poisson model, but it also needs to be considered when modeling other
count models as well. Keep in mind, however, that when the term overdispersion is used, most analysts are referring to Poisson overdispersion (i.e.,
overdispersion in a Poisson model).
Simply put, Poisson overdispersion occurs in data where the variability of
the data is greater than the mean. Overdispersion also is used to describe data
in a slightly more general sense, as when the observed or in fact variance of
the count response is greater than the variance of the predicted or expected
counts. This latter type of variance is called expected variance. Again, if the
10
observed variance of the response is greater than the expected variance, the
data are overdispersed. A model that fails to properly adjust for overdispersed
data is called an overdispersed model. As such, its standard errors are biased
and cannot be trusted. The standard errors associated with model predictors
may appear from the model to significantly contribute to the understanding
of the response, but in fact they may not. Many analysts have been deceived
into thinking that they have developed a well-fitted model.
Unfortunately, statistical software at times fails to provide an analyst with
the information needed to determine if a Poisson model is overdispersed or
underdispersed. We discuss in some detail exactly how we can determine
whether a model is overdispersed. More properly perhaps, this book will provide guidelines to help you decide whether a Poisson model is equidispersed.
Probably the most popular method of dealing with apparent Poisson
overdispersion is to model the data using a negative binomial model. The
negative binomial distribution has an extra parameter, referred to as the
negative binomial dispersion parameter. Some books and articles call the dispersion parameter the heterogeneity parameter or ancillary parameter. These
are appropriate names as well. The dispersion parameter is a measure of the
adjustment needed to accommodate the extra variability, or heterogeneity, in
the data. However, the term dispersion parameter has become the standard
name for the second parameter of the negative binomial distribution.
The negative binomial, which we discuss in more detail later, allows more
flexibility in modeling overdispersed data than does a single-parameter Poisson model. The negative binomial is derived as a Poisson-gamma mixture
model, with the dispersion parameter being distributed as gamma shaped.
The gamma PDF is pliable and allows for a wide variety of shapes. As a
consequence, most overdispersed count data can be appropriately modeled
using a negative binomial regression. The advantage of using the negative
binomial rests with the fact that when the dispersion parameter is zero (0),
the model is Poisson.2 Values of the dispersion parameter greater than zero
indicate that the model has adjusted for correspondingly greater amounts of
I term this the direct parameterization of the negative binomial. Unlike most commercial statistical software, Rs glm and glm.nb functions employ an inverted
relationship of the dispersion parameter, theta, so that a Poisson model results
when theta approaches infinity. Most subsequent R functions have followed glm
and glm.nb. I maintain the direct relationship for all count models in this volume
and discuss the differences between the two parameterizations in some detail later
in the book.
11
12
Mean
Variance
Poisson
Negative binomial (NB1)
Negative binomial (NB2)
Poisson inverse Gaussian
Negative binomial-P
Generalized Poisson
(1 + ) = +
(1 + ) = + 2
(1 + 2 ) = + 3
(1 + ) = +
(1 + )2 = + 23 + 2 3
13
0.2
0.0
0.1
all.lines[[1]]
0.3
NB2
POI
PIG
NB1
GP
10
12
14
15
.1
.2
.3
10
15
y
generalized Poisson
NB1
Poisson
PIG
NB2
taken when interpreting figures such as these to assure that the entire distribution is being displayed or that a notation is provided when the distribution
appears to be truncated but is not. I advise checking the actual range of values
that are expected for a given distribution.
A sixth type of count model that will prove to be of considerable value
when modeling count data is the heterogeneous negative binomial, or NBH.
The heterogeneous negative binomial allows for the parameterization of the
dispersion parameter. That is, the dispersion parameter will have associated
predictor coefficients indicating which ones significantly contribute to the
overdispersion in the model. This feature of the NBH assists the analyst in
determining the possible source for both Poisson and negative binomial
overdispersion.
Other models will be discussed in the book as well, including truncated and zero-inflated models, hurdle models, other generalized models
with three parameters, panel models, quantile models, exact Poisson models,
Bayesian models, and others. However, the Poisson, negative binomial, and
PIG models will play primary roles when we first attempt to identify the model
that best fits the count data to be analyzed. Those models are considered
16
(1.6)
To isolate the predicted mean count on the left side of equation (1.6), both
sides of the equation are exponentiated, giving
= e 0 +1 X 1 +2 X 2 ++n X n
(1.7)
We will later discover that both of these expressions will be important when
defining terms in count models. Notice that there is not a linear relationship
between and the predictors as there was for the linear model. The linear
relationship is between the natural log of and the predictors.
Recall that earlier in this section I referred to the sum of regression terms
as the linear predictor. In the case of linear regression, the linear predictor is
the same as the predicted or expected value. Statisticians typically symbolize
the summation of the terms of the linear predictor for each observation in a
model as
(x)i =
n
0 + 1 X 1i + + j X ji
(1.8)
i=1
with i indicating the observation number in the model data and j the number
of predictors in the model. Notice that I used the standard mathematical
(Sigma) symbol for summation in equation (1.8). The summation starts at
the quantity indicated below Sigma and ends with the value at its top. Here
we have observation number i, starting at 1 representing the first observation in the data, and finishing with n, indicating the last observation in the
data being modeled. At times, statisticians choose not to use the summation
sign, or product sign for probabilities and likelihoods, or to use subscripts if
they believe that their readers understand that this is included in the equation even if not displayed. Subscripts and summation symbols are generally
17
not displayed if we are referring to the equation as a whole and not to the
individual components. Each author has their individual preference.
The relationship of the predicted or fitted statistic, , and the linear predictor, xb, is the same for Poisson, negative binomial, and PIG regressions.3
The term log() is called the link function since it links the linear predictor
and predicted value:
log(i ) =
n
0 + 1 X 1i + + j X ji
(1.9)
i=1
LINK RELATIONSHIP
-----------------------------------------------------------------------linear regression
identity: xb =
Gaussian or normal
major count models log:
xb = ln() Poisson, negative binomial, PIG
------------------------------------------------------------------------
An important feature of having the natural log link for count models is that
it guarantees that the predicted values will always be positive (i.e., 0).
Using a linear regression when modeling counts cannot make such a guarantee. What about the formerly common practice of logging the response
and modeling it as a linear regression (i.e., as a Gaussian model)? The result
is that predicted values are positive but substantially lower than Poisson or
negative binomial predicted values. Ideally, the predicted counts should be
close to the values of the actual counts: y . When the count response is
logged and modeled using linear regression, its predicted values are nearly
always distant from the actual or observed counts. Fit tests that match up
observed versus expected counts rarely show a significant fit. The caveat
here is
Reject the temptation to use linear regression to model a logged count
variable.
At the very start of the modeling process, its imperative to become intimately
familiar with the data you intend to model. Given that the count models
discussed in this book have a single response variable that is being modeled, it
3
Given the way that the NB and PIG models have been parameterized here.
18
is important to understand the points given in Table 1.3 about your proposed
model.
These are just the foremost items that you need to know about the data you
intend to model. Aside from the data itself, though, we also must know the
context of the model we wish to develop. Whats the point? This is perhaps
a more important criterion than is at first realized. But as you will find
when traversing the book, understanding the purpose of the model, or the
purpose of the research, can help greatly when determining which predictors
to use in the model. Its not just a matter of discarding all predictors with
p-values greater than 0.05. It may be important to retain certain predictors
in a model even though they do not appear to contribute significantly to an
understanding of the response.
19
make predictions and classifications about real data data that belong to the
population of data from which the model data are considered to be a sample.
If this characterization of the relationship of probability and model data
is unclear now, I am confident that it will be clear to you by the end of the
book. It is an important distinction that is basic to the frequency-based interpretation of statistics. Essentially, in selecting the most appropriate model for
given data, the analyst is selecting a probability distribution, or mixtures of
probability distributions, that best describe the population data of which the
data being modeled are a random sample. I have mentioned this before, but
it is an extremely important fact to understand about modeling.
As indicated in the last section, there are three primary probability functions that will initially concern us in this book: the Poisson, negative binomial,
and Poisson inverse Gaussian models. Many of the other models we discuss
are variations of these models in particular, variations of the Poisson and
negative binomial models.
Data come to us in a variety of ways. They certainly do not always perfectly
match the Poisson or negative binomial PDF. For example, the Poisson, negative binomial, and PIG distributions each assume the possibility of zero counts
even if there may not in fact be any. If zero counts are not a possibility for the
data being modeled, for instance hospital length-of-stay data, then the underlying PDF may need to be amended to adjust for the excluded zero counts.
Zero-truncated (ZT) models are constructed for exactly that purpose. The
underlying PDF is adjusted so that zero counts are excluded, but the probabilities in the function still sum to 1.0, as is required for a probability function.
Having data with an excessive number of zero counts is another problem
for many count models. For example, given a specific mean value for a Poisson
distribution of counts, the probability of having zero counts is defined by the
PDF. When the mean is 2 or 3, the probability of having zero counts is quite
good. But for a mean of 10, for instance, the Poisson PDF specifies that the
probability of a zero count is very near zero. If your data have a mean value
of 5 and some 30% of the count observations consist of zeros, there is a
problem. The expected percentage of zero counts on the basis of the Poisson
PDF is well under 1%. An adjustment must be made. Typically, analysts use
either a two-part hurdle model or a mixture model, such as zero-inflated
Poisson (ZIP) or zero-inflated negative binomial (ZINB). We will also use a
zero-inflated PIG (ZIPIG), as well as other models.
Hurdle models are nearly always constructed as a two-part 0,1 response
logistic or probit regression and a zero-truncated count model. The logistic component models the probability of obtaining a nonzero count. After
20
separating the data into two components, the software creates a binary variable where all counts greater than zero (0) are assigned the value of one
(1). Zeros in the count model are zeros in the logit component. The count
component truncates or drops observations with zero values for the original
counts and, for example, models the data as a zero-truncated Poisson. The
model I describe here is a Poisson-logit hurdle model. Be aware, though, that
the binary component of the hurdle model need not itself be from a binary
model. An analyst can use a censored Poisson model, with right censoring at
1, and employ it as the binary component. This may not make much sense
at this point, but when we address censored models later in the book, it will
be clear. Most hurdle models use a logistic or probit regression for the binary
component of the hurdle.
Zero-inflated models are not simply two-part models that can be estimated
separately, as are hurdle models. Rather, zero-inflated models are mixture
models. They use logistic or probit regression for the binary component, but
both components the binary and count include the same zero counts
when being estimated. The overlap of zero counts means that the mixture
of Bernoulli (the distribution used in binary logistic regression) and Poisson
distributions must be adjusted so that the resulting PDF sums to one. The
statistical software takes care of this for you, but this overlap is important
to remember since some statisticians especially in ecology interpret the
binary zeros differently from the count zeros. I discuss this in more detail
in Chapter 7. It should also be understood that zero-inflated models, unlike
hurdle models, structure the binary component so that it models zeros instead
of ones.
The count models described here are the only ones known by many analysts. However, several three-parameter count models have been developed
that can be used on count data that fail to fit any of the standard count probability distributions, including mixtures of distributions. We will focus our
attention on one of these models the generalized NBP negative binomial
model. The NBP model parameterizes the exponent on the second term of the
negative binomial variance. Recall that the negative binomial variance function is + 2 . We may symbolize the parameter as (rho), representing
the power + . , , and are all parameters to be estimated.
Actually, there is another model we have only alluded to in our discussion
thus far, the so-called NB1 model. It has a variance function of + 1
or simply + . Since it has a linear form, it is called a linear negative
binomial; the traditional negative binomial is sometimes referred to as the
quadratic negative binomial because of the square exponent. In any case,
21
foremost use of the NBP model is to have it determine whether the data
prefer NB1 or NB2. After estimating the NBP model, if is close to 2, the
analyst should use NB2 rather than NB1. I suggest letting the model speak
for itself. If alpha is 0.5 and is 1.8, then that should be the reported model.
The NBP will be discussed in Chapter 5.
There are several other types of count models that are important in the analysts statistical toolbox. At times you will have data that have been truncated
or censored. We have already mentioned zero-truncated models. However, it
is possible that the data theoretically come from population data that cannot
have count values below 3, or perhaps above 10, or even to either side of two
counts (e.g., 7 to 10). If values are truncated at the low end of the counts, the
model is said to be left truncated; if they cannot exist higher than some cut
point, the model is right truncated. Interval truncation exists when counts
only exist between specific count values.
Censoring occurs where counts can possibly exist, but because of the study
design or other factors they are not observed for these data. Left, right, and
interval censoring can occur.
Finite mixture models are seldom used but can be exactly what you are
looking for in a model. What if counts from more than one source are occurring? In the probabilistic sense, what if counts are coming into the data you
are modeling from more than one data-generating mechanism? Perhaps 25%
of the counts in your data are distributed as Poisson with a specific mean
and the rest as Poisson with another mean value. A finite mixture model can
ferret out the percentage distributions inherent in count data and provide
means and coefficients/standard errors for each distributional component in
the data. This is currently an underutilized class of models that can allow an
analyst to have a much better understanding of various types of data situations than if they were modeled using standard Poisson or negative binomial
methods.
I mentioned at the outset that the majority of count models to be discussed
in this book are based on probability distributions. Nonparametric models,
which include models with smoothers applied to specific continuous data
in your model, can assist in fitting a model. Generalized additive models
(GAMs) are used to assess the linearity of continuous predictors with respect
to the response and provide information concerning what type of transform
is needed to effect linearity. GAMs are not usually employed as a model to be
used to predict or classify data.
Quantile count models are also nonparametric models but are used to
describe the empirical distribution underlying ones data. Quantile count
22
23
This section and Section 1.4.3 are optional and require a knowledge of basic
calculus. They are included for those who wish to obtain an overview of count
model estimation, but they are not required for reading the subsequent chapters. I
do suggest reading the first part of this section, though, to learn about calculating
Poisson probabilities.
24
f (y; , )
(1.10)
L(, ; y)
(1.11)
We can clarify the difference between the probability and likelihood functions
as follows:
A probability function generates data on the basis of known parameters.
A likelihood function determines parameter values on the basis of known
data.
Given that we do know the data we are modeling, and given the fact that
we are attempting to estimate the parameters of the probability function
that (theoretically) generate the data we are modeling, we need to employ a
likelihood function in order to determine the parameters that make the model
data most likely to be the case. Expressed in another way, when modeling we
25
are asking what parameter values of a given PDF most likely generated the
data we have to model.
To make these concepts better understood, recall that the Poisson probability distribution has a single parameter and the negative binomial has two.
Each of these parameters can theoretically have an infinite number of parameter values. Once we select a Poisson probability, for example, to describe our
data, we need to refine the description by determining the location parameter.
We do this by maximizing, or optimizing, the Poisson likelihood function.
The method we use to maximize the likelihood function in question is referred
to as maximum likelihood estimation.
Because of numerical considerations, statisticians maximize the log of the
likelihood rather than the likelihood function itself. We will find throughout this book that the log-likelihood is one of the most well-used statistics
in modeling. Maximum likelihood estimation (MLE) is in fact maximum
log-likelihood estimation, although we rarely hear the term used in this
manner.
Maximization of the log-likelihood function involves taking the partial
derivatives of the function, setting the resulting equation to 0, and solving
for parameter values. The first derivative of the log-likelihood function, with
respect to the coefficients, is called the score function, U. The second derivative is a matrix called the Hessian matrix. The standard errors of the predictors
in the model are obtained by taking the square root of the diagonal terms of
the negative inverse Hessian, H1 . I refer you to Hilbe (2011) and Hilbe and
Robinson (2013), which provide an in-depth evaluation of both maximum
likelihood and IRLS methodology.
Before leaving this overview of maximum likelihood estimation, Ill give an
example of how this works, beginning with the Poisson probability distribution. It is the basic count model. If you understand the logic of estimation, it
will be easier to understand when and why things can go wrong in modeling.
In its simplest form, the Poisson probability distribution can be expressed
as
e i i i
yi !
y
f (y; ) =
(1.12)
26
4;
Tue =
2; Wed =
0; Thu =
3; Fri =
1; Sat =
The mean number of counts over the six days is 2. If a Poisson probability
distribution with a mean, , of 2 truly describes this list or vector of counts,
we can calculate the probability of no (0) customers being in line, or any
other number of customers, using the Poisson probability function:
CUSTOMERS
Prob(0) :
. di
. di exp(-2)* (20)/exp(lnfactorial(0))
.13533528
or simply
. di exp(-2)
.13533528
To calculate the probability of one person being in line and then two, three,
and four, we have
1 . di exp(-2)* (21)/exp(lnfactorial(1)) = .27067057
2 . di exp(-2)* (22)/exp(lnfactorial(2)) = .27067057
3 . di exp(-2)* (23)/exp(lnfactorial(3)) = .18044704
4 . di exp(-2)* (24)/exp(lnfactorial(4)) = .09022352
27
Note: Statas poisson(mu,y) function provides a cumulative Poisson probability for a given mean and count term, y. poissonp(mu, y) gives a specific probability for a mean and y value; for example, for a mean and count (y)
of 2
. di poissonp(2,2)
.27067057
In place of a constant value for the mean, mu, lets vary it, giving it four
different values:
mu = {0.5, 1, 3, 5}
Plotting the values allows us to observe the differences in the shapes of the
distributions. This type of graph will prove to be of use later when we discuss
28
.2
.4
.6
Poisson Distributions
mu0_5
mu3
10
mu1
mu5
Note that mean values under 1 are shaped like negative exponential distributions. The greater the mean, the more normal the shape of its appearance.
When we wish to adjust the values of the probability of customers waiting
by such factors as Is it raining or snowing?, Is it the first of the month
directly after customers may have been paid?, Is there an epidemic in the
29
#Poisson means
y- 0:11
layout(1)
for (i in 1:length(m)) {
#Observed counts
p- dpois(y, m[i])
#poisson pdf
if (i==1) {
plot(y, p, col=i, type=l, lty=i)
} else {
lines(y, p, col=i, lty=i)
}
}
n
(1.13)
i=1
Recall that the linear predictor of the Poisson model is xb, or x. Recall
also that x = ln(). This entails that = exp(x). exp(x) is called the
inverse link function, which defines . exp(x) also defines for the negative
binomial model, as well as for most models based on the Poisson and negative
binomial that are used in research. Because of this relationship, = exp(x)
is also referred to as the exponential mean function. If we substitute exp(x)
for (1.14) may be expressed as
L(; y) =
n
(1.14)
i=1
i=1
n
(1.15)
30
Setting equation (1.15) to zero (0) provides for the solution of the parameter
estimates.
The Hessian matrix is calculated as the second derivative of the loglikelihood function and is negative definite for . For the Poisson, it may
be expressed as
n
2 (L(; y))
=
(exp(xi ))xi xi
i=1
(1.16)
The square roots of the respective terms on the diagonal of the negative inverse
Hessian are the values of predictor standard errors. A NewtonRaphson type
algorithm (see Table 1.6) can be used for the maximum likelihood estimation
of the parameters
r +1 = r H 1 g
(1.18)
31
likelihood estimated values. Other iteration criteria have been employed, but
all generally result in the same estimates. The scheme for maximum likelihood estimation follows. It is the basis for models being estimated using this
method. o indicates estimation at the old value of the parameters (coefficients); n is the new, updated value. Likewise for the likelihood, L. tol
indicates the level of tolerance needed for convergence, which is typically set
at 106 .
Refer to Section 2.4 for examples of both Stata and R maximum likelihood
code. Example code is provided for a Poisson regression estimated using
maximum likelihood. Real data are used for the no-frills Stata and R programs,
but model coefficients and associated statistical values are calculated and
displayed.
(1.19)
32
1.5 Summary
33
1.5 SUMMARY
There are several important points to take from this chapter. The first point
we have made is that most analysts are aware that Poisson and negative
binomial regressions are used to model count response data, but they have
little knowledge of other count models or under what conditions they may
be desirable. They may have heard of overdispersion but not be fully aware
of what it is or how it relates to the Poisson model. We have emphasized,
however, that overdispersion is a very real problem for Poisson regression
and occurs when the variance of the Poisson model is greater than the mean.
Overdispersion also occurs more generally when there is more variability in
the model data than is assumed on the basis of the probability distribution
on which the model is based. Negative binomial regression, having a second parameter, is now a standard way of modeling overdispersed Poisson
data.
Negative binomial regression may not adequately adjust for the amount
of overdispersion in a Poisson model, depending on what gives rise to the
overdispersion or underdispersion. A number of other models based on the
Poisson and negative binomial models have been designed to appropriately
compensate for overdispersion.
Another important point to take from this opening chapter is that when
we are modeling data we are in fact modeling an underlying probability
distribution (PDF) or mixture of distributions. When modeling, the analyst
selects a PDF or mixture of PDFs that is believed to best describe the data to
be modeled. Once a selection is made, the analyst attempts to estimate the
parameters that make the data most likely to be the case. Analysts generally
do this using maximum likelihood estimation. IRLS estimation, which is the
algorithm typically used in generalized linear model (GLM) software, is a
variety of maximum likelihood. It is simplified because GLM models are
members of the single-parameter exponential family of distributions.
The data that are being modeled using maximum likelihood techniques are
also thought of as a random sample from a greater population of data, so the
model is in fact of a distribution that theoretically generates the population
data from which the model data are a random sample. Table 1.8 provides
a rough summary of these relationships. A summary of the relationship of
modeling to probability can be expressed as
The foremost goal of modeling is the identification of a probability distribution, or mixture of distributions, and estimation of its unknown
34
parameter or parameters, that can be said to have generated the population of data from which we are modeling a random sample.
This is not normally how we think of the modeling process, but based on the
predominant frequency-based interpretation of modeling, it makes sense. At
its core, it is consistent with the view that when modeling we are attempting
to discover the unbiased parameter values of a probability distribution, or
mixture of distributions, that make the data we are modeling most likely to
be the case. In the following chapter, we discuss in more detail how we make
this discovery.
CHAPTER 2
Poisson Regression
statistic?
preted?
r What are marginal effects, partial effects, and discrete change with respect
to count models?
Poisson regression is fundamental to the modeling of count data. It was the
first model specifically used to model counts, and it still stands at the base
of the many types of count models available to analysts. However, it was
emphasized in the last chapter that because of the Poisson distributional
assumption of equidispersion, using a Poisson model on real study data is
usually unsatisfactory. It is sometimes possible to make adjustments to the
Poisson model that remedy the problem of under- or overdispersion, but
35
36
POISSON REGRESSION
37
one (1). However, if zero is excluded as a possibility, then the sum of probabilities will always be less than 1. In order to properly model the count data
having a low mean value, a zero-truncated Poisson distribution must be used
for modeling. We discuss this in more detail in Chapter 7.
3. Two ways in which independence may be tested are:
a. Check to determine whether the data are structured in panels (i.e., check
whether the data are clustered or are a collection of longitudinal data).
If they are, then they are not independent. When we model otherwise
panel data as though they are independent, we say that the data have
been pooled (i.e., we assume that the panel effect is not enough to
violate the independence criterion).
b. Check the difference between the model SEs and the SEs adjusted by
i. employing a robust sandwich estimator to the SEs, or
ii. bootstrapping the SEs
or
iii. checking SEs scaled by the dispersion statistic (i.e., model SEs multiplied by the square root of the Pearson Chi2 dispersion).
38
POISSON REGRESSION
If the SEs differ considerably, then the data are correlated. I suggest applying
all three adjustments to the model SEs. If all differ from the model SEs, it is
highly likely that the independence criterion has been violated.
4. This criterion may be tested by calculating the percentage of zeros in the
empirical distribution (i.e., the distribution of the count response variable).
Compare that with the frequency of zero counts expected or predicted
based on a Poisson PDF with the mean determined for the observed distribution. If the compared frequencies substantially differ, then a violation
exists in the distributional assumption. An example was given in Section
1.4.2, where we calculated the probability of six observations with count
values of 0 through 4. The mean of the six counts was calculated to be 2.
The probability of a 0 count for a Poisson distribution with a mean of 2 is
.135, or 13.5%. If we have a count variable with a mean of 2 but 25% of
the counts are 0, this criterion has been violated. It is generally permissible
to have small variations in the differences between actual and expected
counts, but it is usually wise to check using a Chi2 test, which is described
later in this chapter. The relationship of the probability of zero counts and
the value of will be explored in greater detail in Section 7.1, where we
discuss truncated zeros.
5. Calculate the mean and variance of the empirical (observed) count
response. Determine whether the two values differ.
6. The Pearson Chi2 test is performed as follows:
a. Tabulate the observed variance of the count variable being modeled.
Compare it with the predicted variance. If the two values differ, the
model is extradispersed. If the observed variance is greater than the
expected or predicted variance, the model is Poisson overdispersed;
if it is less, the model is Poisson underdispersed. Most count data are in
fact Poisson overdispersed.
b. Estimate the desired full Poisson model, checking the value of the
Pearson Chi2 dispersion. If the value of the dispersion varies from 1.0,
the model is Poisson extradispersed. A boundary likelihood ratio test
can be used to assess overdispersion. A generalized Poisson model can
test for the statistical significance of either under- or overdispersion.
When the preceding criteria are violated, the model generally appears as Poisson extradispersed and in particular as Poisson overdispersed. As such, the
criteria are all related to some degree, some more than others. An indication of
39
40
POISSON REGRESSION
41
42
POISSON REGRESSION
1.00
X1
== 0.75
== -1.25
X3
== 0.50
Freq.
Percent
Cum.
------------+----------------------------------0 |
4,579
9.16
9.16
1 |
9,081
18.16
27.32
2 |
3 |
10,231
8,687
20.46
17.37
47.78
65.16
4 |
.
14 |
6,481
.
16
12.96
.
0.03
78.12
.
99.98
15 |
7
0.01
99.99
16 |
3
0.01
100.00
------------+----------------------------------Total |
50,000
100.00
43
The mean and median of the Poisson response are 3.0. The displayed output
has been amended:
. sum py, detail
50%
3
Mean
3.00764
. glm py x1 x2 x3, fam(poi) nolog
// non-numeric mid-header
Generalized linear models
Optimization
: ML
Deviance
54917.73016
Pearson
49942.18916
Log likelihood
= -93613.32814
output deleted
No. of obs
50000
Residual df
=
Scale parameter =
(1/df) Deviance =
49996
1
1.098442
(1/df) Pearson
AIC
BIC
= .9989237
= 3.744693
= -486027.9
-----------------------------------------------------------------------|
OIM
py |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------x1 |
.7502913 .0090475
82.93 0.000
.7325586
.768024
x2 | -1.240165 .0092747 -133.71 0.000
-1.258343 -1.221987
x3 |
.504346 .0089983
56.05 0.000
.4867096
.5219825
_cons |
.9957061
.00835
119.25 0.000
.9793404
1.012072
-----------------------------------------------------------------------. abic
AIC Statistic
3.744693
AIC*n
BIC Statistic
3.744755
BIC(Stata) = 187269.94
= 187234.66
We set a seed value of 4590, which I just provided by chance. Using this same
seed will guarantee that the same data and model is displayed. Dropping
the seed will result in each run giving slightly different output. If we ran a
Monte Carlo simulation of the data with perhaps 100 iterations, the parameter
estimates would equal the values we assigned them, and the Pearson dispersion statistic, defined as the Pearson statistic divided by the model degrees
of freedom, would equal 1.0. Note that the Pearson dispersion statistic in
the preceding model is 0.9989237, with the parameter estimates approximating the values we specified. Note also that the deviance-based dispersion
statistic is some 10% greater than 1. After hundreds of simulations, I have
found that simulations of a true Poisson model always result in a Pearson
dispersion statistic approximating 1. The deviance appears to vary between
1.03 and 1.13. I have also displayed a user command I wrote called abic,
which provides two types each of AIC (Akaike Information Criterion) and
44
POISSON REGRESSION
1
2
py
0
1
3
4
.
2
3
.
10190
8814
.
20.380
17.628
.
47.982
65.610
.
15
16
17
14
15
16
15
8
2
0.030
0.016
0.004
99.980
99.996
100.000
45
summary(py)
Min. 1st Qu.
0.000
1.000
Median
3.000
Max.
16.000
...
Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)
x1
x2
x3
---
0.985628
0.748436
-1.243313
0.518207
117.13
82.33
2e-16 ***
2e-16 ***
0.009298 -133.72
0.009018
57.46
2e-16 ***
2e-16 ***
0.008415
0.009091
on 49999
on 49996
degrees of freedom
degrees of freedom
0.5358818
to obtain
# by-hand method
The model results are nearly the same, regardless of whether they are generated using Stata or R. Even with different seed values, the models are nearly
identical. Note that the same seeds for Stata and R models result in slightly
different values. The R dispersion statistic is 0.9958121.
To determine whether the single Poisson models we just ran are providing
values that approximate the parameter values we assigned to the algorithm,
we use what is called a Monte Carlo algorithm, in which no seed is provided
to the code so that each run is a bit different. (The Stata code is given in
Table 2.5, and the R code is provided in Table 2.6.) Here we execute the
46
POISSON REGRESSION
/* poix_sim.ado */
version 10
drop _all
set obs 50000
gen x1 = runiform()
gen x2 = runiform()
gen x3= runiform()
gen py = rpoisson(exp(1 + 0.75*x1 - 1.25*x2 + .5*x3))
glm py x1 x2 x3, nolog fam(poi)
return scalar sx1 = _b[x1]
/// x1
return scalar sx2 = _b[x2]
return scalar sx3 = _b[x3]
return scalar sc = _b[_cons]
/// x2
/// x3
/// intercept
run 500 times, keeping the values of the coefficients and both dispersion
statistics. At the conclusion, there are 500 observations of data in memory.
Obtaining the mean of the vector of mean values for each model statistic
provides us with the Monte Carlo results. The simulate command calls
and runs poix_sim:
. simulate mx1 = r(sx1)
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------mx1 |
500
.7499434
.0086789
.7273185
.7806724
mx2 |
500
-1.249551
.0089773 -1.275919 -1.223709
mx3 |
mcon |
mdd |
500
500
500
.4993433
1.00001
1.104367
.0091402
.0082247
.0066602
.4705572
.9790608
1.083313
.5229337
1.021834
1.124828
mpd |
500
1.00027
.006106
.9778052
1.020143
The Monte Carlo results demonstrate that we are modeling a true Poisson
model and that the dispersion statistic is 1.0. The deviance dispersion is 1.10,
47
some 10% greater than 1. The results for the value of the Pearson-based dispersion are identical for any true Poisson model we test using this method.
Monte Carlo results using R are similar to those for Stata. The dispersion
statistic is 0.9999323, nearly identical to 1. Coefficients are nearly correct to
the ten-thousandths place. If we had replicated the model 500 times as we
did for Stata instead of 100 times, our results would have been even closer:
TABLE 2.6. R: Monte Carlo Poisson Code
mysim - function()
{
nobs - 50000
x1 - runif(nobs)
x2 - runif(nobs)
x3 - runif(nobs)
py - rpois(nobs, exp(2 + .75*x1 - 1.25*x2 + .5*x3))
poi - glm(py ~ x1 + x2 + x3, family=poisson)
pr - sum(residuals(poi, type="pearson")2)
prdisp - pr/poi$df.residual
beta - poi$coef
list(beta,prdisp)
}
B - replicate(100, mysim())
apply(matrix(unlist(B[1,]),3,100),1,mean)
48
POISSON REGRESSION
Freq.
Percent
Cum.
------------+----------------------------------0 |
1,611
41.58
41.58
1 |
2 |
3 |
448
440
353
11.56
11.36
9.11
53.15
64.51
73.62
49
4 |
5 |
6 |
213
168
141
5.50
4.34
3.64
79.12
83.45
87.09
7 |
60
...
1.55
...
88.64
...
Notice the 41.58% zero (0) counts. With a mean of 3.163 and variance of
39.39 (variance = standard deviation squared), the variance far exceeds the
mean:
. sum docvis
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------docvis |
3874
3.162881
6.275955
0
121
We expect that only 4.23% of the observations in the model have a zero count,
but there are over 41% zero counts.
The model we are working with is seriously Poisson overdispersed. In fact,
it would be rather surprising if the data came from a Poisson distribution;
or alternatively, surprising if the data could be appropriately modeled using
Poisson regression.
The two other predictor variables we are using are outwork
. tab outwork
1=not |
working; |
0=working |
Freq.
Percent
Cum.
------------+----------------------------------0 |
2,454
63.35
63.35
1 |
1,420
36.65
100.00
------------+----------------------------------Total |
3,874
100.00
50
POISSON REGRESSION
and age
. sum age
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
3874
43.99587
11.2401
25
64
Observations
Variable |
total
distinct
--------------+---------------------age |
3874
40
Centering is the process where the mean of the variable is subtracted from
every value of the variable. For example, centering provides that
age -- mean(age)
for each age in the model. Using the center command, we have named it
cage. It is important to remember that centering changes only the value of the
intercept in the model. All other predictor coefficients and standard errors,
and fit statistics, stay the same. We interpret age differently when centered,
though, as will be seen:
. glm docvis outwork cage, fam(poisson) nolog
Generalized linear models
Optimization
: ML
//
No. of obs
=
Residual df
=
Scale parameter =
3874
3871
1
51
Deviance
Pearson
=
=
24190.35807
43909.61512
(1/df) Deviance =
(1/df) Pearson =
AIC
=
6.249124
11.34322 =
8.074027
Log likelihood
= -15636.39048
BIC
= -7792.0
-----------------------------------------------------------------------|
OIM
docvis |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------outwork |
cage |
_cons |
.4079314
.0220842
.9380964
.0188447
.0008377
.0127571
21.65
26.36
73.54
0.000
0.000
0.000
.3709965
.0204423
.913093
.4448663
.0237261
.9630999
-----------------------------------------------------------------------. abic
AIC Statistic
=
8.074027
AIC*n
= 31278.781
BIC Statistic
8.07418
BIC(Stata) = 31297.566
Other predictors in the data could have been used in the preceding model
as well, but I selected a model with a single binary and single continuous
predictor to keep it simple.
Using the Stata glm command, the nolog option is used so that the iteration
log is not displayed. The intercept value .938 is the linear predictor of all
observations in the model when both outwork and age have values of 0. If age
were not centered, the intercept would have a value of .03351697.
It is of primary importance to notice that the Pearson-based dispersion
statistic has a value of 11.34, far higher than the ideal value for a Poisson model
of 1.0. I determined elsewhere that the model is not apparently overdispersed.
Adding an interaction, for example, or squaring the cage variable, and so forth,
did not result in an elimination of overdispersion. That would have solved
the issue upfront. Given the fact that
r there are excessive 0 counts in docvis given its mean,
r the variance of docvis far exceeds its mean, and
r the dispersion statistic (11.34) is much greater than 1,
52
POISSON REGRESSION
53
5:1 ratio or smaller may at times produce a well-fitted model. See Vittinghoff
and McCulloch (2006) for details. We explore alternatives in subsequent
chapters.
Let us add additional predictors to our model and view the summary
statistics:
ADDED BINARY PREDICTORS: female, married, and kids
CATEORICAL PREDICTOR:
Educational level (edlevel)
category predictor.
--- a 4-
STATA CODE
. tab edlevel
Level of |
education |
Freq.
Percent
Cum.
------------+----------------------------------Not HS grad |
HS grad |
Coll/Univ |
3,152
203
289
81.36
5.24
7.46
81.36
86.60
94.06
Grad School |
230
5.94
100.00
------------+----------------------------------Total |
3,874
100.00
I initially used the term i.edlevel to tell Stata that edlevel is categorical, with
the first level as the default reference level, and therefore is excluded from
estimation. Displayed in Stata and R,
. glm docvis outwork cage female married kids i.edlevel, fam(poi) nolog
or
summary(tst1 - glm(docvis ~ outwork + cage + female + married + kids
+ factor(edlevel), family=poisson, data=rwm1984))
I found, however, that the second level was not significantly different from
the reference level (i.e., its p-value was substantially greater than 0.05). I
then generated four indicator or dummy variables, excluding both levels
1 and 2 from estimation. This means that a combination of levels 1 and 2 is
the reference level (patients who have not attended college or university). A
model can then be developed as
. tab edlevel, gen(edlevel)
54
POISSON REGRESSION
=
=
23894.1626
43855.49193
No. of obs
=
Residual df
=
Scale parameter =
3874
3866
1
(1/df) Deviance =
(1/df) Pearson =
AIC
=
6.18059
11.34389
8.000151
Log likelihood
= -15488.29274
BIC
= -8046.895
-----------------------------------------------------------------------|
OIM
docvis |
Coef.
Std. Err.
z
P|z| [95% Conf. Interval]
-----------+-----------------------------------------------------------outwork |
cage |
female |
.2725941
.0194905
.2596274
.0215047
.0009708
.0211399
12.68
20.08
12.28
0.000
0.000
0.000
.2304456
.0175878
.2181938
.3147426
.0213933
.3010609
married |
kids |
edlevel3 |
-.0937657
-.1247957
-.19083
.022653
.0222546
.0398098
-4.14
-5.61
-4.79
0.000
0.000
0.000
-.1381648
-.1684139
-.2688558
-.0493666
-.0811776
-.1128042
=
=
AIC*n
= 30992.586
BIC(Stata) = 31042.682
8.000151
8.004609
4=3
3,355
289
230
86.60
7.46
5.94
86.60
94.06
100.00
------------+----------------------------------Total |
3,874
100.00
. glm docvis outwork cage female married kids i.elevel, fam(poi) nolog
not displayed
55
# levels of edlevel
elevel - rwm1984$edlevel
levels(elevel)[2] - "Not HS grad"
levels(elevel)[1] - "HS"
# new variable
# assign level 1 to 2
# rename level 1 to "HS"
levels(elevel)
# levels of elevel
summary(tst2 - glm(docvis ~ outwork + cage + female + married + kids
+ factor(elevel), family=poisson, data=rwm1984))
"Coll/Univ"
levels(elevel)
[1] "HS"
"Grad School"
"Coll/Univ"
"Grad School"
Its clear that even with five significant predictors added to the model, the
dispersion and AIC and BIC statistics are nearly identical to those of the
reduced model with outwork and cage as predictors. In such a case, Occams
razor maintains. Occams dictum in this context can be addressed as:
If two models have the same explanatory power,
the simpler model is preferred.
Prediction is one of the most important features of a statistical model. We
address it in Section 2.6 of this chapter.
56
POISSON REGRESSION
errors, and z-values? How do we interpret the statistics that are displayed in
Stata, R, and SAS output?
The coefficient j in general is the change in the log-count of the response
for a one-unit change in the predictor. For a binary predictor such as outwork,
is the change in the log-count value of the response when the value of the
predictor changes from 0 to 1. For a continuous predictor, the change with
respect to the predictor is from the lower of two contiguous values to the next
higher; for example, from age 45 to age 46. Note, though, that the change
in the response is in terms of log-values (i.e., log-physician visits), not the
original values of the response. When a continuous predictor is centered, the
reference is its mean value. When a centered variable is part of an interaction,
it helps to reduce correlation and is therefore in general desirable. Predictions
are not affected. A continuous variable may also be scaled, which entails that
the centered variable be standardized.
Lets take another look at the coefficient table for the German health data
we have been evaluating. This time, however, I will not center age but leave
it as it comes in the data as discrete values from 25 to 64. Stata allows the
user to display only the coefficient table by using the nohead option:
. glm docvis outwork age, fam(poi) nolog nohead
-----------------------------------------------------------------------|
OIM
docvis |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------outwork | .4079314
.0188447
21.65
0.000
.3709965
.4448663
age | .0220842
.0008377
26.36
0.000
.0204423
.0237261
_cons | -.033517
.0391815
-0.86
0.392 -.1103113
.0432773
------------------------------------------------------------------------
Its obvious why most analysts prefer not to interpret Poisson coefficients.
Its rather difficult to decipher the import of a log-count. I provided two
alternative ways of expressing the generic meaning of the coefficient (see
Table 2.9), and other versions can be given as well, but they are all difficult
to comprehend or really visualize.
A coefficient provides a way of quantifying the relationship of the predictor
to the response based on the model. But its value is only as good as its statistical significance. It has become fairly standard to define .05 as the criterion
of statistical significance. In preliminary studies, this criterion is typically
relaxed, and in subjects like physics and nanotechnology the relationship is
57
tightened to .01 or even .001. But this is rare. For our purposes, well maintain
the generally accepted significance level of .05.
I provided the formula for the Poisson Hessian matrix in Chapter 1 and
mentioned that the standard errors of the model parameter estimates are
obtained as the square root of the diagonal terms of the inverse negative of
the Hessian. This can be simplified by having the software being used to
model data display the variancecovariance matrix of the model coefficients.
Thats the negative inverse Hessian.
Following estimation of a model, the variancecovariance matrix can be
obtained as
STATA CODE
. matrix list e(V)
symmetric e(V)[3,3]
docvis:
outwork
docvis:outwork
docvis:age
docvis:_cons
.00035512
-4.485e-06
.00003565
. di sqrt(.00035512)
docvis:
age
7.018e-07
-.00003104
//
docvis:
_cons
.00153519
.01884463
. di sqrt(7.018e-07)
.00083774
. di sqrt(.00153519)
.0391815
//
58
POISSON REGRESSION
The values we calculate match the standard errors displayed in the coefficient
table. These values are referred to as the model standard errors. We define and
use other types of standard errors later in this chapter.
A z-value is simply the ratio of the coefficient and associated standard
error. For the outwork predictor, we have
. di 0.4079314 /0.0188447
21.64701
(2.1)
1.96* .0188447
//
. di .4079314 +
.44486701
1.96* .0188447
//
59
0.0204444 0.02372828
# Traditional Model-based SE
confint.default(pyq)
2.5 %
97.5 %
The standard errors differ little in this likelihood profiling, but this is not
always the case. Profiling may mean the difference between accepting or
rejecting a predictor. In general, standard errors based on profile likelihood
are preferable to traditional model-based standard errors. We will later find
that determining the significance of a predictor by means of a likelihood
ratio test is usually preferable to the traditional model-based method. In
addition, I will suggest using another type of adjustment to model-based
standard errors the use of robust or sandwich variance estimators (refer to
Section 3.4.3).
60
POISSON REGRESSION
method, which in this case amounts to exp()*SE , where the second term is
the standard error of the original standard model. First, we exponentiate the
coefficients:
STATA CODE
# exponentiated coefficient: rate ratio outwork
. di
exp( _b[outwork])
1.5037041
# exponentiated coefficient: rate ratio age
. di
exp( _b[age])
1.0223299
* Delta Method: SE outwork rate ratio
. di
exp( _b[outwork]) * _se[outwork]
.02833683
* Delta Method: SE age rate ratio
. di
exp( _b[age]) * _se[age]
.00085643
The header statistics of the exponentiated model are identical to the standard
table of coefficients. Using R, we can obtain exponentiated coefficients by
running exp(coef(poi1)).
The incidence rate ratio (IRR) indicates the ratio of the rate of counts
between two ascending contiguous levels of the response. For this example,
we interpret the IRRs for each predictor as in Table 2.10.
61
Other ways of expressing the same thing as in the table may also be given.
In particular, rate ratios can be expressed as probabilities or likelihoods in
the sense that we can also interpret outwork as
Out-of-work patients were some one-and-a-half times more likely (or
more probable) to visit a physician in 1984 than were working patients.
If we think of our data as a random sample from a greater population of
data that can be described (or was generated) by a distribution with specific
unknown parameters, it is possible to interpret outwork as:
Out-of-work patients are about one-and-a-half times more likely to
visit a doctor than are working patients;
or
Those with a job see a doctor half as much as those who are out of
work.
For the predictor age, we can express the rate ratio as:
For each year older, Germans are likely to see a doctor some two-anda-half percent more often.
Most analysts prefer to exponentiate coefficients and interpret parameter
estimates as rate ratios. Always remember, though, that such an interpretation
always includes a comparison of two levels of a predictor. For a binary
62
POISSON REGRESSION
Using R, we may obtain the same results by running the code in Table 2.11
in the R script editor.
63
a particular type of product occur each month over the course of a year. The
rate of counts, , is calculated as the number of events counted divided by
the period of time that counting occurs, and likewise for counts per area. We
may count 10 events occurring in area A and only 7 in area B, but if area A
is half the size of area B, the rate of counts of events is greater in area B. The
same is the case for volumes of space: how many supernovae are counted in
an area of 100 cubic light-years in our Milky Way galaxy. We may have to
adjust over both time and area, or time and volume, but most studies do not
entail such a double adjustment.
Statisticians use an offset with a model to adjust for counts of events
over time periods, areas, and volumes. The model is sometimes referred
to as a proportional intensity model.
We briefly addressed the rate parameterization of the Poisson model in
Chapter 1. Although is sometimes said to be an intensity or rate parameter,
it is such only when thought of in conjunction with a constant coefficient, t.
The rate parameterization of the Poisson PDF can be expressed as
e t (t) y
(2.2)
f (y; ) =
y!
where t represents the length of time, or exposure, during which events or
counts uniformly occur. t can also be thought of as an area or volume in which
events uniformly occur, each associated with a specific count. For instance,
when using a Poisson model with disease data, t* can be considered the rate
of disease incidence in specified geographic areas, each of which may differ
from other areas in population. Likewise, the incidence rate of hospitalized
bacterial pneumonia patients can be compared across counties within the
state. A count of such hospitalizations divided by the population size of
the county, or by the number of total hospitalizations for all diseases, results
in the incidence rate ratio (IRR) for that county. When t = 1, the model is
understood to apply to individual counts without a consideration of size.
Technically, basic Poisson distribution assumes that each count in the distribution occurs over a small interval of time so small that it consists of a
single count. Size or period is not a consideration. If the periods, areas, or
volumes in which events occur are the same for the entire study, an offset
is not needed. On the other hand, where unequal periods of time, area, or
volume (TAV) occur in the model, an offset must be given. The key notion
is that events can be considered as entering or being in an area or period of
time independently of other events. They are uniformly distributed in each t.
64
POISSON REGRESSION
(2.3)
,
t
= t exp(x)
(2.4)
65
66
POISSON REGRESSION
No. of obs
15
Optimization
: ML
Deviance
10.93195914
Residual df
Scale parameter
(1/df) Deviance
=
=
=
9
1
1.214662
Pearson
12.60791065
(1/df) Pearson
AIC
=
=
1.400879
4.93278
Log likelihood
= -30.99584752
BIC
= -13.44049
-----------------------------------------------------------------------|
OIM
die |
IRR
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------anterior | 1.963766
.3133595
4.23
0.000
1.436359
2.684828
hcabg |
kk2 |
kk3 |
1.937465
2.464633
3.044349
kk4 |
_cons |
ln(cases) |
12.33746
.0170813
1
.6329708
.4247842
.7651196
2.02
5.23
4.43
0.043
0.000
0.000
1.021282
1.75811
1.86023
3.675546
3.455083
4.982213
3.384215
9.16
.0024923 -27.89
(exposure)
0.000
0.000
7.206717
.0128329
21.12096
.0227362
-----------------------------------------------------------------------. abic
AIC Statistic
=
4.93278
AIC*n
= 73.991692
BIC Statistic
5.566187
BIC(Stata) = 78.239998
The dispersion at first may appear as relatively low at 1.40, but given a total
observation base of 5388, the added 40% overdispersion may represent a lack
of model fit. We delay this discussion until later in this chapter.
2.7 PREDICTION
We have previously described how predicted Poisson probabilities may be
generated. However, one of the foremost uses of prediction rests with the
ability of predicting expected counts based on predictor values. The values
do not have to be the same as those that actually were used to estimate the
model parameters. For example, using the German health data example, we
67
2.7 Prediction
may calculate the expected number of doctor visits for a working 50-year-old
patient based on the table of coefficient values displayed as
glm docvis outwork age, nolog fam(poi) nohead
-----------------------------------------------------------------------|
OIM
docvis |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------outwork |
age |
_cons |
.4079314
.0220842
-.033517
.0188447
.0008377
.0391815
21.65
26.36
-0.86
0.000
0.000
0.392
.3709965
.0204423
-.1103113
.4448663
.0237261
.0432773
------------------------------------------------------------------------
Given that a working patient has an outwork value of 0 and coefficient values
are signified as _b[ ], we may obtain the linear predictor for a 50-year-old
working patient as
. di _b[outwork]*0 + _b[age]*50 - .033517
1.0706929
68
POISSON REGRESSION
R
myglm - glm(docvis ~ outwork + age, family=poisson, data=rwm1984)
lpred - predict(myglm, newdata=rwm1984, type="link", se.fit=TRUE
up - lpred + 1.96*lpred$se.fit); lo - lpred - 1.96*lprd$se.fit)
eta - lpred$fit ; mu - myglm$family$linkinv(eta)
upci - myglm$family$link(up); loci - myglm$family$link(lo)
In the next chapter, we extend the analysis of predictions to making comparisons between observed and predicted, or expected, counts.
dictors.
I believe that these statistical methods are important to research in other
disciplines as well, so it is wise to be aware of what they do.
A marginal effect relates a continuous predictor to the predicted probability
of the response variable. Other predictors in the model are held at their mean,
TABLE 2.14. R: Marginal Effects at Mean
==============================================
library(COUNT)
data(rwm5yr); rwm1984 - subset(rwm5yr, year==1984)
summary(pmem - glm(docvis ~ outwork + age, family=poisson,
data=rwm1984))
mout - mean(rwm1984$outwork); mage - mean(rwm1984$age)
xb - coef(pmem)[1] + coef(pmem)[2]*mout + coef(pmem)[3]*mage
dfdxb - exp(xb) * coef(pmem)[3]
mean(dfdxb)
==============================================
mean(dfdxb)
[1] 0.06552846
69
Number of obs
Model VCE
: OIM
Expression
3874
.3665462 (mean)
age
=
43.99587 (mean)
-----------------------------------------------------------------------|
Delta-method
|
dy/dx
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------age | .0655285
.0024232
27.04
0.000
.060779
.0702779
------------------------------------------------------------------------
70
POISSON REGRESSION
Or:
At the average of age, an extra year of age is associated with a 0.066
increase in doctor visits per year when outwork is set at its mean value.
There is no standard R function yet that calculates marginal effects at the
mean, or average marginal effects. We must therefore write R script to determine the results (see Table 2.14).
mean(dfdxb)
[1] 0.06552846
This is the same value obtained using the Stata margins command.
Number of obs
3874
: OIM
Expression
: Predicted mean docvis, predict()
dy/dx w.r.t. : age
-----------------------------------------------------------------------|
Delta-method
|
dy/dx
Std. Err.
z
P|z|
[95% Conf. Interval]
----+------------------------------------------------------------------age |
.0698497
.0027237
25.64
0.000
.0645113
.0751881
------------------------------------------------------------------------
71
Using R, we need to execute the following code. Note that the calculated
statistic is identical to what we calculated using the margins command:
R CODE
mean(rwm1984$docvis) * coef(pmem)[3]
age
0.06984968
fam(poi)
Number of obs
3874
Expression
: Predicted mean docvis, predict()
dy/dx w.r.t. : 1.outwork
at
: 0.outwork
1.outwork
age
=
=
=
.6334538 (mean)
.3665462 (mean)
43.99587 (mean)
-----------------------------------------------------------------------|
Delta-method
|
dy/dx
Std. Err.
z
P|z|
[95% Conf. Interval]
--------+--------------------------------------------------------------1.outwork | 1.287021 .0625526
20.58
0.000
1.16442
1.409622
-----------------------------------------------------------------------Note: dy/dx for factor levels is the discrete change from the base level.
R CODE
summary(pmem - glm(docvis ~ outwork + age, family=poisson,
data=rwm1984))
mu0 - exp(pmem$coef[1] + pmem$coef[3]*mage)
72
POISSON REGRESSION
Average partial effects or discrete change uses the same method as for average
marginal effects, except for the factoring of outwork:
STATA CODE
. margins, dydx(outwork)
Average marginal effects
Number of obs
3874
Model VCE
: OIM
Expression
: Predicted mean docvis, predict()
dy/dx w.r.t. : 1.outwork
-----------------------------------------------------------------------|
Delta-method
|
dy/dx
Std. Err.
z
P|z|
[95% Conf. Interval]
----------+------------------------------------------------------------1.outwork | 1.327171
.0635453
20.89
0.000
1.202624
1.451718
-----------------------------------------------------------------------Note: dy/dx for factor levels is the discrete change from the base level.
R CODE
summary(pmem - glm(docvis ~ outwork + age, family=poisson,
data=rwm1984))
bout = coef(pmem)[2]
mu = fitted.values(pmem)
xb = pmem$linear.predictors
pe_out = 0
pe_out = ifelse(rwm1984$outwork == 0, exp(xb + bout)-exp(xb), NA)
pe_out = ifelse(rwm1984$outwork == 1, exp(xb)-exp(xb-bout),pe_out)
mean(pe_out)
[1] 1.327171
There are several other ways to relate predictors and the response; for example, as elasticities and semielasticities. If you intend to use marginal effects
with your modeling project, I suggest you check the possibility of parameterizing the relationship of continuous predictors to the response as elasticities
or even semielasticities. See Hilbe (2011) for a more extensive analysis of
marginal and partial effects, and alternative statistics.
2.9 Summary
73
2.9 SUMMARY
We have covered quite a bit of statistical material in this chapter, starting
from the distributional assumptions on which the Poisson model is based
and then identifying and testing for apparent overdispersion as distinct from
real overdispersion. We then discussed constructing and interpreting Poisson
models. A true synthetic Poisson model was created, identifying aspects of
the model output that characterize well-fitted models. This discussion was
followed by modeling a poorly fitted real-data Poisson model, also evaluating its characteristics. Finally, we explained the meaning of two varieties of
marginal and partial effects and described how they are created as well as
how they are interpreted.
A Poisson model is usually a good place to start when modeling count
data, which may have a number of explanatory predictors. The following
chapter expands on the meaning of overdispersion and testing for it. We also
address standard fit tests as they relate to count models. By the conclusion of
the chapter, you should have a good understanding of the basics of modeling
count data. What will remain are models that have been designed to adjust for
specific types of extradispersion; for instance, negative binomial regression,
Poisson inverse Gaussian regression, and so forth.
CHAPTER 3
Testing Overdispersion
is overdispersed?
75
n
{L (yi ; yi ) L(i ; yi )}
(3.1)
i=1
n
yi log () i log (yi !)
(3.2)
i=1
76
TESTING OVERDISPERSION
Note that log(y!), the normalization term that provides for the function to
sum to 1, cancels. In fact, the normalization term for every GLM model
PDF cancels when calculating the respective deviance statistic. Since the
normalization terms can get rather complicated for some distributions, we
can see why the deviance was preferred as the basis of algorithm convergence
especially in the 1970s and 1980s, when computing speed was slow.
The deviance goodness-of-fit (GOF) test is based on the view that the
deviance is distributed as Chi2. The Chi2 distribution has two parameters
the mean and scale. For the deviance GOF, this is the deviance statistic and
residual degrees of freedom, which is the number of observations less predictors in the model, including the intercept and interactions. If the resulting
Chi2 p-value is less than 0.05, the model is considered well fit. Employing the
deviance GOF test on the modeled German health data yields the following
model:1
. glm docvis outwork age, fam(poi) nolog // non-numeric part of header deleted
Generalized linear models
No. of obs
3874
Optimization
Residual df
3871
Scale parameter
6.249124
: ML
Deviance
24190.35807
(1/df) Deviance
Pearson
43909.61512
(1/df) Pearson
11.34322
AIC
8.074027
BIC
-7792.01
Log likelihood
= -15636.39048
---------------------------------------------------------------------------|
docvis |
OIM
Coef.
Std. Err.
P|z|
----------+----------------------------------------------------------------outwork |
.4079314
.0188447
21.65
0.000
.3709965
.4448663
age |
.0220842
.0008377
26.36
0.000
.0204423
.0237261
_cons |
-.033517
.0391815
-0.86
0.392
-.1103113
.0432773
---------------------------------------------------------------------------.(scalar dev=e(deviance)
. scalar df=e(df)
. di " deviance GOF "" D="dev " df="df " p-value= " chiprob(df, dev)
deviance GOF
For pedagogical purposes, I am only using a single binary and continuous predictor.
There are more predictors in the data that may significantly contribute to the fit
of the model. In fact, the added predictors do not eradicate the overdispersion
inherent in the data. Later, we will further discuss the strategy of adding predictors
to effect an optimally fitted model.
77
Coefficients:
(Intercept)
outwork
age
-0.03352
0.40793
0.02208
3871 Residual
25790
AIC: 31280
dev-deviance(Model); df-df.residual(Model)
p_value-1-pchisq(dev,df)
print(matrix(c("Deviance GOF"," ","D",round(dev,4),"df",df, "p_value",
p_value), ncol=2))
[,1]
[,2]
"24190.3581"
[3,] "df"
"3871"
[4,] "p-value"
"0"
78
TESTING OVERDISPERSION
the data. Recall that a p-value of 0.05 accepts that we will be mistaken 1 out
of 20 times on average.
If the value of D is very large, then we can generally be safe in rejecting the
goodness of the model fit:
Deviance is in effect a measure of the distance between the most full or
complete (saturated) model we can fit and the proposed model we are
testing for fit.
The smaller the distance, or deviance, between them, the better the fit. The
test statistic evaluates whether the value of the deviance, for a specific size
of model, is close enough to that of the saturated model that it cannot be
rejected as not fitting.
Many older texts on generalized linear models use a deviance test to compare nested models. A nested model is one that has one or more predictors
dropped from the main model. The p-value for the difference in deviance
between the two models tells us whether the predictors we excluded are statistically important to the modeling of the data. A p-value under 0.05 advises
us that they are important. Parameters may be dropped as well, comparing
similar models that have and do not have that parameter. But the test is usually given to models with and without predictors. In either case, the smaller
model is said to be nested within the larger model. We can also view the test
as evaluating if we should add a predictor to the model. If the difference in
the deviance statistic is minimal when adding a predictor or predictors, we
may conclude that they will contribute little of statistical value to the model
and therefore should not be incorporated into the final model. The logic of
this type of reasoning is well founded, but there is no well-established p-value
to quantify the significance of the differences in deviances. There is, though,
for comparing log-likelihoods, and it is called a likelihood ratio test.
Many analysts use the Pearson Chi2 statistic, Pearson Chi2/rdof, in place of
the deviance GOF statistic. We will not consider using the Pearson Chi2 statistic for a GOF test, though. It appears to produce biased results. However, its
true value comes from its use in defining overdispersion, in particular Poisson overdispersion. The reason for this is that the Pearson Chi2 statistic is
nothing other than the squared residuals weighted or adjusted by the model
variance, and summed across all observations in the model:
n
(yi i )2
=
V (i )
i=1
2
(3.4)
79
The Pearson statistic may also be viewed as the sum of the squared Pearson
residuals as defined later in this chapter. The dispersion statistic of the Poisson
model is defined as the Pearson Chi2 statistic divided by the residual degrees
of freedom. To reiterate from the previous discussion, the residual degrees
of freedom is the number of observations in the model less the number of
predictors, including the intercept and interactions.
The sum of squared residuals is an absolute raw measure (squaring eliminates negative values) of the difference in observed versus predicted model
counts, adjusted by both the variance and size of the model. Adjustment
is made by dividing the squared residuals by the product of the variance
and residual degrees of freedom. The result is the dispersion statistic, which
should have a value of 1 if there is no unaccounted variability in the model
other than what we expect based on the variance function. Values greater than
1 indicate an overdispersed model; values less than 1 are underdispersed.
We discuss the dispersion statistic in much greater detail throughout this
book.
Some statisticians have used the deviance dispersion as the basis for scaling
standard errors. However, as we will find in this book, simulation studies
indicate that
when scaling standard errors, the Pearson dispersion statistic better
captures the excess variability in the data, adjusting model standard
errors in such a manner as to reflect what the standard errors would
be if the excess variability were not present in the data.
In R, Poisson models with scaled standard errors are called quasipoisson:
A Pearson dispersion in excess of 1.0 indicates likely Poisson model
overdispersion. Whether the overdispersion is significant depends on
(1) the value of the dispersion statistic, (2) the number of observations
in the model, and (3) the structure of the data; for example, if the data
are highly unbalanced.
A command exists in Stata to calculate the deviance and Pearson Chi2 statistics and associated p-values for the associated GOF tests. However, it must
be used following the poisson command, which is a full maximum likelihood estimation algorithm that can be used for Poisson models in place
80
TESTING OVERDISPERSION
of glm. Ill run the poisson command quietly (no displayed results), then
use the estat gof command to produce both GOF tests:
STATA CODE
. qui poisson docvis outwork age
. estat gof
Deviance goodness-of-fit =
Prob chi2(3871)
=
24190.36
0.0000
=
=
43909.61
0.0000
Pearson goodness-of-fit
Prob chi2(3871)
# calc p-value
# calc p-vl
81
For example, to load the function from where it is saved in the c:\\Rfiles
directory or folder, type the following:
source("c:\\Rfiles\\P__disp.r")
P__disp(mymod)
CODE
library(COUNT)
data(rwm5yr)
rwm1984 - subset(rwm5yr, year==1984)
mymod -glm(docvis ~ outwork + age, family=poisson, data=rwm1984)
P__disp(mymod)
Pearson Chi2 =
Dispersion
=
43909.62
11.34322
82
TESTING OVERDISPERSION
83
No. of obs
50000
Optimization
: ML
Deviance
73496.70743
Residual df
=
Scale parameter =
(1/df) Deviance =
49997
1
1.470022
Pearson
68655.76474
Log likelihood
= -102902.8168
(1/df) Pearson
AIC
BIC
= 1.373198
= 4.116233
= -467459.7
-----------------------------------------------------------------------|
OIM
py |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------+---------------------------------------------------------------x1 |
.7513276
.0090462
83.05
0.000
.7335973
.7690579
x3 |
.4963757
.0090022
55.14
0.000
.4787317
.5140196
_cons |
.444481
.0075083
59.20
0.000
.429765
.4591969
------------------------------------------------------------------------
The predictor coefficients are close to their true values; that is, close to the
Poisson distribution parameter values we specified when creating the data.
The intercept is about half its true value. However, we would not normally
know that the value of the intercept differs from the parameters used in
creating the data. The fact that the predictor p-values all approximate 0.000
appears to indicate that the model is well fitted. Many researchers are mislead
in this manner.
84
TESTING OVERDISPERSION
85
(y )2 y
(3.5)
The test is post hoc (i.e., it is executed after the data have been modeled). Using
the rwm1984 data as earlier defined, we first model the data using Poisson
86
TESTING OVERDISPERSION
.4079314
.0220842
-.033517
.0188447
.0008377
.0391815
21.65
26.36
-0.86
0.000
0.000
0.392
.3709965
.0204423
-.1103113
.4448663
.0237261
.0432773
-----------------------------------------------------------------------. predict mu
. gen double z=((docvis-mu)2-docvis)/ (mu*sqrt(2))
. regress z
Source |
SS
df
MS
-----------+------------------------------
Number of obs =
F( 0, 3873) =
3874
0.00
Model |
0
0
.
Residual |
16012865 3873
4134.4862
-----------+------------------------------
Prob F
=
R-squared
=
Adj R-squared =
.
0.0000
0.0000
Total |
16012865 3873
4134.4862
Root MSE
=
64.3
-----------------------------------------------------------------------z |
Coef.
Std. Err.
t
P|t|
[95% Conf. Interval]
-----------+-----------------------------------------------------------_cons | 7.30804
1.033073
7.07
0.000
5.282622
9.333459
------------------------------------------------------------------------
The z score test is 7.3, with a t-probability of 0.0005. z tests the hypothesis
that the Poisson model is overdispersed; it evaluates whether the data are
Poisson or negative binomial. This example indicates that the hypothesis of
no overdispersion is rejected (i.e., that it is likely that real overdispersion
exists in the data).
The test is based on two assumptions:
r The data set on which the test is used is large.
r z is t-distributed.
The preceding z-test result indicates that there is overdispersion in the data.
Other z tests have been constructed, and you will find a variety of alternatives
in the literature. The most popular alternative is perhaps an auxiliary regression test, for whichthe equation is the same as equation (3.5) but without
the adjustment of 2, or simply 1.4142 (which is easy to remember using a
87
n
2
i
i=1 i n y
2
n
=
2 i=1
i2
(3.6)
with one degree of freedom. Again, using Stata commands to calculate the
statistic, we have
STATA CODE
. summ docvis, meanonly
With one degree of freedom, the test appears to be significant the hypothesis
of no overdispersion is again rejected:
A Chi2 statistic of 3.84 has a p-value of 0.05.
Higher values of Chi2 result in lower p-values.
For example, a Chi2 of 7.5 has a p of 0.00617.
The preceding inset should be memorized. Chi2 tests permeate statistics.2
2
88
TESTING OVERDISPERSION
Count of Patients
500
1000
1500
10
15
Number Visits to Physician
Expected days
20
25
Observed days
. di normal(1.96)
.9750021
pnorm(1.96)
[1] 0.9750021
. di -invnormal(.025)
1.959964
-qnorm(.025)
[1] 1.959964
89
poissonp(mu, i)
/* (docvis==i) */
90
TESTING OVERDISPERSION
# Pearson Chi2
# dispersion statistic
poi1$aic / (poi1$df.null+1)
# AIC/n
exp(coef(poi1))
# IRR
exp(coef(poi1))*sqrt(diag(vcov(poi1)))
# delta method
exp(confint.default(poi1))
# CI of IRR
modelfit(poi1)
sd(rwm1984$docvis)2
# observed variance
xbp - predict(poi1)
mup - exp(xbp)
mean(mup)
# mean docvis
observe
expect
diff |
|----------------------------------------|
1. |
1611
264.7923
2. |
448
627.102
1346.208 |
-179.102 |
3. |
440
796.0545
-356.0545 |
4. |
353
731.5981
-378.5981 |
5. |
213
554.6024
-341.6024 |
|----------------------------------------|
6. |
168
373.3741
-205.3741 |
7. |
141
232.9783
-91.97826 |
8. |
60
137.5237
-77.52374 |
9. |
85
77.19441
7.805588 |
10. |
47
41.09616
5.903843 |
20. |
19
.
6
.0028615
5.997139 |
|----------------------------------------|
21. |
20
.0008157
3.999184 |
91
# observed variance
[1] 39.38
mean(mup)
# expected variance:
mean=variance
[1] 3.16
rbind(obs=table(rwm1984$docvis)[1:18], exp = round(sapply(0:17,
function(x) sum(dpois(x, fitted(poi1))))))
0
9 10 11 12 13 14 15 16 17
60 85 47 45 33 43 25 13 21 13 11
For the Poisson model, the observed variance is 39.38, whereas the expected
variance is 3.16, indicating substantial overdispersion. In addition, note the
vastly greater number of observed zero counts compared with what we expect
given the mean of docvis. We will see why we need to explicitly adjust the
model as a result of this fact. Without adjustment by the fitting of explanatory predictors, we should have expected 164 of the 3874 patients not to
visit a physician during 1984, based on the Poisson distribution. A fitted
model predicts that 265 patients failed to visit a physician. In fact, 1611
(41.58%) patients never went to a physician during the year. There are far
more zero visits in the data than allowed using the Poisson model, which
without adjustment assumes that only 164 (4.2%) patients, given a mean
of 3.16 days, did not need to visit a physician. Another model needs to
be used to better fit the number of physician visits during 1984. Again,
an excess of zero counts is a common reason for having an overdispersed
Poisson model. The Chi2 test is sometimes used as a fit test to determine
whether the predicted number of counts come from the same population as
the observed values for each count, from 0 to a specified upper value. Here
the null hypothesis cannot be rejected and the two cannot be significantly
separated.
92
TESTING OVERDISPERSION
Using Stata, we can determine the observed and expected values for
0 counts as follows:
STATA CODE
. count
3874
. count if docvis==0
1611
. di 1611/3874
.41584925
// number of 0 counts
//
predicted
(3.7)
with the inverse square root of the dispersion statistic. Scaling by the Pearson
dispersion statistic entails estimating the model, abstracting the dispersion
statistic, and multiplying the model standard errors by the square root of
the dispersion, and then running one additional iteration of the algorithm,
but as
= (X Wd X)1 X Wd z
(3.8)
Scaling in effect adjusts the model standard errors to the value that would
have been calculated if the dispersion statistic had originally been 1.0. The
Pearson-based dispersion statistic should always be used to assess count
93
Response:
Predictors:
los
hmo
length of stay
1 = member of a Health Maintenance Organization
(HMO); 0 = private pay
1 = identifies as white; 0 = other
1 = elective admission (reference level)
1 = urgent admission
1 = emergency admission
white
type1
type2
type3
No. of obs
1495
Optimization
: ML
Deviance
8142.666001
Residual df
=
Scale parameter =
(1/df) Deviance =
1490
1
5.464877
Pearson
9327.983215
Log likelihood
= -6928.907786
(1/df) Pearson
AIC
BIC
= 6.260391
= 9.276131
= -2749.057
-----------------------------------------------------------------------|
OIM
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------+---------------------------------------------------------------hmo | -.0715493
.023944
-2.99
0.003
-.1184786
-.02462
white |
type2 |
type3 |
-.153871
.2216518
.7094767
.0274128
.0210519
.026136
-5.61
10.53
27.15
0.000
0.000
0.000
-.2075991
.1803908
.6582512
-.100143
.2629127
.7607022
_cons |
2.332933
.0272082
85.74
0.000
2.279606
2.38626
------------------------------------------------------------------------
94
TESTING OVERDISPERSION
standard error for hmo of .023944, we may calculate a scaled standard error as
sqrt(6.260391) * .023944 = .05990974. Scaled standard errors are calculated
for this model as
. glm los hmo white type2 type3, fam(poi) nolog scale(x2)
Generalized linear models
No. of obs
1495
Optimization
Residual df
1490
: ML
Scale parameter =
1
5.464877
Deviance
8142.666001
(1/df) Deviance =
Pearson
9327.983215
(1/df) Pearson
6.260391
AIC
9.276131
BIC
= -2749.057
= -6928.907786
Log likelihood
-------------------------------------------------------------------------|
los |
OIM
Coef.
Std. Err.
P|z|
[95% Conf.
Interval]
-------+-----------------------------------------------------------------hmo |
-.0715493
.0599097
-1.19
0.232
-.1889701
.0458715
white |
-.153871
.0685889
-2.24
0.025
-.2883028
-.0194393
type2 |
.2216518
.0526735
4.21
0.000
.1184137
.3248899
type3 |
.7094767
.0653942
10.85
0.000
.5813064
.837647
_cons |
2.332933
.0680769
34.27
0.000
2.199505
2.466361
quasipoisson
=======================================================
summary(poiql - glm(los ~ hmo + white + hmo + factor(type),
family=quasipoisson, data=medpar))
=======================================================
95
data=medpar))
confint(poi)
# profile confidence interval
pr - sum(residuals(poi,type=pearson)2)
# Pearson statistic
# dispersion
# model SE
# OR
poiQL - glm(los ~ hmo + white + factor(type), family=quasipoisson,
data=medpar)
coef(poiQL); confint(poiQL)
modelfit(poiQL)
# AIC,BIC statistics
statistics. The eform option generates such output using Stata glm command,
and the irr option with the poisson and nbreg commands. The R function
exp(coef(model_name)) is used following glm and glm.nb. A Stata table of
rate ratios, with scaled standard errors, can be displayed using the code
. glm los hmo white type2 type3, nolog fam(poi) eform scale(x2) nohead
----------------------------------------------------------------------------|
los |
OIM
IRR
Std. Err.
P|z|
[95% Conf.
Interval]
-------+--------------------------------------------------------------------hmo |
.9309504
.0557729
= -1.19
0.232
.8278113
1.04694
white |
.8573826
.0588069
-2.24
0.025
.7495346
.9807484
type2 |
1.248137
.0657437
4.21
0.000
1.12571
1.383878
type3 |
2.032927
.1329416
10.85
0.000
1.788373
2.310923
Recall that I mentioned earlier that the standard error of the incidence rate
ratio (IRR) is not directly based on a model variancecovariance matrix.
Rather, standard errors for IRRs are calculated using the delta method. Applied
to this situation, the scaled standard errors of the IRR are calculated by
multiplying the IRR by the scaled model SE. For hmo, we have .9309504 *
.0599097 = .05577296, which we observe is correct.
It needs to be emphasized that the parameter estimates remain unaffected
when standard errors are scaled. When a Poisson (or negative binomial)
96
TESTING OVERDISPERSION
97
backward to the implied log-likelihood function. Since this implied loglikelihood is not derived from a probability function, we call it quasilikelihood or quasi-log-likelihood instead. The quasi-likelihood, or the
derived quasi-deviance function, is then used, for example, in an IRLS algorithm to estimate parameters just as for GLMs when the mean and variance
functions are those from a specific member of the exponential family.
The quasi-likelihood is defined as
Q (y; ) =
y
y
V ()
(3.9)
=
=
(IRLS EIM)
1300.664128
1490.00008
Scale parameter =
(1/df) Deviance =
(1/df) Pearson =
BIC
6.260391
.8729289
1
= -9591.059
98
TESTING OVERDISPERSION
-----------------------------------------------------------------------|
EIM
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------hmo | -.0715493
.0095696
-7.48
0.000
-.0903054
-.0527932
white |
-.153871
.010956
-14.04
0.000
-.1753444
-.1323977
type2 |
type3 |
_cons |
.2216518
.7094767
2.332933
.0084138
.0104457
.0108742
26.34
67.92
214.54
0.000
0.000
0.000
.2051611
.6890035
2.31162
.2381424
.7299499
2.354246
------------------------------------------------------------------------
which is the standard error of hmo for the quasi-likelihood Poisson model,
as we just observed. Note that the Pearson-dispersion value of this QL model
is now 1.0. Also, compare to scaling, for which the model standard error is
multiplied by the square root of the dispersion.
Compare the summary statistics of this model with the standard Poisson
model applied to the same data and the model scaled by the dispersion. The
deviance statistic is substantially less than that of the standard and scaled models, which will always have the same values except for the standard errors and
related statistics (p, z, CI). The deviance of the standard Poisson is 8242.666,
and for the quasi-likelihood model its 1300.664, indicating a much better fit.
By fit in this case we mean more accurate predictor standard errors and pvalues. The quasi-likelihood model is not a true likelihood model, and hence
the standard errors are not based on a correct model-based Hessian matrix.
(R code for quasi-likelihood Poisson standard errors is given in Table 3.10.)
TABLE 3.10. R: Quasi-likelihood Poisson Standard Errors
poiQL - glm(los ~ hmo+white+type2+type3, family=poisson, data=medpar)
summary(poiQL)
pr -sum(residuals(poiQL, type="pearson")2 )
disp - pr/poiQL$df.residual
# Pearson dispersion
se -sqrt(diag(vcov(poiQL)))
QLse - se/sqrt(disp); QLse
99
Note that with the Stata command, the irls option is required to use this
method of adjusting standard errors, which, again, are much tighter than
scaled standard errors.
This method is rarely used for making adjustments, but the method of
multiplying the Poisson variance by some value runs through much of this
book. It is important to keep it in mind.
100
TESTING OVERDISPERSION
# Clustering
poi - glm(los ~ hmo + white + factor(type), family=poisson, data=medpar)
library(haplo.ccs)
sandcov(poi, medpar$provnum)
sqrt(diag(sandcov(poi, medpar$provnum)))
No. of obs
1495
Optimization
: ML
Deviance
8142.666001
Residual df
=
Scale parameter =
(1/df) Deviance =
1490
1
5.464877
Pearson
9327.983215
(1/df) Pearson
AIC
BIC
= 6.260391
= 9.276131
= -2749.057
-----------------------------------------------------------------------|
Robust
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
------+----------------------------------------------------------------hmo | -.0715493
.0517323
-1.38
0.167
-.1729427
.0298441
white |
-.153871
.0833013
-1.85
0.065
-.3171386
.0093965
101
|
type |
2 |
.2216518
.0528824
4.19
0.000
.1180042
.3252993
3 |
.7094767
.1158289
6.13
0.000
.4824562
.9364972
_cons |
2.332933
.0787856
29.61
0.000
2.178516
2.48735
------------------------------------------------------------------------
No. of obs
Residual df
=
=
1495
1490
102
TESTING OVERDISPERSION
=
=
Deviance
Pearson
Scale parameter =
(1/df) Deviance =
(1/df) Pearson =
8142.666001
9327.983215
1
5.464877
6.260391
AIC
= 9.276131
Log pseudolikelihood = -6928.907786
BIC
= -2749.057
(Std. Err. adjusted for 54 clusters in provnum)
-----------------------------------------------------------------------|
Robust
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
------+----------------------------------------------------------------hmo | -.0715493
.0527299
-1.36
0.175
-.1748979
.0317993
white |
type |
2 |
-.153871
.0729999
-2.11
0.035
-.2969482
-.0107939
.2216518
.0609139
3.64
0.000
.1022626
.3410409
3 |
.7094767
.202999
3.49
0.000
.311606
1.107347
_cons |
2.332933
.0669193
34.86
0.000
2.201774
2.464093
------------------------------------------------------------------------
R CODE
pnorm(-1.96)*2
[1] 0.04999579
nolog nohead
-----------------------------------------------------------------------|
OIM
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------+---------------------------------------------------------------hmo | -.0715493
.023944
-2.99
0.003
-.1184786
-.02462
103
white |
type |
2 |
-.153871
.0274128
-5.61
0.000
-.2075991
-.100143
.2216518
.0210519
10.53
0.000
.1803908
.2629127
3 |
.7094767
.026136
27.15
0.000
.6582512
.7607022
_cons |
2.332933
.0272082
85.74
0.000
2.279606
2.38626
----------------------------------------------------------------------- summary(poi1 - glm(los ~ hmo+white+factor(type), family=poisson,
data=medpar))
104
TESTING OVERDISPERSION
57
52
442 |
765 |
499
817
-----------+----------------------+---------Total |
109
1,207 |
1,316
I do not show the calculations here, but the standard error for the risk of death
based on age for those in the Titanic disaster is .134. Regardless of whether
we use model or empirical standard errors with a Poisson regression, though
105
(or negative binomial), the logic of the table analysis given here is the basis
for interpreting coefficients in terms of probabilities. For example, here we
may conclude that the likelihood or probability of death for adults is some
33% greater than it is for children. See Hilbe (2011) for a full estimation of
risk ratio and risk difference.
=
=
=
=
1495
1490
Scale parameter =
(1/df) Deviance =
(1/df) Pearson =
1
5.464877
6.260391
No. of obs
Residual df
8142.666001
9327.983215
AIC
= 9.276131
Log likelihood
= -6928.907786
BIC
= -2749.057
-----------------------------------------------------------------------|
Observed
Bootstrap
Normal-based
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------+---------------------------------------------------------------hmo |
white |
type2 |
-.0715493
-.153871
.2216518
.0527641
.0815228
.0536495
-1.36
-1.89
4.13
0.175
0.059
0.000
-.174965
-.3136529
.1165006
.0318664
.0059108
.3268029
type3 |
.7094767
.1178019
6.02
0.000
.4785893
.9403641
_cons |
2.332933
.077869
29.96
0.000
2.180313
2.485554
------------------------------------------------------------------------
106
TESTING OVERDISPERSION
Standard errors indicate that hmo and white are problematic. Bootstrapping
may be done on parts of a model other than standard errors. With respect to
standard errors, though, bootstrapping has become a popular way of attempting to discover optimal standard errors for model coefficients. In fact, bootstrapping and using robust or empirical standard errors is an excellent way
to determine whether a model is overdispersed, or extradispersed in general:
If the values of bootstrapped or robust standard errors differ substantially from model standard errors, this is evidence that the count model
is extradispersed. Use the bootstrapped or robust standard errors for
reporting your model, but check for reasons why the data are overdispersed and identify an appropriate model to estimate parameters.
A note should be made regarding profile likelihood confidence intervals.
Rs glm function uses them as the default for calculating confidence intervals. Stata has a user-authored logprof command for constructing profile
confidence intervals for logistic regression models only. Recall that I have
said that using the likelihood ratio test (discussed in Section 4.2) for determining the inclusion or exclusion of predictors in a model is preferred over
the standard Wald method, which is another way of saying regular predictor
p-values. The profile likelihood method is based on the likelihood ratio test
but is inverted to move from the p-value to the confidence interval rather than
have a confidence interval and then seek to determine whether it includes 0.
Except in R, the method is rarely used, and it has limited software support.
But it is an attractive method of presenting confidence intervals, and I suspect
that it will be more commonly used in the next decade. I refer you to Hilbe
and Robinson (2013) for a more detailed examination of profile likelihood,
together with R code for its implementation.
3.5 SUMMARY
Overdisperson is likely the foremost problem with count data. Overdispersion also occurs with grouped logistic regression, as well as grouped binomial models as a class. It also is a concern for ordered, partially ordered, and
unordered or multinomial regression models as well. But when a statistician
mentions that he or she is working with overdispersed data, the usual supposition is that they are count data. This is because most count models are
(Poisson) overdispersed (i.e., the variance of the count response variable is
greater than the mean).
3.5 Summary
107
In this chapter, we looked at what overdispersion means and the difference between apparent and real overdispersion. We also discussed various
statistical tests that can be given a Poisson model to determine whether it is
overdispersed or underdispersed. I also provided an overview and code that
can be adapted to the data for determining the relationship between the count
values that are in fact observed and those that are predicted on the basis of
a statistical model. Ideally they should be close. A Chi2 test can be used to
determine how well they fit, but such a test is not as feasible when many of
the counts are skewed far to the right.
Finally, we discusssed strategies for handling overdispersion in particular
light overdispersion. A surprisingly good way to adjust for extradispersion
is to scale the standard errors by the square root of the Pearson dispersion,
which is known as the quasipoisson family in R. This is popular among R
users. Most specialists in count models, however, espouse using robust or
sandwich adjusted standard errors when modeling count data even as a
default. If there is no correlation in the data to be adjusted, the SEs of the
sandwich estimator will be the same as model SEs. Some statisticians prefer
to use bootstrapped SEs in place of sandwich SEs, but bootstrapping takes
longer and the results are typically the same as for the robust or sandwich
approaches.
CHAPTER 4
Assessment of Fit
r What is a boundary likelihood ratio test? How does it differ from a regular
r What are information criterion tests? How does one select the most appro-
109
Poisson
Rp = (y )/sqrt(V)
(y ) /
(4.2)
Poisson
Rd = sgn(y )*sqrt(deviance)
d
R = sign (y ) 2 y log y (y )
Divide residual by sqrt(1hat), which aims
to make the variance constant. hat = stdp2 * V
Rp
1h
Divide standardized residual by scale, .
Rp
(1 h)
(4.3)
Pearson
Deviance
Standardized
Poisson
Studentized
Poisson
StandardizedStudentized
Poisson
(4.4)
(4.5)
an age variable ranging from 25 to 64), the intercept will not equal the linear
predictor. In any case, each observation in the model has a linear predictor
value, x or . For members of the GLM family, including Poisson and
negative binomial regressions, a link function converts a linear predictor
to a fitted or predicted value. We have previously seen that for Poisson
models as well as for the traditional negative binomial, = ln(), so that
= exp() or, for full maximum likelihood models, = exp(x). Simply
put, exponentiating the linear predictor produces the fitted, predicted, or
expected value of the count response. The linear predictor and fit are essential
components of all residuals.
The basic or raw residual is defined as the difference between the observed
response and the predicted or fitted response. When y is used to identify the
response, y or is commonly used to characterize the fit. Hence
Raw residual = y y or y or y E(y)
Other standard residuals used in the analysis of count response models
include those in Table 4.1.
In the preceding formulae, we indicated the model variance function as V,
the hat matrix diagonal as hat or h, and the standard error of the prediction
110
ASSESSMENT OF FIT
as stdp. A scale value, , is user defined and is employed based on the type of
data being modeled.
We mentioned earlier that the Anscombe residual (Anscombe 1953) has
values close to those of the standardized deviance. There are times, however,
when this is not the case, and the Anscombe residual performs better than Rd .
Anscombe residuals attempt to normalize the residual so that heterogeneity
in the data, as well as outliers, become easily identifiable.
Anscombe residuals use the model variance function. The variance functions for the three primary count models, as well as the PIG model, are
Poisson
V=
Geometric V = (1 + )
NB2
V = + 2 or (1 + )
PIG
V = + 3
The geometric distribution is the negative binomial with the scale or dispersion parameter, , equal to 1. It is rarely estimated as a geometric model
since true geometric data would have a dispersion parameter value of 1 when
modeled using a negative binomial regression.
Anscombe defined the residual that later became known under his name
as
A (y) A ()
(4.6)
RA =
A () V
here A () = V 1/3 .
The calculated Anscombe residuals for the Poisson model are
Poisson: 3(y2/3 2/3 )/(21/6 )
and for the negative binomial
3
+ y)2/3 (1 + )2/3 + 3 y2/3 2/3
(1
1/6
2 2 +
(4.7)
(4.8)
The Anscombe negative binomial has also been calculated in terms of the
hypergeometrix2F1 function. See Hilbe (1993c) and Hardin and Hilbe (2012)
for a complete discussion.
y2/3 H (2/3, 1/3, 5/3,y/) 2/3 H (2/3, 1/3, 5/3, /)
(4.9)
(4.10)
111
The Pearson Chi2 statistic is calculated as the sum of the squared Pearson
residuals. Put in one line, we have
pchi2 - sum(residuals(pexp, type="pearson")2)
# Pearson Chi2
What are we looking for in graphing residuals for count models? We look for
two things:
r evidence of poor fit, and
r nonrandom patterns.
Patterns usually indicate overdispersion and/or misspecification. An alternative count model may be required:
r all model predictors.
Try including the excluded predictors, but they may need rescaling or are
part of an interaction term:
r a time variable (for longitudinal and time series count models).
112
ASSESSMENT OF FIT
43909.62
11.34322
mu - predict(rwm)
grd - par(mfrow = c(2,2))
plot(x=mu, y= rwm$docvis, main = "Response residuals")
plot(x=mu, y= presid, main = "Pearson residuals")
Figures not displayed
(4.11)
// full model
113
LR chi2(1) =
Prob chi2 =
Likelihood-ratio test
(Assumption: B nested in A)
715.98
0.0000
The test confirms that age is a significant predictor in the model and should
be retained. The likelihood ratio drop1 test is a useful likelihood ratio test
in which one predictor at a time is dropped and the models are checked
in turn for comparative goodness-of-fit. In Stata, a user-developed command
(Wang 2000) called lrdrop1 can be used following logit, logistic, and
poisson to find the predictors that together best fit the model. lrdrop1 can
rather easily be adapted for use with other regression models. It is particularly
useful when a model has quite a few potential predictors. lrdrop1 must be
installed using the command .ssc install lrdrop1 in order to use it
the first time:
STATA OUTPUT
. qui poisson docvis outwork age
. lrdrop1
Likelihood Ratio Tests: drop 1 term
poisson regression
number of obs = 3874
---------------------------------------------------------------------docvis
Df
Chi2
PChi2
-2*log ll
Res. Df
AIC
---------------------------------------------------------------------Original Model
-outwork
1
464.77
0.0000
31272.78
31737.55
3871
3870
31278.78
31741.55
-age
1
715.98
0.0000
31988.76
3870
31992.76
---------------------------------------------------------------------Terms dropped one at a time in turn.
R users are familiar with both lrtest and especially drop1 (see Table 4.2).
The latter uses the deviance statistic rather than the log-likelihood as the basis
for model comparison. Recall that the deviance is itself derived as a variety
of likelihood ratio test and is defined as 2{LLs LLm}, where LLm is the
log-likelihood of the model and LLs is the saturated log-likelihood. For LLs,
the response term, y, is substituted for each in the log-likelihood equation.
The p-value of 2.2e-16 indicates that outwork and age are significant predictors in the model.
The drop1 test found in R is a likelihood ratio test of each nested predictor
in a model. Deviance and AIC statistics are produced for each alternative.
114
ASSESSMENT OF FIT
Chisq Pr(Chisq)
3 -15636
2 -15994 -1 715.98
2.2e-16 ***
--Signif. codes:
drop1(poi1,test="Chisq")
Single term deletions
Model:
docvis ~ outwork + age
Df Deviance
none
AIC
LRT
Pr(Chi)
24190 31279
outwork
age
--Signif. codes:
The test is used by many analysts as a substitute for the standard likelihood
ratio test.
For the purposes of modeling counts, the foremost use of the test is when
it is used as a boundary test.
115
(4.12)
with L symbolizing the log-likelihood function. The resulting value is measured by an upper tail Chi2 distribution with one (1) degree of freedom.
Since the distribution being tested can go no lower than 0, that is the boundary. Only one half of the full distribution is used. Therefore the Chi2 test
is divided by 2. For a standard Chi2 test with no boundary, with one
degree of freedom, a BLR test statistic of 3.84 is at the .05 significance
level:
. di chi2tail(1,3.84)
.05004352
As a key to remember, given the value of 2 times the difference in loglikelihood values, if the difference in Poisson and negative binomial loglikelihoods is less than 1.352, the model is Poisson; if it is greater, the data
need to be modeled other than Poisson presumably negative binomial.
We will discover that there are quite a few count models that are nested
within another, for which a likelihood ratio test is appropriate (e.g., zerotruncated Poisson and zero-truncated negative binomial, censored Poisson
and censored negative binomial).
116
ASSESSMENT OF FIT
(4.13)
and the version with the main AIC terms divided by n, the number of observations in the model
AIC =
2L + 2k
= 2 (L k)/n
n
(4.14)
117
(4.15)
or
2k (k + 1)
(4.16)
nk1
where k is the number of parameters in the model. This statistic has also
been referred to as AICC for AIC-corrected, but I will use that designation for
equation (4.18). Note that AICFS has a greater penalty for additional parameters compared with the standard AIC statistic. Note also that AICFS AIC
for models with large numbers of observations. Hurvich and Tsai (1989) first
developed the finite-sample AIC for use with time series and autocorrelated
data; however, others have considered it preferable to the AIC for models
with noncorrelated data as well, particularly for models with many parameters, and/or for models with comparatively few observations. When used for
longitudinal and time series data, the n in equation (4.16) is summed across
panels, iN n i . A variety of finite-sample AIC statistics have been designed
since 1989.
A Consistent Akaike Information Criterion (CAIC) test and Corrected AIC
have also enjoyed some popularity. Defined as
AICFS = AIC +
CAIC = 2L + k log(n)
(4.17)
and
2 kn
(4.18)
nk1
these tests are sometimes found in commercial statistical software, such as
the AICC in SAS. It is wise to be aware of their existence and meaning since
you will find them being used in journal articles and books on modeling.
Recently, the AIC statistic and alternatives have been the focus of considerable journal article research, particularly with respect to correlated data. Hin
AICC = 2L +
118
ASSESSMENT OF FIT
and Wang (2008) and Barnett et al. (2010) even demonstrated that the AIC
is superior to the QIC statistic, which is an AIC-type fit statistic designed for
generalized estimating equation (GEE) longitudinal models by Pan (2001)
and revised by Hardin and Hilbe (2002). The QIC statistic has itself been
revised in light of these objections, most recently by Hardin and Hilbe (2013)
and Shults and Hilbe (2014). These applications are for longitudinal data, or
data structured in panels.
The traditional AIC statistic can be enhanced to provide a superior general comparative-fit test for maximum likelihood models. Simulation tests
performed by Hardin and Hilbe (2013) appear to confirm that the following
version of AIC, which we may tentatively call AICH , produces a test statistic
that is aimed at optimally selecting the best fitted among competing maximum
likelihood models. The statistic is robust to violations of MLE assumptions,
such as the requirement of the independence of observations. There are three
adjusters in the statistic the number of observations in the model, number
of parameters (coefficients), and number of distributional parameters for
example, the mean or location and scale parameters. In special circumstances,
a scale parameter can become a dispersion parameter:
AICH = 2L +
4( p 2 pk 2 p)( p + k + 1)( p + k + 2)
n pk2
(4.19)
with
n = number of observations in the model;
p = number of slopes in the model;
k = number of model location and scale parameters.
This test statistic is still a work in progress but appears to correctly select the
best fitted among competing maximum likelihood estimated models. I will
show the results of using it together with other more traditional versions in
subsequent example model output.
Lets return to the traditional AIC statistic currently provided in the majority of study results. How do we decide whether one value of the AIC is
significantly superior to another; that is, whether there is a statistically significant difference between two values of the AIC? Hilbe (2009a) devised a table
based on simulation studies that can assist in deciding whether the difference
between two AIC statistic values is significant. Table 4.3 is taken from that
source and is based on equation (4.13) (i.e., the AIC without division by n).
119
Result if
Models A and B
A B
---------------------------------------------0.0 & = 2.5
No difference in models
2.5 & = 6.0
6.0 & = 9.0
9.0
Prefer A if n 256
Prefer A if n 64
Prefer A
(4.20)
with k indicating the number of predictors, including the intercept, and n the
number of observations in the model. The statistic gives a higher weight to
the adjustment term, k*log(n), than does the AIC statistic, where only 2k is
employed for adjusting 2 times the model log-likelihood function. It is on
this account that many statisticians prefer the Schwarz BIC to the traditional
AIC, but in general the AIC is used more frequently by analysts when testing
and reporting research studies.
Another parameterization of the BIC, which is used in the Limdep software
program, was developed by Hannan and Quinn (1979). It is given as
BICHQ = 2(L k ln(k))/n
(4.21)
120
ASSESSMENT OF FIT
which has results that approximate AIC/n in equation (4.14) for most models.
If the values differ substantially, this is an indication that the model is poorly
fit and is likely misspecified. This test is not definitive, though, since many
poorly fit models have similar AIC/n and BICHQ values. Many statisticians,
including the author, prefer the use of BICHQ to the Schwarz parameterization.
The Stata abic command and the R modelfit function (in the COUNT
package) display the values of the most used parameterizations of the AIC and
BIC statistics. The BIC statistic given in the lower-left corner of the output
is BICHQ , whereas the lower-right BIC statistic is BICSC , or Schwarz BIC.
The top-left AIC is equation (4.14) and the top right is equation (4.13). The
way that the labeling reads is that the AIC by default is (4.13). Multiplying
equation (4.13) by n yields equation (4.14). Rs modelfit.r function is
provided in Table 4.4. The AICH command was added during the writing of
this book and is displayed in selected model output.
121
An example will perhaps help clarify things. Using the medpar data, we
model and use the abich (abic + AICH ) and modelfit statistics for Stata
and R output, respectively:
. glm los hmo white type2 type3, fam(poi) nolog nohead
-----------------------------------------------------------------------|
OIM
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------+---------------------------------------------------------------hmo | -.0715493
.023944
-2.99
0.003
-.1184786
-.02462
white |
-.153871
.0274128
-5.61
0.000
-.2075991
-.100143
type2 |
type3 |
_cons |
.2216518
.7094767
2.332933
.0210519
.026136
.0272082
10.53
27.15
85.74
0.000
0.000
0.000
.1803908
.6582512
2.279606
.2629127
.7607022
2.38626
-----------------------------------------------------------------------. abich
AIC Statistic
=
9.276131
AIC*n
= 13867.815
BIC Statistic
AICH Statistic
=
=
9.280208
9.274205
BIC(Stata) = 13894.365
AICH*n
= 13854.255
Note that in the Stata glm command output the BIC statistic displayed in
the header statistics is commonly referred to as the Raftery BIC, authored in
1986 by Adrian Raftery of the University of Washington. It is based on the
deviance function, not the log-likelihood as are other AIC and BIC statistics.
The reason for this is Rafterys desire to implement an information criterion
fit test for generalized linear models, which were at the time all based on
the deviance, with the log-likelihood rarely even calculated. Rafterys BIC is
seldom used at this time. I wrote Statas initial glm command in late 1992,
when the deviance was still used as the basic fit test in GLM softare, as
122
ASSESSMENT OF FIT
well as the criterion for convergence. The Raftery BIC test statistic is defined
as
BICR = D (d f ) ln(n)
(4.22)
The AIC and BIC statistics are now considered as basic fit tests and should normally accompany statistical model output. The traditional AIC and Schwarz
BIC are assumed when a reference is made to AIC and BIC statistics. If you
select another definition, be certain to mention it in a research report.
123
Both of these methods of constructing and implementing a validation sample help prevent an analyst from developing an overfitted model. Overfitted
models are such that they cannot be used for classification, or for prediction of nonmodel data from a greater population. An overfitted model is so
well fitted that it is applicable only to the data being modeled. Coefficient
values and other summary statistics may not be extrapolated to other data.
For most research situations, such a model is of little value. Unfortunately,
though, many analysts attempt to fit a model as perfectly as possible so that
overfitting does become a problem.
It is easy to overfit a model. When an analyst incorporates a large number
of predictors into the model and then attempts to fine-tune it with highly
specific transformations and statistical manipulations, it is likely that the
resultant model is overfitted. Testing a model against validation data assists
in minimizing this possibility and helps assure us of the stability of the
parameter estimates.
Remember that even if we are confident that a model is well fitted, it is
nevertheless dangerous to make predictions outside the sample space of the
model. If the modeled data are truly representative of the population on which
the modeled data is a sample, it is safer to make out-of-sample predictions,
but its not wise to make them too far from the models sample space.
How far is too far? Let experience and common sense be a guide. Statistics
is in many respects an art, not simply a recipe to be followed.
124
ASSESSMENT OF FIT
When used in these circumstances, the information criterion tests (AIC, BIC,
etc.) are likely valid indicators of which model is better fit, but only if there
is a wide separation in AIC/BIC values between the models being compared.
I caution you not to rely solely on them for assessing comparative model fit
when the data are clearly correlated.
When modeling, it is important to mention that analysts should not focus
on predictor p-values. Rather, emphasis should be given to confidence intervals. Moreover, unless the distributional assumptions of the model are met in
full, it is preferred to employ robust or sandwich-based confidence intervals.
Profile and bootstrapped confidence intervals are also preferred. If the lower
and upper 95% confidence limits of a coefficient exclude 0, or if a risk ratio
excludes 1, we may be 95% confident that the true value of the coefficient or
risk ratio lies somewhere within the confidence limits.
A predictor p-value relates to the value of the predictor z-statistic, which
itself is the ratio of the coefficient to the standard error. It is not the probability
that the predictor fits the model, or of how influential a predictor is in the
model. Unfortunately many analysts interpret a p-value in this manner. A
more full discussion of p-values and confidence intervals takes us beyond the
scope of this guidebook. I refer you to Vickers (2010) and Ellis (2010) for
more details. It is advised, though, to give preference to confidence intervals
when considering a predictors worth to a model.
At this point, we start our examination of models that extend the base
Poisson model. Most are designed to adjust for overdispersion in general,
whereas others adjust for a specific distributional abnormality. Several models
we discuss are also capable of adjusting for underdispersion.
First, however, I would like to provide a summary of what we have thus
far discussed regarding the modeling process.
125
and median are nearly the same. If they are, the variable may be normally
distributed (i.e., bell shaped). If a continuous variable such as age begins at
40 and ends at 65, it is likely best to center it. This means that the variable, as
a predictor in a model, will be interpreted differently from normal. Check also
for outliers and missing values. This is all preliminary to actually modeling
the data.
For count models such as Poisson or negative binomial regression, I suggest
using robust standard errors as a default method. If there is no excessive
correlation in the data, the robust standard errors will reduce to the values
of model standard errors. If there is correlation in the data, which is typically
the case, then the robust standard errors help adjust for the extradispersion.
Bootstrapping does the same thing but takes longer, and the results are nearly
identical.
If overdispersion is not adjusted by simple post hoc standard error adjustment, the majority of analysts turn to negative binomial regression. Of course,
if ones study data are structured such that no zero (0) counts are allowed, or if
there are far more zero (0) counts than allowed by the distributional assumptions of the Poisson model, then using a zero-truncated or zero-inflated Poisson model might be the preferred approach. Well discuss this scenario later.
On the other hand, if an analyst is not sure of the cause of model overdispersion, a good procedure is to model the data using a negative binomial model,
testing it using a boundary likelihood ratio test. There are other options as
well, but this is a good start.
To reiterate, we next address a model that can be used for generic overdispersion one for which we may not know the source of extra correlation in
the data. Of course, for the example rwm1984 data we have been modeling,
I strongly suspect that the excessive zero counts found in the data result in
overdispersion. For the medpar data, the response term, los, has no possibility of having zero counts. That situation also gives rise to overdispersion
but for a different reason. Remember also that a model may have multiple sources of overdispersion. Adjusting for one source will seldom adjust
for other sources. For now, however, we will overlook these facts and will
assume that we have no idea why our example models are overdispersed.
CHAPTER 5
Negative Binomial
Regression
do they relate?
r What is NB-P, and how can it help select the best fit between NB1 and
NB2 models?
negative binomial?
127
128
model. For the Poisson PDF, the exponential form of the log-likelihood is
equation (1.14):
n
yi log (i ) i log(yi !)
L (; y) =
i=1
-------link
------cumulant
---------normalization
Recall that the exponential family form for count models is {y b()
c(.)}, where y is the response count variable being modeled, is the link, b()
the cumulant, and c(.) the normalization term, which guarantees that the
distribution sums to 1. For the Poisson model, the canonical link is log()
and the cumulant . The first derivative of the cumulant with respect to
is the mean; the second derivative is the variance. For the Poisson, the
result is for both the first and second derivatives with respect to . This is
solved using the chain rule in calculus. Because the Poisson link is log(),
it is the canonical function. For the traditional NB2 negative binomial, the
link is also log(). We want that to be the case so that it can properly model
overdispersed Poisson data.
I refer you to equation (5.2), which is the negative binomial log-likelihood
function in exponential family form. The canonical link is log(/(1 + )),
which clearly is not log(). Given the fact that the NB2 negative binomial
is a noncanonical model, its standard errors are based on what are called
the expected information matrix (EIM). Standard errors of canonical models
are based on the observed information matrix (OIM). Standard errors from the
OIM are more accurate than those from the EIM when a model has less than
30 observations. Note that because of this problem, the Stata glm command
and SAS Genmod procedure by default estimate noncanonical models using
OIM but allow an option to estimate a model using EIM, or a combination
of both. See Hilbe (2011) for a complete discussion of OIM, EIM, and their
relationship. Also see the same source for a discussion on the canonical
negative binomial. It is to my knowledge the only source on the subject. Our
discussion has taken us into the topic of the following section, to which we
now turn.
129
(1 + )
(1 + )
.5
.5
.5
.5
1
1
1
1
.5
1
2
5
.5
1
2
5
0.625
0.75
1.0
1.75
1.5
2.0
3.0
6.0
5
5
5
5
10
10
10
10
.5
1
2
5
.5
1
2
5
17.5
30
55
130
60
110
210
510
130
NB1 is similar to the quasi-likelihood Poisson in that can take the term
1. The difference, however, is that, for the NB1 model, is estimated
as a second parameter as is the NB2 negative binomial. It is not a constant.
If it were, the model would be quasi-likelihood. It is important to remember,
though, that statisticians also use the term quasi-likelihood when referring
to a GLM model that does not specifically derive from an underlying probability function. Poisson or negative binomial models with scaled or robust
standard errors, for example, are also referred to as quasi-likelihood models
since the standard errors are not directly based on the second derivative of the
model log-likelihood, which are called model standard errors. The use of
the term quasi-likelihood in a specific situation must relate to the context
in which it is being applied.
Our attention now will focus on the NB2 negative binomial since it is
by far the most widespread understanding of negative binomial regression.
The negative binomial probability distribution function and associated loglikelihood function are considerably more complex than the Poisson. The
probability distribution can be expressed in a variety of ways, with a common
parameterization appearing as
f (y; , ) =
yi +
1
1
1
1
1
1 + i
1
i
1 + i
yi
(5.1)
i
i=1
1
(5.2)
log (yi + 1) log
L(; y, ) =
exp(xi )
1
log 1 + exp(xi )
1 + exp(xi )
i=1
1
1
log (yi + 1) log
(5.3)
+ log yi +
n
yi log
131
132
POISSON : UNIFORM
---------------------------------------------------------------------------------------------|
|
|
|
|
|
|
|
|
|
|
|
|
|
| events enter cell in a random
|
|
|__ _ |
|
| |
| uniform manner
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
----------------------------------------------------------------------------------------------
FIGURE 5.1. How events can enter or exist within panels of areas or time periods for the
Poisson distribution.
|
|
|
|
|
|
|
|
|_ __ |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
---------------------------------------------------------------------------------------------------------
|
| events enter cell in
| with the same gamma
| shape each cell
|
FIGURE 5.2. Same information as in Figure 5.1, but for the negative binomial distribution.
133
r
r
r
134
medpar
los white+
hmo+type2+
type3
los white+
hmo
value with +
dispersion
Poi dispersion
DIRECT NB dispersion
NB alpha
INDIRECT NB dispersion
NB theta
6.260405
1.091024
0.445757
0.254205
2.243376
7.739524
1.328627
0.484603
0.359399
2.063547
135
uniform manner. The negative binomial then gave a gamma shape to how
events enter into periods or areas in y. But the limitation of the negative binomial is its assumption that events enter every period or area in the data in the
same way. Events enter with the same type of gamma shape. A three-parameter
negative binomial allows for different gamma shapes for each period or area
in y. It adjusts or describes the distribution of y in more detail.
Table 5.2 shows that the Poisson dispersion statistic rises when there is
increased variability added to the data. Here we dropped two statistically
significant predictors, which will increase the model variability. When a negative binomial model is applied to both the full and reduced models, both the
negative binomial dispersion statistic and parameter (alpha) values increase
as well. For the inverted parameterization of the NB dispersion parameter, the NB dispersion statistic increases but the dispersion parameter, theta,
decreases. It is the inverse value of the direct relationship. With increased
variability, the negative binomial value for theta decreases. This seems rather
counterintuitive.
Why would an analyst prefer to use an indirect relationship between the
mean and dispersion parameter for the negative binomial? Essentially the
reason comes from the relationship of the variance and gamma variance. The
Poisson variance is and the gamma variance 2 /v. Adding the two together
produces + 2 . Of course, one can argue that the second term can also
appear as 1 2 , so that = 1/v, with being the dispersion parameter. In
effect, though, there is no mathematical reason for not inverting so that the
dispersion parameter coheres with the direction of variability in the model. I
have to admit that I have had to correct a misinterpretation of the dispersion
parameter in numerous manuscripts that I have refereed where R was used
by the authors for modeling negative binomial data. Misinterpretations also
commonly occur when an analyst who uses Stata, SAS, or another of the major
commercial packages evaluates or comments on negative binomial results
based on R. However, there are R users such as myself who use glm.nb but
invert theta to alpha for interpreting results. Care must be taken when doing
this, though, since the Pearson residuals differ as well and they cannot be
simply inverted to correspond to a direct relationship.
As a result, I will use the nbinomial function when using a negative
binomial model for examples in the text. It is a function both in the msme
and COUNT packages on CRAN. Note that nbinomial has an option that
allows estimation using the indirect relationship as well. The output appears
almost identical to that of glm.nb, but it also provides a summary of Pearson
136
residuals, the Pearson Chi2 statistic, and dispersion statistic in the output. A
likelihood ratio test with a boundary test option is also provided.
137
Freq.
Percent
Cum.
------------+----------------------------------Not HS grad |
3,152
81.36
81.36
HS grad |
Coll/Univ |
Grad School |
203
289
230
5.24
7.46
5.94
86.60
94.06
100.00
The default reference level for both Stata and R is the first level. For SAS,
it is the highest level, level 4. We use the default, which has over 81% of
the observations. In such a case, level 1 is the reasonable reference level
since it is both intuitively preferred, being at the base of an ordered group of
educational levels, and has by far the most patients. As a caveat, the second
level is not significant for most of the models we run on the data, indicating
that there is no statistically significant difference in the two groups and their
propensity to visit their physician. Visits generally indicate sickness, but not
always. If we assume that it does, we cannot differentiate instances of new
sickness from treatment for continuing problems. Its best to maintain the
visit interpretation.
Next we should look at the distribution of the response variable, docvis,
which indicates how many visits were made by a patient to the doctor in
1984. We did this for visits 07 in Section 2.4; you are invited to look back
at that section. To summarize, though, there are 1611 subjects in the study
with an observed frequency of 0 counts, which is 41.58% of the data set. We
calculated the predicted value in Section 2.4 as well, which, given a Poisson
distribution, is 4.2%, nearly ten times less than observed. To calculate what
is expected, we do the following:
. sum docvis
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------docvis |
3874
3.162881
6.275955
0
121
138
Robust
Coef.
Std. Err.
P|z|
---------+----------------------------------------------------------------outwork |
.26473
.0934473
2.83
0.005
.0815767
age |
.0221169
.002909
7.60
0.000
.0164153
.4478834
.0278185
female |
.2615599
.0914541
2.86
0.004
.0823132
.4408065
.0419602
married |
-.128839
.0871441
-1.48
0.139
-.2996382
edlevel2 |
-.0743016
.1043152
-0.71
0.476
-.2787557
.1301524
edlevel3 |
-.1825212
.0972583
-1.88
0.061
-.373144
.0081016
edlevel4 |
-.2644094
.1556778
-1.70
0.089
-.5695323
.0407135
_cons |
.0156058
.1801753
0.09
0.931
-.3375313
.3687428
---------------------------------------------------------------------------
139
140
Count
Actual
Predicted
|Diff|
Pearson
-----------------------------------------------not displayed
-----------------------------------------------Sum
0.952
0.999
0.778 7304.642
NBRM: Predicted and actual probabilities
Count
Actual
Predicted
|Diff|
Pearson
-----------------------------------------------0
0.416
0.411
0.005
0.251
1
0.116
0.156
0.040
39.735
2
0.114
0.096
0.017
12.287
3
0.091
0.067
0.024
32.969
4
0.055
0.050
0.005
2.077
5
0.043
0.038
0.005
2.673
6
0.036
0.030
0.006
5.213
7
0.015
0.024
0.009
11.730
8
0.022
0.019
0.002
1.225
9
0.012
0.016
0.004
3.528
10
0.012
0.013
0.002
0.720
11
0.009
0.011
0.002
2.143
12
0.011
0.009
0.002
1.494
-----------------------------------------------Sum
0.952
0.940
0.124
116.046
Tests and Fit Statistics
PRM
BIC= -936.032 AIC=
8.007 Prefer Over Evidence
------------------------------------------------------------------------vs NBRM
BIC=-15320.139 dif= 14384.107 NBRM
PRM
Very strong
AIC=
4.293 dif=
3.715 NBRM
PRM
LRX2=14392.370 prob=
0.000 NBRM
PRM
p=0.000
------------------------------------------------------------------------NBRM
BIC=-15320.139 AIC=
4.293 Prefer Over Evidence
The note on the upper left-hand side of Figure 5.3 informs us that positive
deviations show underpredictions. The figure is a graph of Poisson and NB
model residuals. Imagine a horizontal line from 0 to 12 at the zero (0) point.
Residuals close to the zero line are well fitted. Both Poisson and negative
binomial models fit the data well, starting with the sixth visit. Before that, the
negative binomial does much better. This same information can be obtained
numerically by viewing the actual and predicted probability tables which
are outstanding diagnostic tools.
Although most information we need is provided here, for comparison purposes with later modeling Ill provide a negative binomial model of the data.
141
-.1 -.05 0
Observed-Predicted
.05 .1
6
7
Count
PRM
10
11
12
NBRM
We find that the dispersion is 1.43, indicating that there is more variability
in the data than can be adjusted for using a negative binomial model. Recall
that we still have the excessive zero counts we have not adjusted for. Note
that the dispersion parameter is 2.25, which is fairly high. We should also
compare it with the NB1 linear negative binomial model:
NB2 - NEGATIVE BINOMIAL (TRADITIONAL)
. glm docvis $xvar, fam(nb ml) vce(robust) nolog
Generalized linear models
No. of obs
3874
Optimization
Residual df
3866
: ML
Scale parameter =
Deviance
3909.433546
(1/df) Deviance =
1.011235
Pearson
5517.070211
(1/df) Pearson
1.427075
[Neg. Binomial]
AIC
BIC
= -28031.62
4.292374
142
--------------------------------------------------------------------------|
Robust
docvis |
Coef.
Std. Err.
P|z|
---------+----------------------------------------------------------------outwork |
.2733504
.0812069
3.37
0.001
.1141879
.4325129
age |
.0232448
.0026847
8.66
0.000
.0179828
.0285068
female |
.3164156
.0758391
4.17
0.000
.1677737
.4650576
married |
-.1906226
.0855496
-2.23
0.026
-.3582968
-.0229485
edlevel2 |
-.1139377
.1051687
-1.08
0.279
-.3200647
.0921893
edlevel3 |
-.1948105
.098729
-1.97
0.048
-.3883158
-.0013051
edlevel4 |
-.3628498
.1321642
-2.75
0.006
-.6218868
-.1038127
_cons |
-.013249
.1520918
-0.09
0.931
-.3113435
.2848456
---------------------------------------------------------------------------
vce(robust) nolog
. abich
AIC Statistic
4.29289
BIC Statistic
4.298453
BIC(Stata) = 16687.016
AICH Statistic
4.291909
AICH*n
AIC*n
= 16630.656
= 16605.557
This code captures the log-likelihood from the model, storing it in llnb2:
. scalar llnb2 = e(ll)
It will help in assessing the fit of the model if we can see the relationship of
the fit and observed probabilities of docvis. Stata code for constructing Figure
5.3 is provided. It is an adaptation of the code in Hilbe (2011). This code can
be adapted for any negative binomial model using the nbreg command.
Caveat: recall that the glm command enters the dispersion parameter into
the estimation algorithm as a constant. As such, it does not add the dispersion
as an extra parameter when calculating the AIC and BIC degrees of freedom.
The full maximum likelihood command, nbreg, estimates the dispersion,
adding it to the degrees of freedom. There is therefore a slight difference
in AIC and BIC values between the two methods. The dispersion parameter
should be added to the calculation. I calculated the correct values by quietly
running nbreg before abich.
To run the code in Table 5.3, I suggest downloading the file from the
books web site, typing doedit on the Stata command line, and pasting the
code into the editor. When ready to run, click Tools on the menu and then
click on Execute (do). The figure will be in the graphics screen.
If the option disp(constant) is placed at the end of the second line of
code in Table 5.3, an NB1 model will be produced. You can compare the
143
- lngamma(i+1) - lngamma(1/alpha))
local i = i + 1
}
quietly gen cnt = .
quietly gen obpr = .
quietly gen prpr = .
local i 0
while i =15 {
local obs = i + 1
replace cnt = i in obs
tempvar obser
gen obser = (e(depvar)==i)
sum obser
replace obpr = r(mean) in obs
sum pri
replace prpr = r(mean) in obs
local i = i + 1
}
gen byte count = cnt
label var prpr "NB2 - Predicted"
label var obpr "NB2 - Observed"
label var count "Count"
}
twoway scatter prpr obpr count, c(l l) ms(T d) title("docvis Observed vs
Predicted Probabilities") sub("Negative Binomial") ytitle(Probability of
Physician Visits)
144
Negative Binomial
10
15
Count
NB2 - Predicted
NB2 - Observed
FIGURE 5.4. Negative binomial model: docvis observed vs. predicted probabilities.
predicted values for NB1 and NB2 by typing the following on the command
line:
. rename prpr prnb2
. rename obpr obnb2
. drop mu*
. drop count
then rerun the code in the dofile editor with the changes you have made
to it
You may adapt the code in Table 5.3 for use with other models. However,
you should type ereturn list to check the e() macros that are saved following
the execution of the regression to determine if their names are different. If
you use the correct macro names, you can adapt the code in Table 5.3 to
compare observed versus predicted probabilities across a number of different
regression models. R users can usually find the names of saved statistics in
the help files.
We have examined the relationship of Poisson and NB2 models, and the
observed versus predicted values of the response variable, docvis (see Figure
5.4). A remaining comparison can be made between NB2 and NB1:
145
Dispersion
Number of obs
3874
Wald chi2(7)
312.30
Prob chi2
0.0000
-------------------------------------------------------------------------|
Robust
docvis |
Coef.
Std. Err.
P|z|
---------+---------------------------------------------------------------outwork |
.1861926
.0489325
3.81
0.000
.0902868
age |
.01871
.0018376
10.18
0.000
.0151084
.2820985
.0223116
female |
.3081968
.0481739
6.40
0.000
.2137777
.4026158
.0525076
married |
-.0430265
.0487428
-0.88
0.377
-.1385605
edlevel2 |
.0533671
.0844128
0.63
0.527
-.112079
.2188132
edlevel3 |
-.0041829
.0772246
-0.05
0.957
-.1555403
.1471746
edlevel4 |
-.2383558
.0989037
-2.41
0.016
-.4322035
-.0445081
_cons |
.1013333
.1021209
0.99
0.321
-.0988199
.3014865
---------+---------------------------------------------------------------/lndelta |
1.976527
.0561385
1.866498
2.086556
---------+---------------------------------------------------------------delta |
7.217632
.405187
6.465611
8.057122
-------------------------------------------------------------------------. abich
AIC Statistic
4.280869
AIC*n
BIC Statistic
4.286432
BIC(Stata) = 16640.445
= 16584.086
AICH Statistic
4.279888
AICH*n
= 16558.986
Note that the AIC and BIC statistics are better for the NB1 model. Aside
from the AIC and BIC comparative-fit tests, the optimal tool for deciding
between NB1 and NB2 is the generalized NB-P negative binomial, discussed
in the final section of this chapter. I leave the model diagnostics to you.
R code in Tables 5.4 and 5.5 provides a means to replicate some of the output obtained from using countfit. Additional code will be made available
on the books web site as it is produced.
The interpretation of the Poisson and negative binomial models is the
same. Any model with a log link function will have the same interpretation.
When the coefficients are exponentiated, both regressions will be able to be
interpreted as incidence rate ratios, as discussed in Chapter 2, on Poisson
regression. You are referred there for the following model:
146
vce(robust) irr
Dispersion
Number of obs
3874
Wald chi2(7)
231.08
Prob chi2
0.0000
-------------------------------------------------------------------------|
Robust
docvis |
IRR
Std. Err.
P|z|
---------+---------------------------------------------------------------outwork |
1.314361
.106737
3.37
0.001
1.12096
1.54113
age |
1.023517
.0027479
8.66
0.000
1.018146
1.028917
female |
1.372201
.1039725
4.18
0.000
1.182828
1.591892
married |
.8264441
.0707537
-2.23
0.026
.6987797
.9774323
edlevel2 |
.8923135
.0938496
-1.08
0.279
.7260923
1.096587
edlevel3 |
.8229906
.0812487
-1.97
0.048
.6782051
.9986852
edlevel4 |
.6956908
.0919757
-2.74
0.006
.5368845
.9014707
_cons |
.9868381
.150062
-0.09
0.931
.7325028
1.329482
---------+---------------------------------------------------------------/lnalpha |
.8112091
.037307
.7380888
.8843295
---------+---------------------------------------------------------------alpha |
2.250628
.0839641
2.091934
2.42136
--------------------------------------------------------------------------
147
I will work through the interpretations of the preceding table, with the
understanding that there are alternative ways of expressing essentially the
same thing. If you are interpreting a model based on coefficient values, it is
wise to exponentiate them before attempting an interpretation. Exponentiated
coefficients are called incidence rate ratios. When the response is binary, an
TABLE 5.6. Interpretation of German Health Care Parameter Estimates
Patients who are out of work visit the doctor some 31% more often than working
patients.
For each one-year increase in age, a patient is expected to go to the doctor a little
over 2% more often.
Females see the doctor some 37% more often than males.
Single patients are expected to go to the doctor some 21% more often than married
patients. Note that 1/.8264441 = 1.21000. Or we can affirm that married patients
are expected to see the doctor some 17.5% less often than single patients. I prefer
to interpret the relationship in a positive manner (i.e., more than).
Patients who have a college or university education are expected to go to the doctor
some 17.5% less often than those who have not been to college. Because edlevel2
is not significant, edlevel1 and edlevel2 have been merged. Edlevel12 is the
reference level.
Patients with at least some graduate school education were about 30% less likely to
see a doctor (in 1984) than patients who never went to college.
148
1,134
265
96
75.85
17.73
6.42
75.85
93.58
100.00
----------------+----------------------------------Total |
1,495
100.00
No. of obs
1495
Optimization
Residual df
1490
Scale parameter =
: ML
149
Deviance
8142.666001
(1/df) Deviance =
5.464877
Pearson
9327.983215
(1/df) Pearson
6.260391
AIC
9.276131
BIC
= -2749.057
-------------------------------------------------------------------------|
Robust
los |
Coef.
Std. Err.
P|z|
--------+----------------------------------------------------------------hmo |
-.0715493
.0517323
-1.38
0.167
-.1729427
.0298441
white |
-.153871
.0833013
-1.85
0.065
-.3171386
.0093965
type2 |
.2216518
.0528824
4.19
0.000
.1180042
.3252993
type3 |
.7094767
.1158289
6.13
0.000
.4824562
.9364972
_cons |
2.332933
.0787856
29.61
0.000
2.178516
2.48735
-------------------------------------------------------------------------. abich
AIC Statistic
9.276131
AIC*n
BIC Statistic
9.280208
BIC(Stata) = 13894.365
= 13867.815
AICH Statistic
9.274205
AICH*n
= 13854.255
Note the high dispersion statistic at 6.26. Also, robust or empirical standard
errors were given the model because of the overdispersion. The model without
robust SEs had all p-values as statistically significant. It is noteworthy that
the data do not have 0 counts. We will come back to this when we evaluate
zero-truncated models.
We next tabulate los. Note the comparatively greater number of 1 counts.
Some 8.5% of patients stayed only a single night in the hospital:
. tab los
hospital |
Length of |
Stay |
Freq.
Percent
Cum.
------------+----------------------------------1 |
126
8.43
8.43
2 |
3 |
4 |
71
75
104
4.75
5.02
6.96
13.18
18.19
25.15
5 |
123
.
8.23
.
33.38
.
74 |
91 |
116 |
1
1
1
0.07
0.07
0.07
99.87
99.93
100.00
150
No. of obs
1495
Optimization
: ML
Residual df
=
Scale parameter =
1490
1
Deviance
=
1568.14286
Pearson
= 1624.538251
Variance function: V(u) = u+(.4458)u2
(1/df) Deviance =
(1/df) Pearson =
[Neg. Binomial]
1.052445
1.090294
AIC
= 6.424718
Log pseudolikelihood = -4797.476603
BIC
= -9323.581
-----------------------------------------------------------------------|
Robust
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------+---------------------------------------------------------------hmo |
white |
type2 |
-.0679552
-.1290654
.221249
.0513265
.0710282
.0530361
-1.32
-1.82
4.17
0.186
0.069
0.000
-.1685533
-.2682782
.1173001
.0326429
.0101473
.3251978
type3 |
.7061588
.1157433
6.10
0.000
.4793061
.9330116
_cons |
2.310279
.0689142
33.52
0.000
2.17521
2.445348
-----------------------------------------------------------------------. abich
AIC Statistic
6.426055
AIC*n
BIC Statistic
AICH Statistic
=
=
6.432411
6.42592
BIC(Stata) = 9638.8125
AICH*n
= 9589.0547
= 9606.9531
= mean
Number of obs
1495
LR chi2(2)
118.46
Prob chi2
0.0000
Pseudo R2
0.0122
151
-------------------------------------------------------------------------los |
Coef.
Std. Err.
P|z|
---------+---------------------------------------------------------------_hat |
-.2952803
1.986614
-0.15
0.882
-4.188973
3.598412
_hatsq |
.2598939
.3982465
0.65
0.514
-.5206548
1.040443
_cons |
1.59086
2.446458
0.65
0.516
-3.20411
6.38583
---------+---------------------------------------------------------------/lnalpha |
-.8083432
.0444598
-.8954828
-.7212036
---------+---------------------------------------------------------------alpha |
.4455957
.0198111
.4084104
.4861668
0.000
The _hatsq predictor is not significant, indicating that the model is not
misspecified. An NB1 test is attempted next.
To check the NB1 parameterization, we run a model, check the AIC and
BIC statistics, and run a linktest. To save space, we will not display the
regression results but only the summary statistics of interest. Placing a qui in
front of a Stata command suppresses the display to screen:
. qui nbreg los hmo white type2 type3, nolog disp(const)
. abich
AIC Statistic
BIC Statistic
=
=
6.471913
6.478269
AIC*n
= 9675.5107
BIC(Stata) = 9707.3701
AICH Statistic
6.471778
AICH*n
= 9657.6123
. linktest
-----------------------------------------------------------------------los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-------+---------------------------------------------------------------_hat |
-8.003606
4.769589
-1.68
0.093
-17.35183
1.344617
_hatsq |
_cons |
1.987063
10.15396
.9933492
5.700143
2.00
1.78
0.045
0.075
.0401342
-1.018113
3.933992
21.32604
_hatsq |
1.987063
.9933492
2.00
0.045
.0401342
3.933992
The linktest fails, indicating that the NB1 model is misspecified. In addition, the AIC and BIC statistics are greater than for NB2, and of course very
much greater than with Poisson.
At this point, it is clear that we need to determine whether the lack of
0 counts in the model affects the dispersion. Its greater than it should be, but
not bad.
152
NB1
NB2
-----------------------------------------------------------------------AIC*n
13867.815
9675.5107
9606.9531
BIC
13894.365
9707.3701
9638.8125
AICH*n
Variable |
13854.255
Obs
9657.6123
Mean
Std. Dev.
9589.0547
Min
Max
---------+-----------------------------------------------------los |
1495
9.854181
8.832906
1
116
The mean of los is 9.85, indicating that, on average, patients were in the
hospital some 10 days. Because the mean is so high, it is unlikely that lack of
0 counts will make a difference. With a mean of 9, the percentage of 0 counts
expected is near 0. Also look at the standard deviation, 8.83. The variance
is the square of the standard deviaton 78. This value clearly violates the
Poisson assumption of equidispersion. We will explore other alternatives for
these data in subsequent chapters as we look at more complex modeling
strategies.
153
interval for the Poisson dispersion). In Section 4.2.2, we demonstrated the use
of a one-sided boundary likelihood ratio test to be used following estimation
of a negative binomial model, testing whether the dispersion parameter, ,
is statistically different from 0. You are referred to that discussion. We also
presented several tests for overdispersion in Section 4.3. You are also referred
there.
In the following two subsections, we test negative binomial models by
employing different types of regression models. Both of the extended negative binomial models we consider here have been referred to as generalized
negative binomial models. The first, NB-P, is a generalized negative binomial
model since it adds an additional parameter to the base NB2/NB1 model.
The second, NB-H, is not a generalized negative binomial in the usual sense,
although it is called that by Stata.
Both models test the base negative binomial for different purposes. The
NB-P model is primarily used to decide between using an NB2 model or an
NB1 model. It does this by parameterizing the negative binomial exponent.
The heterogeneous negative binomial, NB-H, parameterizes the dispersion
parameter, with the goal of determining which predictors add or adjust for
excess variability in the data. This in turn produces overdispersion in the data.
Although Greene was the first to actually develop software for parameterizing the
exponent of the negative binomial variance, this parameterization is mentioned
in Cameron and Trivedi (1998), Winkelmann (2008), and described in Hilbe and
Greene (2008).
154
To explain, recall that the variance functions of the NB1 and NB2 models
are, respectively,
(5.4)
NB1 + or + 1
and
(5.5)
NB2 + 2
The difference rests in the value of the exponent. To choose between them,
Greene let the exponent become a parameter to be estimated, called , for
power. is the Greek letter rho. The model is called, appropriately, NB-P:
NB-P +
(5.6)
Number of obs
3874
Wald chi2(7)
186.34
Prob chi2
0.0000
---------------------------------------------------------------------------|
docvis |
Robust
Coef.
Std. Err.
P|z|
-----------+---------------------------------------------------------------outwork |
.2329032
.0632728
3.68
0.000
.1088908
age |
.0221351
.0024555
9.01
0.000
.0173225
.3569157
.0269477
female |
.3420894
.0594544
5.75
0.000
.225561
.4586178
married |
-.0799971
.064053
-1.25
0.212
-.2055387
.0455445
edlevel2 |
.0163448
.0993362
0.16
0.869
-.1783506
.2110403
edlevel3 |
-.0555854
.0937499
-0.59
0.553
-.2393319
.1281611
edlevel4 |
-.2940122
.1187276
-2.48
0.013
-.526714
-.0613103
_cons |
-.0660212
.1287343
-0.51
0.608
-.3183357
.1862934
-----------+---------------------------------------------------------------/P |
1.363081
.1248427
/lntheta |
1.545855
.1509675
10.92
0.000
1.118394
1.607769
1.249964
1.841746
-----------+---------------------------------------------------------------theta |
4.69198
.7083367
3.490217
6.307539
chi2 =
0.0014
chi2 =
0.0000
155
. abich
AIC Statistic
4.278763
AIC*n
BIC Statistic
4.285488
BIC(Stata) = 16638.549
= 16575.928
The AIC and BIC values are only a little less than for the NB1 model but
are significantly less than in NB2. However, this finding needs to be checked
using a likelihood ratio test. The likelihood ratio (LR) test between NB-P and
NB2 can be given as follows:
LIKELIHOOD RATIO TEST
. di -2*(llnb2 - llnbp)
56.728519
P-VALUE FOR LR TEST
. di chi2tail(1, 56.72852)
5.003e-14
The likelihood ratio test of the hypothesis that the model having a of
1.36 is statistically the same as NB2, with a = 2, is rejected. Recall that
with 1 degree of freedom, any Chi2 test statistic having a value greater than
3.84 will reject the hypothesis that the models being compared are statistically
the same. The same holds when comparing NBP and NB1, with = 1. These
tests inform us that the model is neither NB1 nor NB2.
Recall also that Greene provides us with another test statistic one to
determine whether NB2 is preferable to NB1. It is based on the powers. The
resultant statistic is distributed as a reverse cumulative upper tail Students t
test with 1 degree of freedom:
nbp nb2
nbpse
(5.7)
Here we have
. di (1.363081 -2)/.1248427
-5.1017721
. di ttail(1, -5.1017721)
.93838908
Values of p greater than 0.05 are not significantly different. For our models,
the data can be modeled using either NB1 or NB2. However, given the lower
value of the AIC and BIC statistics, NB1 should likely be preferred.
156
There are some, including myself, who will take the NB-P model results
as valuable in themselves, and not merely as a test between NB2 or NB1. If
an NB-P test result comes with a substantially lower information criterion
value, then I suggest using NB-P as the model of choice. In this case, no such
determination can be made between NB1 and NB2, so NB-P should be used
as the preferred model.
As of this writing, no NB-P function exists in R. As soon as a function is
developed, I will provide information on this books web site regarding how
it may be downloaded.
157
sheight
scover
---
0.2004
0.5280
0.1734
0.2439
1.156
2.164
0.2536
0.0355 *
The dispersion statistic is 13.68, indicating substantial Poisson overdispersion. Only the standardized tree height predictor is not significant, but
because of the overdispersion we have no idea of the significance status of
any of the predictors. A table of the response, SqCones, shows that the mean
number of cones stripped is 16.49. Five 0 counts is excessive, but the data
are fairly uniformly distributed after 5. The excess zero counts may be a
reason for overdispersion, but we really have no idea whether that is the
case:
table(nut$cones)
0 1 2 3 4
5 1 3 4 5
48 55 57 60 61
1 1 1 1 1
5
3
6
1
7
1
9 11 12 14 15 16 17 18 20 21 22 25 27 32 35 42 44 47
2 2 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
158
summary(nut$cones )
Min. 1st Qu. Median
0.00
3.50
11.00
Max.
61.00
0.306 0.161
0.158 0.159
0.536 0.179
1.90
0.99
3.00
(Intercept)_s
0.927 0.199
4.65 3.37e-06
on
LCL
UCL
2.34948 2.905
49 d.f.
scale.link = "log_s"))
Deviance Residuals:
Min. 1st Qu.
Median
0.4419
Max.
1.5670
159
Pearson Residuals:
Median
Mean
3rd Qu.
Max.
Min.
1st Qu.
0.03088
0.31550
3.20200
(Intercept)
Estimate
1.1e-73
2.3325 2.897
sntrees
0.2731 0.110
2.478
0.0132
0.0571 0.489
sheight
0.0744 0.144
0.516
0.5217 0.158
3.300 0.000967
scover
SE
LCL
UCL
(Intercept)_s
sntrees_s
sheight_s
0.3312 0.321
1.033
scover_s
0.2723 0.417
0.652
on
on
49 d.f.
on
43 d.f.
49 d.f.
on
43 d.f.
Dispersion: 1.088876
AIC:
385.7116
exp(coef(HNB))
(Intercept)
sntrees
sheight
13.6637970
1.3140450
1.0772100
sntrees_s
sheight_s
scover_s
0.6815513
1.3925859
1.3130309
scover (Intercept)_s
1.6849189
0.8228193
in the data.
160
5.5 SUMMARY
The negative binomial model is commonly used by statisticians and analysts
to model overdispersed Poisson data. Typically, negative binomial regression
is used as a catch-all model for overdispersed count data. Many times
perhaps most of the time we dont know the reason for overdispersion. We
may have suspicions, such that there are an excessive number of zeros in the
data, that the data are correlated, or even that there are no zero counts in
the data. There are a host of reasons why data can be overdispersed. There are
also many models to address the variety of overdispersion we find in study
data. But if we have no clear idea as to what is causing overdispersion in a
Poisson model, analysts generally turn to negative binomial regression.
Negative binomial regression is a two-parameter model that is estimated
using either maximum likelihood estimation or by IRLS estimation as a member family of generalized linear models (GLMs). When the negative binomial
is estimated as a GLM, the dispersion parameter is entered into the IRLS estimating algorithm as a constant. Some IRLS routines exist that estimate the
dispersion parameter, , using an outside maximum likelihood procedure,
which then sends the result back into the IRLS fitting algorithm. R, Stata,
SAS, and Limdep use this method for negative binomial estimation.
Two models were discussed that extend the basic negative binomial
model the three-parameter NB-P and the heterogeneous negative binomial, NB-H. The NB-P model parmeterizes the negative binomial variance
exponent, allowing analysts to determine whether the data are better modeled using NB1 or NB2. The NB-H model allows analysts to determine which
5.5 Summary
161
CHAPTER 6
Why hasnt the PIG model been widely used before this?
What types of data are best modeled using a PIG regression?
How do we model data with a very high initial peak and long right skew?
How do we know whether a PIG is a better-fitted model than negative
binomial or Poisson models?
163
Mean
Variance
Inv Variance
POI
NB1
NB2
PIG
NB-P
+
+
+ 2
+ 2
+ 3
+ 3
Poisson and negative binomial (both NB2 and NB1) models have log links.
Recall that the negative binomial is a mixture of the Poisson and gamma
distributions, with variances of and 2 /v, respectively. We inverted so
that there is a direct relationship between the mean, dispersion, and variance function. Likewise, the inverse Gaussian is a mixture of Poisson and
inverse Gaussian distributions, with an inverse Gaussian variance of 3 .
The meanvariance relationship of the Poisson, negative binomial, PIG, and
NB-P, with accepted symbols for the inversely parameterized dispersion, is
given in Table 6.1.
I parameterize the PIG distribution and associated regression model as the
negative binomial is traditionally parameterized, with a direct relationship
between the mean and dispersion parameter. Doing this directly relates the
mean and amount of overdispersion in the data. The dispersion for the PIG
model will also be given the name alpha, . The dispersion is often given as nu,
, or phi, . I will use to refer to the dispersion parameterized indirectly with
the mean such that = 1. This is similar to the relationship we discussed
regarding the negative binomial, where theta is the inverse of : = 1. We
must make a clear distinction, though, in how is understood here. GLM
theory symbolizes the exponential family link function as . R, however, has
chosen to use the same symbol for the dispersion parameter as well. My caveat
is to keep the distinction in mind when dealing with this class of models.
Recall Table 5.1, in which we compared the relationship of the mean and
variance for several values of the Poisson and negative binomial distributions.
The PIG variance can be added to the table for comparative purposes.
Table 6.2 shows us that with greater mean values of the PIG distribution
come increasingly greater values of the variance. With a dispersion parameter
greater than 1, greater values of the mean of the response variable in a
PIG regression provide for adjustment of greater amounts of overdispersion
than does the negative binomial model. Simply put, PIG models can better
deal with highly overdispersed data than can negative binomial regression,
particularly with data clumped heavily at 1 and 2.
164
+ 2
+ 3
+ 2
+ 3
.5
.5
.5
.5
1
1
1
1
.5
1
2
5
5
1
2
5
0.625
0.75
1.0
1.75
1.5
2.0
3.0
6.0
.0563
.063
.755
1.125
1.5
2.0
3.0
6.0
5
5
5
5
10
10
10
10
.5
1
2
5
.5
1
2
5
17.5
30
55
130
60
110
210
510
67.5
130
255
380
510
1010
2010
5010
The foremost assumptions of the PIG model are similar to those underlying
the negative binomial model. The key difference is the following:
PIG regression is used to model count data that have a high initial peak
and that may be skewed to the far right as well as data that are highly
Poisson overdispersed.
The PIG probability distribution, as a variety of Sichel distribution, can be
given as
(y )2
(6.1)
exp
f (y; , ) =
2y3
22 y
with {y, , } 0.
Again, a value of the model rests in the fact that it can capture the modeling
of count data that are sharply peaked and skewed to the far right (e.g., hospital
data where most patients are discharged within the first few days and then the
numbers taper off quickly, with a few perhaps lingering on for a long time).
The negative binomial model may do a poor job of modeling such data, as
will other count models. The PIG model is the only model we describe that
can deal with this form of data.
The preceding PDF can be used to calculate PIG probabilities, as we did
for the Poisson and NB2 distributions. Parameterizing = 1 for a direct
165
6.2.2 Examples
I begin by modeling the German health data using the rwm1984 file that was
used earlier in the book (see Table 6.3). The reduced model data will be used,
166
which includes outwork and age as predictors. Recall that outwork is binary
and age is continuous, ranging from 25 to 64. Earlier, we centered the variable, which technically is the appropriate manner of handling a continuous
predictor that does not start near 0 or 1. For pedagogical purposes, Ill leave
age in its natural form. docvis ranges from 0 to 121, but there are only a few
patients visiting a physician this much during the year. We will test the PIG
model against NB2. Remember that the data have far more zero counts than
allowed by the distributional assumptions:
. use rwm1984
NB2 MODEL
. glm docvis outwork age, nolog
not displayed
. abic
AIC Statistic
BIC Statistic
=
=
AIC*n
= 16673.525
BIC(Stata) = 16698.572
4.303956
4.304753
. predict munb
(option n assumed; predicted number of events)
. linktest
_hatsq |
.0551027
.1963277
0.28
0.779
-.3296926
.439898
Number of obs
3874
167
Wald chi2(2)
=
1527.35
Log pseudolikelihood = -8378.5553
Prob chi2
=
0.0000
-----------------------------------------------------------------------|
Robust
docvis |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
------------+----------------------------------------------------------outwork |
age |
.5277197
.0268643
.019414
.0010644
27.18
25.24
0.000
0.000
.489669
.0247781
.5657705
.0289506
=
=
4.327597
4.328395
AIC*n
= 16765.111
BIC(Stata) = 16790.158
You may compare the predicted counts for the NB2 (munb) and PIG (mupig)
models as we have for other models:
. predict mupig
. linktest
_hatsq |
.3137388
.1355676
2.31
0.021
.0480311
.5794465
. pigreg , irr
Poisson-Inverse Gaussian regression
Number of obs
Wald chi2(2)
=
=
3874
1527.35
1.695063
1.027228
.7515428
.0329079
.0010934
.0418261
27.18
25.24
-5.13
0.000
0.000
0.000
1.631776
1.025088
.6738779
1.760804
1.029374
.8381587
------------+----------------------------------------------------------/lnalpha |
1.344595
.0152366
1.314732
1.374459
------------+----------------------------------------------------------alpha |
3.836634
.0584573
3.723753
3.952936
------------------------------------------------------------------------
168
Std. Error
0.10801
t value
-2.646
Pr(|t|)
8.186e-03
outwork
0.52763
0.05563
9.485 4.096e-21
age
0.02686
0.00241
11.146 2.020e-28
------------------------------------------------------------------Sigma link function: log
Sigma Coefficients:
Estimate Std. Error
(Intercept)
1.344
0.04166
t value
Pr(|t|)
32.26
1.792e-202
exp(1.344)
[1] 3.83435
We next model the medpar data using a PIG model. Recall that the type
categorical variable refers to the type of admission to a hospital. Type1, the reference level, is an elective admission, type2 is urgent, and type3 is emergency.
We use a robust variance estimator since we found Poisson overdispersion in
the data:
169
=
=
1495
6047.35
.87023
1.218342
.0054969
.0060711
-22.01
39.63
0.000
0.000
.8595228
1.206501
.8810706
1.2303
type3 |
1.688646
.0149556
59.16
0.000
1.659586
1.718214
6.430156
AIC*n
BIC Statistic
AICH Statistic
=
=
6.436512
6.430021
BIC(Stata) = 9644.9424
AICH*n
= 9595.1846
= 9613.084
. linktest
-----------------------------------------------------------------------los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
------------+----------------------------------------------------------_hat | -1.085516
3.213062 -0.34
0.735
-7.383002
5.211969
_hatsq | .4303325
.6625117
0.65
0.516
-.8681666
1.728832
_cons | 2.505929
3.868191
0.65
0.517
-5.075586
10.08744
------------+----------------------------------------------------------/lnalpha | -.6035841
.0571029
-.7155038 -.4916645
------------+----------------------------------------------------------alpha | .5468481
.0312266
.4889457
.6116076
-----------------------------------------------------------------------Likelihood-ratio test of alpha=0:
Prob=chibar2 = 0.000
chibar2(01) = 4241.09
The AIC and BIC tests are statistically the same as for the negative binomial
model. The linktest indicates that the model is not misspecified (_hatsq =
.516). The negative binomial model linktest fails, though, with a hat-squared
170
value of 0.036. It appears that the PIG model is to be preferred over the
negative binomial model.
171
10
15
Count
NB2 - Predicted
NB2 - Observed
FIGURE 6.1. Poisson inverse Gaussian model: docvis observed vs. predicted probabilities.
test model for the medpar data is a zero-truncated model. We address both
of these models in the following chapter.
I should mention that it is possible to program a Stata command like
countfit to incorporate pigreg as an optional model. It does not exist
at present. A graphic like Figure 5.4 for observed versus predicted PIG can
nevertheless be made by substituting
qui gen newvar = exp((-(i-mu)2)/(alpha*2*i*mu2)) *
(sqrt(1/(alpha*2*_pi*i3)))
in place of the negative binomial PDF in Table 5.3. Running the code produces
Figure 6.1.
R code for the same figure may be created by substituting the R PIG PDF
for the R code used to generate the negative binomial version of this figure.
Graphics such as these are valuable ways in which an analyst can assess the
fit of the model they believe best describes their study data.
CHAPTER 7
and that the number of zero counts will be .04978707 * 100 = 4.978707,
or 5.
172
173
174
and
I should mention at this point that zero-truncated models have another use
besides adjusting the underlying PDF of a model and the log-likelihood of its
estimation algorithm so that they can model data that preclude the possibility
of 0s. I am referring to their use as a component of hurdle models. We will
discuss hurdle models in the following section.
Also to be mentioned is the fact that some statisticians have encouraged
the use of shifted Poisson models when there are structurally no possible zero
counts. That is, the counts are shifted left so that 1 counts become 0s, 2s
become 1s, and so forth. A problem is that the resultant model is not based
on data as they in fact exist. My caveat is that if a shifted model is employed
on study data, it must be reported in study results.
175
this difference. That is, the Poisson PDF is rescaled to exclude zero counts
by dividing the PDF by 1 Pr(y = 0), which is 1 exp(). This in effect
apportions the withdrawal of zero counts from each element of the Poisson
PDF:
e i i i
f (y; ) =
(1 exp(i ))yi !
y
(7.1)
n
i=1
(7.2)
The model is no longer a GLM and cannot be estimated from within its
umbrella. It is nearly always estimated using a full maximum likelihood
algorithm. We therefore parameterize the log-likelihood in terms of xb and
not .
Perhaps the most important thing to remember about zero truncation is
that it is not really needed unless the mean of the distribution of counts to
be modeled is low. Remember the Poisson PDF earlier in this chapter. We
saw that, for a mean of 3, there should be only 5% 0s in the data. For a
100-observation data set, only 5 out of 100 counts should be 0s. However,
for a mean of 5, we expect 2/3 of 1% of the counts to be 0s (i.e., about 7 out
of 1000):
. di exp(-5) * 50 / exp(lnfactorial(0))
.00673795
If we have a model for which the count response variable has a mean
greater than 5, we simply should not expect any 0 counts at all. So:
r For a Poisson model, if the mean is 5 or more, expect less than 1% 0s in
176
r Check the mean and number of observations of the data first before becom-
As an example, we use the medpar data. The response, los, has a mean of
. sum los
Variable |
Obs
Mean
Std. Dev.
Min
Max
------------+----------------------------------------------------------los |
1495
9.854181
8.832906
1
116
177
-.153871
.0274128
-5.61
0.000
-.2075991
-.100143
hmo | -.0715493
type2 | .2216518
type3 | .7094767
.023944
.0210519
.026136
-2.99
10.53
27.15
0.003
0.000
0.000
-.1184786
.1803908
.6582512
-.02462
.2629127
.7607022
_cons | 2.332933
.0272082
85.74
0.000
2.279606
2.38626
-----------------------------------------------------------------------. abic
AIC Statistic
BIC Statistic
=
=
AIC*n
= 13867.815
BIC(Stata) = 13894.365
9.276131
9.280208
.0274166
.0239636
-5.61
-2.99
0.000
0.003
-.2076792
-.1186164
-.1002081
-.0246807
type2 |
type3 |
_cons |
.0210563
.0261385
.0272121
10.53
27.15
85.73
0.000
0.000
0.000
.180511
.6583857
2.279526
.2630502
.7608467
2.386195
.2217806
.7096162
2.33286
-----------------------------------------------------------------------. abic
AIC Statistic
=
9.275884
AIC*n
= 13867.447
BIC Statistic
9.279961
BIC(Stata) = 13893.996
There is in fact very little difference in the models. The guideline is that if
the mean of the count response variable is high perhaps over 4 there is
typically no need to model the data using a zero-truncated model.
(7.3)
178
Subtracting equation (7.3) from 1, logging the result and dividing the reparameterized negative binomial log-likelihood by the result, we have
n
1/
L(; y|y 0) =
LNB2 ln 1 1 + exp xi
(7.4)
i=1
where the first term in the braces is the standard NB2 log-likelihood function,
given in equation (5.3). Note that since the log-likelihood is on the log scale,
the term 1 Pr(0) must also be logged; i.e., log(1 Pr(0)).
As an example, we use the same medpar data:
. ztnb los white hmo type2 type3, nolog
. .
.
-----------------------------------------------------------------------los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------white | -.1345583
hmo | -.0726664
type2 | .2344362
.0757438
.0589298
.0559015
-1.78
-1.23
4.19
0.076
0.218
0.000
-.2830134
-.1881666
.1248713
.0138968
.0428339
.3440011
type3 | .7355978
.084056
8.75
0.000
.5708511
.9003445
_cons | 2.272518 .0752203
30.21
0.000
2.125089
2.419947
-----------+-----------------------------------------------------------/lnalpha | -.6007157
.0549884
-.708491
-.4929403
-----------+-----------------------------------------------------------alpha |
.548419
.0301567
.4923866
.6108277
-----------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 4354.66
Prob=chibar2 = 0.000
. abic
AIC Statistic
6.364409
AIC*n
BIC Statistic
6.370764
BIC(Stata) = 9546.6514
. linktest
...
_hatsq |
.2424067
.4024168
0.60
0.547
= 9514.792
-.5463157
1.031129
179
The AIC statistics for the ZTNB are substantially lower than those associated
with the ZTP, indicating a much better fit.
A standard NB2 model is displayed for comparison. Note that the coefficients differ more than the ZTP did compared with the Poisson model.
Note also that the AIC and BIC values are substantially lower for the zerotruncated model compared with the standard model. Recall that the values
for the Poisson and ZTP were nearly identical. This is because more 0 counts
were predicted for the negative binomial with the mean of los, and therefore
more adjustment had to be made to adjust the truncated model PDF to sum
to 1:
. glm los white hmo type2 type3, fam(nb ml) nolog nohead
-----------------------------------------------------------------------|
OIM
los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
---------+-------------------------------------------------------------white |
hmo |
-.1290654
-.0679552
.0685416
.0532613
-1.88
-1.28
0.060
0.202
-.2634046
-.1723455
.0052737
.0364351
type2 |
type3 |
_cons |
.221249
.7061588
2.310279
.0505925
.0761311
.0679472
4.37
9.28
34.00
0.000
0.000
0.000
.1220894
.5569446
2.177105
.3204085
.8553731
2.443453
-----------------------------------------------------------------------. abic
AIC Statistic
=
6.424718
AIC*n
= 9604.9531
BIC Statistic
6.428794
BIC(Stata) = 9631.5029
The model is not misspecified, passing the linktest (p = 0.547). The likelihood ratio test tells us that the zero-truncated model is preferable to a
standard negative binomial model.
Calculating the probability of zero counts for a negative binomial distribution is important when deciding if a negative binomial model would benefit by
using a zero-truncated negative binomial. To determine the number of zeros
that is expected for a model whose mean is 2 and alpha is 1, use the code in
the following table and multiply the result by the number of observations in
the model.
Stata: Calculate NB2 expected 0s for given and
=======================================================
. gen a = 1
. gen mu = 2
. gen y=0
180
In the negative binomial model for the medpar data, the response, los,
has a mean of 9.854181. Inserting that into the preceding formulae gives
us a predicted probability of a zero count of 0.09213039. Multiplying by
1495 gives us 137.735 as the expected number of zero counts. So a sizable
adjustment must have been made to the distribution since the zeros have
been truncated from the model, as mentioned earlier.
181
(7.5)
where the dispersion statistic is given a direct relationship with the mean (i.e.,
). Dividing the PIG PDF by 1 equation (7.5) results in a zero-truncated
PDF:
. ztpig los hmo white type2 type3, nolog
Zero-truncated PIG regression
Number of obs
LR chi2(4)
=
=
1495
63.10
Prob chi2
=
0.0000
Log likelihood = -4778.6977
Pseudo R2
=
0.0066
-----------------------------------------------------------------------los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------hmo | -.0643421
.0593155
-1.08 0.278
-.1805983
.0519142
white | -.1478711
.0747265
-1.98
0.048
-.2943324
-.0014098
type2 |
type3 |
.0554261
.0793471
3.79
6.84
0.000
0.000
.1011741
.3872638
.3184405
.6982988
.2098073
.5427813
_cons | 2.324182
.0741818
31.33 0.000
2.178788
2.469576
-----------+-----------------------------------------------------------/lnalpha | -.5109055
.0608871
-.630242
-.391569
-----------+-----------------------------------------------------------alpha | .5999521
.0365293
.5324629
.6759954
-----------------------------------------------------------------------Likelihood-ratio test of delta=0: chibar2(01) = 4300.05
Prob=chibar2 = 0.000
. abich
AIC Statistic
6.400933
AIC*n
BIC Statistic
6.407289
BIC(Stata) = 9601.2549
AICH Statistic
6.400798
AICH*n
= 9569.3955
= 9551.4971
The model is ideal for highly skewed data and highly overdispersed data
that cannot have zero counts. No R code is currently available for this model.
Note that a linktest failed to prove the model as misspecified. However,
comparing AIC and BIC values, the model does not fit the data as well as the
negative binomial.
182
Note: The azsurgical data on this bookss web site is ideal for modeling
with a ZTPIG. It is appendectomy data modeling los on procedure. The task
is to discover the extent to which LOS relates to having open or laprascopic
surgery.
Number of obs
1495
Wald chi2(4)
=
50.61
Log likelihood = -4740.3872
Prob chi2
=
0.0000
-----------------------------------------------------------------------los |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
------------+----------------------------------------------------------hmo | -.0646661 .0483063
-1.34 0.181 -.1593447
.0300126
white |
type2 |
type3 |
-.062624
.2034899
.7023193
.0656351
.0507239
.1081974
-0.95
4.01
6.49
0.340
0.000
0.000
-.1912665
.1040729
.4902563
.0660186
.3029069
.9143823
=
=
=
6.35102
6.359878
6.353812
AIC*n
= 9494.7744
BIC(Stata) = 9531.9434
AICH*n
= 9471.6875
This is the best-fitted model, if we are to believe the AIC and BIC statistics.
We discussed this model in Chapter 4 as a three-parameter model aimed at
advising the analyst whether NB2 or NB1 better fits the data. It appears that
a PIG model is preferable to NB1 or NB2 since the NB-P power parameter
is 3.18. It would be a mistake to interpret in this manner, though, since
NB-P is based on a negative binomial distribution, not a PIG. However,
given that the ZTNBP model is better fitted than ZTPIG, it appears likely
that there are other causes of overdispersion. Keep in mind that a model
with excessive zero counts will almost always be overdispersed, but there
may be other sources of overdispersion as well. The NB-P model may in
effect be adjusting for multiple sources of overdispersion. Recall that, for
the NB and PIG models, the dispersion statistic has a single value. If there is
183
heterogeneity in the dispersion, then they are not suitable models to adjust for
it. Heterogeneous NB and NB-P are suitable models, though, as are the threeparameter generalized negative binomial models we discuss in Chapter 9.
A linktest was given to the model, indicating that it is not misspecified. Recall that the ZTPIG model also passed the linktest and appears well
specified.
Number of obs
Wald chi2(4)
Prob chi2
=
=
=
1495
56.68
0.0000
-----------------------------------------------------------------------los |
Coef.
Std. Err.
z
P|z| [95% Conf. Interval]
------------+----------------------------------------------------------hmo | -.0604598
white | -.1407904
type2 | .2025758
.0573828
.0736062
.0544102
-1.05
-1.91
3.72
0.292
0.056
0.000
-.1729281
-.2850559
.0959339
.0520084
.0034751
.3092177
type3 |
.0828975
6.23
0.000
.3539716
.6789238
.5164477
38.00
-3.19
0.000
.6663518
Prz = 0.9993
. abich
AIC Statistic
6.390244
AIC*n
BIC Statistic
AICH Statistic
=
=
6.3966
6.390109
BIC(Stata) = 9585.2744
AICH*n
= 9535.5166
.7387676
= 9553.415
This model is also not misspecified. A Vuong test against the ZTNB is not
good, though, telling us that the ZTNB is the preferred model. Given the
differences in AIC and BIC statistics for each model, this is not surprising.
184
185
have the hurdle at crossing 0, which is the form we discuss in this section. In
any case, the two processes are conjoined using the log-likelihood
L = ln( f (0)) + {ln[1 f (0)] + ln P (t)}
(7.6)
where f (0) represents the probability of a zero count, and P(t) represents the
probability of a positive count. Equation (7.6) reads that the hurdle model
log-likelihood is the log of the probability of y = 0 plus the log of y = 1 plus
the log of y being a positive count.
In the case of a logit model, the probability of zero is
f (0) = P (y = 0; x) =
1
1 + exp(xi b )
(7.7)
and 1 f (0) is
exp(xi b )
1 + exp(xi b )
(7.8)
186
Number of obs
Wald chi2(2)
=
=
3874
155.20
.0030821
.1351158
8.55
-7.35
0.000
0.000
.0202966
-1.258377
.0323782
-.7287326
------------+----------------------------------------------------------poisson
|
outwork |
age |
.2134047
.0116587
.0191964
.0008481
11.12
13.75
0.000
0.000
.1757804
.0099965
.251029
.013321
_cons |
1.04142
.0397268 26.21 0.000
.9635565
1.119283
-----------------------------------------------------------------------AIC Statistic =
6.282
. abich
AIC Statistic
BIC Statistic
AICH Statistic
=
=
=
6.283534
6.285986
6.281605
AIC*n
= 24342.41
BIC(Stata) = 24379.982
AICH*n
= 24328.146
Next, lets model the two-part hurdle model separately. We can use the
following code guidelines (see Table 7.4):
r Create a variable visit equal to 1 if docvis 0 and 0 otherwise.
r Model visit on predictors the binary component of the hurdle model.
187
First, for the binary component, we have the following logistic regression
model:
. gen visit=docvis0
. logit visit outwork age, nolog
Logistic regression
Number of obs
3874
LR chi2(2)
Prob chi2
=
=
164.64
0.0000
=
=
=
1.316883
1.317036
1.315514
AIC*n
= 5101.605
BIC(Stata) = 5120.3911
AICH*n
= 5095.2573
188
It is the same as the hurdle model. Next the count component is modeled:
. ztp docvis outwork age if docvis0, nolog
Zero-truncated Poisson regression
Number of obs
LR chi2(2)
=
=
2263
443.77
Prob chi2
=
0.0000
Log likelihood = -9617.4025
Pseudo R2
=
0.0226
-----------------------------------------------------------------------docvis |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
--------------+--------------------------------------------------------outwork | .2134047
.0191964
11.12
0.000
.1757804
.251029
age |
_cons |
.0116587
1.04142
.0008481
.0397268
13.75
26.21
0.000
0.000
.0099965
.9635565
.013321
1.119283
-----------------------------------------------------------------------. abich
AIC Statistic
BIC Statistic
=
=
8.502344
8.502605
AIC*n
= 19240.805
BIC(Stata) = 19257.979
AICH Statistic
8.500219
AICH*n
= 19234.209
189
reason to believe that the binary component is not asymmetric about .5,
perhaps you should try an NB-complementary loglog hurdle model. I will
not explore these models here, but you should know that they are available
on the books web site.
Look at the changes in the model coefficients and particularly in the standard errors and model p-values. Also check the AIC and BIC statistics to
determine whether the negative binomial component has contributed to a
significant reduction or elimination of overdispersion:
. hnblogit docvis outwork age, nolog
Negative Binomial-Logit
Hurdle Regression
Number of obs
Wald chi2(2)
Prob chi2
=
=
=
3874
155.20
0.0000
-----------------------------------------------------------------------|
Coef. Std. Err.
z
P|z| [95% Conf. Interval]
--------------+--------------------------------------------------------logit
|
outwork |
age |
.524712
.0263374
.0718972
.0030821
7.30
8.55
0.000
0.000
.3837961
.0202966
.6656279
.0323782
.2844499
.0158616
.3522567
.0621091
.0027161
.1336561
4.58
5.84
2.64
0.000
0.000
0.008
.1627184
.0105382
.0902956
.4061815
.021185
.6142178
--------------+--------------------------------------------------------/lnalpha |
.7514482 .0926998
8.11 0.000
.5697601
.9331364
--------------+--------------------------------------------------------alpha |
2.120068 .1965298
1.767843
2.542471
-----------------------------------------------------------------------AIC Statistic =
4.290
. abich
AIC Statistic
4.292372
AIC*n
BIC Statistic
4.295791
BIC(Stata) = 16672.484
AICH Statistic
4.290557
AICH*n
= 16628.65
= 16611.166
The information criterion tests (AIC and BIC) are significantly lower,
indicating a better fit. We can determine whether robust standard errors are
required for the count component, and we can parameterize the coefficients
190
.0031672
.0502134
8.54
-7.33
0.000
0.000
1.020498
.2838357
1.032914
.4829948
------------+----------------------------------------------------------negbinomial |
outwork | 1.329031
.1087
3.48
0.001
1.132182
1.560105
age | 1.015988
.0033321
4.84
0.000
1.009478
1.02254
_cons | 1.422274
.2124444
2.36
0.018
1.061303
1.906017
------------+----------------------------------------------------------/lnalpha | .7514482
.1196007
6.28
0.000
.5170351
.9858614
------------+----------------------------------------------------------alpha | 2.120068
.2535617
1.677048
2.68012
------------------------------------------------------------------------
Robust standard errors differ from the model standard errors for the count
component; we therefore want to keep them in a model.
We have not yet discussed obtaining predicted values from a hurdle model.
We can separate models and obtain predictions for each component, or predict for a specified component in a multicomponent model. The latter is the
best course to take. The following Stata code provides predicted values for
the negative binomial model. If eq(#2) is not specified, the default is the first
logistic model component. We call the fitted or predicted value mu:
. hnblogit_p mu, eq(#2) irr
. l mu docvis outwork age in 1/5
+-----------------------------------+
|
mu
docvis
outwork
age |
|-----------------------------------|
1. | 3.349409
1
0
54 |
2. | 3.798543
0
1
44 |
3. | 4.74305
4. | 3.925132
5. | 3.042121
0
7
6
1
0
1
58 |
64 |
30 |
+-----------------------------------+
191
Marginal effects may be obtained for hurdle models as well. See Hilbe
(2011) for guidelines and example code on how to construct and interpret
hurdle model marginal effects. Briefly, however, for both Stata and R (see
Table 7.5) it is easiest to obtain separate marginal effects for each component. For example, given the NB2-logit model we just reviewed, to obtain
marginal effects at means for the negative binomial component, we do as
follows:
STATA CODE
. ztnb docvis i.outwork age if docvis0
. margins, dydx(*)
atmeans noatlegend
192
binary component (equation #1). Either type the same command as done
previously or one for each component,
. margins, dydx(*) atmeans noatlegend predict(equation(#1))
. margins, dydx(*) atmeans noatlegend predict(equation(#2))
Note that if an older command is used that does not support the i. prefix
for factor variables, use the tabulate command as
. tab outwork, gen(outwork)
Number of obs
LR chi2(2)
Prob chi2
=
=
=
3874
73.90
0.0000
193
outwork | .2933067
age | .0147178
_cons | .6937237
.0604065
.0026478
.1221457
4.86
5.56
5.68
0.000
0.000
0.000
.1749122
.0095282
.4543225
.4117011
.0199073
.9331248
-------------+---------------------------------------------------------/lnalpha | .4436434
.0637793
.3186383
.5686485
-------------+---------------------------------------------------------alpha | 1.558375
.099392
1.375254
1.765879
-----------------------------------------------------------------------Likelihood-ratio test of alpha=0:
Prob=chibar2 = 0.000
chibar2(01) = 7838.96
. abich
AIC Statistic
BIC Statistic
=
=
5.039261
5.040627
AIC*n
= 11403.847
BIC(Stata) = 11426.745
AICH Statistic
5.036855
AICH*n
= 11394.568
.2933067
.0147178
.6937237
.0604065
.0026478
.1221457
4.86
5.56
5.68
0.000
0.000
0.000
.1749122
.0095282
.4543225
.4117011
.0199073
.9331248
194
The Poisson log-normal hurdle may be calculated in the same manner. The
preceding logit model will be the same model used here as well:
. ztpnm docvis outwork age if docvis0,
Zero-truncated Poisson normal mixture model
Number of obs
2263
Wald chi2(2)
Prob chi2
=
=
73.22
0.0000
-----------------------------------------------------------------------docvis |
Coef.
Std. Err.
z
P|z| [95% Conf. Interval]
-------------+---------------------------------------------------------outwork |
age |
_cons |
.2398099
.0117142
.5241504
.0496063
.0021539
.099247
4.83
5.44
5.28
0.000
0.000
0.000
.1425833
.0074927
.3296299
.3370365
.0159358
.7186708
-------------+---------------------------------------------------------/lnsigma | -.0650437
.022556 -2.88
0.004 -.1092526 -.0208348
-----------------------------------------------------------------------sigma |
.9370265
.0211355
44.33
0.000
.896504
. abich
AIC Statistic
BIC Statistic
=
=
5.039033
5.040399
AIC*n
= 11403.333
BIC(Stata) = 11426.23
AICH Statistic
5.036628
AICH*n
. di 11403.333+5101.605
.9793807
= 11394.055
16504.938
The AICn value is nearly identical to that for the PIG-logit hurdle model.
I also tested the generalized NB-Plogit hurdle, with the result that its AIC
value (16629.17) is substantially higher than for the PIG or log-normal hurdle
models.
195
Number of obs
Wald chi2(2)
=
=
3874
171.15
.0178382
.0020181
8.84
0.000
.0138827
.0217937
_cons | -1.043595
.0915753 -11.40
0.000 -1.223079 -.8641103
-----------------------------------------------------------------------AIC Statistic =
1.316
AICn
5098.184
The overall PIG-Poisson hurdle AIC can be calculated just as for the
previous hurdle models. The AICn for the right censored at 1 Poisson is
5098.184, and from before we know that the AICn of the zero-truncated PIG
is 11403.847. The sum is
. di 5098.184+11403.847
16502.031
/// AIC
which is the lowest AICn value we have calculated. Perhaps a PIG-NB hurdle
would be better, or a PIG-PIG hurdle. I will let you determine the result.
What I hope that you learned from this section is that there are various ways
to model count data.
We could have tried a wide variety of models, such as Poissonlognormal
negative binomial hurdle, or any variety of binary model, such as logit, probit,
complementary loglog, or loglog, or a right censored at 1 count model. The
choice to be used is the one that best fits the data.
Rs pscl package supports the Poisson and negative binomial hurdle
models but no others. I hope to develop R functions that match the Stata
commands used with this book.
196
when the data have excessive zero counts. For example, it may be that
a negative binomial or a hurdle model will be all that is needed to best
understand such data or some other model. To appropriately employ a
zero-inflated model on data with excessive zero counts, the analyst should
have a theory as to why there are a class of observations having both
observed and expected zero counts.
r Zero-inflated models can be thought of as finite mixture models (i.e., where
there are supposed two data-generating mechanisms, one generating 0s
and one generating the full range of counts).
197
r There are two different types of 0s in the data one generated by a binary
198
1
1 + exp(x)
(7.10)
199
The idea is that 0 counts represent errors. For example, when counting the
number of birdcalls made by a type of bird, zero counts can occur because of
r birds being quiet during the time they were being recorded and
r those for which you didnt record the calls because you were sitting in the
wrong place or were there at the wrong time to observe the calls. Perhaps
you even failed to show up for the recording session.
In this case, the first reason for zero counts represents good counts and
comes from the count component of the model. The second reason, because
no one shows up to do the counting, or because someone had just shot at the
birds and they flew away, or someone shot and killed them, represents bad
zeros, which form the binary component.
There is really no way to tell where the lack of counts came from, but many
ecologists have considered it important to make this type of distinction. I
mention this here since you may read about it in the literature, especially if
researchers from outside ecology start to use this distinction. See Zuur (2012)
for a complete guide to this method.
vuong
Number of obs
3874
Nonzero obs
Zero obs
=
=
2263
1611
.2132031
.0116248
1.043192
.0192091
.0008472
.0396523
11.10
13.72
26.31
0.000
0.000
0.000
.1755539
.0099643
.9654749
.2508522
.0132852
1.120909
------------+-----------------------------------------------------------
200
inflate
|
outwork | -.5145616
age | -.025714
.0724867
.0031096
-7.10
-8.27
0.000
0.000
-.6566328
-.0318087
-.3724904
-.0196192
19.24
19.19
. abic
AIC Statistic
6.2837
BIC Statistic
6.286153
Prz = 1.0000
Prz = 0.0000
Prz = 1.0000
Prz = 0.0000
Prz = 1.0000
= 24343.055
AIC*n
BIC(Stata) = 24380.627
The rate ratios for the count component of the preceding table of model
statistics can be displayed as
. zip, irr
Zero-inflated Poisson regression
Number of obs
Nonzero obs
Zero obs
=
=
=
3874
2263
1611
1.237636
1.011693
2.838262
.0237739
.0008571
.1125436
11.10
13.72
26.31
0.000
0.000
0.000
1.191906
1.010014
2.626034
1.28512
1.013374
3.067641
-------------+---------------------------------------------------------inflate
|
outwork | -.5145616 .0724867
-7.10
0.000 -.6566328 -.3724904
age | -.025714 .0031096
-8.27
0.000 -.0318087 -.0196192
_cons | .9453518 .1365165
6.92
0.000
.6777844
1.212919
-----------------------------------------------------------------------Vuong test of zip vs. standard Poisson: z =
19.26
Prz = 0.0000
The Vuong test, displayed directly under the Stata model output, aims to
test whether ZIP is preferable to the traditional Poisson model. A resulting
p-value of under 0.05 provides evidence that the ZIP model is preferred
(i.e., the number of zero counts exceeds Poisson distributional assumptions).
201
# expected counts
# observed counts
Two correction tests for the Vuong statistic are displayed by using the zipcv
command (Desmarais and Harden 2013). The Vuong statistic is biased toward
the zero-inflated model because the same data are used to estimate both the
binary and count component parameters. AIC- and BIC-based correction
factors adjust for the extra parameters in the zero-inflated component. The
AIC correction factor tends to prefer the zero-inflated model when it is correct,
but it does not always reject the zero-inflated model when it is not correct;
the BIC correction factor tends to prefer the single-equation model when it
is correct. Interpret the tests with care. I refer you to Desmarais and Harden
(2013) for details. The correction factors are not yet available in R or SAS. For
this model, the standard Vuong and both correction tests provide support for
the ZIP model.
Next, calculate the predicted counts of the count component and predicted probability of the binary component, which we have specified as a
binary logistic model. Recall that zero counts are modeled in both the binary
and count components, resulting in a mixture of distributions. This mixture of zeros from both the binary and count model components must be
accommodated when calculating the predicted counts.
The predicted probability of the binary logistic component of the ZIP
model is calculated straightforwardly using the inverse logistic link, 1(1 +
exp(xb)) or exp(xb)(1 + exp(xb)):
STATA CODE
. predict prob
/* predicted count */
202
The predicted counts for the count component, however, reflect the mixture of zeros. The Poisson inverse link, exp(xb), defines the predicted counts,
but each count observation must be adjusted by 1 minus the predicted probability of being in the binary component:
. gen prcnt=exp(xb_c)*(1-pr0)
. su docvis prob prcnt pr0
Variable |
Obs
Mean
Std. Dev.
Min
Max
-----------+--------------------------------------------------------docvis |
3874
3.162881
6.275955
0
121
prob |
prcnt |
pr0 |
3874
3874
3874
3.160131
3.160131
.4116941
1.140611
1.140611
.0980708
1.612873
1.612873
.2288347
5.70035
5.700351
.5750543
Note that the AIC and BIC statistics are 6.28. These values are substantially lower than the Poisson model AIC and BIC of 8.07 but much higher
than the negative binomial value of 4.30. A negative binomial model of the
data, regardless of the excessive zero counts, is better than either standard
Poisson or ZIP models. Given that ZIP is preferable to Poisson, however, it is
reasonable to assume that a ZINB model will be preferable to an NB model.
A table of expected counts can be given for the hurdle model, which can
be compared with the observed counts:
R CODE
pred - round(colSums(predict(zip, type="prob") [,1:13]))
obs - table(rwm1984$docvis)[1:13]
rbind(obs, pred)
obs
0
1
2
3
4
5
6
1611 448 440 353 213 168 141
pred 1611
7
60
8
85
9 10 11 12
47 45 33 43
Number of obs
3874
Nonzero obs
Zero obs
=
=
2263
1611
203
1.334501
1.018774
.0839373
.0028387
4.59
6.68
0.000
0.000
1.179723
1.013226
1.509585
1.024353
_cons |
1.356756
.1847742
2.24 0.025
1.038911
1.771844
------------+----------------------------------------------------------inflate
|
outwork |
age |
_cons |
-1.312531
-.0318529
-.2309631
.6455665
.0119638
.5240838
-2.03
-2.66
-0.44
0.042
0.008
0.659
-2.577818
-.0553015
-1.258149
-.0472438
-.0084043
.7962223
------------+----------------------------------------------------------/lnalpha |
.5832367
.0677161
8.61 0.000
.4505156
.7159578
------------+----------------------------------------------------------alpha |
1.791829
.1213356
1.569121
2.046146
-----------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 7697.35
Pr=chibar2 = 0.0000
Vuong test of zinb vs. standard negative binomial: z =
2.61
Prz = 0.0046 =
with AIC (Akaike) correction: z =
2.11
. abic
AIC Statistic
4.297291
BIC Statistic
4.30071
AIC*n
0.58
Prz = 0.9954
Prz = 0.0172
Prz = 0.9828
Prz = 0.2824
Prz = 0.7176
= 16647.707
BIC(Stata) = 16691.541
The predicted counts for the count component reflect the mixture of zeros.
The negative binomial inverse link, exp(xb), defines the predicted counts, but
each count observation must be adjusted by 1 minus the predicted probability
204
# observed counts
3874
3874
3874
3.161081
3.161082
.1299529
1.162163
1.162163
.073266
1.590592
1.590592
.0270649
5.792892
5.792892
.2636108
R code is given in Table 7.7, and partial output follows the table:
R OUTPUT
Count model coefficients (negbin with log link):
(Intercept)
outwork
age
Log(theta)
0.002786
0.067736
6.676
-8.610
2.45e-11 ***
2e-16 ***
0.52445
0.64614
0.01196
-0.440
-2.031
-2.664
0.65966
0.04222 *
0.00772 **
---
Downloaded from Cambridge Books Online by IP 155.54.213.67 on Fri Jun 05 18:50:54 BST 2015.
http://dx.doi.org/10.1017/CBO9781139236065.008
Cambridge Books Online Cambridge University Press, 2015
205
Signif. codes:
Theta = 0.5581
Number of iterations in BFGS optimization: 26
Log-likelihood: -8317 on 7 Df
print(vuong(zinb,nb2))
Vuong Non-Nested Hypothesis Test-Statistic: 2.60572
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishable)
in this case:
model1 model2, with p-value 0.00458407
exp(coef(zinb))
count_(Intercept)
1.3567548
count_outwork
1.3345011
zero_outwork
0.2691399
zero_age
0.9686489
count_age
1.0187743
zero_(Intercept)
0.7937695
ZINB models may be compared with ZIP models using the boundary likelihood test, with p-values less than 0.05 indicating that the ZINB model is
preferable to ZIP. ZINB models may be compared with NB (NB2) models by
using the Vuong test. p-values less than 0.05 indicate that the ZINB model is
preferable to NB. The test statistic is normally distributed, with large positive
values favoring the ZINB model and large negative values favoring the NB
model. An insignificant test result favors neither model.
The likelihood ratio test assesses the relationship of ZINB to ZIP. Here the
test clearly prefers ZINB. The uncorrected Vuong and AIC-corrected tests
show that the ZINB model is preferable to the standard negative binomial
model (p = 0.0046 and p = 0.0172, respectively). The BIC-corrected Vuong
prefers neither (p = 0.2824). The standard negative binomial AIC and BIC
statistics are 16673.525 and 16698.572, which gives a 26-point drop in the
AIC statistic for the ZINB model compared with the standard NB model,
but only a 7-point drop in the BIC statistic. Given the AIC-corrected Vuong
statistic preferring the ZI model and the BIC-corrected statistic preferring
neither, it is likely reasonable to give an overall slight preference to ZINB.
The difference in test results should be acknowledged in a study report.
206
Note that the binomial (logistic) component that models the zero counts
does not exponentiate the coefficients, generating odds ratios. I recommend
that this be done. The standard errors of the odds ratios are determined by
multiplying each respective odds ratio by its associated model SE. Confidence
intervals are exponentiated, as are the coefficients.
3874
=
=
=
2263
1611
83.16
1.343885
.0824651
4.82
0.000
1.191597
1.515635
.1153151
-4.36
0.000
-.7290476
-.2770207
207
z =
z =
8.61
7.99
Prz = 0.0000
Prz = 0.0000
. abic
AIC Statistic
4.260713
AIC*n
BIC Statistic
4.264132
BIC(Stata) = 16549.836
= 16506.002
3874
3874
3.167142
.2987763
1.266035
.0992726
1.37824
.1334178
6.192106
.489595
R
======================================================
library(gamlss) ; data(rwm1984); attach(rwm1984)
summary(zpig - gamlss(docvis ~ outwork + age, nu.fo= ~ outwork + age,
family=ZIPIG, data=rwm1984))
# code for calculating vuong, LR test, etc. on books web site
=======================================================
The AIC and BIC values are some 140 points lower than that of the zeroinflated negative binomial (ZINB), and 50 points lower than the zero-inflated
generalized Poisson (ZIGP) (examined in the following chapter). Moreover,
the Vuong test favors the ZIPIG model over the PIG. The test statistic is
normally distributed, with large positive values favoring the ZIPIG model
and large negative values favoring the PIG model. The likelihood ratio test is
also significant, comparing the ZIPIG with the zero-inflated Poisson (ZIP).
The test clearly prefers ZIPIG to any of the models we have thus far used. To
be sure, the excess of zero counts significantly impacts model overdispersion.
208
209
BIC
AICHn
Model
ZIP
ZINB
NB-L hurdle
24343.055
16647.707
16628.650
24380.627
16691.541
16672.484
24328.791
16630.223
16611.166
ZINBP
ZIHNB
16625.041
16624.355
16675.139
16680.713
16603.963
16599.256
ZIGP
16559.494
16603.328
16542.010
ZIPIG
16506.002
16549.836
16488.520
Pig-P hurdle
16502.031
16547.136
16489.821
ZI Poisson
ZI negative binomial
negative binomial-logit
hurdle
ZI NB-P
ZI heterogeneous negative
binomial
ZI generalized Poisson
(from Chap. 8)
ZI Poisson inverse
Gaussian (PIG)
PIG-Poisson hurdle
CHAPTER 8
Modeling Underdispersed
Count Data Generalized
Poisson
models?
r Is the generalized Poisson a mixture model? If so, what are the component
211
Percent
5
10
10
Mean
15
20
LOS
25
30
35
leading an analyst to believe that predictors are significant when in fact they
are not.
The generalized Poisson probability function is based on Consul (1989)
and Famoye (1993), and this parameterization on Harris, Yang, and Hardin
(2012):
f (y; , ) =
i (i + yi ) yi 1 e i yi
, yi = 0, 1, 2, . . .
yi !
(8.1)
The types of count data that are underdispersed consist of data that are lumped
more tightly together than should be expected based on Poisson and negative
binomial distributional assumptions. Consider the azprocedure data that is
on the books web site. This data set is a record of how long patients were
in the hospital following either a CABG (1) or PTCA (percutaneous transluminal coronary angioplasty) (0) procedure. Other predictors are sex (1 =
male; 0 = female) and admit (1 = emergency/urgent; 0 = elective). We have
212
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------------los |
3589
8.830872
6.926239
83
. hist los if los40, title("LOS for full Heart Procedure Data") discrete
xlab( 5 8.83 "mean" 10 15 20 25 30 35) percent
POISSON REGRESSION
. glm los procedure sex admit, nolog fam(poi) vce(robust) nohead nolog eform
---------------------------------------------------------------------------|
Robust
los |
IRR
Std. Err.
P|z|
----------+----------------------------------------------------------------procedure |
2.604873
.0567982
43.91
0.000
2.495896
2.718608
sex |
.8779003
.0195083
-5.86
0.000
.8404854
.9169806
admit |
1.395248
.0297964
15.60
0.000
1.338054
1.454887
_cons |
4.443334
.1174981
56.40
0.000
4.218908
4.679698
--------------------------------------------------------------------------. abich
AIC Statistic
6.264283
AIC*n
BIC Statistic
6.265144
BIC(Stata) = 22507.256
. di e(dispers_p)
= 22482.514
// Dispersion
3.2323804
. di e(N)
3589
Ill next drop all lengths of stay greater than 8 days, which amounts to
1607 patients. That leaves 982 remaining, with a new mean LOS of 4.46 days.
The dispersion has dropped to 0.79, indicating underdispersion:
. drop if los8
. sum los
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+---------------------------------------------------------los |
1982
4.465691
2.30982
A Poisson model on the data clearly demonstrates that the data are underdispersed:
213
No. of obs
1982
Optimization
: ML
Deviance
1574.440006
Residual df
=
Scale parameter =
(1/df) Deviance =
1978
1
.7959757
Pearson
1562.747889
(1/df) Pearson
AIC
BIC
= .7900647
= 4.020972
= -13442.26
-----------------------------------------------------------------------|
Robust
los |
IRR
Std. Err.
z
P|z|
[95% Conf. Interval]
----------+------------------------------------------------------------procedure | 2.077086
.0344479
44.07
0.000
2.010655
2.145712
sex | .9333992
.0185912
-3.46
0.001
.8976632
.9705579
admit |
_cons |
1.363049
3.276827
.0249235
.0690439
16.94
56.33
0.000
0.000
1.315064
3.14426
1.412784
3.414984
-----------------------------------------------------------------------. abich
AIC Statistic
=
4.020972
AIC*n
= 7969.5674
BIC Statistic
4.022532
BIC(Stata) = 7991.9346
Now, lets see if the generalized Poisson model picks up on the underdispersion. Remember, a negative binomial model cannot model data that are
underdispersed. If it does succeed in estimating parameters, they are at the
values of a Poisson model:
. gpoisson los procedure sex admit, nolog
vce(robust)
Number of obs
Wald chi2(3)
=
=
1982
1891.88
Prob chi2
Pseudo R2
=
=
=
0.0000
0.0000
0.1052
Prob chi2
Dispersion
= -.1195314
Log pseudolikelihood = -3956.3794
-----------------------------------------------------------------------|
Robust
los |
IRR
Std. Err.
z
P|z|
[95% Conf. Interval]
-----------+-----------------------------------------------------------procedure | 2.051253
.0343166
42.94
0.000
1.985085
2.119628
sex |
.934638
.0186265
-3.39
0.001
.8988346
.9718675
admit | 1.366056
.0255772
16.66
0.000
1.316835
1.417118
_cons | 3.281648
.0706452
55.20
0.000
3.146066
3.423072
-----------+------------------------------------------------------------
214
/tanhdelta | -.1201056
.0150796
-.1496611 -.0905502
-----------+-----------------------------------------------------------delta | -.1195314
.0148641
-.1485536 -.0903035
-----------------------------------------------------------------------Likelihood-ratio test of delta=0: chi2(1) = 48.81 Prob=chi2 = 0.0000
. abich
=
=
AIC Statistic
BIC Statistic
AIC*n
= 7922.7588
BIC(Stata) = 7950.7183
3.997356
4.000431
The AIC and BIC statistics are reduced, and the value of the dispersion
parameter, delta, is negative, indicating the dispersion parameter has adjusted
for underdispersion. Note that /tanhdelta is the inverse hyperbolic tangent
function of delta, with a definition of
1+
1
(8.2)
tanhdelta = log
2
1
The dispersion parameter for the generalized Poisson is
=
1
(1 )2
(8.3)
Number of obs
3874
Regression link:
Nonzero obs
2263
Zero obs
1611
LR chi2(2)
84.55
Prob chi2
0.0000
-------------------------------------------------------------------------docvis |
exp(b)
Std. Err.
P|z|
------------+------------------------------------------------------------docvis
outwork |
1.369381
.1089162
3.95
0.000
1.171716
1.600391
215
age |
1.014631
.0032109
4.59
0.000
1.008357
1.020944
_cons |
1.79139
.2550683
4.09
0.000
1.355162
2.36804
------------+------------------------------------------------------------inflate
outwork | -.3160915
.2510973
-1.26
0.208
-.8082332
.1760502
age | -.0194765
.0093451
-2.08
0.037
-.0377926
-.0011604
_cons | -.3925934
.398819
-0.98
0.325
-1.174264
.3890774
------------+------------------------------------------------------------atanhdelta
_cons |
.7741937
.0171378
45.17
0.000
.7406042
.8077831
------------+------------------------------------------------------------delta |
.6493614
.0099113
.6295101
.6683655
------------+------------------------------------------------------------X2 =
7785.56
PrX2= 0.0000
z =
3.89
Prz = 0.0000
z =
3.50
Prz = 0.0002
z =
2.28
Prz = 0.0114
. abic
AIC Statistic
4.274521
AIC*n
BIC Statistic
4.277939
BIC(Stata) = 16603.328
= 16559.494
Predicted counts and zeros are obtained by using the gpois_p and predict commands.
. gpois_p prgc, n
. predict prg0, pr
. sum docvis pr*
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------docvis |
prcg |
prc0 |
3874
3874
3874
3.162881
3.930725
.2074883
6.275955
1.020594
.0466571
0
2.575682
.1239893
121
6.214907
.2932782
216
negative values favoring the GP model. In both cases, there is strong evidence
for preferring the ZIGP model. It is obvious that excessive zero counts need
to be adjusted. The AIC and BIC statistics have been reduced to 4.275 and
4.278, respectively, demonstrating that the ZIGP model is preferable to a
ZPNB model for these data.
SUMMARY
The generalized Poisson model can be used to adjust for overdispersion in an
otherwise Poisson model, in a similar manner to negative binomial regression.
It can also be used to model otherwise underdispersed Poisson data. Hurdle
models and 3-parameter count models can also be used for undispersed
count data, but a generalized Poisson model can be interpreted more easily
than these other methods. When the dispersion parameter, delta, is positive,
the model has adjusted for overdispersed data; when it is negative, it has
adjusted for underdispersed data. The extent of over or under dispersion can
be roughly evaluated by the magnitude of the value of delta.
Underdispersion, in the sense of Poisson underdispersion, is not commonly found in real data situations. A common, ad hoc manner of dealing
with underdispersion has been to scale the standard errors of the underdispersed Poisson model. This produces standard errors that would be the case
if the data were not underdispersed. This method, however, does not affect
coefficients, only standard errors. It is preferable to assume that the underdispersed data has been generated, or can be best described, by the generalized
Poisson PDF, which specifically allows for Poisson underdispersed data. More
complex models can also be used in such situations, but typically the generalized Poisson model is sufficient to use when faced with underdispersed
data.
CHAPTER 9
In this final chapter, I briefly discuss models that have been developed for
data that are generally more complex than what we have thus far observed.
The sections are meant to introduce you to these methods if you have not
already read about them or used them.
The following list shows the types of data situations with which you may be
confronted, together with a type of model that can be used in such a situation
(given in parentheses).
217
218
models)
A very brief overview of the Bayesian modeling of count data will be presented
in Section 9.8.
219
11
23
18
17.46
36.51
28.57
17.46
53.97
82.54
4 |
5 |
6 |
4
4
3
6.35
6.35
4.76
88.89
95.24
100.00
------------+----------------------------------Total |
63
100.00
34
26 |
60
CABG |
3
0 |
3
-----------+----------------------+---------Total |
37
26 |
63
1=CABG; 0=PTCA
PTCA
CABG |
Total
-----------+----------------------+---------1 |
11
0 |
11
2 |
23
0 |
23
220
3 |
4 |
5 |
17
4
3
1 |
0 |
1 |
18
4
4
6 |
2
1 |
3
-----------+----------------------+---------Total |
60
3 |
63
There are few observations for CABG patients, but it does appear that
PTCA patients stay in the hospital postsurgery for less time on average than
CABG patients:
. sort procedure
. by procedure: sum los
PTCA
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------los |
60
2.516667
1.214205
CABG
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------los |
3
4.666667
1.527525
3
6
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------los |
63
2.619048
1.300478
1
6
Since los is a count, the data are modeled using Poisson regression. The
model is parameterized so that incidence rate ratios are displayed in place of
coefficients:
R
==============================================================
library(COUNT); library(COUNT)
data(azcabgptca); attach(azcabgpca)
table(los); table(procedure, type); table(los, procedure)
summary(los)
summary(c91a - glm(los ~ procedure+ type, family=poisson, data=azcabgptca))
modelfit(c91a)
summary(c91b - glm(los ~ procedure+ type, family=quasipoisson, data=azcabgptca))
221
modelfit(c91b)
library(sandwich); sqrt(diag(vcovHC(c91a, type="HC0")))
==============================================================
. glm los procedure type, fam(poi) eform nolog
Generalized linear models
No. of obs
63
Optimization
Residual df
60
: ML
Scale parameter =
Deviance
30.23003933
(1/df) Deviance =
.503834
Pearson
29.12460512
(1/df) Pearson
.4854101
[Poisson]
Link function
: g(u) = ln(u)
Log likelihood
= -102.0780388
[Log]
AIC
3.335811
BIC
-218.358
--------------------------------------------------------------------------------|
OIM
los |
IRR
Std. Err.
P|z|
-------------+------------------------------------------------------------------procedure |
2.144144
.6249071
2.62
0.009
1.21108
type |
1.360707
.2215092
1.89
0.058
.989003
3.796078
1.872111
_cons |
2.176471
.2530096
6.69
0.000
1.733016
2.733399
--------------------------------------------------------------------------------. abic
AIC Statistic
3.335811
AIC*n
BIC Statistic
3.345202
BIC(Stata) = 216.58548
= 210.15608
222
----------------------------------------------------------------------------|
OIM
los |
IRR
Std. Err.
P|z|
-----------+----------------------------------------------------------------procedure |
2.144144
.4353814
3.76
0.000
1.440165
3.19224
type |
1.360707
.1543285
2.72
0.007
1.08949
1.699441
_cons |
2.176471
.1762753
9.60
0.000
1.857004
2.550896
The predictor statistics are nearly the same as for the scaled model. This
is a common result. Most statisticians tend to prefer using robust adjustment
rather than scaling, but the results are typically the same.
We should check the zero-truncated Poisson model results since the data
are such that zero counts are excluded and the mean of the response variable is
low. The ZTP model may fit the data better than the traditional Poisson model:
R
===========================================================
library(gamlss.tr)
gen.trun(0,"POI", type="left", name="leftr")
summary(c91c - gamlss(los~ procedure+type, data=az
ptca, family=POleftr))
===========================================================
. ztp los procedure type ,
----------------------------------------------------------------------------|
Robust
los |
IRR
Std. Err.
P|z|
------------+---------------------------------------------------------------procedure |
2.530652
.4607842
5.10
0.000
1.771106
3.615932
type |
1.521065
.2296154
2.78
0.005
1.131496
2.044761
_cons |
1.825901
.1480402
7.43
0.000
1.557628
2.14038
----------------------------------------------------------------------------. abic
AIC Statistic
3.131446
AIC*n
BIC Statistic
3.140837
BIC(Stata) = 203.71049
= 197.28108
223
The bottom _hatsq coefficient approximates zero; the p-value is 1. The ZIP
model appears to be misspecified. This likely results from the fact that the
model excludes zero counts yet is rather severely underdispersed.
Using an exact statistics procedure on these data produces the following
coefficients and p-values. Remember that the p-values and confidence intervals are not based on asymptotic statistics. They are of the same nature as the
exact Fisher test when evaluating tables. We see that procedure is significant
at the 95% level but type is not. The standard Poisson model has p-values
of 0.09 and 0.058 for procedure and type, respectively. When scaled and
when a robust sandwich estimator is applied the p-values are 0.008 and
0.005, respectively. We usually think when there is extradispersion in the
data that scaling (quasipoisson) or applying a robust estimator produces
more accurate standard errors, and hence p-values/confidence intervals. But
here it is clear that they do not. The unadjusted Poisson standard errors
and p-values are closer to the exact values than are the scaled and sandwich
results. The difference is important since the exact values inform us that type
is not a significant predictor:
. expoisson los procedure type, irr
Exact Poisson regression
Number of obs =
63
-----------------------------------------------------------------------los |
IRR
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
----------+------------------------------------------------------------procedure |
2.144144
14
0.0224
1.118098
3.828294
type |
1.360707
77
0.0701
.9760406
1.898079
------------------------------------------------------------------------
224
Parameter
Estimates
Point Estimate
95
%CI
2*1-sided
Model Term
Type
Beta
SE(Beta)
Type
Lower
Upper
p-Value
PROCEDURE
TYPE
CMLE
CMLE
0.7627
0.308
0.2914
0.1628
Exact
Exact
0.1116
0.02425
1.342
0.6408
0.02237 ==
0.07008 ==
I have run a LogXact model (see Table 9.1) on the same data. Note that
the conditional maximum likelihood (exact) estimates are the same for both
procedure and type. Exponentiate the LogXact coefficients and confidence
intervals to obtain the same values. For example, exp(.7627) = 2.144 and
exp(.308) = 1.3607.
At the time of this writing, no suitable R code exists for exact Poisson
regression. Exact Poisson regression is recommended for all models that are
unbalanced as well as for data with few observations. Memory is a problem
with larger models, but exact statistics software now provides alternative
models that generally approximate exact values; these include median unbiased estimates (Stata and SAS) and Monte Carlo estimation (LogXact). As
computers become faster, exact methods will likely find more widespread
use.
225
to how response values beyond user-defined cut points are handled. Truncated models eliminate the values altogether; censored models revalue them
to the value of the cut point. In both cases, the probability function and
log-likelihood functions must be adjusted to account for the change in the
distribution of the response. We begin by considering truncation.
Prob (Y = y) =
y = 0, 1, . . .
Prob (Y = (y = 0, 1)) =
y = 2, 3, . . .
(9.1)
226
The same logic applies to each greater integer or count from the left. Left
refers to the number line beginning with 0 and moving one integer to the
right with each higher count:
------------------------------------------------- --------------
...
n-1
Prob (Y = y|Y C ) =
for y = 0, 1, . . . , C 1
(9.3)
An example will make it clear how to actually use the model. The same
German health reform data set, rwm1984, is used for our example data.
Employing the user-created Stata treg command (Hardin and Hilbe 2014b),
I first display the results of estimating a left-truncated at 3 Poisson model (see
R code in Table 9.2). With a left cut point at 3, counts in the resultant model
begin at 4, or C + 1 (see distribution in header). Limdep and the gamlss
package in R are the only other software programs I know of that provide the
capability of modeling truncated count data, although they are both limited to
the Poisson and negative binomial distributions. Counts 3 and above are being
modeled. Note that treg allows the user to model left-, right-, and intervaltruncated Poisson, negative binomial, generalized Poisson, PIG, NB-P, and
three-parameter generalized Famoye negative binomial regressions. The latter
four models are new additions with the treg command. The truncated PIG
227
# alternative method
Number of obs
1022
LR chi2(2)
69.68
Log pseudolikelihood =
Prob chi2
0.0000
-4802.67
-----------------------------------------------------------------------------|
docvis |
Robust
Coef.
Std. Err.
P|z|
-------------+---------------------------------------------------------------outwork |
.0735456
.0733219
1.00
0.316
-.0701626
.2172539
age |
.006175
.0029573
2.09
0.037
.0003789
.0119712
_cons |
1.923062
.129157
14.89
0.000
1.669919
2.176205
------------------------------------------------------------------------------
For a truncation from the right side (i.e., at the higher end of the distribution), I display an example of a right-truncated at 10 PIG. With right
truncation at 10, only counts ranging from 0 to 9 are included in the model.
Note the distribution shown in the header information:
. treg docvis outwork age if docvis10, dist(pig)
rtrunc(10)
nolog vce(robust)
Number of obs
3566
LR chi2(2)
176.44
Prob chi2
0.0000
228
------------------------------------------------------------------------------|
docvis |
Robust
Coef.
Std. Err.
P|z|
-------------+----------------------------------------------------------------outwork |
.415145
.0506344
8.20
0.000
.3159033
age |
.018404
.0022195
8.29
0.000
.0140538
.5143867
.0227542
_cons |
-.3912701
.1007169
-3.88
0.000
-.5886716
-.1938686
-------------+----------------------------------------------------------------/lnalpha |
.5445644
.052255
.4421464
.6469824
-------------+----------------------------------------------------------------alpha |
1.723857
.0900802
1.556044
1.909769
-------------------------------------------------------------------------------
Finally, suppose we wish to model visits to the doctor, but only for patients
who have in fact visited a physician during the calendar year 1984. Moreover, for our example, suppose also that visits were not recorded for more
than 18 visits. We then model the data with a left truncation at 0 and right
truncation at 19. This is called interval truncation. This time a truncated NB-P
model is used to model the interval data, parameterized so that estimates are
in terms of incidence rate ratios. This can be achieved with the following
Stata code, but it is not yet possible using R or SAS:
. treg docvis outwork age if docvis0 & docvis19, dist(nbp) ltrunc(0)
rtrunc(19) vce(robust) eform
Truncated neg. bin(P) regression
Number of obs
2172
LR chi2(2)
Prob chi2
=
=
52.70
0.0000
229
-----------------------------------------------------------------------|
Robust
docvis |
exp(b)
Std. Err.
z
P|z|
[95% Conf. Interval]
---------+-------------------------------------------------------------outwork | 1.330828
.0819098
4.64
0.000
1.179593
1.501453
age | 1.010227
.0027296
3.77
0.000
1.004891
1.015591
_cons | 1.840426
.2360925
4.76
0.000
1.431281
2.366528
---------+-------------------------------------------------------------/P | .539376
.2738796
1.97
0.049
.0025817
1.07617
/lnalpha | 1.593174
.3356481
.9353162
2.251033
---------+-------------------------------------------------------------alpha | 4.91934
1.651167
2.548019
9.497538
------------------------------------------------------------------------
230
Number of obs
3874
Wald chi2(2)
868.53
Prob chi2
0.0000
------------------------------------------------------------------------docvis |
Coef.
Std. Err.
P|z|
--------+---------------------------------------------------------------outwork |
.2824314
.0176317
16.02
0.000
.247874
.3169888
age |
.0153682
.0007858
19.56
0.000
.0138281
.0169084
_cons |
.6101677
.0367009
16.63
0.000
.5382354
.6821001
------------------------------------------------------------------------AIC Statistic =
5.009
Right censoring is a more common application when using censored Poisson or negative binomial regressions, whereas left truncation is more commonly used with truncation models. cpoissone is used here to model the
right-censored at 10 model:
. replace cenvar=1
. replace cenvar=-1 if docvis=10
. cpoissone docvis outwork age, censor(cenvar) cright(10) nolog
Censored Poisson Regression
Log likelihood = -10463.973
Number of obs
3874
Wald chi2(2)
864.81
Prob chi2
0.0000
-------------------------------------------------------------------------docvis |
Coef.
Std. Err.
P|z|
--------+----------------------------------------------------------------outwork |
.3530813
.0213676
16.52
0.000
.3112017
age |
.0178009
.0009444
18.85
0.000
.0159499
.3949609
.0196519
_cons |
-.045936
.04368
-1.05
0.293
-.1315473
.0396752
------------------------------------------------------------------------AIC Statistic =
5.404
231
232
Number of obs
3874
Wald chi2(2)
393.14
Prob chi2
0.0000
-------------------------------------------------------------------------docvis |
Coef.
Std. Err.
P|z|
---------+---------------------------------------------------------------outwork |
.3250068
.0287805
11.29
0.000
.2685981
age |
.0164991
.0012707
12.98
0.000
.0140087
.3814156
.0189896
_cons |
-.4033441
.0580651
-6.95
0.000
-.5171495
-.2895387
-------------------------------------------------------------------------AIC Statistic =
2.886
Freq.
Percent
Cum.
---------+-------------------------------------0 |
2,499
64.51
64.51
1 |
1,375
35.49
100.00
---------+-------------------------------------Total |
3,874
100.00
OIM
Coef.
Std. Err.
P|z|
---------+---------------------------------------------------------------outwork |
.5701073
.0717644
7.94
0.000
.4294516
.710763
age |
.0287807
.0031698
9.08
0.000
.022568
.0349933
_cons |
-2.10176
.1438936
-14.61
0.000
-2.383786
-1.819733
--------------------------------------------------------------------------
233
As an example, well start with an NB2-NB2 mixture. This means that both
components in the data are distributed as NB2, but with differing parameters.
We will use the fishing (adapted from Zuur, Hilbe, and Ieno 2013) data
to determine whether the data appear to be generated from more than one
generating mechanism. The data are adapted from Bailey et al. (2008), who
were interested in how certain deep-sea fish populations were impacted when
commercial fishing began in locations with deeper water than in previous
years. Given that there are 147 sites that were researched, the model is (1) of
the total number of fish counted per site (totabund); (2) on the mean water
depth per site (meandepth); (3) adjusted by the area of the site (sweptarea);
and (4) the log of which is the model offset.
. use fishing
. fmm totabund meandepth, exposure(sweptarea) components(2) mixtureof(negbin2)
2 component Negative Binomial-2 regression
Log likelihood = -878.91748
Number of obs
147
Wald chi2(2)
167.04
Prob chi2
0.0000
-----------------------------------------------------------------------------totabund |
Coef.
Std. Err.
P|z|
-------------+---------------------------------------------------------------component1
meandepth |
-.0008543
.0001498
-5.70
0.000
-.0011479
-.0005606
_cons |
-4.144922
.5319592
-7.79
0.000
-5.187543
-3.102301
ln(swepta~a) |
(exposure)
-------------+---------------------------------------------------------------component2
meandepth |
-.0011734
.0000919
-12.77
0.000
-.0013535
-.0009932
234
_cons |
-2.816184
ln(swepta~a) |
.3340012
-8.43
0.000
-3.470815
-2.161554
(exposure)
-------------+---------------------------------------------------------------/imlogitpi1 |
.06434
1.077807
0.06
0.952
-2.048124
2.176804
/lnalpha1 |
-.4401984
.2589449
-1.70
0.089
-.9477211
.0673244
/lnalpha2 |
-1.488282
.4407883
-3.38
0.001
-2.352212
-.624353
-----------------------------------------------------------------------------alpha1 |
.6439087
.1667369
.3876234
1.069642
alpha2 |
.2257601
.0995124
.0951585
.5356078
pi1 |
.5160794
.2691732
.1142421
.898147
pi2 |
.4839206
.2691732
.101853
.8857579
------------------------------------------------------------------------------
147
147
150.9075
335.3274
94.06647
269.7885
19.26319
15.90274
415.0464
1128.203
The second component has a mean value of fish per site that is some twoand-a-quarter times (335/151) greater than the first component, even though
it is only 48% of the overall population. The overall mean fish per site can be
calculated as
. di .5160794 * 150.9075 + .4839206 * 335.3274
240.15209
If the data are partitioned into three components, the probability values
for being in each component are
pi1 |
pi2 |
.2230491
.7169981
.2346794
.2227915
.0197984
.2275204
.8031637
.9561276
pi3 |
.0599529
.0351927
-.0090236
.1289294
235
This time, the second component has 72% of the data, the first has 22%,
and the third 6%. The third component, however, is not significant (confidence interval includes 0). Greater partitions also fail to show more than
two significant components. Interestingly, the model used is pooled over precommercial fishing (19771989) and commercial fishing (20002002) years,
although there is some overlap in period and meandepth:
. tab period
0=1977-89;1 |
=2000+ |
Freq.
Percent
Cum.
------------+----------------------------------1977-1989 |
97
65.99
65.99
2000-2002 |
50
34.01
100.00
------------+----------------------------------Total |
147
100.00
Modeling the data using negative binomial regression but including the
binary predictor, period, produces a well-fitted model. Using robust estimators
yields standard errors that are nearly identical to the following standard
model:
. nbreg totabund meandepth period, exposure(sweptarea) nolog
------------------------------------------------------------------------------totabund |
Coef.
Std. Err.
P|z|
--------------+---------------------------------------------------------------meandepth | -.0010168
.0000505
-20.14
0.000
-.0011158
-.0009179
period | -.4269806
.1260863
-3.39
0.001
-.6741052
-.1798559
.1395975
-23.69
0.000
-3.580609
-3.033397
_cons | -3.307003
ln(swepta~a) |
(exposure)
--------------+---------------------------------------------------------------/lnalpha |
-.6697869
.1118123
-.8889351
-.4506388
--------------+---------------------------------------------------------------alpha |
.5118176
.0572275
.4110933
.637221
Finite mixture models may have multiple components and be structured such
that the response is comprised of more than one distribution.
236
by a sum of smoothed functions, s(X). The method is used to discover nonlinear covariate effects that may not be detectable using traditional statistical
techniques.
GAM methodology was originally developed by Stone (1985) but was later
popularized by Hastie and Tibshirani (1986, 1990). The authors even developed a software package called GAIM, for generalized additive interactive
modeling, which stimulated its inclusion in several of the major commercial
packages.
The key concept in GAM modeling is that the partial residuals of continuous predictors in a model are smoothed using a cubic spline, loess smoother,
or another type of smoother while being adjusted by the other predictors
in the model. The parameters of the smooths are related to the bandwidth
that was used for the particular smooth. The relationship that is traditionally
given for the GAM distribution is
y = 0 +
j
fj Xj +
(9.4)
i=1
237
summary(pgam)
Parametric coefficients:
(Intercept)
outwork
Estimate
0.97150
0.29554
Std. Error
0.02398
0.02216
z value
40.512
13.338
Pr(|z|)
2e-16 ***
2e-16 ***
female
married
edlevel2
0.25287
-0.12594
-0.06771
0.02137
0.02246
0.04232
11.831
-5.607
-1.600
2e-16 ***
2.06e-08 ***
0.11
edlevel3
edlevel4
---
-0.19530
-0.24543
0.03997
0.04813
-4.886
-5.099
1.03e-06 ***
3.41e-07 ***
0.0388
n = 3874
plot(pgam)
0.0
-0.2
-0.4
s(age,7.19)
0.2
0.4
30
40
50
60
age
238
(9.5)
3874
If the coefficients and standard errors are the same when modeling with
fewer iterations (e.g., 100), then an analyst can be more confident that the
model is working correctly.
Marginal effects may be calculated using the command qcount_mfx. The
model should be compared with a negative binomial model on the same data.
239
Most models give results similar to those of negative binomial and PIG. R
does not support the quantile count model at this time.
variables;
data.
Other structures also exist (stationary, nonstationary, Markov, family) but are
rarely used in research. Statas xtgee command is a full GEE package. R has
several GEE functions: geepack, glmgee, and yags. The first two are used
most often for GEE modeling among R users. The Genmod procedure in SAS
is a GLM procedure but has an option for modeling GEEs. I can recommend
Hardin and Hilbe (2013b), the standard text on the subject. It has both
Stata and R examples throughout the book, as well as supplementary SAS
code.
240
id=medpar$provnum,
corstr=exchangeable, family=poisson))
To give a brief example, well use the U.S. Medicare data called medpar.
Length of stay is the response, with hmo, white, age80, and types of admission
as predictors. Stata requires that the cluster variable be numeric. The encode
command converts a string to numeric. An exchangeable correlation structure
was selected since we do not know anything about the data other than that
they are overdispersed and come in panels (i.e., patients within hospitals).
The idea is that patients are likely to be treated more alike within hospitals
than between hospitals, adding overdispersion to the data. A Poisson GEE
model is used for the example. (See Table 9.8.)
. use medpar
. encode provnum, gen(hospital)
. xtgee los hmo white age80 type2 type3, i(hospital) c(exch) vce(robust)
fam(poi) eform
GEE population-averaged model
Number of obs
1495
Number of groups
54
Link: log
Family: Poisson
avg
27.7
Correlation: exchangeable
max
92
Wald chi2(5)
27.51
Prob chi2
0.0000
Scale parameter: 1
Robust
Coef.
Std. Err.
P|z|
---------+------------------------------------------------------------------hmo |
-.0876095
.0584452
-1.50
0.134
-.20216
.026941
white |
-.1078697
.0732792
-1.47
0.141
-.2514942
.0357549
age80 |
-.0583864
.0611035
-0.96
0.339
-.1781471
.0613743
type2 |
.2295517
.0602295
3.81
0.000
.111504
.3475993
type3 |
.5402621
.1974048
2.74
0.006
.1533559
.9271684
_cons |
2.310611
.074003
31.22
0.000
2.165568
2.455654
-----------------------------------------------------------------------------
241
Coefficients:
Estimate Naive S.E.
(Intercept)
Robust z
0.07328592 31.5298719
hmo
0.05788583 -1.5128541
white
0.07260164 -1.4878908
age80
0.06053692 -0.9643112
type2
0.22951223 0.05663849
4.052231
0.05967494
3.8460402
type3
0.54096501 0.08222260
6.579274
0.19567264
2.7646431
6.504578
hmo
white
age80
type2
type3
10.0814351
0.9161522
0.8976067
0.9432948
1.2579863
1.7176636
242
Number of obs
1495
Number of groups
54
Integration points =
avg =
27.7
max =
92
Wald chi2(4)
109.81
Prob chi2
0.0000
------------------------------------------------------------------------------los |
Coef.
Std. Err.
P|z|
-------------+----------------------------------------------------------------hmo |
-.0907326
.0259212
-3.50
0.000
-.1415373
white |
-.0275537
.0302745
-0.91
0.363
-.0868906
-.0399279
.0317832
type2 |
.2319144
.0247704
9.36
0.000
.1833654
.2804634
type3 |
.1226025
.0488912
2.51
0.012
.0267775
.2184275
_cons |
2.191951
.0649116
33.77
0.000
2.064726
2.319175
------------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters
Estimate
Std. Err.
------------------------------+-----------------------------------------------provnum: Identity
|
sd(_cons) |
.4116571
.0457474
.3310866
.5118346
chibar2(01) =
243
Estimate
2.11707
-0.07501
Std. Error
0.02839
0.02398
t value
74.563
-3.128
Pr(|t|)
0.000e+00
1.763e-03
white
type2
-0.03080
0.21709
0.02774
0.02113
-1.110
10.275
2.670e-01
9.996e-25
type3
0.26166
0.03098
8.447
3.115e-17
z
0.50154
0.01642
30.549 7.335e-202
------------------------------------------------------------------No. of observations in the fit:
Degrees of Freedom for the fit:
Residual Deg. of Freedom:
Global Deviance:
AIC:
SBC:
29900
6
1489
13086.16
13098.16
13130.02
Freq.
Percent
Cum.
------------+----------------------------------1984 |
3,874
19.76
19.76
1985 |
3,794
19.35
39.10
244
1986 |
1987 |
1988 |
3,792
3,666
4,483
19.34
18.70
22.86
58.44
77.14
100.00
------------+----------------------------------Total |
19,609
100.00
. meglm docvis outwork age female married
Mixed-effects GLM
Number of obs
Family: Poisson
Link:
log
----------------------------------------------------------|
No. of
Observations per Group
Group Variable |
Groups
Minimum
Average
Maximum
----------------+-----------------------------------------id |
6127
1
3.2
5
year |
19609
1
1.0
1
----------------------------------------------------------Integration method: mvaghermite
Integration points =
Wald chi2(4)
Prob chi2
= 585.20
= 0.0000
----------------------------------------------------------------------docvis |
Coef.
Std. Err.
z
P|z|
[95% Conf. Interval]
----------+-----------------------------------------------------------outwork |
age |
female |
.1522959
.025162
.4215679
.0318875
.0014824
.03611
4.78
16.97
11.67
0.000
0.000
0.000
.0897975
.0222565
.3507936
.2147943
.0280674
.4923422
married |
-.048628
.0367921 -1.32
0.186 -.1207391
.0234831
_cons | -1.108713
.0705329 -15.72
0.000 -1.246955
-.9704716
----------+-----------------------------------------------------------id
|
var(_cons)|
1.11585
.035832
1.047785
1.188337
----------+-----------------------------------------------------------idyear
|
var(_cons)|
.8704313
.0202643
.8316065
.9110686
----------------------------------------------------------------------LR test vs. Poisson regression: chi2(2) = 70423.53
Probchi2 = 0.0000
245
2013)
For those who are interested in following up on this class of models, the
Bibliography has articles by the researchers listed. R software is available
for the listed models as well, and Stata commands exist for the top two
246
(Harris, Hilbe, and Hardin 2013). However, to give a sense of how these
types of models work, Ill outline the generalized Waring model. The model
is essentially akin to generalized Poisson regression, but with an additional
parameter.
The model is named after Edward Waring (17361798), who first proposed
a distribution from which the currently parameterized distribution is derived.
The distribution was advanced by Newbold (1927), and Irwin (1968) and
Xekalaki (1983), who developed its present form. The PDF is defined by
Rodrguez-Avi et al. (2009) as
( + ) (1)
(1 + )(+ ) with , 0
() ( )
(9.6)
The three parameters are , , and , which reflect the variability in the
data. The PDF posits that the variability in the data can be partitioned into
three additive components:
r randomness,
r proneness internal differences between individuals,
r liability the presence of outside sources of variability not included directly
in the model.
Xekalaki (1983) developed the three components of variability for this distribution and was instrumental, along with Rodrguez-Avi et al. (2009), in
developing an R package for estimation of a model based on the generalized
Waring distribution. The model statistics (mean, variance, etc.) and examples
of the use of this model can be found in Hilbe (2011). The Stata command that
follows is from Harris, Hilbe, and Hardin (2013). The PDF appears different
from equation (9.6) but is in fact the same. The zero-inflated version of the
model is used here since the data being modeled are substantially overdispersed because of excessive zero counts. The response term, docvis, has far
more zeros than expected due to its mean value of 3.1:
. global xvar "outwork age female married"
. zinbregw docvis $xvar, vce(robust) nolog inflate($xvar)
Zero-inflated gen neg binomial-W regression
Number of obs
Regression link:
Nonzero obs
2263
Zero obs
1611
Wald chi2(4)
87.78
Prob chi2
0.0000
3874
247
---------------------------------------------------------------------------|
docvis |
Robust
Coef.
Std. Err.
P|z|
----------+----------------------------------------------------------------docvis
outwork |
.2033453
.0696412
2.92
0.004
.0668511
age |
.0144888
.0025891
5.60
0.000
.0094143
.3398396
.0195633
female |
.1779968
.0632358
2.81
0.005
.0540569
.3019366
married |
-.08596
.0719044
-1.20
0.232
-.2268901
.0549701
_cons |
.6758925
.1368147
4.94
0.000
.4077405
.9440444
----------+----------------------------------------------------------------inflate
outwork |
-.1881244
.1430329
-1.32
0.188
-.4684638
.092215
age |
-.0253405
.0052679
-4.81
0.000
-.0356654
-.0150156
-.3412052
female |
-.5850628
.1244194
-4.70
0.000
-.8289203
married |
.0060488
.1471382
0.04
0.967
-.2823369
.2944345
_cons |
.4636771
.2425741
1.91
0.056
-.0117594
.9391136
----------+----------------------------------------------------------------/lnrhom2 |
.1981729
.0867388
.028168
.3681778
/lnk |
1.068476
.0799546
.911768
1.225184
----------+----------------------------------------------------------------rho |
3.219173
.1057496
3.028568
3.445099
k |
2.91094
.2327431
2.488719
3.404794
z =
0.92
Prz = 0.1777
248
and more so for the variance parameters. (R code for the generalized Waring
regression is given in Table 9.10.)
GWRM.display(war)
$Table
covars
estimates
se
z
p
1 (Intercept) -0.24507662 0.121905501 -2.010382 4.439078e-02
2
outwork 0.26839550 0.064308985 4.173530 2.999156e-05
3
4
5
$betaII
par
coef
value
1
k -0.5201369 0.5944391
2
ro
1.0447706 3.8427462
$Fit
1
log-likelihood
AIC
BIC
df
-8273.765 16561.53 16605.36 3867
OF
Bayesian estimation has become popular only in the last two decades. Actually, though, it is likely more accurate to state that it has become popular
249
since the mid-2000s. There were certainly Bayesians before this, but the vast
majority of statisticians adhered to what is called the frequentist interpretation of statistics. The statistical models we have thus far discussed in this
book are based on a frequentist interpretation. Frequentists assume that there
are fixed but unknown true parameters of a probability distribution that best
describe the data we are modeling. The model data are regarded as a random sample of data from a greater population of data, which is distributed
according to the parameters of a specific probability distribution. Again, the
model parameters are fixed; the analysts goal is to model the data so that we
derive an estimate as close as possible to the true parameter values. Maximum likelihood estimation is the method used most often for estimating the
parameters.
Bayesians, on the other hand, argue that parameters are themselves random
variables, and each has its own distribution in a model. Each predictor coefficient is a random variable and can be entirely different from other coefficients
in the model. We may initially assume that a binary variable is best described
by a Bernoulli or a more general binomial distribution, a count described by
a Poisson or negative binomial distribution, a continuous variable described
by a normal or Gaussian distribution, or a positive-only continuous variable
described by a gamma distribution. For a binary variable, for instance, we
assume that it is best described by a Bernoulli distribution, which we call a
posterior distribution. It is mathematically represented as a likelihood not
a log-likelihood.
A key feature of Bayesian statistics is that we may have information about
the data that might bear on the parameter values. This is called prior information, and it is described mathematically as a distribution a prior distribution. The posterior distribution for each predictor is updated by the respective prior distribution at each iteration in the overall estimation process.
The likelihood is multiplied by the prior, resulting in an updated posterior
distribution:
p(|y) L() ()
(9.7)
250
# GLM POISSON
Coefficients:
Estimate
(Intercept) 2.33293
hmo
-0.07155
0.02721
0.02394
85.744
-2.988
2e-16 ***
0.00281 **
white
-0.15387
0.02741
-5.613
1.99e-08 ***
type2
type3
0.22165
0.70948
0.02105
0.02614
10.529
27.146
2e-16 ***
2e-16 ***
confint.default(poi)
2.5 %
251
Pr(|z|)
97.5 %
0.1803908
0.6582513
0.26291271
0.76070204
Time-series SE
0.0003650
hmo
white
0.0003381
0.0003764
type2
type3
0.0002661
0.0004182
252
25%
50%
75%
97.5%
0.1801
0.6578
0.20782
0.69171
0.22194
0.70929
0.2361
0.7265
0.26160
0.76031
There are two basic MCMC methods in common use at this time. The first,
and first to be developed for this class of models, is the MetropolisHastings
algorithm. The second is Gibbs sampling. Both have had numerous variations
developed from the originals.
The foremost software packages used for Bayesian modeling are
WinBUGS/OpenBUGS, JAGS, SAS Genmod, SAS MCMC, and MLwiN. Cytels
LogXact can also perform MCMC-based modeling. Analysts often use R
directly to create Bayesian models, but it is much easier to employ software
that has already-developed built-in simulation routines with error checks,
graphics, and so forth. WinBUGS is popular and is the subject of numerous
books, but the developers have stopped development and are putting their
help and development tools into OpenBUGS. JAGS is an acronym for Just
Another Gibbs Sampler. It can be run through the R environment by using
the R2jags package on CRAN.
SAS is also an excellent tool for Bayesian modeling. I used SASs Genmod
for displaying examples of Bayesian negative binomial models in Hilbe (2011).
For those interested in Bayesian modeling and its relationship to frequencybased modeling, I recommend Zuur, Hilbe, and Ieno (2013). The book compares fully worked-out examples using both traditional frequency-based modeling and Bayesian modeling of both generalized linear models (GLMs) and
generalized linear mixed models (GLMMs), including normal, logistic, betabinomial, Poisson, negative binomial, and gamma regression. R and JAGS
code are used throughout. I also recommend Hilbe and Robinson (2013),
in which R code for writing a MetropolisHastings algorithm is provided in
annotated form. JAGS code for the medpar model in this section is provided
on the books web site.
9.9 SUMMARY
This chapter has covered various more advanced count models used by
analysts. However, I only provided a brief overview of the subject area. I
9.9 Summary
253
have placed additional material for many of these models on the books web
site in an electronic book titled Extensions to Modeling Count Data. I intend
to provide timely updates to the e-book at http://works.bepress.com/joseph
hilbe/.
I began the chapter discussing exact Poisson modeling, which is recommended when a dataset is small and unbalanced; e.g., for a highly unequal
ratio of 1s and 0s in binary variables, highly unequal number of observations
per level in categorical models, or highly skewed, bifurcated, or multimodal
data for continuous predictors. The p-values are exact, not asymptotic.
Truncated and censored count models are important models to have in
ones tool chest. Neither have been available in major commercial statistical
packages. Limdep was the only software offering Poisson and negative binomial truncated and censored models until they were added to the R gamlss
package (Stasinopoulos, Rigby, and Akantziliotou 2008) by Bob Rigby for
Hilbe (2011). Statas new treg command (Hardin and Hilbe 2014b) provides truncated models for Poisson, NB2, PIG, generalized Poisson, NB-P,
and NB-F. Censored models for these distributions will be available in later
2014.
We also discussed finite mixture models, where we find that a count variable can be composed of counts from different data-generating mechanisms.
I believe that this model will have increased use in the future as well.
A brief look at the nonparametric modeling of count data was given in Sections 9.4 and 9.5. Space prohibited a longer exposition. See Hilbe (2011) for
more details, as well as extensions to MCD. I highly recommend Zuur (2012)
for dealing with GAM analysis of count data. This chapter also examined
longitudinal count models, three-parameter models, and the newer Bayesian
modeling of count data that is only now becoming popular. I hope that the
overview of count models provided in this book encourages you to read more
about this class of models and to incorporate them into your research.
APPENDIX
SAS Code
A
dditional SAS code for models, graphs, and tables is located at http://
works.bepress.com/joseph hilbe/.
My appreciation goes out to Sachin Sobale, team leader for statistical and
SAS programming, and Ashik Chowdhury, senior biostatistician, both at Cytel
Corporation, India, for their good help in preparing the majority of the
SAS code in this appendix. My appreciation is also given to Yang Liu, Sam
Houston State University, for providing the SAS code to generate observed
versus predicted Poisson and NB2 tables and figures (final Appendix section
code). I thank Valerie Troiano and Kuber Dekar of the Institute for Statistics
Education (Statistics.com) for arranging Cytels assistance in this project.
POISSON
/*To call the data from .csv (excel) file*/
proc import OUT= WORK.rwm5yr
DATAFILE= "D:\sas code from R and stata\rwm5yr.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
run;
/*Select data for year = 1984*/
data rwm1984;
set rwm5yr;
where year = 1984;
run;
255
256
APPENDIX
poisson(3.162881,0);
257
SAS Code
run;
258
APPENDIX
/ dist =
Provider_number;
run;
*/
input die cases anterior hcabg killip kk1 kk2 kk3 kk4;
datalines ;
5 19
10 83
15 412
0 0 4 0 0 0 1
0 0 3 0 0 1 0
0 0 2 0 1 0 0
28 1864 0 0 1 1 0 0 0
1 1
0 1 4 0 0 0 1
0 3
0 1 3 0 0 1 0
1 18
2 70
10 28
0 1 2 0 1 0 0
0 1 1 1 0 0 0
1 0 4 0 0 0 1
9 139 1 0 3 0 0 1 0
39 443 1 0 2 0 1 0 0
50 1374 1 0 1 1 0 0 0
1
3
2
6
16
27
1 1 3 0 0 1 0
1 1 2 0 1 0 0
1 1 1 1 0 0 0
run;
/*data shown on page 53 */
proc print data= fasttrakg;
run;
/*Create offset variable*/
data fasttrakg;
set fasttrakg;
lncase = log(cases);
run;
259
SAS Code
dist = poisson
link = log
offset = lncase;
run;
/*To get the actual result from the preceding model by
exponentiating the estimates*/
data fast;
length parameter $15 IRR 8 StdErr 8 lowerci 8 upperci 8
ProbChiSq 8;
set estimate;
where parameter ne Scale;
IRR = exp(estimate);
lowerci = exp(LowerWaldCL);
upperci = exp(UpperWaldCL);
StdErr=sqrt(exp(estimate)**2*exp(StdErr**2)*(exp(StdErr**2) - 1));
keep parameter IRR StdErr lowerci upperci ProbChiSq;
run;
/*To get the output similar to STATA on page 53 */
proc print data = fast;
run;
run;
/*Take the average of age and outwork*/
proc means data = rwm1984 noprint;
var outwork age;
output out = summ(drop = _type_ _freq_) mean = outwork age;
run;
260
APPENDIX
docvis;
261
SAS Code
data full_;
set summ_; set full (where = (parameter = age));
dfdxb_avg = docvis*estimate;
run;
/* To get the output similar to STATA on page 57 */
/*Average marginal effects*/
proc print data = full_;
var dfdxb_avg;
run;
ZERO-INFLATED POISSON
/*To get the output of STATA on page 145 */
/*To get the IRR and corresponding CIs exponentiate the estimate and
CIs derived from the following model*/
proc genmod data = rwm1984;
model docvis = outwork age /dist=zip;
zeromodel outwork age /link = logit ; /*zeromodel option is
available in SAS 9.2 or later version*/
run;
262
APPENDIX
ZERO-TRUNCATED POISSON
/*Zero truncated poisson output of STATA on page 129 */
proc nlmixed data=Medpar;
xb = intercept + hmo*HMO_readmit_+ white*__White
+Type2*Urgent_Admit
+Type3*Emergency_Admit;
ll = Length_of_Stay*xb - exp(xb) - lgamma(Length_of_Stay + 1)
- log(1-exp(-exp(xb)));
model Length_of_Stay ~ general(ll);
run;
SAS Code
263
264
APPENDIX
do i= 0 to 20;
if estimate=. then py=.;
py=pdf(poisson,i,exp(estimate)); output; end;
keep i py;
run;
/* Obtain the average of the above probabilities */
proc means data=pred;
class i;
output out=means mean(py)=;
run;
/* Obtain the expected (predicted) counts */
data expect; set means;
drop _type_ _freq_;
if i=. then delete;
expect=py*&number;
run;
/* Join data sets and create table */
proc sql;
create table exptobs as
select i as count,observe,expect,observe-expect as diff
from expect as l left join
obs as r
on l.i=r.docvis;
run;
/* Get proportion */
data prop; set exptobs;
obsprop=observe/&number*100;
preprop=expect/&number*100;
diffprop=diff/&number*100;
drop observe expect diff;
run;
/*Calculate the difference between observed and expected proportion
and then divide the result by expected proportion (def. Chi
square = sum((square(o-e))/e)*/
data count_; set prop;
diff_sq = diffprop*diffprop;
if preprop ne 0 then
div = diff_sq/preprop;
run;
SAS Code
265
266
APPENDIX
SAS Code
267
268
APPENDIX
Bibliography
270
BIBLIOGRAPHY
Desmarais, B. A., and J. J. Harden. 2013. Testing for Zero Inflation in Count Models:
Bias Correction for the Vuong Test. Stata Journal 13 (no. 4): 810835.
Dohoo, I., W. Martin, and H. Stryhn. 2012. Methods in Epidemiologic Research. Charlottetown:VER Publishing.
Efron, B. 1986. Double Exponential Families and Their Use in Generalized Linear
Regression. Journal of the American Statistical Association 81: 709721.
Ellis, P. D. (2010). The Essential Guide to Effect Sizes. Cambridge: Cambridge University Press.
Faddy, M., and D. Smith. 2012. Analysis of Count Data with Covariate Dependence
in Both Mean and Variance. Journal of Applied Statistics 38: 26832694.
Famoye, F. 1993. Restricted Generalized Poisson Regression Model. Communications in Statistics, Theory and Methods 22: 13351354.
and K. Singh. 2006. Zero-Truncated Generalized Poisson Regression Model with
an Application to Domestic Violence. Journal of Data Science 4: 117130.
Fabermacher, H. 2011. Estimation of Hurdle Models for Overdispersed Count Data.
Stata Journal 11 (no. 1): 8294.
2013. Extensions of Hurdle Models for Overdispersed Count Data. Health Economics 22 (no.11): 13981404.
Faraway, J. J. 2006. Extending the Linear Model with R. Boca Raton, FL: Chapman &
Hall/CRC.
Flaherty, S., G. Patenaude, A. Close, and P. W. W. Lutz. 2012. The Impact of Forest
Stand Structure on Red Squirrel Habitat Use. Forestry 85: 437444.
Geedipally, S. R., D. Lord, and S. S. Dhavala. 2013. A Caution about Using
Deviance Information Criterion While Modeling Traffic Crashes. Unpublished
manuscript.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel/
Hierarchical Models. Cambridge: Cambridge University Press.
Goldberger, A. S. 1983. Abnormal Selection Bias, in Studies in Econometrics, Time
Series, and Multivariate Statistics, ed. S. Karlin, T. Amemiya, and L. A. Goodman,
pp. 6785. New York: Academic Press.
Greene, W. H. 2003. Econometric Analysis, fifth edition. New York: Macmillan.
2006. LIMDEP Econometric Modeling Guide, Version 9. Plainview, NY: Econometric
Software Inc.
2008. Functional Forms for the Negative Binomial Model for Count Data, Economics Letters, 99 (no. 3): 585590.
Hamilton, L. C. 2013. Statistics with Stata, Version 12. Boston: BrooksCole.
Hannan, E. J., and B. G. Quinn. 1979. The Determination of the Order of an Autoregression. Journal of the Royal Statistical Society Series B 41: 190195.
Bibliography
271
272
BIBLIOGRAPHY
Bibliography
273
Hurvich, C. M., and C. Tsai. 1989. Regression and Time Series Model Selection in
Small Samples. Biometrika 76 (no. 2): 297307.
Irwin, J. O. 1968. The Generalized Waring Distribution Applied to Accident Theory.
Journal of the Royal Statistical Society Series A 131 (no. 2): 205225.
Lawless, J. F. 1987. Negative Binomial and Mixed Poisson Regression. Canadian
Journal of Statistics 15 (no. 3): 209225.
Leisch, F., and B. Gruen. 2010. Flexmix: Flexible Mixture Modeling. CRAN.
Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables.
Thousand Oaks, CA: Sage Publications.
and J. Freese. 2006. Regression Models for Categorical Dependent Variables Using
Stata, second edition. College Station, TX: Stata Press.
Lord, D., S. E. Guikema, and S. R. Geedipally. 2007. Application of the ConwayMaxwell-Poisson Generalized Linear Model for Analyzing Motor Vehicle
Crashes. Unpublished manuscript.
Machado, J., and J. M. C. Santos Silva. 2005. Quantiles for Counts. Journal of the
American Statistical Association 100: 12261237.
Maindonald, J., and J. Braun. 2007. Data Analysis and Graphics Using R. Cambridge:
Cambridge University Press.
McCullagh P. 1983. Quasi-likelihood Functions. Annals of Statistics 11: 5967.
and J. A. Nelder. 1989. Generalized Linear Models, second edition. New York:
Chapman & Hall.
Miranda, A. 2013. Un modelo de valla doble para datos de conteo y su aplicacion
en
el estudio de la fecundidad en Mexico, in Aplicaciones en Economa y Ciencias
Sociales con Stata, ed. A. Mendoza. College Station, TX: Stata Press.
Morel, J. G., and N. K. Neerchal. 2012. Overdispersion Models in SAS. Cary, NC: SAS
Press.
Muenchen, R., and J. M. Hilbe. 2010. R for Stata Users. New York: Springer.
Mullahy, J. 1986. Specification and Testing of Some Modified Count Data Models.
Journal of Econometrics 33: 341365.
Nelder, J., and D. Pregibon. 1987. An Extended Quasi-likelihood Function.
Biometrika 74: 221232.
Newbold, E. M. 1927. Practical Applications of the Statistics of Repeated Events,
Particularly to Industrial Accidents. Journal of the Royal Statistical Society 90:
487547.
Pan, W. 2001. Akaikes Information Criterion in Generalized Estimating Equations.
Biometrics 57: 120125.
Rabe-Hesketh, S., and A. Skrondal. 2005. Multilevel and Longitudinal Modeling Using
Stata. College Station, TX: Stata Press.
274
BIBLIOGRAPHY
Bibliography
275
Wedderburn, R. W. M. 1974. Quasi-likelihood Functions, Generalized Linear Models and the GaussNewton Method. Biometrika 61: 439447.
Westfall, P. H., and K. S. S. Henning. 2013. Understanding Advanced Statistical Methods. Boca Raton, FL: Chapman & Hall/CRC.
White, H. 1980. A Heteroskedasticity-Consistent Covariance Matrix Estimator and
a Direct Test for Heteroskedasticity. Econometrica 48 (no. 4): 817838.
Winkelmann, R. 2008. Econometric Analysis of Count Data, 5th ed. New York:
Springer.
Xekalaki, E. 1983. The Univariate Generalized Waring Distribution in Relation to
Accident Theory: Proneness, Spells or Contagion? Biometrics 39 (no. 3): 3947.
Zhu, F. 2012. Modeling Time Series of Counts with COM-Poisson INGARCH
Models. Mathematical and Computer Modelling 56 (no. 9): 191203.
Zou, Y., S. R. Geedipally, and D. Lord. 2013. Evaluating the Double Poisson Generalized Linear Model. Unpublished manuscript.
Zuur, A. 2012. A Beginners Guide to Generalized Additive Models with R. Newburgh:
Highlands Statistics.
Zuur, A. F., J. M. Hilbe, and E. N. Ieno. 2013. A Beginners Guide to GLM and
GLMM with R: A Frequentist and Bayesian Perspective for Ecologists. Newburgh:
Highlands Statistics.
Zuur, A. F., A. A. Sveliev, and E. N. Ieno. 2012. Zero Inflated Models and Generalized
Linear Mixed Models with R. Newburgh: Highlands Statistics.
Index
277
278
confidence intervals
bootstrap, 105
exact, 223234
Poisson, 58
profile likelihood, 106
risk ratio, 59, 206
consistent estimator, 34
Consul, P. C., 211, 269
Conway, R. W., 245, 269
correlation
between panels, 241
within panels, 132
correlation structures
autoregressive, 239
exchangeable, 230240
independence, 239
unstructured, 239
count-related distributions
inverse Gaussian, 162166, 170, 172, 192,
206, 245
lognormal, 195
Poisson inverse Gaussian, 161171
shifted Poisson, 174
uniform, 42, 63, 132, 134, 135, 157,
250
Wald, 106
See also negative binomial; Poisson
covariate pattern, 64
Cox regression, 52
Cragg, J. C., 184, 269
CRAN, xiii, 11, 48, 80, 93, 135, 185, 247,
250, 252
credible interval, 22, 250, 251
cumulant, 31, 32, 128
Cytel, 218, 224, 252, 255
Dean, C., 269
degrees of freedom, 43, 48, 76, 77,
79
delta, 59, 82, 95, 142, 214
method, 59, 60, 62, 95
parameter, 145, 181, 214, 215
Demietrio, C. G. B., 272
Desmarais, B.A., 201, 270
deviance
dispersion, 4648, 79
function, 96, 97, 121
INDEX
goodness-of-fit test, 77
residuals, 158
statistic, 75, 76, 78, 98, 113
Dhavala, S. S., 245, 270
discrete change, 35, 68, 7172
dispersion
equi-, 9, 35, 82, 84, 129, 134, 152, 153
extra-, 39, 41, 73, 84, 103, 107, 125, 132,
158, 161, 196, 221, 223
over-, 912, 74107, 123125, 156160
Pearson, 43, 46, 47, 79, 81, 92, 93, 97, 98,
107, 196, 221, 257
under-, x, 11, 12, 33, 48, 83, 124, 136,
156, 212214, 221, 245
dispersion statistic, 35, 37, 3941, 4347, 51,
7984, 92, 93, 97, 126, 127, 131137,
149153, 157, 181, 196, 221, 245
distributions, probability
Bernoulli, 20, 249
beta, 4, 46, 92, 111, 224, 248, 252
binomial, 41, 92, 206
binomial-logit, 189, 209
chi-square (chi2), 37, 38, 7694, 107,
111, 115, 131, 136, 155, 156, 262
complementary loglog (cloglog), 184,
189, 195
double Poisson, 208, 245
exponential, 23, 31, 96, 127, 128, 163
frequency, 8, 19, 34, 38, 122, 137, 250,
252
gamma, 10, 11, 129, 132, 133, 135, 162,
163, 249
generalized Famoye negative binomial,
226
generalized Waring negative binomial,
209, 245248
generalized Poisson, 210215, 226246
geometric, 110, 129, 184
inverse Gaussian, 162166, 170, 172,
192, 206, 245
lognormal, 195
Poisson inverse Gaussian, 161171
shifted Poisson, 174
uniform, 42, 63, 132, 134, 135, 157, 250
Wald, 106
See also negative binomial; Poisson
Dobson, A., 269
Index
279
280
INDEX
281
Index
282
Poisson (cont.)
zero-truncated, 20, 37, 115, 174, 176,
196, 222, 225
population averaged models, 239
posterior distribution, 23, 249251
Pregibon, D., 96, 273
Priede, I. G., 269
prior, 249250
probability distribution. See distributions,
probability
profiling, 59
probit, 1920, 184, 195, 198
proneness, 82, 111, 246248
QIC, 118
quadrature
estimation (in general), 23, 31, 243,
251
Gaussian-Hermite, 183
quantile count models, 21, 218, 237,
239
quasi-deviance, 97
quasi-likelihood, 75, 9698, 101, 129, 130,
239
quasi-likelihood information criterion
(QIC). See QIC
quasi-Poisson, 223
Quinn, B. G., 270
Rabe-Hesketh, S., 273
rate parameter, 63, 64
rate ratio. See incidence rate ratio
real overdispersion, 73, 85, 86, 106
relative rate ratio (RRR). See risk ratio
residual, 110, 111
rho, 12, 20, 154, 245, 247
Rigby, R., 11, 165, 253, 273, 274
risk ratio, 100, 105, 147
Robinson, A., xiii, xiv, 25, 106107, 252,
272
robust estimator, 99, 101, 159, 222, 223,
235
robust standard errors, 98, 99, 103, 106,
125, 130, 148, 150, 176, 189, 199
Rodrguez-Avi, J., 274
Rouse, D. M., 274
Ruppert, D., 269
INDEX
283
Index
WINBUGS, 252
Winkelmann, R., 275