Mixed Model Selection Information Theoretic

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Mixed Model Selection via the Information-Theoretic Approach using SAS

Jesse A. Canchola, California State University, East Bay


Torsten B. Neilands, University of California, San Francisco

ABSTRACT
Model-building via the hypothesis-testing method (i.e., variable selection) has been the de-facto standard in the statistical
analyst’s toolbox for some time. Other methods that include Bayesian, cross-validation and the bootstrap are excellent
competing methods but can be complex and resource intensive. Still, superior predictive algorithms such as neural networks,
random forests or support vector machines can be used but suffer from interpretability limitations (Breiman, 2001).

The information-theoretic approach to model selection, as codified by Burnham and Anderson (2002), involves averaging
over a class of eligible models to provide robust inference. The IT approach side-steps the (sometimes controversial) issue
of judging whether individual variables are important or not based on the traditional hypothesis testing dichotomy.

We utilize the SAS COMPMIX macro to carry out the IT paradigm. We illustrate with an example from AIDS research. A
rationale for the strategies is given and we discuss the limitations.

INTRODUCTION

In traditional statistical model-building, stepwise approaches (including true stepwise, backward elimination and forward
addition) rely on testing the hypothesis of whether to keep predictors or remove them (most often singly) at a specified
significance level (typically, 0.05 to remove and 0.10 to enter into the model). This method has the potential to produce a
multiplicity of data models. That is, there is a possibility of having a multiple set of models from this method that are equally
viable but may each tell a different, yet important, story. The challenge for the investigator would be to choose the model
that best reflects reality. We can use the IT approach to help us choose the best model as determined by three selection
measures that are developed below (also from Burnham and Anderson).

THEORY

“Distance” Between Models. Kullback and Leibler (1951) introduced the notion of model “distance” which can also be
taken to mean “information” between models1. They developed the idea as follows.

In the context of probability and statistics, assume we have two known continuous distributions2, F(•) and G(•), where F(•) is
the “true” distribution and G(•) is taken to be the approximating probability distribution to the truth, viz., F(•). Then,
minimizing I[F(•)|G(•)] over G(•), the expression

+∞ ⎡ F(X) ⎤
I [ F( X ) | G( X )] = ∫ F( X )log ⎢ ⎥ dX (1)
-∞
⎣ G(X | Θ) ⎦

where I[F|G] ≥ 0 (always nonnegative3), can easily be interpreted to mean the information lost when approximating the fixed
truth, F(•), using G(•)4. Here, G(X | Θ) is the approximating model chosen over a space of models as indexed by Θ . It is
straightforward to see that I[F|G]=0 if and only if F(•) = G(•).

The formula in (1) holds irrespective of log (i.e., logarithm) base but is typically expressed in natural base e or base 2.

The K-L distance formula has other coined names: relative entropy, information divergence, K-L divergence and
information gain.

1
The “distance” analogy is not quite entirely correct due to the fact that expression is not symmetric (i.e., I[F|G] ≠ I[G|F] and I[F|G] does
not satisfy the triangle inequality.
2
We will concentrate on continuous probability distributions here though the discrete case theory is easily extended.
3
A result also known as the Gibbs inequality.
4
So the “I” in I[F|G] can be both “Information lost” or “Inefficiency”, the latter due to Cover and Thomas (1991).
1
The theory proposed by K-L turned out to be a re-derivation of the famous physicist Ludwig Eduard Boltzmann’s entropy
(1877; with negative valence).

Subsequently, Akaike (1973) introduced the notion of utilizing the K-L distance/information for model selection. However,
under the K-L information rubric, F(•) and G(•) are known functions. Akaike developed a rigorous methodology to
approximate the K-L information function that involves optimizing the log-likelihood function. He showed that this
approximation to the K-L information for model selection was to minimize the expected estimated K-L “distance” as
opposed to minimizing the known K-L distance over the models under consideration. That is, the estimate of the expression

ˆ (Y ))]}
EY E X {log[G ( X | Θ (2)

is the target of model selection approaches. Akaike (1973, 1974, 1985, 1994) found that an approximately unbiased
estimator of (2) under large sample statistical theory is

ˆ | Y~ )] − K
log[L( Θ (3)

~ ~
ˆ | Y )] is the maximized log-likelihood and where Y = y is taken to be the empirical data and K (a bias-
where log[L( Θ
5
correction term ) is the number of estimable parameters in the model and is equivalent to

C − Eˆ θˆ {I [ F (•) | Gˆ (• | Θ
ˆ )]} (4)

where C is a constant.

For historical reasons6, Akaike multiplied his final “Information Criterion” by -2 so that (3) becomes the famous and well-
known Akaike information criterion (AIC) result:

ˆ | Y~ )] + 2 K
AIC = −2 log[L( Θ (5)

which, roughly speaking, can be thought of as an estimated expected relative “distance” between the model fitted to the
empirical data and the model that perhaps “truly” generated those data. The “2K” portion is generally discussed as a
“penalty term” – a penalty exacted as you increase the number of parameters in the model, though it was originally derived
as a bias-correction term (see above.)

The AIC will find the best model amongst a set of candidate models. Given that the true model under K-L is unknown, it
becomes important that the investigator have a well-defined set of candidate models. A poor set of models will, thus, yield a
poor model as a result as found by the AIC.

Sugiura (1978) and Sakamoto et al. (1986) showed that (5) will perform poorly when the ratio of the number of parameters
to the sample size is high (call this ratio, R). Sugiura, in fact, derived a correct form of the AIC that was further refined by
Hurvich and Tsai (1989) for small samples. They applied a second-order bias adjustment that was applied to the penalty
term of (5). The result was the corrected AIC (AICc):

ˆ | Y~ )] + 2 K ⎛⎜
AIC c = −2 log[L( Θ
n ⎞
⎟ (6)
⎝ n − K −1⎠

where n is the size of the sample and K and L(•) as before. This is recommended by Burnham and Anderson (2002) as the
measure of information whenever R is not large. We will use the AICc as the default for estimating the best model(s) from a
candidate set in the sections that follow below.

5
Akaike (1973) showed the estimate of (2) using the log-likelihood is biased by a factor of about K.
Using the result that -2log(A/B) is distributed as a χ where A and B are maximum likelihood values.
6 2

2
To help us make sense of the relative candidate model importance we will use three simple and interpretable measures that
incorporate the AICc.

1. AICc differences7.

Δ i = AICCi − AICC min (7)

where AICC is equivalent notation for AICc and AICCmin is the smallest of the AICCs of the candidate models.

The larger the value of Δ i the less likely it is to be the optimal K-L model. In fact, a general rule of thumb on the relative
importance of the difference (slightly different than Burnham and Anderson give), Δ i :

Table 1. Rule of thumb for evaluating AICC change.


Δ i = AICCi − AICC min Evidence that Model i is
the best model8
0-2 Highly likely
3-10 Moderately likely
>10 Highly unlikely

2. Akaike Weights (wi) or Model Probabilities.

If we let L(Gi | X) be the likelihood of the model i given the empirical evidence (i.e., the data) we get the proportionality:

⎛ Δ ⎞
L(Gi | X ) ∝ exp⎜ − i ⎟ (8)
⎝ 2⎠

which when summed over the space of all T models will give us a normalized set of weights, Wi, such that

⎛ Δ ⎞
exp⎜ − i ⎟
Wi = ⎝ 2⎠ (9)
⎛ Δ ⎞
∑t =1 exp⎜⎝ − 2t ⎟⎠
T

which is easily interpretable as a probability that the ith model is the correct K-L model amongst the candidate models
(assuming the candidate set contains the K-L best model.)

3. Evidence Ratios

If we array the models by ascending weight, we can use evidence ratios (ER) to compare the first model with the highest
weight to the following models in the candidate set in a pairwise manner. ER can be defined as:

⎛ − 1 ⎞
⎛ Wi ⎞ ⎜ e 2 Δ i ⎟
ERij = ⎜ ⎟ ≡ ⎜ 1 ⎟ (10)
⎜W ⎟
⎝ j ⎠ ⎜ − 2Δ j ⎟
⎝e ⎠

which can be taken to be an odds9. Given the latter definition, we can interpret the ER as the odds against Model j being the
correct K-L model as compared to the base model in the ascending array.

7
Of course, this formula is generalizable to other forms of the AIC you may encounter.
8
…among the set of candidate models.
3
Traditional Hypothesis Testing Approach

The traditional hypothesis testing approach to model-building is well-known and space limitations preclude further
discussion of this methodology. Pardoe (2006) and Myers (1986), for example, provide readable treatments of this approach
to variable selection.

METHODS & RESULTS

Setting. We first exemplify the IT approach then give a comparison with respect to the traditional hypothesis testing method
to model-building. For the IT approach, we use the COMPMIX SAS macro by Russell Wolfinger (1999). For the
hypothesis testing approach (HTA), we use the modified stepwise regression SAS macro swRegALL_mod, originally written
by Alan Bostrom at the University of California, San Francisco (in its original form called RegAll_mod). It includes his
StepMix SAS macro subroutine for stepwise mixed modeling. We also perform the HTA “by hand” to add to the traditional
mix of model-building strategies. All HTA use the p-to-remove and p-to-enter 0.05 p-value.

Example. We use data from the FRAM study (Fat Redistribution And Metabolic changes in HIV infection) cohort. In this
example, we are investigating the predictors of human leptin (a hormone that plays a key role in regulating energy intake and
energy expenditure, including the regulation of appetite and metabolism). The demographic predictors forced into the model
are (with variable names in all uppercase with “(r)” being the referent category): continuous age (adjusted to decades;
AGE10); gender (male(r), female, transgender excluded; GENDER1); ethnicity (White(r), African-American, Hispanic,
Other; ETHNIC1); log of “total fat divided by height squared” (continuous; LNSVATTOT). The candidate predictors are
physical activity measure (inactive, some activity, moderate activity, active, very active(r); PHYSACT), enough food to eat
(enough food(r), enough but not always, sometimes not enough, often not enough; A05FOOD), current cocaine & crack user
(current vs. not current(r); CURR_COCA_CRACK), current crack user (current vs. not current(r); CURR_CRACK), current
marijuana user (current vs. not current(r); CUR_MARIJ).

Automatic Stepwise Mixed Model.

The SAS code for running the stepwise mixed model using the swRegAll_mod SAS macro is shown in Table 2.

Table 2. swRegAll_mod SAS macro:


ods rtf body = "c:\FRAM\LeptinMultivariates_HIVonly_&sysdate..rtf" ;
/* STEP 5a: Final Model Stepwise */ ;
title4 "Step 5a: Final Model" ;
%include 'c:\sasmacro\swRegAll_Mod.sas' ;
%include 'c:\sasmacro\stepmix_mod.sas';
%swRegAll_mod(dsname =data.hiv_malesfemales4 ,
outcome = lnHuman_Leptin ,
forced = age10 gender1 ethnic1 lnsvattot ,
candidates = physact a05food cur_coca_crack
cur_crack cur_marij ,
classvar = gender1 ethnic1 physact a05food
cur_coca_crack cur_crack cur_marij ,
method = ml ,
order = formatted
);
ods rtf close ;

The final model given by the stepwise regression macro (swRegAll_mod.sas) includes the predictors: age in decades,
gender, ethnicity, log of total fat, physical activity and current marijuana use.

Using the manual method (the more traditional approach in mixed modeling), we get the final model with exactly the same
predictors using the much faster stepwise algorithm (swRegAll.sas): age in decades, gender, ethnicity, log of total fat,
physical activity and current marijuana use. Traditionally, there has not been a widely available algorithm for stepwise
mixed modeling until the development of the unpublished SAS macro, swRegAll.sas, and its underlying StepMix_mod.sas
companion algorithm. This is most likely because of the well-known dangers of automatic methods for model selection.

9
Technically, given a success probability π, the odds is defined as π /(1- π).
4
However, used wisely and with care, automatic stepwise methods can be of use as a perturbing tool for examining viable
model behaviors.

We next utilize the COMPMIX SAS macro with code shown in Table 3. One must first define the candidate set of models as
indicated below (see also the reference manual for COMPMIX.) Our candidate set includes the model found by the
automatic stepwise and manual stepwise approaches above.

Table 3. COMPMIX SAS macro


%INCLUDE "C:\Sasmacro\COMPMIX.SAS" ;
%LET model1 = %str(
CLASS gender1 ethnic1 physact a05food
cur_coca_crack cur_crack cur_marij ;
MODEL lnHuman_Leptin = age10 gender1 ethnic1
lnsvattot physact a05food
cur_coca_crack cur_crack cur_marij;
);

%LET model2 = %str(


CLASS gender1 ethnic1 physact a05food
Cur_marij ;
MODEL lnHuman_Leptin = age10 gender1 ethnic1
lnsvattot physact a05food
cur_marij ;
);

%LET model3 = %str(


CLASS gender1 ethnic1 physact a05food
cur_coca_crack cur_marij ;
MODEL lnHuman_Leptin = age10 gender1 ethnic1
lnsvattot physact a05food
cur_coca_crack cur_marij ;
);

%LET model4 = %str(


CLASS gender1 ethnic1 physact
cur_marij ;
MODEL lnHuman_Leptin = age10 gender1 ethnic1
lnsvattot physact
cur_marij ;
);

/* ******* COMPARE MODELS USING ML ********* */ ;


%COMPMIX(DATA = data.hiv_malesfemales4 ,
MODELS = 4 ,
METHOD = ML ,
IC = AICC ,
OPTIONS = PLOT
);
RUN ;

The results of the COMPMIX run are shown in Table 4. Note the compactness of the resulting output (slightly edited in
Table 4.) We note the models (“Model” column) are arrayed in ascending order of Akaike weight (wi). The “Parms”
column indicates the number of parameters in the model (for both continuous and categorical variables). The “AICC”
column gives the corrected AIC while the “Diff” column gives the AICC differences (i.e., Δ i ). The “Odds” column shows
the evidence ratios (ERij). Note that the weights are cumulated in the last column (i.e., “Cum Weight”) to exhibit the nature
of the weight as a probability and the total probability rule in a closed set (i.e., total probabilities must sum to 1.0).

5
Table 4. Results of the COMPMIX SAS
macro using the FRAM data {edited for
compactness}.
AICC Values
Cum
Model Parms AICC Diff Odds Weight Weight
1 18 653.7 0.00000 1.0000 0.62921 0.62921
2 16 655.4 1.73406 2.3798 0.26439 0.89361
3 17 657.6 3.87687 6.9479 0.09056 0.98417
4 13 661.1 7.36484 39.7425 0.01583 1.00000

DISCUSSION

From Table 4, the “winning” model would seem to be Model 1, if we use the Table 1 rule of thumb where models with
“Odds” less than or equal to two are “Highly likely” to be the correct model. In terms of weight, we can interpret it as the
probability that Model 1 is the correct model amongst the candidate set is 63%.

Of course, a winning or correct model supposes we have the K-L true model in our candidate model mix. Let us assume that
this is so in this case.

Model 1 includes the predictors: age in decades, gender, ethnicity, log of total fat, physical activity, enough food to eat,
current cocaine/crack use, current crack use, current marijuana use.

Note that Model 2 is the model found by both the automatic stepwise algorithm and the manual approaches. For this model,
Table 4 shows that this “next best” has odds 2.4 times against it being the correct model as compared to the best model in the
candidate set. In terms of weight (wi), the probability that it is the correct model is only 26%.

Clearly the hypothesis testing approaches have pared the omnibus model to be parsimonious. However, it is not the best
model in the sense of K-L information.

For the three methods, in terms of time to task completion for choosing the “best” model, the manual method took the
longest with the obvious potential for human error. The automatic stepwise algorithm, once set up correctly for the problem
at hand was the next time consuming task. Finally, the COMPMIX.sas macro was easiest to set up and run so took the
smallest amount of time from start to finish.

CONCLUSIONS

We have shown that model-building can be a perilous endeavor. Depending on the tool that the investigator uses, they may
arrive at different answers that may be driven by p-values rather than available (and valuable) “information” as shown by our
example. The analyst should use the information-theoretic approach as a primary tool in their model-selection process (not
as an adjunct). The COMPMIX SAS macro provides a simple to use method for effecting the IT paradigm.

ACKNOWLEDGMENTS

Many thanks to Carl Grunfeld at the VA San Francisco for limited use of his HIV data and Alan Bostrom for the use of his
unpublished swRegAll.sas and StepMix.sas macros. We wish to thank Alison J. Canchola at the Northern California Cancer
Center for editing the final manuscript. However, responsibility for remaining errors and omissions belong to the authors.

CONTACT INFORMATION

Jesse A. Canchola, MS
Currently at: Bayer Healthcare, Diagnostics Division
Berkeley, California
Phone: 510.705.5855; eFax: 415.276.9342
e-mail (first author): jesse.canchola.b@bayer.com
REFERENCES

6
Akaike H (1973). “Information theory as an extension of the maximum likelihood principle.” Second International
Symposium on Information Theory. Akademiai Kiado, Budapest.

------ (1974). “A new look at the statistical model identification.” IEEE Transactions on Automatic Control AC 19, 716-
723.

------ (1985). “Prediction and entropy.” A celebration of Statistics. Springer, New York, NY.

------ (1994). “Implications of the informational point of view on the development of statistical science.” Engineering and
Scientific Applications. Vol. 3, Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An
Informational Approach. Kluwer Academic Publishers, Dordrecht, the Netherlands.

Boltzmann, LE (1877). “Über die Beziehung zwischen dem Hauptsatze derzwe: Ten mechanischen Wärmetheorie und der
Wahrscheinlichkeitsrechnung respective den Sätzen über das Wärmegleichgewicht.” Wiener Berichte 76, 373-435.

Bostrom A (1995). swRegAll SAS macro. University of California, San Francisco.

Breiman, L (2001). “Statistical Modeling: The Two Cultures.” Statistical Science. 2001, Vol. 16, No. 3, 199-231.

Burnham KP and Anderson DR (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic
Approach. 2nd Ed. Springer. 488 pp.

Cover TM and Thomas JA (1991). Elements of information theory. John Wiley and Sons, New York, NY. 542 pp.

Hurvich CM, Tsai CL (1989). Regression and time series model selection in small samples. Biometrika 76, 297-307.

Kullback S and Leibler RA (1951). “On information and sufficiency.” Annals of Mathematical Statistics 22, 79-86.

Myers (1986). Classical and Modern Regression With Applications. Duxbury Press, Boston, MA. 359 pp.

Pardoe L (2006). Applied regression modeling: A business approach. John Wiley and Sons, New York, NY. 320 pp.

Sakamoto Y, Ishiguro M, Kitagawa G (1986). Akaike information criterion statistics. KTK Scientific Publishers, Tokyo,
Japan.

Sugiura N (1978). “Further analysis of the data by Akaike’s information criterion and the finite corrections.”
Communications in Statistics, Theory and Methods A7, 13-26.

Wolfinger R (1999). “COMPMIX macro” downloadable with documentation and examples at URL:
http://support.sas.com/ctx/samples/index.jsp?sid=496&tab=downloads as of 07/21/2006.

SAS, SAS/Stat are registered trademarks or trademarks of The SAS Institute Inc. in the USA and other countries. ® indicates
USA registration. All other trademarks, registered trademarks, and service marks are the property of their respective owners.

You might also like