Loglinear Models: Angela Jeansonne

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Loglinear Models

Angela Jeansonne

This page last updated

Brief History

Until the late 1960’s, contingency tables - two-way tables formed by cross classifying categorical
variables - were typically analyzed by calculating chi-square values testing the hypothesis of
independence.  When tables consisted of more than two variables, researchers would compute
the chi-squares for two-way tables and then again for multiple sub-tables formed from them in
order to determine if associations and/or interactions were taking place among the variables. In
the 1970’s the analysis of cross-classified data changed quite dramatically with the publication of
a series of papers on loglinear models by L.A. Goodman.  Many other books appeared around
that time building on Goodman’s work (Bishop, Finberg & Holland 1975; Haberman 1975). 
Now researchers were introduced to a wide variety of models that could be fitted to cross-
classified data.  Thus, the introduction of the loglinear model provided them with a formal and
rigorous method for selecting a model or models for describing associations between variables.

Overview

When to use loglinear models:

The loglinear model is one of the specialized cases of generalized linear models for Poisson-
distributed data. Loglinear analysis is an extension of the two-way contingency table where the
conditional relationship between two or more discrete, categorical variables is analyzed by
taking the natural logarithm of the cell frequencies within a contingency table. Although
loglinear models can be used to analyze the relationship between two categorical variables (two-
way contingency tables), they are more commonly used to evaluate multiway contingency tables
that involve three or more variables. The variables investigated by log linear models are all
treated as “response variables”. In other words, no distinction is made between independent and
dependent variables. Therefore, loglinear models only demonstrate association between
variables.  If one or more variables are treated as explicitly dependent and others as independent,
then logit or logistic regression should be used instead.  Also, if the variables being investigated
are continuous and cannot be broken down into discrete categories, logit or logistic regression
would again be the appropriate analysis.  For a complete discussion on logit and logistic
regression consult Agresti (1996) or Tabachnick and Fidell (1996).

Example of data appropriate for loglinear models:

Suppose we are interested in the relationship between sex, heart disease and body weight.  We
could take a sample of 200 subjects and determine the sex, approximate body weight, and who
does and does not have heart disease. The continuous variable, body weight, is broken down into
two discrete categories: not over weight, and over weight. The contingency table containing the
data may look like this:    

Heart Disease Total


Body Weight Sex Yes No
Not over weight Male 15 5 20
Female 40 60 100
Total 55 65 120
Over weight Male 20 10 30
Female 10 40 50
Total 30 50 80

In this example, if we had designated heart disease as the dependent variable and sex and body
weight as the independent variables, then logit or logistic regression would have been the
appropriate analysis.

Basic Strategy and Key Concepts:

The basic strategy in loglinear modeling involves fitting models to the observed frequencies in
the cross-tabulation of categoric variables.  The models can then be represented by a set of
expected frequencies that may or may not resemble the observed frequencies.  Models will vary
in terms of the marginals they fit, and can be described in terms of the constraints they place on
the associations or interactions that are present in the data. The pattern of association among
variables can be described by a set of odds and by one or more odds ratios derived from them. 
Once expected frequencies are obtained, we then compare models that are hierarchical to one
another and choose a preferred model, which is the most parsimonious model that fits the data. 
It’s important to note that a model is not chosen if it bears no resemblance to the observed data. 
The choice of a preferred model is typically based on a formal comparison of goodness-of-fit
statistics associated with models that are related hierarchically (models containing higher order
terms also implicitly include all lower order terms).  Ultimately, the preferred model should
distinguish between the pattern of the variables in the data and sampling variability, thus
providing a defensible interpretation.

The Loglinear Model

The following model refers to the traditional chi-square test where two variables, each with two
levels (2 x 2 table), are evaluated to see if an association exists between the variables. 

Ln(Fij) =  + iA  + jB + ijAB


Ln(Fij) = is the log of the expected cell frequency of the cases for cell ij in the

contingency table.

 =  is the overall mean of the natural log of the expected frequencies

 = terms each represent “effects” which the variables have on the cell frequencies

A and B = the variables

i and j = refer to the categories within the variables

Therefore:

iA = the main effect for variable A

jB = the main effect for variable B

ijAB = the interaction effect for variables A and B

The above model is considered a Saturated Model because it includes all possible one-way and
two-way effects.  Given that the saturated model has the same amount of cells in the contingency
table as it does effects, the expected cell frequencies will always exactly match the observed
frequencies, with no degrees of freedom remaining (Knoke and Burke, 1980).  For example, in a
2 x 2 table there are four cells and in a saturated model involving two variables there are four
effects, ,  iA, jB, ijAB , therefore the expected cell frequencies will exactly match the observed
frequencies.  Thus, in order to find a more parsimonious model that will isolate the effects best
demonstrating the data patterns, a non-saturated model must be sought.  This can be achieved by
setting some of the effect parameters to zero.  For instance, if we set the effects parameter ijAB to
zero (i.e. we assume that variable A has no effect on variable B, or vice versa) we are left with
the unsaturated model.     

Ln(Fij) =  + iA  + jB


This particular unsaturated model is titled the Independence Model because it lacks an
interaction effect parameter between A and B.  Implicitly, this model holds that the variables are
unassociated.  Note that the independence model is analogous to the chi-square analysis, testing
the hypothesis of independence.

Hierarchical Approach to Loglinear Modeling

The following equation represents a 2 x 2 x 2 multiway contingency table with three variables,
each with two levels – exactly like the table illustrated on page 1 of this article.  Here, this
equation is being used to illustrate the hierarchical approach to loglinear modeling.
Ln(Fij) =  + iA  + jB + kC  + ijAB + ikAC + jkBC + ijkABC
A hierarchy of models exists whenever a complex multivariate relationship present in the data
necessitates inclusion of less complex interrelationships (Knoke and Burke, 1980). 

For example, in the above equation if a three-way interaction is present (ABC), the equation for
the model must also include all two-way effects (AB, AC, BC) as well as the single variable
effects (A, B, C) and the grand mean ().  In other words, less complex models are nested
within the higher-order model (ABC). Note the shorter notation used here to describe models. 
Each set of letters within the braces indicates a highest order effect parameter included in the
model and by virtue of the hierarchical requirement, the set of letters within braces also reveals
all lower order relationships which are necessarily present (Knoke and Burke, 1980).

SPSS uses this model to generate the most parsimonious model; however, some programs use a
non-hierarchical approach to loglinear modeling.  Reverting back to the previous notation, a non-
hierarchical model would look like the following: Ln(Fij) =  + iA + ijAB.  Notice that the main
effect term iB is not included in the model therefore violating the hierarchical requirement.  The
use of non-hierarchical modeling is not recommended, because it provides no statistical
procedure for choosing from among potential models.

Choosing a model to Investigate

Typically, either theory or previous empirical findings should guide this process.  However, if an
a priori hypothesis does not exist, there are two approaches that one could take:

1.      Start with the saturated model and begin to delete higher order interaction terms until the fit
of the model to the data becomes unacceptable based on the probability standards adopted by
the investigator.

2.      Start with the simplest model (independence model) and add more complex interaction terms
until an acceptable fit is obtained which cannot be significantly improved by adding further
terms.

Fitting Loglinear Models

Once a model has been chosen for investigation the expected frequencies need to be tabulated. 
For two variable models, the following formula can be used to compute the direct estimates for
non-saturated models. 

(column total) * (row total)/grand total

For larger tables, an iterative proportional fitting algorithm (Deming-Stephan algorithm) is used
to generate expected frequencies.  This procedure uses marginal tables fitted by the model to
insure that the expected frequencies sum across the other variables to equal the corresponding
observed marginal tables (Knoke and Burke, 1980).
For Example: In the following contingency tables, the observed marginal table totals (each
column and row) are equal to the expected marginal table totals, even though the actual expected
frequencies are different from the observed frequencies.

                           Observed Frequencies                 Expected Frequencies


                   Membership                    Membership
Vote One or None Total Vote One or None Total
More Turnout More
Turnout Voted 689 298 987 Voted 617.13 369.87 987
Total
Total Not 232 254 486 Not 303.87 182.13 486
Voted Voted
921 552 1473 921 552 1473

(Note: The above contingency tables were taken from Knoke and Burke, 1980 and represent data
collected on voluntary membership association and voter turnout in the 1972 and 1976
Presidential elections in the United States.)

The iterative proportional fitting process generates maximum likelihood estimates of the
expected cell frequencies for a hierarchical model.  In short, preliminary estimates of the
expected cell frequencies are successfully adjusted to fit each of the marginal sub-tables
specified in the model.  For example, in the model AB, BC, ABC, the initial estimates are
adjusted to fit AB then BC and finally to equal the ABC observed frequencies.  The previous
adjustments become distorted with each new fit, so the process starts over again with the most
recent cell estimate.  This process continues until an arbitrarily small difference exists between
the current and previous estimates. Consult Christensen (1997) for a numerical explanation of the
iterative computation of estimates.

Parameter Estimates

Once estimates of the expected frequencies for the given model are obtained, these numbers are
entered into appropriate formulas to produce the effect parameter estimates (’s) for the
variables and their interactions (Knoke and Burke, 1980).  The effect parameter estimates are
related to odds and odds ratios.  Odds are described as the ratio between the frequency of being
in one category and the frequency of not being in that category.  For example, in the above
contingency table for observed frequencies, the odds that a person voted is 987/486 = 2.03.  The
odds ratio is one conditional odds divided by another for a second variable, such as the odds of
having voted for the second variable Membership.  Based on the same contingency table, the
conditional odds for having voted and belonging to one or more groups is 2.97 (689/232), and
the conditional odds for having voted and not belonging to any groups is 1.17 (289/254).  Then
the odds ratio for voting for people belonging to more than one group to belonging to none is
2.97/1.17 = 2.54.  This is also called the “cross-product ratio” and in a 2x2 table can be
computed by dividing the product of the main diagonal cells (689*254) by the product of the off
diagonal cells (232*298).  An odds ratio above 1 indicates a positive association among
variables, while odds ratios smaller than one indicate a negative association.  Odds ratios
equaling 1 demonstrate that the variables have no association (Knoke and Burke, 1980).  Note
that odds and odds ratios are highly dependent on a particular model.  Thus, the associations
illustrated by evaluating the odds ratios of a given model are informative only to the extent that
the model fits well.

Testing for Fit

Once the model has been fitted, it is necessary to decide which model provides the best fit.  The
overall goodness-of-fit of a model is assessed by comparing the expected frequencies to the
observed cell frequencies for each model.  The Pearson Chi-square statistic or the likelihood
ratio (L2) can be used to test a models fit. However, the (L2) is more commonly used because it is
the statistic that is minimized in maximum likelihood estimation and can be partitioned uniquely
for more powerful test of conditional independence in multiway tables (Knoke and Burke,
1980).  The formula for the L2 statistic is as follows:

L2 = 2fij ln(fij/Fij)
L2 follows a chi-square distribution with the degrees of freedom (df) equal to the number of
lambda terms set equal to zero.  Therefore, the L2 statistic tests the residual frequency that is not
accounted for by the effects in the model (the  parameters set equal to zero). The larger the L2
relative to the available degrees of freedom, the more the expected frequencies depart from the
actual cell entries.  Therefore, the larger L2 values indicate that the model does not fit the data
well and thus, the model should be rejected. Consult Tabachnick and Fidell (1996) for a full
explanation on how to compute the L2  statistic.  

It is often found that more than one model provides an adequate fit to the data as indicated by the
non-significance of the likelihood ratio.  At this point, the likelihood ratio can be used to
compare an overall model within a smaller, nested model (i.e. comparing a saturated model with
one interaction or main effect dropped to assess the importance of that term).  The equation is as
follows:

L2comparison = L2model1 – L2model2


Model 1 is the model nested within model 2.  The degrees of freedom (df) are calculated by
subtracting the df of model 2 from the df of model 1.

If the L2 comparison statistic is not significant, then the nested model (1) is not significantly
worse than the saturated model (2).  Therefore, choose the more parsimonious (nested) model.

Following is a table that is often created to aid in the comparison of models.  Based on the above
equation, if we wanted to compare model 1 with model 11 then we would compute L2
comparison = 66.78 – 0.00 which yields a L2 comparison of 66.78.  The df would be computed
by subtracting 0 from 1 yielding a df of 1.  The L2 comparison figure is significant, therefore we
cannot eliminate the interaction effect term VM from the model.  Thus, the best fitting model in
this case is the saturated model.
Comparisons Among Models
Effect Parameters Likelihood
Ratio
Model Fitted  T1V T1M T11VM L2 d.f. p
Marginals
1 VM 331.66 1.37 0.83 0.80 0.00 0 -
11 VM 335.25 1.43 0.77 1.00* 66.78 1 .001
12 V 346.3 1.43 1.00* 1.00* 160.22 2 .001
13 M 356.51 1.00* 1.29 1.00* 240.63 2 .001
14    368.25 1.00* 1.00* 1.00* 334.07 3 .001

* Set to 1.00 by hypothesis  (Note: Table is taken from Knoke and Burke, 1980)

Loglinear Residuals

In order to further investigate the quality of fit of a model, one could evaluate the individual cell
residuals.  Residual frequencies can show why a model fits poorly or can point out the cells that
display a lack of fit in a generally good-fitting model (Tabachnick and Fidell, 1996).  The
process involves standardizing the residuals for each cell by dividing the difference between
frequencies observed and frequencies expected by the square root of the frequencies expected
(Fobs – Fexp / Fexp).  The cells with the largest residuals show where the model is least
appropriate.  Therefore, if the model is appropriate for the data, the residual frequencies should
consist of both negative and positive values of approximately the same magnitude that are
distributed evenly across the cells of the table.

Limitations to Loglinear Models

Interpretation

The inclusion of so many variables in loglinear models often makes interpretation very difficult.

Independence

Only a between subjects design may be analyzed.  The frequency in each cell is independent of
frequencies in all other cells.

Adequate Sample Size

With loglinear models, you need to have at least 5 times the number of cases as cells in your
data.  For example, if you have a 2x2x3 table, then you need to have 60 cases.  If you do not
have the required amount of cases, then you need to increase the sample size or eliminate one or
more of the variables.
Size of Expected Frequencies

For all two-way associations, the expected cell frequencies should be greater than one, and no
more than 20% should be less than five.  Upon failing to meet this requirement, the Type I error
rate usually does not increase, but the power can be reduced to the point where analysis of the
data is worthless.  If low expected frequencies are encountered, the following could be done:

1.      Accept the reduced power for testing effects associated with low expected frequencies.

2.      Collapse categories for variables with more than two levels, meaning you could combine two
categories to make one “new” variable.  However, if you do this, associations between the
variables can be lost, resulting in a complete reduction in power for testing those
associations.  Therefore, nothing has been gained.

3.      Delete variables to reduce the number of cells, but in doing so you must be careful not to
delete variables that are associated with any other variables.

4.      Add a constant to each cell (.5 is typical).  This is not recommended because power will
drop, and Type I error rate only improves minimally.

      Note:  Some packages such as SPSS will add .5 continuity correction under default.

Introduction to Neural Networks


Neural networks are the preferred tool for many predictive data mining applications
because of their power, flexibility, and ease of use. Predictive neural networks are
particularly useful in applications where the underlying process is complex, such as:

•  Forecasting consumer demand to streamline production and delivery costs.

•  Predicting the probability of response to direct mail marketing to determine which


households on a mailing list should be sent an offer.

•  Scoring an applicant to determine the risk of extending credit to the applicant.

•  Detecting fraudulent transactions in an insurance claims database.

Neural networks used in predictive applications, such as the multilayer perceptron


(MLP) and radial basis function (RBF) networks, are supervised in the sense that the
model-predicted results can be compared against known values of the target variables.
The Neural Networks option allows you to fit MLP and RBF networks and save the
resulting models for scoring.

Multilayer Perceptron
The Multilayer Perceptron (MLP) procedure produces a predictive model for one or more
dependent (target) variables based on the values of the predictor variables.

Examples. Following are two scenarios using the MLP procedure:

A loan officer at a bank needs to be able to identify characteristics that are indicative of
people who are likely to default on loans and use those characteristics to identify good
and bad credit risks. Using a sample of past customers, she can train a multilayer
perceptron, validate the analysis using a holdout sample of past customers, and then use
the network to classify prospective customers as good or bad credit risks. Show me

A hospital system is interested in tracking costs and lengths of stay for patients admitted
for treatment of myocardial infarction (MI, or "heart attack"). Obtaining accurate
estimates of these measures allows the administration to properly manage the available
bed space as patients are treated. Using the treatment records of a sample of patients
who received treatment for MI, the administrator can train a network to predict both cost
and length of stay. Show me

Radial Basis Function


The Radial Basis Function (RBF) procedure produces a predictive model for one or more
dependent (target) variables based on values of predictor variables.

Example. A telecommunications provider has segmented its customer base by service


usage patterns, categorizing the customers into four groups. An RBF network using
demographic data to predict group membership allows the company to customize offers
for individual prospective customers. Show me

Data Considerations
Creating a Radial Basis Function Network
Fields with Unknown Measurement Level

This procedure pastes RBF command syntax.

Classify:

Choosing a Procedure for Clustering


Cluster analyses can be performed using the TwoStep, Hierarchical, or K-Means Cluster
Analysis procedure. Each procedure employs a different algorithm for creating clusters,
and each has options not available in the others.

TwoStep Cluster Analysis. For many applications, the TwoStep Cluster Analysis


procedure will be the method of choice. It provides the following unique features:

•  Automatic selection of the best number of clusters, in addition to measures for


choosing between cluster models.

•  Ability to create cluster models simultaneously based on categorical and continuous


variables.

•  Ability to save the cluster model to an external XML file and then read that file and
update the cluster model using newer data.

Additionally, the TwoStep Cluster Analysis procedure can analyze large data files.

Hierarchical Cluster Analysis. The Hierarchical Cluster Analysis procedure is limited to


smaller data files (hundreds of objects to be clustered) but has the following unique
features:

•  Ability to cluster cases or variables.

•  Ability to compute a range of possible solutions and save cluster memberships for each
of those solutions.

•  Several methods for cluster formation, variable transformation, and measuring the
dissimilarity between clusters.

As long as all the variables are of the same type, the Hierarchical Cluster Analysis
procedure can analyze interval (continuous), count, or binary variables.

K-Means Cluster Analysis. The K-Means Cluster Analysis procedure is limited to


continuous data and requires you to specify the number of clusters in advance, but it has
the following unique features:
•  Ability to save distances from cluster centers for each object.

•  Ability to read initial cluster centers from and save final cluster centers to an external
IBM® SPSS® Statistics file.

Additionally, the K-Means Cluster Analysis procedure can analyze large data files.

TwoStep Cluster Analysis


The TwoStep Cluster Analysis procedure is an exploratory tool designed to reveal natural
groupings (or clusters) within a dataset that would otherwise not be apparent. The
algorithm employed by this procedure has several desirable features that differentiate it
from traditional clustering techniques:

• Handling of categorical and continuous variables. By assuming variables to be


independent, a joint multinomial-normal distribution can be placed on categorical and
continuous variables.

• Automatic selection of number of clusters. By comparing the values of a model-


choice criterion across different clustering solutions, the procedure can automatically
determine the optimal number of clusters.

• Scalability. By constructing a cluster features (CF) tree that summarizes the records,
the TwoStep algorithm allows you to analyze large data files.

Show me

Example. Retail and consumer product companies regularly apply clustering techniques


to data that describe their customers' buying habits, gender, age, income level, etc.
These companies tailor their marketing and product development strategies to each
consumer group to increase sales and build brand loyalty.

Distance Measure. This selection determines how the similarity between two clusters is
computed.

• Log-likelihood. The likelihood measure places a probability distribution on the


variables. Continuous variables are assumed to be normally distributed, while
categorical variables are assumed to be multinomial. All variables are assumed to be
independent.

• Euclidean. The Euclidean measure is the "straight line" distance between two clusters.
It can be used only when all of the variables are continuous.

Number of Clusters. This selection allows you to specify how the number of clusters is
to be determined.

• Determine automatically. The procedure will automatically determine the "best"


number of clusters, using the criterion specified in the Clustering Criterion group.
Optionally, enter a positive integer specifying the maximum number of clusters that
the procedure should consider.
• Specify fixed. Allows you to fix the number of clusters in the solution. Enter a positive
integer.

Count of Continuous Variables. This group provides a summary of the continuous


variable standardization specifications made in the Options dialog box. See the
topic TwoStep Cluster Analysis Options for more information.

Clustering Criterion. This selection determines how the automatic clustering algorithm


determines the number of clusters. Either the Bayesian Information Criterion (BIC) or the
Akaike Information Criterion (AIC) can be specified.

Dimension Reduction: Factor analysis

Factor Analysis
Factor analysis attempts to identify underlying variables, or factors, that explain the
pattern of correlations within a set of observed variables. Factor analysis is often used in
data reduction to identify a small number of factors that explain most of the variance that
is observed in a much larger number of manifest variables. Factor analysis can also be
used to generate hypotheses regarding causal mechanisms or to screen variables for
subsequent analysis (for example, to identify collinearity prior to performing a linear
regression analysis).

The factor analysis procedure offers a high degree of flexibility:

• Seven methods of factor extraction are available.

• Five methods of rotation are available, including direct oblimin and promax for
nonorthogonal rotations.

• Three methods of computing factor scores are available, and scores can be saved as
variables for further analysis.

Example. What underlying attitudes lead people to respond to the questions on a


political survey as they do? Examining the correlations among the survey items reveals
that there is significant overlap among various subgroups of items--questions about taxes
tend to correlate with each other, questions about military issues correlate with each
other, and so on. With factor analysis, you can investigate the number of underlying
factors and, in many cases, identify what the factors represent conceptually. Additionally,
you can compute factor scores for each respondent, which can then be used in
subsequent analyses. For example, you might build a logistic regression model to predict
voting behavior based on factor scores.

Statistics. For each variable: number of valid cases, mean, and standard deviation. For
each factor analysis: correlation matrix of variables, including significance levels,
determinant, and inverse; reproduced correlation matrix, including anti-image; initial
solution (communalities, eigenvalues, and percentage of variance explained); Kaiser-
Meyer-Olkin measure of sampling adequacy and Bartlett's test of sphericity; unrotated
solution, including factor loadings, communalities, and eigenvalues; and rotated solution,
including rotated pattern matrix and transformation matrix. For oblique rotations: rotated
pattern and structure matrices; factor score coefficient matrix and factor covariance
matrix. Plots: scree plot of eigenvalues and loading plot of first two or three factors.

Scale:

Reliability Analysis
Reliability analysis allows you to study the properties of measurement scales and the
items that compose the scales. The Reliability Analysis procedure calculates a number of
commonly used measures of scale reliability and also provides information about the
relationships between individual items in the scale. Intraclass correlation coefficients can
be used to compute inter-rater reliability estimates.

Example. Does my questionnaire measure customer satisfaction in a useful way? Using


reliability analysis, you can determine the extent to which the items in your questionnaire
are related to each other, you can get an overall index of the repeatability or internal
consistency of the scale as a whole, and you can identify problem items that should be
excluded from the scale.

Statistics. Descriptives for each variable and for the scale, summary statistics across
items, inter-item correlations and covariances, reliability estimates, ANOVA table,
intraclass correlation coefficients, Hotelling's T2, and Tukey's test of additivity.

Models. The following models of reliability are available:

• Alpha (Cronbach). This model is a model of internal consistency, based on the


average inter-item correlation.

• Split-half. This model splits the scale into two parts and examines the correlation
between the parts.

• Guttman. This model computes Guttman's lower bounds for true reliability.

• Parallel. This model assumes that all items have equal variances and equal error
variances across replications.

• Strict parallel. This model makes the assumptions of the Parallel model and also
assumes equal means across items.

NonParametric Tests:

One-Sample Nonparametric Tests


One-sample nonparametric tests identify differences in single fields using one or more
nonparametric tests. Nonparametric tests do not assume your data follow the normal
distribution.

 Show details

What is your objective? The objectives allow you to quickly specify different but
commonly used test settings.

• Automatically compare observed data to hypothesized. This objective applies the


Binomial test to categorical fields with only two categories, the Chi-Square test to all
other categorical fields, and the Kolmogorov-Smirnov test to continuous fields.

• Test sequence for randomness. This objective uses the Runs test to test the
observed sequence of data values for randomness.

• Custom analysis. When you want to manually amend the test settings on the Settings
tab, select this option. Note that this setting is automatically selected if you
subsequently make changes to options on the Settings tab that are incompatible with
the currently selected objective.

This procedure pastes NPTESTS command syntax.

Forecasting:

 Introduction to Time Series


 Time Series Modeler
 Apply Time Series Models
 Seasonal Decomposition
 Spectral Plots
 Introduction to Time Series
 A time series is a set of observations obtained by measuring a single variable
regularly over a period of time. In a series of inventory data, for example, the
observations might represent daily inventory levels for several months. A series
showing the market share of a product might consist of weekly market share taken
over a few years. A series of total sales figures might consist of one observation per
month for many years. What each of these examples has in common is that some
variable was observed at regular, known intervals over a certain length of time.
Thus, the form of the data for a typical time series is a single sequence or list of
observations representing measurements taken at regular intervals.
 Daily inventory time series
Tim
Week Day Inventory level
e
t 1 1 Monday 160
t 2 1 Tuesday 135
t 3 1 Wednesday 129
Tim
Week Day Inventory level
e
t 4 1 Thursday 122
t 5 1 Friday 108
t 6 2 Monday 150
    ...  
t 60 12 Friday 120

 One of the most important reasons for doing time series analysis is to try to
forecast future values of the series. A model of the series that explained the past
values may also predict whether and how much the next few values will increase or
decrease. The ability to make such predictions successfully is obviously important
to any business or scientific field.

Time Series Modeler


The Time Series Modeler procedure estimates exponential smoothing, univariate
Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer
function models) models for time series, and produces forecasts. The procedure includes
an Expert Modeler that automatically identifies and estimates the best-fitting ARIMA or
exponential smoothing model for one or more dependent variable series, thus eliminating
the need to identify an appropriate model through trial and error. Alternatively, you can
specify a custom ARIMA or exponential smoothing model.

Example. You are a product manager responsible for forecasting next month's unit sales
and revenue for each of 100 separate products, and have little or no experience in
modeling time series. Your historical unit sales data for all 100 products is stored in a
single Excel spreadsheet. After opening your spreadsheet in IBM® SPSS® Statistics, you
use the Expert Modeler and request forecasts one month into the future. The Expert
Modeler finds the best model of unit sales for each of your products, and uses those
models to produce the forecasts. Since the Expert Modeler can handle multiple input
series, you only have to run the procedure once to obtain forecasts for all of your
products. Choosing to save the forecasts to the active dataset, you can easily export the
results back to Excel.

Statistics. Goodness-of-fit measures: stationary R-square, R-square (R2), root mean


square error (RMSE), mean absolute error (MAE), mean absolute percentage error
(MAPE), maximum absolute error (MaxAE), maximum absolute percentage error
(MaxAPE), normalized Bayesian information criterion (BIC). Residuals: autocorrelation
function, partial autocorrelation function, Ljung-Box Q. For ARIMA models: ARIMA orders
for dependent variables, transfer function orders for independent variables, and outlier
estimates. Also, smoothing parameter estimates for exponential smoothing models.

Plots. Summary plots across all models: histograms of stationary R-square, R-square


(R2), root mean square error (RMSE), mean absolute error (MAE), mean absolute
percentage error (MAPE), maximum absolute error (MaxAE), maximum absolute
percentage error (MaxAPE), normalized Bayesian information criterion (BIC); box plots of
residual autocorrelations and partial autocorrelations. Results for individual models:
forecast values, fit values, observed values, upper and lower confidence limits, residual
autocorrelations and partial autocorrelations.

Seasonal Decomposition
The Seasonal Decomposition procedure decomposes a series into a seasonal component,
a combined trend and cycle component, and an "error" component. The procedure is an
implementation of the Census Method I, otherwise known as the ratio-to-moving-average
method.

Example. A scientist is interested in analyzing monthly measurements of the ozone level


at a particular weather station. The goal is to determine if there is any trend in the data.
In order to uncover any real trend, the scientist first needs to account for the variation in
readings due to seasonal effects. The Seasonal Decomposition procedure can be used to
remove any systematic seasonal variations. The trend analysis is then performed on a
seasonally adjusted series.

Statistics. The set of seasonal factors.

Spectral Plots
The Spectral Plots procedure is used to identify periodic behavior in time series. Instead
of analyzing the variation from one time point to the next, it analyzes the variation of the
series as a whole into periodic components of different frequencies. Smooth series have
stronger periodic components at low frequencies; random variation ("white noise")
spreads the component strength over all frequencies.

Series that include missing data cannot be analyzed with this procedure.

Example. The rate at which new houses are constructed is an important barometer of


the state of the economy. Data for housing starts typically exhibit a strong seasonal
component. But are there longer cycles present in the data that analysts need to be
aware of when evaluating current figures?

Statistics. Sine and cosine transforms, periodogram value, and spectral density estimate
for each frequency or period component. When bivariate analysis is selected: real and
imaginary parts of cross-periodogram, cospectral density, quadrature spectrum, gain,
squared coherency, and phase spectrum for each frequency or period component.

Plots. For univariate and bivariate analyses: periodogram and spectral density. For
bivariate analyses: squared coherency, quadrature spectrum, cross amplitude, cospectral
density, phase spectrum, and gain.

Life Tables
There are many situations in which you would want to examine the distribution of times
between two events, such as length of employment (time between being hired and
leaving the company). However, this kind of data usually includes some cases for which
the second event isn't recorded (for example, people still working for the company at the
end of the study). This can happen for several reasons: for some cases, the event simply
doesn't occur before the end of the study; for other cases, we lose track of their status
sometime before the end of the study; still other cases may be unable to continue for
reasons unrelated to the study (such as an employee becoming ill and taking a leave of
absence). Collectively, such cases are known as censored cases, and they make this
kind of study inappropriate for traditional techniques such as t tests or linear regression.

A statistical technique useful for this type of data is called a follow-up life table. The
basic idea of the life table is to subdivide the period of observation into smaller time
intervals. For each interval, all people who have been observed at least that long are
used to calculate the probability of a terminal event occurring in that interval. The
probabilities estimated from each of the intervals are then used to estimate the overall
probability of the event occurring at different time points.

Example. Is a new nicotine patch therapy better than traditional patch therapy in
helping people to quit smoking? You could conduct a study using two groups of smokers,
one of which received the traditional therapy and the other of which received the
experimental therapy. Constructing life tables from the data would allow you to compare
overall abstinence rates between the two groups to determine if the experimental
treatment is an improvement over the traditional therapy. You can also plot the survival
or hazard functions and compare them visually for more detailed information.

Statistics. Number entering, number leaving, number exposed to risk, number of


terminal events, proportion terminating, proportion surviving, cumulative proportion
surviving (and standard error), probability density (and standard error), and hazard rate
(and standard error) for each time interval for each group; median survival time for each
group; and Wilcoxon (Gehan) test for comparing survival distributions between groups.
Plots: function plots for survival, log survival, density, hazard rate, and one minus
survival.

MULTIPLE RESPONSE:

MISSING VALUE:

Missing Value Analysis


The Missing Value Analysis procedure performs three primary functions:

• Describes the pattern of missing data. Where are the missing values located? How
extensive are they? Do pairs of variables tend to have values missing in multiple
cases? Are data values extreme? Are values missing randomly?

• Estimates means, standard deviations, covariances, and correlations for different


missing value methods: listwise, pairwise, regression, or EM (expectation-
maximization). The pairwise method also displays counts of pairwise complete cases.

• Fills in (imputes) missing values with estimated values using regression or EM


methods; however, multiple imputation is generally considered to provide more
accurate results.
Missing value analysis helps address several concerns caused by incomplete data. If
cases with missing values are systematically different from cases without missing values,
the results can be misleading. Also, missing data may reduce the precision of calculated
statistics because there is less information than originally planned. Another concern is
that the assumptions behind many statistical procedures are based on complete cases,
and missing values can complicate the theory required.

Example. In evaluating a treatment for leukemia, several variables are measured.


However, not all measurements are available for every patient. The patterns of missing
data are displayed, tabulated, and found to be random. An EM analysis is used to
estimate the means, correlations, and covariances. It is also used to determine that the
data are missing completely at random. Missing values are then replaced by imputed
values and saved into a new data file for further analysis.

Statistics. Univariate statistics, including number of nonmissing values, mean, standard


deviation, number of missing values, and number of extreme values. Estimated means,
covariance matrix, and correlation matrix, using listwise, pairwise, EM, or regression
methods. Little's MCAR test with EM results. Summary of means by various methods. For
groups defined by missing versus nonmissing values: t tests. For all variables: missing
value patterns displayed cases-by-variables.

Analyze Patterns (Multiple Imputation)


Analyze Patterns provides descriptive measures of the patterns of missing values in the
data, and can be useful as an exploratory step before imputation. This is a Multiple
Imputation procedure.

Example. A telecommunications provider wants to better understand service usage


patterns in its customer database. They have complete data for services used by their
customers, but the demographic information collected by the company has a number of
missing values. Analyzing the patterns of missing values can help determine next steps
for imputation. Show me

Complex Samples Plan


Complex Samples analysis procedures require analysis specifications from an analysis or
sample plan file in order to provide valid results.

Plan. Specify the path of an analysis or sample plan file.

Joint Probabilities. In order to use Unequal WOR estimation for clusters drawn using a
PPS WOR method, you need to specify a separate file or an open dataset containing the
joint probabilities. This file or dataset is created by the Sampling Wizard during sampling.

Related Topics
Complex Samples Plan
Complex Samples Frequencies
Complex Samples Descriptives
Complex Samples Crosstabs
Complex Samples Crosstabs Statistics
Complex Samples Ratios
Complex Samples General Linear Model
Complex Samples Logistic Regression
Complex Samples Ordinal Regression
Complex Samples Cox Regression
Complex Samples Analysis Procedures Common Features

Control Charts
Control Charts allows you to make selections that determine the type of chart you obtain.
Select the icon for the chart type you want and then select the option under Data
Organization that best describes your data.

You can click on the icons below to browse examples of available chart types. If you find
an example that looks like the chart you want, click on the How To button next to that
example for specific instructions on how to create that chart.

X-bar, R, s

Individuals, Moving Range

p, np

c, u

Show me

ROC Curves
This procedure is a useful way to evaluate the performance of classification schemes in
which there is one variable with two categories by which subjects are classified.

Example. It is in a bank's interest to correctly classify customers into those customers


who will and will not default on their loans, so special methods are developed for making
these decisions. ROC curves can be used to evaluate how well these methods perform.

Statistics. Area under the ROC curve with confidence interval and coordinate points of
the ROC curve. Plots: ROC curve.
Methods. The estimate of the area under the ROC curve can be computed either
nonparametrically or parametrically using a binegative exponential model.

You might also like