Data Screening Checklist

A Practical Guide to the Use of Selected

Multivariate Statistics
This document is set up to allow the user of multivariate statistics to get assistance in the
choice of multivariate technique to use and the state the data must be in to use the desired
technique. A series of links enable the user to view areas of interest in relation to
multivariate statistical theory, use, and application.
Cette information aidera lutilisateur des statistiques multivaries choisir la technique
approprie et traiter correctement les donnes pour quelles soient dans ltat requis,
afin dutiliser la technique voulue. Divers liens permettent lutilisateur dobtenir de
linformation sur la thorie, lutilisation et lapplication des statistiques multivaries.
Les scientifiques ayant particip au projet tant anglophones, Guide pratique
dutilisation de certaines techniques en statistiques multivaries est offert seulement en
Mike Wulder
Research Scientist
Canadian Forest Service
Pacific Forestry Centre
506 West Burnside Road
Victoria, BC V8Z 1M5

Main Index:
1. Introduction to this Document

2. Summary of Multivariate Statistics

3. Data Screening

4. Multiple Correlation and Regression

5. Principal Components Analysis and Factor Analysis

6. Discriminant Analysis


7. Cluster Analysis


8. Spatial Autocorrelation


1. Census data.


2. Correlation Coefficient and Coefficient of Variation


3. Homoscedasticity


4. Linearity


5. Multivariate General Linear Hypothesis (MGLH)


6. Missing Data


7. Multicollinearity and singularity


8. Normality


9. Orthogonality


10. Outliers


11. Residual Plots


12. Data Transformations


13. References


1: Introduction to this document

The objective of Graduate Geography 616, Multivariate Statistics, with Dr. Barry Boots,
was to familiarize the participant with a range of multivariate statistical procedures and
their geographical applications. This was accomplished in two ways, through lectures
which provided an introduction to the underlying theories, and through subsequent
application of the theory on a typical geographic data set. Stress was placed upon the
limitations and validity of individual procedures in sample empirical contexts and the
ability to interpret the results of any particular analysis. The multivariate statistical
techniques examined were all applications of the multivariate general linear hypothesis.
The four principle techniques which were examined, and which will be assessed, are:

multiple correlation and regression,

principal components and factor analysis,
discriminant analysis, and
cluster analysis.

To properly apply the general linear model the data input to such analysis must meet
specific constraints and criteria, such as the rules of normality, linearity,
homoscedasticity, and non-multicollinearity, to accomplish this, a guide to data screening
is presented. Spatial autocorrelation also affects many geographic data sets and may need
to be addressed.
The goal of this document is to present a summary of the statistical techniques presented
and to demonstrate each with examples and illustrations. Consult the references list for
depth greater than appropriate from a document of this scope. The data set which is used
for this study is an excerpt of Canada Census data for 1991 (see appendix 1).
This document is set up to allow the user of multivariate statistics to get assistance in the
choice of multivariate technique to use and the state the data must be in to use the desired

2: Summary of Multivariate Statistics

Multivariate statistics provide the ability to analyze complex sets of data. Multivariate
statistics provide for analysis where there are many independent (IVs) and possible
dependent variables (DVs) which are correlated to each other to varying degrees. The
ready availability of software application programs which can handle the complexity of
large multivariate data sets has increased and popularized the use of multivariate
statistics. The current scientific methodology is increasingly seeking the complex
relationships between variables in an attempt to provide for more holistic, inclusive,
studies and models. Assessment of results is iterative and stochastic.

For analysis involving multivariate statistics, an appropriate data set is composed of

values related to a number of variables for a number of subjects. Accordingly,
appropriate data sets may be organized as, a data matrix, a correlation matrix, a variancecovariance matrix, a sum-of-squares and cross-products matrix, or a sequence of
residuals (Tabachnick and Fidell, 1989).

3: Data Screening to Assess Validity/Quality of Input

Checklist for screening data. Prior to the fundamental analysis, it is important to consider
the data, due to the effect characteristics of the data may have upon the results. Table 1,
after Tabachnick and Fidell, (1989), provides for an appropriate sequence for screening
the proposed data. The order of the screening is important as decisions at the earlier steps
influence decisions to be taken at later steps. For example, if the data is both non-normal
and has outliers, the decision to delete values or transform the data is confronted. If
transformation is undertaken first, there is likely to be fewer outliers, yet if the outliers
are deleted or modified first, there are likely of be fewer variables with non-normality.
Transformation of the variable is usually preferred as it typically reduces the number of
outliers, and is more likely to produce normality, linearity, and homoscedasticity.
Screening of the input data will help assess the appropriateness of the use of a particular
data set. Screening will aid in the isolation of data peculiarities and allow the data to be
adjusted in advance of further multivariate analysis. The checklist isolates key decision
points which need to be assessed to prevent poor data induced analysis problems.
Consideration and resolution of problems encountered in the screening of a data set is
necessary to ensure a robust statistical assessment.
Table 1. Checklist for Screening Data
1. Inspect univariate descriptive statistics for accuracy of input
1. out-of-range values, be aware of measurement scales
2. plausible means and standard deviations
3. coefficient of variation
2. Evaluate amount and distribution of missing data: deal with problem
3. Independence of variables
4. Identify and deal with nonnormal variables
1. check skewness and kurtosis, probability plots
2. transform variables (if desirable)
3. check results of transformations
5. Identify and deal with outliers
1. univariate outliers
2. multivariate outliers
6. Check pairwise plots for nonlinearity and heteroscedasticity
7. Evaluate variables for multicollinearity and singularity

8. Check for spatial autocorrelation

How to deal with the results of screening the data, what to look for.
1. Inspect univariate descriptive statistics for accuracy of input
Generate summary statistics on all the variables and bivariate correlations between all
variables. Summary statistics such as mean, median, standard deviation, minimum,
maximum. (See transformations page for an example.)
Check range of variables, ensure data values fall within appropriate range, no out-ofrange values. In the context of the given data set assess plausible means and standard
deviations. Assess the matrix with the coefficient of variation based upon the bivariate
correlations, especially if a correlation matrix is to be used as the input data form for
multivariate analysis. Correlation's may be inflated, deflated, or inaccurately calculated.
Inflated correlations may be due to repetition of a variable, deflated correlations may be
due to restricted range of a variable. Observing the bivariate relationships may also
foretell variables which are multicollinear or singular, problems that are dealt with in a
subsequent section.
1. Evaluate amount and distribution of missing data: deal with problem,
2. Independence of variables, check data for orthogonality
3. Identify and deal with nonnormal variables - normality, skewness, kurtosis.
Assess for normality.
4. Identify and deal with outliers.
5. Check pairwise plots for nonlinearity and heteroscedasticity.
6. Evaluate variables for multicollinearity and singularity
7. Check for spatial autocorrelation.

4: Multiple Correlation and Regression

Regression analyses are a set of statistical techniques which allow one to assess the
relationship between one dependent variable (DV) and several independent variables
(IVs). Multiple regression is an extension of bivariate regression in which several IVs are
combined to predict the DV. Regression may be assessed in a variety of manners, such
1. partial regression and correlation
- Isolates the specific effect of a particular independent variable controlling for the
effects of other independent variables. The relationship between pairs of variables
while recognizing the relationship with other variables.
2. multiple regression and correlation
- combined effect of all the variables acting on the dependent variable; for a net,

combined effect. The resulting R2 value provides an indication of the goodness of

fit of the model. The multivariate regression equation is of the form:
Y = A + B1X1 + B2X2 + ... + BkXk + E
Y = the predicted value on the DV,
A = the Y intercept, the value of Y when all Xs are zero,
X = the various IVs,
B = the various coefficients assigned to the IVs during the regression,
E = an error term.
Accordingly, a different Y value is derived for each different case of IV. The goal of the
regression is then to derive the B values, the regression coefficients, or beta coefficients.
The beta coefficients allow the computation of reasonable Y values with the regression
equation, and provide that calculated values are close to actual measured values.
Computation of the regression coefficients provides two major results:
1. minimization of deviations (residuals) between predicted and obtained Y values
for the data set,
2. optimization of the correlation between predicted and obtained Y values for the
data set.
As a result the correlation between the obtained and predicted values for Y relate the
strength of the relationship between the DV and IVs.
Although regression analyses reveal relationships between variables this does not imply
that the relationships are causal. Demonstration of causality is not a statistical problem,
but an experimental and logical problem.
The ratio of cases to independent variables must be large to avoid a meaningless (perfect)
solution. As with more IVs than cases, a regression solution may be found which
perfectly predicts the DV for each case. As a rule of thumb there should be approximately
20 times more cases than IVs for good results, yet at a bare minimum 5 times more cases
than IVs may be used. Be aware that cases with missing values are generally deleted in
the calculation by default. Extreme cases (outliers) have a strong effect on the regression
solution and should be dealt with. Calculation of the regression coefficients requires
matrix inversion, which is possible only when the variables are not multicollinear or
singular. The examination of residual plots will assist in the assessment that the results
meet the assumptions of normality, linearity, and homoscedasticity between predicted
DV scores and errors of prediction. The assumptions of the analysis are:

that the residuals (the difference between predicted and obtained scores) are
normally distributed,

that the residuals have a straight line relationship with predicted DV scores, and
the variance of the residual about the predicted scores is the same for all predicted

Prior to processing of the data as input to a multiple regression model the data should be
screened. The simple mathematics involved and the ubiquity of programs capable of
computing regression result in the misuse of regression procedures Wetherill, 1986.
Sample interpretation of beta coefficient table for interpretation of
regression results
(Sample table generated with SYSTAT, 1992)

How to interpret regression results:

the dependent Y variable should be stated, and the number of variables (ie.
AVGINC, N = 290)
parameters: the intercept, or constant value, (ie. 47170)
partial regression coefficients, "b" values, (ie. POP91 = 0.044)
standard error of "b" values, (ie. POP91 = 0.009)
significance test, such as t-test, of partial regression coefficients, and
a "p" value, a probability of significance. (significance test and p values are
generated for all the beta coefficients and for the model itself)
significance of model, observe significance test: look for significant F-ratio or ttest, and the corresponding significant p (probability value). Such as, in the
ANOVA section of the table generated above
the F statistic is significant, as is the P value, therefore the model is significant.

seek which of the independent variables are significant, again look at the t- and Pvalues, those associated with each independent variable, for significance generally
necessary is a high t- value and a low P-value. (Generally a T-value > 2 is
the beta values, the partial regression coefficients, represent the importance of the
variable. As can be seen with MUNEMP, the high beta value, and the highest
standard error of beta, also corresponds to significant P- and T-values.
standard error values - relate the range of the beta values
the different R-values relate different levels of confidence on the data, the R value
is highest, at 0.649, which relates the correlations of the model, while when
squared to reflect the amount of variation the IVs and the are explaining in the
DV, the value is reduced to 0.421. When controlling for sample size with the
adjusted R2 the model accounts for 0.41 of the variation in average family
income. The slight change between the R2 and adjusted R2 value reflects the
large sample size of the data set. 42% of the variation in average family income,
the dependent Y variable, can be explained by the combination of these seven

Some packages, such as SPSS, generate a VIF and tolerance value

the VIF, or variance inflation factor, will reflect the presence or absence of
multicollinearity. A high VIF, larger than one, the variable may be affected by
multicollinearity. The VIF has a range 1 to infinity.
tolerance has a range from zero to one. The closer the tolerance value is to zero
relates a level of multicollinearity.

As mentioned above, the results of the regression should be assessed to reflect the quality
of the model, especially if the data was not screened. See transformations, linearity,
homoscedasticity, residual plot, multivariate normality, and multicollinearity or

5: Principal Components Analysis and Factor Analysis

Principal components analysis (PCA) and factor analysis (FA) are statistical techniques
applied to a single set of variables to discover which sets of variables in the set form
coherent subsets that are relatively independent of one another. Variables that are
correlated with one another which are also largely independent of other subsets of
variables are combined into factors. Factors which are generated are thought to be
representative of the underlying processes that have created the correlations among
PCA and FA can be exploratory in nature, FA is used as a tool in attempts to reduce a
large set of variables to a more meaningful, smaller set of variables. As both FA and PCA
are sensitive to the magnitude of correlations robust comparisons must be made to ensure
the quality of the analysis. Accordingly, PCA and FA are sensitive to outliers, missing
data, and poor correlations between variables due to poorly distributed variables (See

normality link for more information on distributions.) As a result data transformations

have a large impact upon FA and PCA.
Correlation coefficients tend to be less reliable when estimated from small sample sizes.
In general it is a minimum to have at least five cases for each observed variable. Missing
data need be dealt with to provide for the best possible relationships between variables.
Fitting missing data through regression techniques are likely to over fit the data and result
in correlations to be unrealistically high and may as a result manufacture factors.
Normality provides for an enhanced solution, but some inference may still be derived
from nonnormal data. Multivariate normality also implies that the relationships between
variables are linear. Linearity is required to ensure that correlation coefficients are
generated form appropriate data, meeting the assumptions necessary for the use of the
general linear model. Univariate and multivariate outliers need to be screened out due to
a heavy influence upon the calculation of correlation coefficients, which in turn has a
strong influence on the calculation of factors. In PCA multicollinearity is not a problem
as matrix inversion is not required, yet for most forms of FA singularity and
multicollinearity is a problem. If the determinant of R and eigenvalues associated with
some factors approach zero, multicollinearity or singularity may be present. Deletion of
singular or multicollinear variables is required.
Uses of Principle Components Analysis and Factor Analysis
Direct Uses:

identification of groups of inter-related variables,

reduction of number of variables,

Indirect Uses:

a method of transforming data. Transformation of data through rewriting the data

with properties the original data did not have. The data may be efficiently
simplified prior to a classification while also removing artifacts such as

Theory to Common Factor Analysis and Factor Analysis

The key underlying base to Common Factor Analysis (PCA and FA) is that the chosen
variables can be transformed into linear combinations of an underlying set of
hypothesized or unobserved components (factors). Factors may either be associated with
2 or more of the original variables (common factors) or associated with an individual
variable (unique factors). Loadings relate the specific association between factors and
original variables. Therefore, it is necessary to find the loadings, then solve for the
factors, which will approximate the relationship between the original variables and
underlying factors. The loadings are derived from the magnitude of eigenvalues
associated to individual variables.

The difference between PCA and FA is that is that for the purposes of matrix
computations PCA assumes that all variance is common, with all unique factors set equal
to zero; while FA assumes that there is some unique variance. The level of unique
variance is dictated by the FA model which is chosen. Accordingly, PCA is a model of a
closed system, while FA is a model of an open system.
Rotation attempts to put the factors in a simpler position with respect to the original
variables, which aids in the interpretation of factors. Rotation places the factors into
positions that only the variables which are distinctly related to a factor will be associated.
Varimax, quartimax, and equimax are all orthogonal rotations, while oblique rotations are
non-orthogonal. The varimax rotation maximizes the variance of the loadings, and is also
the most commonly used.

To run a PCA or FA
To analyze data with either PCA or FA 3 key decisions must be made:

the factor extraction method,

the number of factors to extract, and
the transformation method to be used.

Interpretation of a Factor Analysis

Determination of number of factors to extract

significance test, difficult to meet assumptions required to significance tests,

therefore the following heuristics are used.
magnitude of eigenvalues,

Assess the amount of original variance accounted for. Retain factors whose eigenvalues
are greater than 1. (Ignore those with eigen values less than one as the factor is
accounting for less variance than an original variable).

the figure relates that five factors are significant.

substantive importance,
an absolute test of eigen values in a proportional sense. Retain any eigenvalue that
accounts for at least 5% of the variance,
skree test,
plot magnitude of eigen values (Y axis) versus components (X axis), retain factors
which are above the inflection point of the slope.
a battery of tests where the above heuristics may all be applied, assess magnitude
of eigen values, substantive importance, and a skree test.

Which variables are best summarized by the model?

interpret communalities (final estimates of communalities),

high = most important,
low = least important.

The communalities relate the overall effect of the factors.

Naming of Factors

look at individual factor scores, see which variables have the highest factor
scores. Also look at the factor scores to see if the initial interpritations are
confimed by the factor scores (Factor scores are normally distributed only when
the input variables are normally distributed. Therefore, when interpreting factors
the greatest concern is with the tail values. The normal distribution of factor
scores also acts as a data transformation and prepares the data for other
multivariate analyses.)


What is meant by an Ill conditioned Correlation Matrix

an illconditioned correlation matrix is a manifestation of multicollinearity. FA is

sensitive to an illconditioned matrix while PCA is not. To solve for the
characteristic equation in FA matrix inversion is required, which is not possible
with a singular matrix. To solve this problem click on the multicollinearity link.

To assess the value of input variables to the model

assess the Kaiser-Meyer-Olkin measure of sampling adequacy (KMO), which

provides results in the range from 0.5 to 0.9.

For example: MINES is a less valid variable than POPDEN in this model.
A value of 1 relates a complete relationship, totally related, which is bad. The range
which is provided as a heuristic is:
0.9 - marvelous,
0.8 - meritorious,
0.7 - middling,
0.6 - mediocre, or
0.5 - miserable (perfectly uncorrelated).
Barlett test of sphericity, variable projected upon an n-dimensional spheroid, the
significance of the relationship is then evaluated. (See figure above, where the p value is
significant, and does not fit in the allocated space).


6: Discriminant Analysis
The main use of discriminant analysis is to predict group membership from a set of
predictors. Discriminant function analysis consists of finding a transform which gives the
maximum ratio of difference between a pair of group multivariate means to the
multivariate variance within the two groups (Davis, 1986). Accordingly, an attempt is
made to delineate based upon maximizing between group variance while minimizing
within group variance. The predictors characteristics are related to form groups based
upon similarities of distribution n-dimensional space which are then compared to groups
which are input by the user as truth. This enables the user to test the validity of groups
based upon actual data, to test groups which have been created, or to put objects into
groups. Discriminant analysis (DA) may act as a univariate regression and is also related
to ANOVA (Wesolowsky, 1976). The relationship to ANOVA is such that DA may be
considered as a multivariate version of ANOVA. The underlying assumptions of DA are:

the observations are a random sample,

each group is normally distributed, DA is relatively robust to departures from
the variance/covariance matrix for each group is the same,
each of the observations in the initial classification is correctly classified (training

Prior probabilities may be assigned to deal with unequal sample sizes. Although the size
of the smallest group should still be larger than the number of predictor variables, at a
minimum. The assumption of multivariate normality holds, that the scores on predictors
are randomly distributed, and that the sampling distribution is of any linear combination
of predictors is linearly distributed. Discriminant analysis is relatively robust to
nonnormality due to skewness, yet not that which is due to outliers. Discriminant analysis
is highly sensitive to outliers. Variables with significant outliers necessitate
transformation prior to analysis. Linearity is also assumed for discriminant analysis. The
inclusion of redundant variables in the computation of discriminant analysis normally
results in automatic exclusion of multicollinear and singular variables. A tolerance test is
undertaken to assess the viability of all independent variables prior to analysis with most
statistical application programs.
Analysis of Output:
Importance of variables to model:

test of significance: D2, if the discriminant functions discriminate well, the D2

will be large, yet if the D2 is small, the discriminant functions do not discriminate
in the context of a oneway ANOVA, the Wilks' Lambda is a reflectance of a
variables importance. The smaller the Wilks' Lambda, the more important the
variable. An F-test and significant p value are also provide as measures of the
importance of a variable. Therefore, the smallest Wilks' Lambda and the greatest


significance relates the most important variable. This relationship is demonstrated

in the following graphic, where CDAREA has the lowest Wilks' Lambda, the
highest F-ratio, and a significant p value reflecting the importance of this variable
to the analysis.

SPSS Output, Key Features

the groups are defined by the number of cases in each group, and the new weights
associated with the variable to create or account for biases due to sample size or
research aims,
Wilks' Lambda, F-ratio, and significance for each of the variables,
variables which failed the tolerance test will be presented,
the Canonical Discriminant Functions Table: relates the number of important
functions through a variety of tests:
o eigenvalues, which are greater 1 are significant,
o percentage of variance greater than 5 is significant,
o Cumulative percentage to approximately 75% is significant,
o a canonical correlation of greater than 0.6 is significant, and
o a probability p value.

The preceding criteria are all presented on the following graphic. What may be seen is
that each of the tests have different cut-off points for the number of significant functions,
accordingly use discretion and apply a value which is appropriate given the results and
nature of input data.


structure matrix, which presents the discriminant functions related to each

variable. The largest absolute correlation between each variable and discriminant
functions are denoted and this assists in the naming of significant functions,
a territorial map is also presented which relates the delineation of groups around
centroids, the quality of the groups is presented in this 2-Dimensional display,
list of cases, with membership probabilities. For example:

Actual Highest Prob
Group P(D|G) P(G|D)

High P(G|D)


0.8677 3

0.6968 -1.685



0.6819 4

0.1873 -1.008

Case 1 is a case which is moderately specified on the periphery of group 1, by the

training data, with the actual case being the same as that dictated by the highest
probability. The P(D|G), the probability of the dictated class (actual class) given the
probability of the calculated group, the probability of D given G (a measure of how
central an object is to the group, the conditional probability). Yet for case number 6, the
group with the highest probability of being in the correct class based upon the input data
is not the same as the class provided as truth. P(G|D) is the posterior probability is, given
that a discriminant score D, what is the probability the object is in G. The actual class
was not even second highest group. This table enables users to assess the validity of a
grouping structure and to also screen for problem cases in an analysis. Problem cases
which have similar characteristics may require a new variable to assist in the delineation
of the groups.
Classification results - a table of actual group (Y) versus predicted group membership (X)
which tallies for the total number of cases where they were grouped. Ideally, with a 100%
accurate classification the trace, down the matrix diagonal, would be all ones. Yet, with a
non-perfect matrix, the class where the values are being placed is visible. A tally of the
results of the previous case placement table in a percentage form.
As can be expected there is more than one way to calculate the accuracy of the
classification. Congalton, (1991), presents two methods of calculation classification


accuracy: users accuracy and producers accuracy. Users accuracy calculates correctly
classed from the trace variable over the row total and provides and indication of errors of
case omission. Producers accuracy is the calculation of correctly classed from the trace
value over the column total. Producers accuracy gives an indication of the accuracy of
what the model was able to itself predict, whereas users accuracy relates how well the
training data was discerned. The goal of the analysis ought to dictate which of the
methods is utilized.

The diagonal of the confusion matrix is the number of variables which were both trained
and predicted to the same group. Division of the percentage accuracy may be calculated
either with the row total or column total. The row total is the number of variables trained
to be in the class. The column total is the number of variables predicted to be in that
class. Accordingly, the different results both have uses valid to different situations. The
row division method is referred to as USERS accuracy, and relates the amount of
commission from the class. The column division method is called producers accuracy,
and relates the amount of omission error. The purpose of an individual study will dictate
which manner of accuracy should be applied. (Commission error = consumers accuracy;
omission error = producers accuracy)
The preceding graphic demonstrates the profoundly different accuracy results which will
be calculated based upon use of either users or producers accuracy. Of significant note
are the cases or groups with small sample sizes.

7: Cluster Analysis
Cluster analysis (CA) is a multivariate procedure for detecting natural groupings in data.
Cluster analysis classification is based upon the placing of objects into more or less
homogeneous groups, in a manner such that the relationship between groups is revealed.
CA lacks an underlying body of statistical theory and is heuristic in nature. Cluster
analysis requires decisions to be made by the user relating to the calculation of clusters,
decisions which have a strong influence on the results of the classification. CA is useful
to classify groups or objects and is more objective than subjective.
Clustering methods may be top down and employ logical division, or bottom up and
undertake aggregation. Aggregation procedures which are based upon combining cases


through assessment of similarities are the most common and popular will be the focus of
this section.
Care should be taken that groups (classes) are meaningful in some fashion and are not
arbitrary or artificial. To do so the clustering techniques attempt to have more in common
with own group than with other groups, through minimization of internal variation while
maximizing variation between groups. Homogeneous and distinct groups are delineated
based upon assessment of distances or in the case of Ward's method, an F-test (Davis,
Steps to Cluster Analysis
The two key steps within cluster analysis are the measurement of distances between
objects and to group the objects based upon the resultant distances (linkages). The
distances provide for a measure of similarity between objects and may be measured in a
variety of ways, such as Euclidean and Manhatan metric distance. The criteria used to
then link (group) the variables may also be undertaken in a variety of manners, as a result
significant variation in results may be seen. Linkages are based upon how the association
between groups is measured. For example, simple linkage or nearest neighbor distance,
measures the distance to the nearest object in a group while furthest neighbor linkage or
complete linkage, measures the distance between furthest objects. These linkages are
both based upon single data values within groups, whereas average between group
linkage is based upon the distance from all objects in a group. Centroid linkage has a new
value, representing the group centroid, which is compared to the ungrouped point to
weigh inclusion. Ward's method is variance based with the groups variance assessed to
enable clustering. The group which see the smallest increase in variance with the iterative
inclusion of a case will receive the case. Ward's is a popular default linkage which
produces compact groups of well distributed size. Standardization of variables is
undertaken to enable the comparison of variables to minimize the bias in weighting
which may result from differing measurement scales and ranges. Z score format accounts
for differences between mean values and reduces the standard deviation when variables
have multivariate normality. Multicollinearity will bias the clusters due to the high
correlations between variables.

Choosing number of groups

The ideal number of groups to establish may be assessed graphically or numerically.
Graphically the number of groups may be assessed with an icicle plot or dendrogram.
The dendrogram bisected at a point which will divide the cases into a cluster based upon
groupings up to the point where the bisection occurred. Numerically the number of cases
may be assessed on the agglomeration schedule, by counting up from the bottom to
where a significant break in slope (numbers) occurs. This is similar to a
visual interpretation of a skree plot
The optimal number of groups may be assessed a priori based upon knowledge of the
data set. A skree plot which converts a dendrogram to a profile curve will have an


extreme inflection point where the number of groups significantly changes. The number
of groups above the inflection point is an appropriate number of groups. Optimality of
classes may be assessed by how "natural" the classes appear. Low within class variation
in comparison to the between class variation reflects an appropriate class structure.
Discriminant analysis may also be employed to assess optimality and efficiency of
computed groups, by imputing the cluster analysis derived classes for analysis with the
original data.

8: Spatial Autocorrelation
Spatial autocorrelation is an assessment of the correlation of a variable in reference to
spatial location of the variable. Assess if the values are interrelated, and if so is there a
spatial pattern to the correlation, there is spatial autocorrelation. Spatial autocorrelation
measures the level of interdependence between the variables, the nature and strength of
the interdependence. Spatial autocorrelation may be classified as either positive or
negative. Positive spatial autocorrelation has all similar values appearing together, while
negative spatial autocorrelation has dissimilar values appearing in close association.
Spatial autocorrelation is related to the scale of the data as a periodicity of elements is
assessed. Negative spatial autocorrelation is more sensitive to changes in scale. In
geographic applications there is usually positive spatial autocorrelation.
Uses of assessment of spatial autocorrelation:

identification of patterns which may reveal an underlying process,

describe a spatial pattern and use as evidence, such as a diagnostic tool for the
nature of residuals in a regression analysis,
as an inferential statistic to buttress assumptions about the data,
data interpolation technique.

How to Compute Spatial Autocorrelation:

The measurement scale dictates the type of measure,

assign weights to the cases,
create a matrix representing the relationships between variables into software such
as Anaspace. Anaspace, by Michael Tiefelsdorf, of Wilfrid Laurier University,
which will compute measure of the spatial autocorrelation between the input data
matrix. In the case of raster data, when building a contiguity matrix be cognizant
of the relationships between cases, are neighbors determined on a eight directional
Queens case or non-diagonal, four directional Rooks case.


How to Measure Spatial Autocorrelation:

Moran's I
Computation of Moran's I is achieved by division of the spatial covariation by the total
variation. Resultant values are in the range from approximately -1 to 1. Positive signage
represents positive spatial autocorrelation, while the converse is true for negative signage.
With a Zero result representing no spatial autocorrelation.
Geary's C
Computation of Geary's C results in a value within the range of 0 to +2. With zero being
a strong positive spatial autocorrelation, through to 2 which represents a strong negative
spatial autocorrelation.
How to Correct for Spatial Autocorrelation in Regression:

indicates incomplete model, there may be a missing variable. Therefore add an

additional variable which may change data pattern.
incorrect model specification. The data may not be appropriate for a linear fit, or a
non-spatial effect may be manifest in the residuals, nuisance spatial
autocorrelation. Substantive spatial autocorrelation occurs when there is missing
dominant or extreme cases, outliers, which should have been found at data
screening stage.
systematic measurement error in response variable (non-random). A case in which
error increases as values increase, or vice versa.
regression model is inappropriate, reflects the need for an explicitly spatial
model. A spatially autoregressive model which incorporates a spatial lag operator
into the regression computation. The approach for the implementation of spatial
autoregressive models is as follows:
o establish nature of spatial dependency,
o use information to choose appropriate model form,
o fit model using maximum likelihood operators,
o calculate residuals from model,
o test residuals, and
o adjust model based upon residuals. (after Haining, 1990)


2. Correlation Coefficient and Coefficient of Variation

Correlation Coefficient
The correlation coefficient measures the measures the strength of the linear association
between two interval/ratio scale variables. (Bivariate relationships are denoted with a
small r.) Though it does not distinguish explanatory from response variables and is not
affected by changes in the unit of measurement of either or both variables (Moore and
McCabe, 1993).
Multiple correlation coefficients, R
-1 < = r < = 1, whereas 0 < = R < = 1
(or the multiple coefficient of determination, 0 < = R2 < = 1)
the proportion of dependent variable (Y) that can be attributed to the combined effects of
all the X independent variables acting together.
- for net effects (multivariate), assess R, R2,
- for individual effects (bivariate) assess r, r2.
R - simple correlation between residuals,
R2 - denotes the percentage of variation in the dependent variable accounted for by the
independent predictor variables.
Adjusted R-Squared
An adjusted R-squared, takes the size of the sample into effect. Use when need to
compare the results of models which had a differing number of observations or
independent variables, or to temper the results of an analysis with suspect results due to a
small number of observations.
adjusted R2 = R2 - (k - 1) / (n - k) * (1 - R2)
n = # of observations,
k = # of independent variables,
smaller n, decreases R2 value,
larger n, increases R2 value,


smaller k, increases R2 value,

larger k, decreases R2 value.

3. Homoscedasticity
Homoscedasticity is the assumption that the variability in scores for one variable is
roughly the same at all values of the other variable, which is related to normality, as when
normality is not met, variables are not homoscedastic.

Heteroscedasticity is caused by nonnormality of one of the variables, an indirect

relationship between variables, or to the effect of a data transformation.
Heteroscedasticity is not fatal to an analysis, the analysis is weakened, not invalidated.
Homoscedasticity is detected with scatterplots and is rectified through transformation.


4. Linearity
Linearity is the assumption that there is a straight line relationship between variables.
Examples of non-linear distributions:

Linearity is essential for calculation of multivariate statistics due to the basis upon the
general linear model, and the assumption of multivariate normality which implies that
there is linearity between all pairs of variables, with significance tests based upon that
assumption. Non-linearity may be diagnosed from bivariate scatterplots between pairs of
variables or from a residual plot, with predicted values of the dependent variable versus
the residuals. Residual plots may demonstrate: assumptions met, failure of normality,
nonlinearity, and heteroscedasticity. Linearity between two variables may be assessed
through observation of bivariate scatterplots. When both variables are normally
distributed and linearly related, the scatterplot is oval shaped, if one of the variables is
nonnormal then the scatterplot is not oval.


Refer to the homoscedasticity page for further information.

5. Multivariate General Linear Hypothesis (MGLH)

The Multivariate general linear hypothesis can estimate and test any univariate or
multivariate general linear model, such as multiple regression, analysis of variance,
discriminant analysis, or principle components analysis. All these procedures have their
genesis in the same linear model. Linear models are based upon lines, more generally,
they are based on linear planes or surfaces. Linear models are applied with wide
acceptance as lines or planes often are able to describe relations among events in the real
world. Accordingly, linearity of variables is very important when applying the general
linear model. Linearity is the assumption of a straight line fit between variables
(SYSTAT, 1992). As pairs are understood to have a linear relationship with each other,
that the relationship between the variables is adequately represented by a straight line.
Additivity is also important to the general linear model, as one set of variables may be
predicted from another set of variables, the effect of the variables within the data set are
additive in the prediction equation. The second variable in the set provides predictability
to the first, the third variable provides predictability to the first two, and so on.
Accordingly, in multivariate solutions, the equation which relates the set of variables is
composed of a series of weighted terms added together. The assumption of linearity does
not limit the use of variables with non-linear, curvilinear relationships, or multiplicative
relationships. The data may be made suitable for analysis through transformation.


6. Missing Data
If the data set is large and a few random points are missing the problem is not serious, yet
in a smaller data set with a non-random distribution of missing values the problem may
be serious. Dealing with the problem:

deleting cases, if a case is missing values it may be deleted. Deletion is often the
default option with statistical software packages.
estimate missing data, estimate the missing values and use these values during
subsequent analysis. Estimates may be made upon, prior knowledge, inserting
mean values, and using regression.
a missing data correlation matrix may be computed. The matrix is computed using
only values which are common to both variables, ie. if two cases have 15
corresponding values of 20 the correlation matrix value will be computed from
those 15 common cases.
spatially autoregressive models.
treating missing data as data. In sociological studies it is potentially the case that a
failure to respond may be indicative of some form of behavior.
check results with and without missing data. assess the results of each analysis, if
they are markedly different attempt to discern the reason for the difference.
Attempt to evaluate which result more closely approximates reality (Tabachnick
and Fidell, 1989).

7. Multicollinearity and Singularity

Multicollinearity and singularity are issues which are derived from the having a
correlation matrix with too high of correlation between variables. Multicollinearity is
when variables are highly correlated (0.90 and above), and singularity is when the
variables are perfectly correlated. Multicollinearity and singularity expose the
redundancy of variables and the need to remove variables from the analysis.
Multicollinearity and singularity can cause both logical and statistical problems.
Logically, redundant variables weaken the analysis (except in the case of factor analysis),
through reduction of degrees of freedom error. Accordingly, unless doing a factor
analysis, avoid variables with a bivariate correlation of greater than 0.70 in the same
analysis. The statistical problems related to singularity and multicollinearity are related to
matrix stability and ability for matrix inversion. With multicollinearity the effects are
additive, the independent variables are inter-related, yet effecting the dependent variable
differently. The high the multicollinearity, the greater the difficulty in partitioning out the
individual effects of independent variables. Accordingly, the partial regression
coefficients are unstable and unreliable.
Most programs appear to automatically screen for multicollinearity and singularity by
computing the squared multiple correlation of a variable. The squared multiple
correlation is computed where a variable is compared to all the rest of the included


variables, if the results show a high correlation the variable is multicollinear. If the
variable is perfectly related to the other variables then singularity is present. Large
standard errors due to multicollinearity result in both a lessened probability of rejecting
the null hypothesis and wide confidence intervals.
To identify multicollinearity:

look at pairwise relationships between variables, if r values are greater than |0.80|
the variables are strongly inter-related and should not be used.
tolerance of variable, a value of near one indicates independence, if the tolerance
value is close to zero, the variables are multicollinear.
VIF, variance inflation factor, if highly collinear a high value is calculated.
eigenvalues, view eigenvalues: no multicollinearity if eigenvalues are
approximately the same size. If some are much larger than others this is indicative
of the related variable loading together. Under independence the eigenvaluess are
normally distributed, with equal importance, yet under multicollinearity, the
distribution has a few high eigenvalues and many low ones, reflecting an uneven
importance of variables.
For analysis of multicollinearity in regression analysis results some packages,
such as SPSS, generate a VIF and tolerance value
the VIF, or variance inflation factor, will reflect the presence or absence of
multicollinearity. A high VIF, larger than one, the variable may be affected by
multicollinearity. The VIF has a range 1 to infinity.
tolerance has a range from zero to one. The closer the tolerance value is to zero
relates a level of multicollinearity.

To Solve for Multicollinearity

turn variables to rates,

reduce data set, remove variables which are redundant due to a very high
relationship with one another.

8. Normality
The underlying assumption of most multivariate analysis and statistical tests is the
assumptions of multivariate normality. Multivariate normality is the assumption that all
variables and all combinations of the variables are normally distributed. When the
assumption is met the residuals are normally distributed and independent, the differences
between predicted and obtained scores (the errors) are symmetrically distributed around a
mean of zero and there is no pattern to the errors. Screening for normality may be
undertaken in either statistical or graphical methods.


How to assess and deal with problems:


examine skewness and kurtosis. when a distribution is normal both skewness and
kurtosis are zero. Kurtosis is related to the peakedness of a distribution, either too
peaked or too flat. skewness is related to the symmetry of the distribution, the
location of the mean of the distribution, a skewed variable is a variable whose
mean is not in the center of the distribution. Tests of significance for skewness
and kurtosis test the obtained value against a null hypothesis of zero. Although
normality of all linear combinations is desirable to ensure multivariate normality
it is often not testable. Therefore, normality assessed through skewness and
kurtosis of individual variables may indicate variables which may require


view distributions of the data and compare to the distributions above.

probability plots, where scores are ranked and sorted, an expected normal value is
compared for the actual normal value for each case. If a distribution is normal, the
points for all cases fall along the line running diagonal from lower left to upper
right. Deviations from normality shift the points away from the diagonal.
examine the residuals, plot expected values vs. obtained scores (predicted vs.

If nonnormality is found in the residuals or the actual variables transformation may be

considered. Univariate normality does not ensure multivariate normality, but does
increase the likelihood. Transformations are recommended as a remedy for outliers,
breaches in normality, non-linearity, and lack of homoscedasticity. Although

recommended, be aware of the change to the data, and the adaptation to the change which
must be implemented for interpretation of results. If the scale is arbitrary interpretation
will only be hindered marginally, yet if the scale is meaningful the transformation may
cause confusion. Transformations are appropriate when the non-linearity is monotonic, if
the distribution plot is non-monotonic rethink the variable. Sample distributions and
appropriate transform to produce normality can be found on the transformations link.
Remember to check the transformation for normality after application.

9. Orthogonality
Orthogonality relates to independence, where there should be non-association between
variables, (An R = 0, see correlation coefficient location.).
Orthogonality is perfect non-association between variables. Independence of variables is
desired so that each addition of independent variable adds to the prediction of the
independent variable. If the relationship between independent and dependent variables is
orthogonal, the overall effect of an independent variable may be partitioned into effects
on the dependent variable in an additive fashion. In orthogonal experimental designs,
with random assignment, causality may be assigned to various main effects and

10. Outliers
Outliers are extreme cases on one variable, or a combination of variables, which have a
strong influence on the calculation of statistics. The following box plot provides an
example of outliers.


It is also possible to view for possible presence of outliers using scatterplots. Variables
may be plotted both in univariate and bivariate combinations. Below is a bivariate plot of
some multivariate regression results scatterploted which reveal an outlier.

Reasons for outliers:

incorrect data entry,

failure to specify missing value codes,
case not a member of intended sample population,
actual distribution of the population has more extreme cases than a normal

Reducing the influence of outliers:

check the data for the case, ensure proper data entry
check if one variable is responsible for most of the outliers, consider deletion of
the variable,
delete the case if it is not part of the population,
if variables are from population, yet to remain in the analysis, transform the
variable to reduce influence.

Graphical methods for identifying multivariate outliers:

1. Observe the residuals to identify cases for which there is a poor fit between
obtained and predicted DV scores. A residuals plot will isolate cases which are
outliers. Multivariate outliers will produce points outside the general swarm of
2. Leverage on the X axis versus residuals on the Y axis. Leverage is a distance
measure, such as Mahalanobis distance. Outliers will again be exposed by lying
outside the general swarm of points.


The outlier which is present above will cause problems with linearity. As a comparison
the same plot is produced with the outlying value removed. (removal of outlier, case
4619000, Division 19, Province. 7 (Manitoba)). An improved linear relationship is seen
with the removal of the outlier.

11. Residual Plots

Residuals are the difference between an observed value of the response variable and the
value predicted by the model (Moore and McCabe, 1993).
therefore: residual = observed y - predicted y
The mean of the residuals is always zero. As a result plotting of residuals enables the data
to be viewed from a standard orientation point. Residual plots show the deviation from
the expected value for each x value in the model.
Plots of predicted values versus error (residual) values
The plots demonstrate, (a) assumptions met, (b) failure of normality, (c) nonlinearity, and
(d) heteroscedasticity.


Residual plots, raw and standardized from the census data model.
Raw value residuals

Standardized value residuals


These plots demonstrate that the residuals are from a model where the assumptions were
met. The effect of some outliers may need to be assessed.

12. Data Transformation

Transformations are a remedy for outliers, failures of normality, linearity, and
homoscedasticity. Yet, caution must still be employed in the usage of transformations due
to the increased difficulty of interpretation of transformed variables. The scale of the data
influences the utility of transformations, if a scale is arbitrary transformations are more
effective, while when the scale is meaningful the difficulty of interpretation increases.
Sample data forms with recommended transformations:


As with many statistical techniques, transformation is an iterative process which requires

post calculation evaluation. Check to see that a variable is normally or near-normally
distributed after transformation, if not, redo with a more appropriate transformation.
Continue attempting transformations until skewness and kurtosis values are nearest zero,
or the fewest outliers. The suggested transformations above are intended to bring the
distribution closer to normal.
Principal component analysis and factor analysis also serve to transform data in particular
Below is an example of data transformation:
Summary Statistics generated in SYSTAT:

Observe the problems which can be seen in the summary data, such as: scales which
vary, ranges which differ, skewness and kurtosis not near zero, means and standard
deviations which further reflect the skewed data. Further steps to be taken when viewing
the univariate descriptive statistics:

number of cases all equal, no missing data.

compare minimum and maximum values to known range.
assess data for outliers, large range in reference of the mean location and standard
deviation, ie. POP91.


Box plot of POP91

as opposed to that of MUNEMP, which is still skewed, yet not as badly.

Further to demonstrate the skewed distribution of POP91 is a stem and leaf plot:
MEDIAN IS: 37621.
MAXIMUM IS: 2275771.
accordingly POP91 ought to be transformed, consulting the distributions at the beginning
of this document which show the different distributions, accordingly, a logarithm
transformation will be implemented.


an improved distribution of values is seen with the log transformed population 1991 data.

observe mean values to see if plausible given the context of the variable, ie
AVEINC - average income - approximately $44360 per year, is a reasonable
get sense of the distribution by comparing mean to median values, the closer
together the more symmetrical the distribution.
skewness and kurtosis, the closer the distribution is to normal the closer the values
of skewness and kurtosis are to zero.

Raw vs. rated values, reduce influence of outliers




X3 = GOV
X7 = POP91
The varying ranges for each of the variables reflects the difficulty in comparing of the
variables. The larger values of variables such as POP91 will dominate the model.
RATED - why, to reduce the influence of any given variable

X1 = NONNAT/POP91 (%)
X4 = LOWED/POP91 (%)
X5 = HIGHED/POP91 (%)
POP91 will be dealt with in the transformation coming up.



confined data (list) multiplied by 10 to provide that all data within same dynamic range.

X1 = rated and transformed NONNAT
X2 = rated and transformed MINES
X3 = rated and transformed GOV
X4 = rated LOWED
X5 = rated and transformed HIGHED
X6 = raw MUNEMP
X7 = transformed POP91


14. References
Congalton, 1991; A Review of Assessing the Accuracy of Classifications of Remotely
Sensed Data, Remote Sensing of Environment, Vol. 37, pp. 35-46.
Davis, J., 1986; Statistics and Data Analysis in Geology, (John Wiley & Sons, Toronto,
Moore, D., and G. McCabe, 1993; Introduction to the Practice of Statistics, (W.H.
Freeman and Company, New York, 854p.)
SYSTAT, 1992; Statistics, Version 5.2 Edition, (SYSTAT, Inc., Evanston, IL., 724p.)
Tabachnick, B., and L. Fidell, 1989; Using Multivariate Statistics, (Harper & Row
Publishers, New York, 746p.)
Weslowsky, G., 1976; Multiple Regression and Analysis of Variance, (John Wiley &
Sons, Toronto, 292p.)
Wetherill, G., 1986; Regression Analysis with Applications, (Chapman and Hall, New
York, 311p.)

Graduate Geography 616, Suggested Readings List

Clark, W., and P. Hosking, 1986; Statistical Methods for Geographers.
Johnston, R., 1978; Multivariate Statistical Analysis in Geography.
Haggett, P., et al. 1977; Locational Methods.
Haining, R., 1990; Spatial Data Analysis in the Social and Environmental Sciences
King, L., 1969; Statistical Analysis in Geography.
Shaw, G., and D. Wheeler, 1985: Statistical Techniques in Geographical Analysis .
Taylor, P., 1977; Quantitative Methods in Geography.
Webster, R., and M. Oliver, 1990; Statistical Methods in Soil and Land Resource Theory.


You might also like