UNIT 2_2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

UNIT -II PREPARING FOR MULTIVARIATE ANALYSIS 9

Conceptualization of research problem – Identification of technique - Examination of


variables and data – Measurement of variables and collection of data – Measurement of errors
– Statistical significance of errors. Missing data – Approaches for dealing with missing data –
Testing the assumptions of multivariate analysis – Incorporating non-metric data with dummy
variables.
Conceptualization of research problem: Conceptualization is to specify exactly what we
mean and don‟t mean by the terms we use in our research. The term concept (also referred as
‘construct’) refers to end product of conceptualization. Concept could be a word or complex
set of events or ideas referred by the word. Concept can be word or symbol used to represent
a meaningful whole .Words we use to form the description of a concept are also concepts. To
fully understand the description of a given concept, each concept in that definition must also
be understood. In addition to organizing observations into meaning wholes, concepts also
needed to be organized into separate phenomena.
Factors to consider when conceptualizing a research problem
• Empiricism: Empiricism is the belief that people should rely on practical experience
and experiments, rather than on theories, as a basis for knowledge. The researcher
must ensure that the specific problem identified in the statement of the problem is
researchable. A good research problem should be able to withstand empirical test.
Research problem should be able to generate appropriate terminologies that can be
used to generate expected data.
• The research problem should be written clearly to capture the interest of the
reader: The researcher should avoid all forms of ambiguity through operationalization
of research variables.
• The scope of the research problem should be indicated: The scope of the study should
come out clearly. It addresses the extent to which the research attempts to tackle the
research problem.
• Importance of the study in adding new knowledge: The research problem should
generate information that adds new knowledge in the relevant area of study. This is
because one of the main functions of research is to discover new knowledge.
• The problem statement must give the purpose of the research: The research problem
must have goals and objectives that need to be accomplished. It will therefore lay
ground for the formulation of the study objectives.
• Feasibility: The research problem should be feasible and able to shape a research that
can be realistically conducted within a reasonable period of time taking into
consideration resources available
Identification of appropriate technique: Define multivariate (or multidimensional)
datasets as data tables containing more than 2 variables (usually stored in columns)
measured on more than 2 statistical units (individuals, patients, sites…) usually stored in
rows. Multidimensional data analysis techniques are used to extract interesting

1
information in large datasets that can hardly be read in their raw format. Those tools are
often referred to as data mining tools.
The following grid will guide through the choice of an appropriate data mining method
according to the type of question to investigate using the data (exploratory or decisional) as
well as the structure of the data.
Divide the questions into two types:
 Exploratory questions allow the investigation of multivariate datasets without
considering any particular hypothesis to validate. Exploratory multivariate data
analysis tools often imply a reduction of the dimensionality of large datasets making
data exploration more convenient.
 Decisional questions imply testing the relationship between two sets of variables
(correlation), or explaining a variable or a set of variables by another set (causality).

2
A Four-Step Process For Identifying Missing Data And Applying Remedies
Step 1: Determine the Type of Missing Data
The first step in any examination of missing data is to determine the type of missing data
involved. Here the researcher is concerned whether the missing data are part of the research
design and under the control of the researcher or whether the “causes” and impacts are truly
unknown. Also, researchers should understand the “levels” of missingness present in their
data so that the most effective missing data strategies can be developed.

Ignorable Missing Data


Many times missing data are expected and part of the research design. In these instances, the
missing data are termed ignorable missing data, meaning that specific remedies for missing
data are not needed because the allowances for missing data are inherent in the technique
used. There are three instances in which a researcher most often encounters ignorable missing
data.
A Sample as Missing Data : The first example encountered in almost all surveys and most
other datasets is the ignorable missing data process resulting from taking a sample of the
population rather than gathering data from the entire population. In these instances, the
missing data are those observations in a population that are not included when taking a
sample. The purpose of multivariate techniques is to generalize from the sample observations
to the entire population, which is really an attempt to overcome the missing data of
observations not in the sample. The researcher makes these missing data ignorable by using
probability sampling to select respondents. Probability sampling enables the researcher to
specify that the missing data process leading to the omitted observations is random and that
the missing data can be accounted for as sampling error in the statistical procedures. Thus,
the missing data of the non-sampled observations are ignorable.
Part of Data Collection: A second instance of ignorable missing data is due to the specific
design of the data collection process. Certain non-probability sampling plans are designed for
specific types of analysis that accommodate the nonrandom nature of the sample. Much more
common are missing data due to the design of the data collection instrument, such as through
skip patterns where respondents skip sections of questions that are not applicable. For
example, in examining customer complaint resolution, it might be appropriate to require that
individuals make a complaint before asking questions about how complaints are handled. For
those respondents not making a complaint, they do not answer the questions on the process
and thus create missing data. The researcher is not concerned about these missing data,
because they are part of the research design and would be inappropriate to attempt to remedy.
Censored Data: A third type of ignorable missing data occurs when the data are censored.
Censored data are observations not complete because of their stage in the missing data
process. A typical example is an analysis of the causes of death. Respondents who are still
living cannot provide complete information (i.e., cause or time of death) and are thus
censored. Another interesting example of censored data is found in the attempt to estimate the
heights of the U.S. general population based on the heights of armed services recruits. The
data are censored because in certain years the armed services had height restrictions that

3
varied in level and enforcement. Thus, the researchers face the task of estimating the heights
of the entire population when it is known that certain individuals (i.e., all those below the
height restrictions) are not included in the sample. In both instances the researcher’s
knowledge of the missing data process allows for the use of specialized methods, such as
event history analysis, to accommodate censored data.
In each instance of an ignorable missing data process, the researcher has an explicit means of
accommodating the missing data into the analysis. It should be noted that it is possible to
have both ignorable and non-ignorable missing data in the same data set when two different
missing data processes are in effect.

Missing Data Processes That Are Not Ignorable Missing data that cannot be classified as
ignorable occur for many reasons and in many situations. In general, these missing data fall
into two classes based on their source: known versus unknown processes.
Known Processes Many missing data processes are known to the researcher in that they can
be identified due to procedural factors, such as errors in data entry that create invalid codes,
disclosure restrictions (e.g., small counts in US Census data), failure to complete the entire
questionnaire, or even the morbidity of the respondent. In these situations, the researcher has
little control over the missing data processes, but some remedies may be applicable if the
missing data are found to be random.
Unknown Processes These types of missing data processes are less easily identified and
accommodated. Most often these instances are related directly to the respondent. One
example is the refusal to respond to certain questions, which is common in questions of a
sensitive nature (e.g., income or controversial issues) or when the respondent has no opinion
or insufficient knowledge to answer the question. The researcher should anticipate these
problems and attempt to minimize them in the research design and data collection stages of
the research. However, they still may occur, and the researcher must now deal with the
resulting missing data. But all is not lost. When the missing data occur in a random pattern,
remedies may be available to mitigate their effect.
In most instances, the researcher faces a missing data process that cannot be classified as
ignorable. Whether the source of this non-ignorable missing data process is known or
unknown, the researcher must still proceed to the next step of the process and assess the
extent and impact of the missing data.
Levels Of Missingness
In addition to the distinction between ignorable and not ignorable missing data, the researcher
should understand what forms of missing data are likely to impact the research. Note that the
missing data process refers to whether a case has a missing value or not, but does not relate to
the actual value that is missing. Thus, missingness is concerned with the absence or presence
of a missing/valid value. Determining how that missing data value might be imputed is
addressed once the type of missing data process is determined. Newman proposed three
levels of missingness described below that follow a hierarchical arrangement:
Item-level The level of missingness first encountered, this is when a value is not available
(i.e., a respondent does not answer a question, a data field is missing in a customer record,

4
etc.). This is the level at which remedies for missing data (e.g., imputation) are identified and
performed.
Construct-level This level of missingness is when item-level missing data acts to create a
missing value for an entire construct of interest. A common example is when a respondent has
missing data on all of the items for a scale, although it could also apply to single-item scales
as well. Since constructs are the level of interest in most research questions, the missing data
become impactful on the results through its actions at the construct level.
Person-level This final level is when a participant does not provide responses to any part of
the survey. Typically also known as non-response, it potentially represents influences from
both characteristics of the respondent (e.g., general reluctance to participate) as well as
possible data collection errors (e.g., poorly designed or administered survey instrument)
While most missing data analysis occurs at the item level, researchers should still be aware of
the impact at the construct-level (e.g., the impact on scale scores when using only valid data
and the factors impacting person level missingness and how they might be reflected in either
item-level and even construct-level missing data. For example, person-level factors may
make individuals unresponsive to all items of a particular construct, so while we might think
of them as item-level issues, they are actually of a different order.

Step 2: Determine the Extent of Missing Data


Given that some of the missing data are not ignorable and we understand the levels of
missingness in our data, the researcher must next examine the patterns of the missing data
and determine the extent of the missing data for individual variables, individual cases, and
even overall (e.g., by person). The primary issue in this step of the process is to determine
whether the extent or amount of missing data is low enough to not affect the results, even if it
operates in a nonrandom manner. If it is sufficiently low, then any of the approaches for
remedying missing data may be applied. If the missing data level is not low enough, then we
must first determine the randomness of the missing data process before selecting a remedy
(step 3). The unresolved issue at this step is this question: What is low enough? In making the
assessment as to the extent of missing data, the researcher may find that the deletion of cases
and/or variables will reduce the missing data to levels that are low enough to allow for
remedies without concern for creating biases in the results.
Assessing The Extent And Patterns Of Missing Data The most direct means of assessing
the extent of missing data is by tabulating (1) the percentage of variables with missing data
for each case and (2) the number of cases with missing data for each variable. This simple
process identifies not only the extent of missing data, but any exceptionally high levels of
missing data that occur for individual cases or observations. The researcher should look for
any nonrandom patterns in the data, such as concentration of missing data in a specific set of
questions, attrition in not completing the questionnaire, and so on. Finally, the researcher
should determine the number of cases with no missing data on any of the variables, which
will provide the sample size available for analysis if remedies are not applied.
With this information in hand, the important question is: Is the missing data so high as to
warrant additional diagnosis? At issue is the possibility that either ignoring the missing data

5
or using some remedy for substituting values for the missing data can create a bias in the data
that will markedly affect the results. Even though most missing data require researcher
judgment, the two guidelines below apply:
● 10 percent or less generally acceptable. Cases or observations with up to 10 percent
missing data are generally acceptable and amenable to any imputation strategy. Notable
exceptions are when nonrandom missing data processes are known to be operating and then
they must be dealt with.
● Sufficient minimum sample. Be sure that the minimum sample with complete data (i.e., no
missing data across all the variables), is sufficient for model estimation.
If it is determined that the extent is acceptably low and no specific nonrandom patterns
appear, then the researcher can employ any of the imputation techniques (step 4) without
biasing the results in any appreciable manner. If the level of missing data is too high, then the
researcher must consider specific approaches to diagnosing the randomness of the missing
data processes (step 3) before proceeding to apply a remedy.
Deleting Individual Cases And/Or Variables Before proceeding to the formalized methods
of diagnosing randomness in step 3, the researcher should consider the simple remedy of
deleting offending case(s) and/or variable(s) with excessive levels of missing data. The
researcher may find that the missing data are concentrated in a small subset of cases and/or
variables, with their exclusion substantially reducing the extent of the missing data.
Moreover, in many cases where a nonrandom pattern of missing data is present, this solution
may be the most efficient.

Step 3: Diagnose the Randomness of the Missing Data Processes


Having determined that the extent of missing data is substantial enough to warrant action, the
next step is to ascertain the degree of randomness present in the missing data, which then
determines the appropriate remedies available. Assume for the purposes of illustration that
information on two variables (X and Y) is collected. X has no missing data, but Y has some
missing data. A nonrandom missing data process is present between X and Y when significant
differences in the values of X occur between cases that have valid data for Y versus those
cases with missing data on Y. Any analysis must explicitly accommodate any nonrandom
missing data process (i.e., missingness) between X and Y or else bias is introduced into the
results.
Levels Of Randomness Of The Missing Data Process Missing data processes can be
classified into one of three types .Two features distinguish the three types: (a) the randomness
of the missing values among the values of Y and (b) the degree of association between the
missingness of one variable (in our example Y) and other observed variable(s) in the dataset
(in our example X). Figure 2.6 provides a comparison between the various missing data
patterns. Using Figure 2.6 as a guide, examine these three types of missing data processes.
Missing Data at Random (MAR) Missing data are termed missing at random (MAR) if the
missing values of Y depend on X, but not on Y. In other words, the observed Y values
represent a random sample of the actual Y values for each value of X, but the observed data
for Y do not necessarily represent a truly random sample of all Y values. In Figure 2.6, the

6
missing values of Y are random (i.e., spread across all values), but having a missing value on
Y does relate to having low values of X (e.g., only values 3 or 4 of X correspond to missing
values on Y). Thus, X is associated with the missingness of Y, but not the actual values of Y
that are missing. Even though the missing data process is random in the sample, its values are
not generalizable to the population. Most often, the data are missing randomly within
subgroups, but differ in levels between subgroups. The researcher must determine the factors
determining the subgroups and the varying levels between groups.
Missing Completely at Random (MCAR) A higher level of randomness is termed missing
completely at random (MCAR). In these instances the observed values of Y are truly a
random sample of all Y values, with no under lying association to the other observed
variables, characterized as “purely haphazard missingness”. In Figure 2.6, the missing values
of Y are random across all Y values and there is no relationship between missingness on Y
and the X variable (i.e., missing Y values occur at all different values of X). Thus, MCAR is a
special condition of MAR since the missing values of Y are random, but it differs in that there
is no association with any other observed variable(s). This also means that the cases with no
missing data are simply a random subset of the total sample. In simple terms, the cases with
missing data are indistinguishable from cases with complete data, except for the presence of
missing data.
Not Missing at Random (MNAR) The third type of missing data process is missing not at
random (MNAR), which as the name implies, has a distinct nonrandom pattern of missing
values. What distinguishes MNAR from the other two types is that the nonrandom pattern is
among the Y values and the missingness of the Y values may or may not be related to the X
values. This is the most problematic missing data process for several reasons. First, it is
generally undetectable empirically and only becomes apparent through subjective analysis. In
Figure 2.6, all of the missing values of Y were the lowest values (e.g., values 1, 2, 3, and 4).
Unless we knew from other sources, we would not suspect that valid values extended below
the lowest observed value of 5. Only researcher knowledge of the possibil ity of values lower
than 5 might indicate that this was a nonrandom process. Second, there is no objective
method to empirically impute the missing values. Researchers should be very careful when
faced with MNAR situations as biased results can be substantial and threats to
generalizability are serious.
Defining The Type of Missing Data Process Two of the types exhibit levels of randomness
for the missing data of Y. One type requires special methods to accommodate a nonrandom
component (MAR) while the second type (MCAR) is sufficiently random to accommodate
any type of missing data remedy [62, 31, 37, 79]. Although both types seem to indicate that
they reflect random missing data patterns, only MCAR allows for the use of any remedy
desired. The distinction between these two types is in the generalizability to the population in
their original form. The third type, MNAR, has a substantive nonrandom pattern to the
missing data that precludes any direct imputation of the values. Since MNAR requires
subjective judgment to identify, researchers should always be aware of the types of variables
(e.g., sensitive personal characteristics or socially desirable responses) that may fall into this
type of missing data pattern.
DIAGNOSTIC TESTS FOR LEVELS OF RANDOMNESS As previously noted, the
researcher must ascertain whether the missing data process occurs in a completely random

7
manner (MCAR) or with some relationship to other variables (MAR). When the dataset is
small, the researcher may be able to visually see such patterns or perform a set of simple
calculations (such as in our simple example at the beginning of the chapter). However, as
sample size and the number of variables increases, so does the need for empirical diagnostic
tests. Some statistical programs add techniques specifically designed for missing data
analysis (e.g., Missing Value Analysis in IBM SPSS), which generally include one or both
diagnostic tests.
t Tests of Missingness The first diagnostic assesses the missing data process of a single
variable Y by forming two groups: observations with missing data for Y and those with valid
values of Y. The researcher can create an indicator variable with a value of 1 if there is a
missing value for Y and a zero if Y has a valid value. Thus, the indicator value just measures
missingness—presence or absence. Statistical tests are then performed between the
missingness indicator and other observed variables—t tests for metric variables and chi-
square tests for nonmetric variables. Significant differences between the two groups indicates
a relationship between missingness and the variable being tested—an indication of a MAR
missing data process.
Little’s MCAR Test A second approach is an overall test of randomness that determines
whether the missing data can be classified as MCAR. This test analyzes the pattern of
missing data on all variables and compares it with the pattern expected for a random missing
data process. If no significant differences are found, the missing data can be classified as
MCAR. If significant differences are found, however, the researcher must use the approaches
described previously to identify the specific missing data processes that are nonrandom.
Web Reference : https://www.youtube.com/watch?v=22aR9ruSig4

Step 4: Select the Imputation Method


At this step of the process, the researcher must select the approach used for accommodating
missing data in the analysis. This decision is based primarily on whether the missing data are
MAR or MCAR, but in either case the researcher has several options for imputation.
Imputation is the process of estimating the missing value based on valid values of other
variables and/or cases in the sample. The objective is to employ known relationships that can
be identified in the valid values of the sample to assist in estimating the missing values.
However, the researcher should carefully consider the use of imputation in each instance
because of its potential impact on the analysis.
The imputation techniques are divided into two classes: those that require an MCAR missing
data process and those appropriate when facing a MAR situation.

IMPUTATION OF MCAR USING ONLY VALID DATA


Complete Case Approach The simplest and most direct approach for dealing with missing
data is to include only those observations with complete data, also known as the complete
case approach. This method, also known as the LISTWISE method in IBM SPSS, is available
in all statistical programs and is the default method in many programs. Yet the complete case

8
approach has two distinct disadvantages. First, it is most affected by any nonrandom missing
data processes, because the cases with any missing data are deleted from the analysis. Thus,
even though only valid observations are used, the results are not generalizable to the
population. Second, this approach also results in the greatest reduction in sample size,
because missing data on any variable eliminates the entire case. It has been shown that with
only two percent randomly missing data, more than 18 percent of the cases will have some
missing data. Thus, in many situations with even very small amounts of missing data, the
resulting sample size is reduced to an inappropriate size when this approach is used. As a
result, the complete case approach is best suited for instances in which the extent of missing
data is small, the sample is sufficiently large to allow for deletion of the cases with missing
data, and the relationships in the data are so strong as to not be affected by any missing data
process. But even in these instances, most research suggests avoiding the complete case
approach if at all possible
Using All-Available Data The second imputation method using only valid data also does not
actually replace the missing data, but instead imputes the distribution characteristics (e.g.,
means or standard deviations) or relationships (e.g., correlations) from every valid value. For
example, assume that there are three variables of interest (V1, V2, and V3). To estimate the
mean of each variable, all of the valid values are used for each respondent. If a respondent is
missing data for V3, the valid values for V1 and V2 are still used to calculate the means.
Correlations are calculated in the same manner, using all valid pairs of data. Assume that one
respondent has valid data for only V1 and V2, whereas a second respondent has valid data for
V2 and V3. When calculating the correlation between V1 and V2, the values from the first
respondent will be used, but not for correlations of V1 and V3 or V2 and V3. Likewise, the
second respondent will contribute data for calculating the correlation of V2 and V3, but not
the other correlations.
IMPUTATION OF MCAR BY USING KNOWN REPLACEMENT VALUES
The second form of imputation for MCAR missing data processes involves replacing missing
values with estimated values based on other information available in the sample. The
principal advantage is that once the replacement values are substituted, all observations are
available for use in the analysis. The options vary from the direct substitution of values to
estimation processes based on relationships among the variables.
Hot or Cold Deck Imputation In this approach, the researcher substitutes a value from
another source for the missing values. In the “hot deck” method, the value comes from
another observation in the sample that is deemed similar. Each observation with missing data
is paired with another case that is similar on a variable(s) specified by the researcher. Then,
missing data are replaced with valid values from the similar observation. Recent advances in
computer software have advanced this approach to more widespread use . “Cold deck”
imputation derives the replacement value from an external source (e.g., prior studies, other
samples). Here the researcher must be sure that the replacement value from an external
source is more valid than an internally generated value. Both variants of this method provide
the researcher with the option of replacing the missing data with actual values from similar
observations that may be deemed more valid than some calculated value from all cases, such
as the mean of the sample.

9
Case Substitution In this method, entire observations with missing data are replaced by
choosing another non-sampled observation. A common example is to replace a sampled
household that cannot be contacted or that has extensive missing data with another household
not in the sample, preferably similar to the original observation. This method is most widely
used to replace observations with complete missing data, although it can be used to replace
observations with lesser amounts of missing data as well. At issue is the ability to obtain
these additional observations not included in the original sample.
IMPUTATION OF MCAR BY CALCULATING REPLACEMENT VALUES
The second basic approach involves calculating a replacement value from a set of
observations with valid data in the sample. The assumption is that a value derived from all
other observations in the sample is the most representative replacement value. These
methods, particularly mean substitution, are more widely used due to their ease in
implementation versus the use of known values.
Mean Substitution One of the most widely used methods, mean substitution replaces the
missing values for a vari able with the mean value of that variable calculated from all valid
responses. The rationale of this approach is that the mean is the best single replacement
value. This approach, although it is used extensively, has several disadvantages. First, it
understates the variance estimates by using the mean value for all missing data. Second, the
actual distribution of values is distorted by substituting the mean for the missing values.
Third, this method depresses the observed correlation because all missing data will have a
single constant value. It does have the advantage, however, of being easily implemented and
providing all cases with complete information. A variant of this method is group mean
substitution, where observations with missing data are grouped on a second variable, and then
mean values for each group are substituted for the missing values within the group. It is many
times the default missing value imputation method due to its ease of implementation, but
researchers should be quite cautious in its use, especially as the extent of missing data
increases.
Regression Imputation In this method, regression analysis is used to predict the missing
values of a variable based on its relationship to other variables in the dataset. First, a
predictive equation is formed for each variable with missing data and estimated from all cases
with valid data. Then, replacement values for each missing value are calculated from that
observation’s values on the variables in the predictive equation. Thus, the replacement value
is derived based on that observation’s values on other variables shown to relate to the missing
value. Although it has the appeal of using relationships already existing in the sample as the
basis of prediction, this method also has several disadvantages. First, it reinforces the
relationships already in the data. As the use of this method increases, the resulting data
become more characteristic of the sample and less generalizable. Second, unless stochastic
terms are added to the estimated values, the variance of the distribution is understated. Third,
this method assumes that the variable with missing data has substantial correlations with the
other variables. If these correlations are not sufficient to produce a meaningful estimate, then
other methods, such as mean substitution, are preferable. Fourth, the sample must be large
enough to allow for a sufficient number of observations to be used in making each prediction.
Finally, the regres sion procedure is not constrained in the estimates it makes. Thus, the

10
predicted values may not fall in the valid ranges for variables (e.g., a value of 11 may be
predicted for a 10-point scale) and require some form of additional adjustment.
IMPUTATION OF A MAR MISSING DATA PROCESS
If a nonrandom or MAR missing data pattern is found, the researcher should apply only one
remedy—the modeling approach specifically designed to deal with this. Application of any
other method introduces bias into the results. This set of procedures explicitly incorporates
the MAR missing data process into the analysis and exemplifies what has been termed the
“inclusive analysis strategy” which also includes auxiliary variables into the missing data
handling procedure.
Maximum Likelihood and EM The first approach involves maximum likelihood estimation
techniques that attempt to model the processes underlying the missing data and to make the
most accurate and reasonable estimates possible. Maximum likelihood is not a technique, but
a fundamental estimation methodology. However, its application in missing data analysis has
evolved based on two approaches. The first approach is the use of maximum likelihood
directly in the estimation of the means and covariance matrix as part of the model estimation
in covariance-based SEM. In these applications missing data estimation and model estimation
are combined in a single step. There is no imputation of missing data for individual cases, but
the missing data process is accommodated in the “imputed” matrices for model estimation.
The primary drawback to this approach is that imputed datasets are not available and it takes
more specialized software to perform.
A variation of this method employs maximum likelihood as well, but in an iterative process.
The EM method is a two-stage method (the E and M stages) in which the E stage makes the
best possible estimates of the missing data and the M stage then makes estimates of the
parameters (means, standard deviations, or correlations) assuming the missing data were
replaced. The process continues going through the two stages until the change in the
estimated values is negligible and they replace the missing data. One notable feature is that
this method can produce an imputed dataset, although it has been shown to underestimate the
standard errors in estimated models.
Multiple Imputation The procedure of multiple imputation is, as the name implies, a process
of generating multi ple datasets with the imputed data differing in each dataset, to provide in
the aggregate, both unbiased parameter estimates and correct estimates of the standard errors.

Testing The Assumptions Of Multi Variate Analysis


The final step in examining the data involves testing for the assumptions underlying the
statistical bases for multivar iate analysis.
The need to test the statistical assumptions is increased in multivariate applications because
of two charac teristics of multivariate analysis. First, the complexity of the relationships,

11
owing to the typical use of a large number of variables, makes the potential distortions and
biases more potent when the assumptions are violated, particularly when the violations
compound to become even more detrimental than if considered separately. Second, the
complexity of the analyses and results may mask the indicators of assumption violations
apparent in the simpler univariate analyses. In almost all instances, the multivariate
procedures will estimate the multi variate model and produce results even when the
assumptions are severely violated. Thus, the researcher must be aware of any assumption
violations and the implications they may have for the estimation process or the interpretation
of the results.
Four Important Statistical Assumptions
Multivariate techniques and their univariate counterparts are all based on a fundamental set of
assumptions repre senting the requirements of the underlying statistical theory. Although
many assumptions or requirements come into play in one or more of the multivariate
techniques we discuss in the text, four of them potentially affect every univariate and
multivariate statistical technique.
1.Normality The most fundamental assumption in multivariate analysis is normality,
referring to the shape of the data distribution for an individual metric variable and its
correspondence to the normal distribution, the benchmark for statistical methods. If the
variation from the normal distribution is sufficiently large, all resulting statistical tests are
invalid, because normality is required to use the F and t statistics.
Univariate Versus Multivariate Normality Univariate normality for a single variable is
easily tested, and a number of corrective measures are possible. In a simple sense,
multivariate normality (the combination of two or more variables) means that the individual
variables are normal in a univariate sense and that their combinations are also normal. Thus,
if a variable is multivariate normal, it is also univariate normal. However, the reverse is not
necessarily true (two or more univariate normal variables are not necessarily multivariate
normal). Thus, a situation in which all variables exhibit univariate normality will help gain,
although not guarantee, multivariate normality. Multivariate normality is more difficult to test
, but specialized tests are available in the techniques most affected by departures from
multivariate normality. In most cases assessing and achieving univariate normality for all
variables is sufficient, and we will address multivariate normality only when it is especially
critical. Even though large sample sizes tend to diminish the detrimental effects of
nonnormality, the researcher should always assess the normality for all metric variables
included in the analysis.
Assessing The Impact Of Violating The Normality Assumption The severity of non-
normality is based on two dimen sions: the shape of the offending distribution and the sample
size. the researcher must not only judge the extent to which the variable’s distribution is non-
normal, but also the sample sizes involved. What might be considered unacceptable at small
sample sizes will have a negligible effect at larger sample sizes.
Impacts Due to the Shape of the Distribution How can we describe the distribution if it
differs from the normal distribu tion? The shape of any distribution can be described by two
measures: kurtosis and skewness. Kurtosis refers to the “peakedness” or “flatness” of the
distribution compared with the normal distribution. Distributions that are taller or more

12
peaked than the normal distribution are termed leptokurtic, whereas a distribution that is
flatter is termed platykurtic. Whereas kurtosis refers to the height of the distribution,
skewness is used to describe the balance of the distribution; that is, is it unbalanced and
shifted to one side (right or left) or is it centered and symmetrical with about the same shape
on both sides? If a distribution is unbalanced, it is skewed. A positive skew denotes a
distribution shifted to the left, whereas a negative skewness reflects a shift to the right.
Knowing how to describe the distribution is followed by the issue of how to determine the
extent or amount to which it differs on these characteristics? Both skewness and kurtosis have
empirical measures that are available in all statistical programs. In most programs, the
skewness and kurtosis of a normal distribution are given values of zero. Then, values above
or below zero denote departures from normality. For example, negative kurtosis values
indicate a platykurtic (flatter) distribution, whereas positive values denote a leptokurtic
(peaked) distribution. Likewise, positive skewness values indicate the distribution shifted to
the left, and the negative values denote a rightward shift.
Impacts Due to Sample Size Even though it is important to understand how the distribution
departs from normality in terms of shape and whether these values are large enough to
warrant attention, the researcher must also consider the effects of sample size. The sample
size has the effect of increasing statistical power by reducing sampling error. Larger sample
sizes reduce the detrimental effects of non-normality. In small samples of 50 or fewer
observations, and especially if the sample size is less than 30 or so, significant departures
from normality can have a substantial impact on the results. For sample sizes of 200 or more,
however, these same effects may be negligible. Moreover, when group comparisons are
made, such as in ANOVA, the differing sample sizes between groups if large enough, can
even cancel out the detrimental effects. Thus, in most instances, as the sample sizes become
large, the researcher can be less concerned about non-normal variables, except as they might
lead to other assumption violations that do have an impact in other ways.
Tests Of Normality Researchers have a number of different approaches to assess normality,
but they primarily can be classified as either graphical or statistical. Graphical methods were
developed to enable normality assessment without the need for complex computations. They
provide the researcher with a more “in depth” perspective of the distributional characteristics
than a single quantitative value, but they are also limited in making specific distinctions since
graphical interpretations are less precise that statistical measures.
Graphical Analyses The simplest diagnostic test for normality is a visual check of the
histogram that compares the observed data values with a distribution approximating the
normal distribution. Although appealing because of its simplicity, this method is problematic
for smaller samples, where the construction of the histogram (e.g., the number of categories
or the width of categories) can distort the visual portrayal to such an extent that the analysis is
useless. A more reliable approach is the normal probability plot, which compares the
cumulative distri bution of actual data values with the cumulative distribution of a normal
distribution. The normal distribution forms a straight diagonal line, and the plotted data
values are compared with the diagonal. If a distribution is normal, the line representing the
actual data distribution closely follows the diagonal.
Statistical Tests In addition to examining the normal probability plot, one can also use
statistical tests to assess normality. A simple test is a rule of thumb based on the skewness and

13
kurtosis values (available as part of the basic descriptive statistics for a variable computed by
all statistical programs). The statistic value (z) for the skewness value is calculated as:

where N is the sample size. A z value can also be calculated for the kurtosis value using the
following formula:

If either calculated z value exceeds the specified critical value, then the distribution is non-
normal in terms of that characteristic. The critical value is from a z distribution, based on the
significance level we desire. The most commonly used critical values are 62.58 (.01
significance level) and 61.96, which corresponds to a .05 error level. With these simple tests,
the researcher can easily assess the degree to which the skewness and peakedness of
the distribution vary from the normal distribution.
Specific statistical tests for normality are also available in all the statistical programs. The
two most common are the Shapiro-Wilks test and a modification of the Kolmogorov–
Smirnov test. Each calculates the level of significance for the differences from a normal
distribution. The researcher should always remember that tests of significance are less useful
in small samples (fewer than 30) and quite sensitive in large samples (exceeding 1,000
observations).
Thus, the researcher should always use both the graphical plots and any statistical tests to
assess the actual degree of departure from normality.
2. Homoscedasticity The next assumption is related primarily to dependence relationships
between variables. Homoscedasticity refers to the assumption that dependent variable(s)
exhibit equal levels of variance across the range of predictor variable(s). Homoscedasticity is
desirable because the variance of the dependent variable being explained in the dependence
relationship should not be concentrated in only a limited range of the independent values. In
most situations, we have many different values of the dependent variable at each value of the
independent variable. For this relationship to be fully captured, the dispersion (variance) of
the dependent variable values must be relatively equal at each value of the predictor variable.
If this dispersion is unequal across values of the independent variable, the relationship is said
to be heteroscedastic.
Tests For Homoscedasticity As found for normality, there are a series of graphical and
statistical tests for identifying situations impacted by heteroscedasticity. The researcher
should employ both methods where the graphical methods provide a more in-depth
understanding of the overall relationship involved and the statistical tests provide increased
precision. Graphical Tests of Equal Variance Dispersion The test of homoscedasticity for two
metric variables is best examined graphically. Departures from an equal dispersion are shown

14
by such shapes as cones (small dispersion at one side of the graph, large dispersion at the
opposite side) or diamonds (a large number of points at the center of the distribution). The
most common application of graphical tests occurs in multiple regression, based on the
dispersion of the dependent variable across the values of either the metric independent
variables.
Boxplots work well to represent the degree of variation between groups formed by a
categorical variable. The length of the box and the whiskers each portray the variation of data
within that group. Thus, heteroscedasticity would be portrayed by substantial differences in
the length of the boxes and whiskers between groups representing the dispersion of
observations in each group.
Statistical Tests for Homoscedasticity The statistical tests for equal variance dispersion
assess the equality of variances within groups formed by nonmetric variables. The most
common test, the Levene test, is used to assess whether the variances of a single metric
variable are equal across any number of groups. If more than one metric variable is being
tested, so that the comparison involves the equality of variance/covariance matrices, the
Box’s M test is applicable. The Box’s M test is available in both multivariate analysis of
variance and discriminant analysis and is discussed in more detail in later chapters pertaining
to these techniques.
3. Linearity An implicit assumption of all multivariate techniques based on correlational
measures of association, including multiple regression, logistic regression, factor analysis,
and structural equation modeling, is linearity. Because correlations represent only the linear
association between variables, nonlinear effects will not be represented in the correlation
value. This omission results in an underestimation of the actual strength of the relationship. It
is always prudent to examine all relationships to identify any departures from linearity that
may affect the correlation.
Identifying Nonlinear Relationships The most common way to assess linearity is to
examine scatterplots of the variables and to identify any nonlinear patterns in the data. Many
scatterplot programs can show the straight line depicting the linear relationship, enabling the
researcher to better identify any nonlinear characteristics. An alternative approach is to run a
simple regression analysis and to examine the residuals. The residuals reflect the unexplained
portion of the dependent variable; thus, any nonlinear portion of the relationship will show up
in the residuals. A third approach is to explicitly model a nonlinear relation ship by the testing
of alternative model specifications (also know as curve fitting) that reflect the nonlinear
elements.
4. Absence of Correlated Errors: Predictions in any of the dependence techniques are not
perfect,

Incorporating nonmetric data with dummy variables


In many instances metric data must be used as independent variables. A researcher has
available a method for using dichotomous variables, known as dummy variables, which act as
replacement variables for the nonmetric variable. Any nonmetric value with k categories can
be represented as k-1 dummy variables. In constructing dummy variables, two approaches

15
can be used to represent the categories, and more importantly, the category that is omitted,
known as the reference category or comparison group.
1. The first approach is known as indicator coding. An important consideration is the
reference category, the category that received all zeros for the dummy variables. The
deviations represent the differences between the dependent variable mean score and the
comparison group. This form is most appropriate in a logical comparison group.
2. An alternative method is effects coding. It is the same as indicator coding except that the
comparison group is given -1 score instead of 0 for the dummy variables.

UNIT 3

16
Introduction – Multiple Linear Regression Analysis – Basic concepts – Multiple linear
regression model – Least square estimation – Inferences from the estimated regression
function
https://www.youtube.com/watch?v=DKAv2X09dvE

FOR CLASS
Missing completely at random (MCAR)
Missing completely at random, or MCAR, is missingness that has no
association with any data you have observed, or not observed. In other
words, the cause of the missingness can be considered truly random,
and unrelated to observed or unobserved variables meaningful to the
data and your analyses.
For example, imagine you are a tornado researcher. You are
determined to deploy small devices into a tornado that, when
suspended in the tornado, will record windspeeds and dynamics (yes -
this is the plot of the classic film Twister starring Bill Paxton). One
day while driving to try to launch your devices, your car runs out of
17
gas, and you are unable to obtain windspeed readings. Those
unrecorded windspeeds show up as NA in the dataset for that tornado.
In this case, the cause of the missingness (running out of gas)
is unrelated to tornado windspeeds - it can be considered a truly
“random” cause of missingness, or missing completely at random
(MCAR).
An important distinction: MCAR does not mean there is “no
reason” for missingness. In this example, windspeed is missing for
this tornado because you ran out of gas. It is still MCAR because the
cause of missingness is unrelated to tornado windspeed in a
meaningful way.
Critical thinking: Imagining that you are the tornado researcher in
the example above, what other hypothetical causes may result in
tornado windspeeds being missing completely at random (MCAR)?
How might MCAR appear in data?
A hypothetical example of how we might want MCAR to appear for
the max_windspeed variable is shown below:

18
In the dataset above, we see two missing values (NA) in
the max_windspeed column. For each, the comments in
the notes column describe reasons for missingness that are unrelated
to tornado windspeeds, and can thus be considered MCAR

Missing at random (MAR)


Missing at random (MAR) occurs when missingness depends on data
you have observed, but not on unobserved data.
Returning to our Twister tornado example: Imagine that you are again
driving to release your wind speed devices into a tornado. Due to
heavy rainfall, however (for which you do have data), several river
crossings are flooded and you are unable to safely approach the
tornado. Therefore, missingness in wind speed is due to another
recorded variable in the data (rainfall, recorded as daily_precip_mm).
In this case, wind speed is Missing at Random because it is dependent
on another recorded variable.
Missing not at random (MNAR)
MNAR explanation
If missingness within a variable is related to unobserved data
(including values of the missing variable itself), the missingness is
missing not at random (MNAR).
Let’s again envision that we are Bill Paxton, driving out to a tornado to
release our devices that record wind speed. In this scenario, the
tornado wind speeds are so high that upon approaching the tornado
our truck is tipped over, thwarting our efforts to release the devices.
Therefore, we are missing wind speed data for the tornado because
the wind speeds were so high.
Because the missingness in wind speed depends on the unrecorded
high values of wind speed, the values are missing not at random.

19
Missing completely at random (MCAR)
As it says, values are randomly missing from your dataset. Missing
data values do not relate to any other data in the dataset and there is
no pattern to the actual values of the missing data themselves.
For instance, when smoking status is not recorded in a random subset
of patients.
This is easy to handle, but unfortunately, data are almost never
missing completely at random.
Missing at random (MAR)
This is confusing and would be better stated as missing conditionally
at random. Here, missing data do have a relationship with other
variables in the dataset. However, the actual values that are missing
are random.

For example, smoking status is not documented in female patients


because the doctor was too shy to ask. Yes ok, not that realistic!
Missing not at random (MNAR)
The pattern of missingness is related to other variables in the dataset,
but in addition, the values of the missing data are not random.
For example, when smoking status is not recorded in patients
admitted as an emergency, who are also more likely to have worse
outcomes from surgery.
Missing not at random data are important, can alter your conclusions,
and are the most difficult to diagnose and handle. They can only be
detected by collecting and examining some of the missing data. This
is often difficult or impossible to do.
 MCAR (Missing Completely At Random)

20
Missing data is distributed completely in random and the missing
values in a feature are not related to the values of other features. Or no
pattern can be found in the observations where the values are missing.
An example can be when some forms are failed to be completely
submitted on the system due to a sudden temporary damage on some
of the servers. It has been said that this type of missingness is not very
realistic and is rare to happen.
 MAR (Missing At Random)
Missing data has a relationship with other features in the dataset.
However the data is randomly missing. An example is when the
client's age is not recorded by the teller in a random subset of the data
(there is no systematic reason for the missingness). In this case we
might be able to calculate or predict the missing ages from the other
features, such as Date of Birth, etc... .
 MNAR (Missing Not At Random)
This means the missing data has a relationship with the subject of the
dataset. Also the features with missing values have relationship with
other features. This type of missing data is considered as "not
ignorable". It is important to investigate the reason of the missingness
(as oppose to the other two types which are "ignorable"). An example
can be again the missing ages for some of the patients who are
admitted in an emergency. We might need to investigate the reason of
why the age is not recorded for some patients. Those observations can
be a new trend in our dataset.
https://www.theanalysisfactor.com/seven-ways-to-make-up-data-
common-methods-to-imputing-missing-data/

https://www.slideshare.net/slideshow/missing-data-and-non-response-
pdf/70564448

21
https://www.ques10.com/p/66524/multiple-linear-regression-problem-
1/#google_vignette

22

You might also like