Data Preparation and Analysis 3
Data Preparation and Analysis 3
Data Preparation and Analysis 3
ANALYSIS
Unit 4
1
Chi-Square
Goodness of fit – Whether the
sequence/occurrence in sample is
similar to the occurrences in the
population. (Card Game – Win-Loss-
Tie)
For Finding Association:
Nominal Vs Nominal (Association)
Equality of proportions: Car
Customer’s Satisfactions.
T-test
One sample t-test – Whether the mean
of the sample is equal to mean of the
population.
Paired sample t-test –
1. When I am selecting samples for the study in a pair – Ex (Mom
and Child, Husband and Wife)
2. Before and After Training – For measuring Effectiveness
6
The Data Preparation Process
Data Editing
Data Coding
Data Classification
Data Tabulation
One to two = 1,
26. Family size Three to five = 2, X26
Six & more = 3
20-25
26-30
31-35
36-40
41-45
46 & Above
Exploratory Data Analysis…
Bar Charts
Age Group
40
30
F re q u e n c y
20
10
0
20-25 26-30 31-35 36-40 41-45 46 & Above
Age Group
Exploratory Data Analysis…
Histograms
Histogram
6
F re q u e n c y
Mean =18.3553
Std. Dev. =6.55777
0
N =15
10.00 15.00 20.00 25.00 30.00 35.00 40.00
purchase in gms
Exploratory Data Analysis
Stem and Leaf Diagrams
Itshows individual data values in each
set as against the histogram which
presents only group aggregates
13 1339
15 668
16 36
17 37
18 2
22 2
31 0
35 6
Qualitative Vs Quantitative
Data Analyses
Qualitative Data Analysis
Qualitative research is ‘any kind of research
that produces findings not arrived at by means
of statistical procedures or other means of
quantification’
Here,the phenomenon of interest unfolds
naturally and the research does not
attempt to manipulate it
Qualitativeresearch seek understanding
and exploration and seldom seek
‘causality’
Quantitative Data Analysis
Quantitativeresearch employs
experimental methods and quantitative
measures to test hypothetical
generalizations
Here,the emphasis is on facts and causes
of behavior
The information is in the form of numbers
that can be quantified and summarized
Quantitative analysis uses mathematical
and statistical techniques for analyzing
and interpreting numerical data
Univariate, Bivariate and
Multivariate Statistical
Techniques
Meaning of Uni- Bi- &
Multivariate Analysis of Data
Univariate Analysis – One variable is analyzed at a
time
Bivariate Analysis – Two variables are analyzed
together and examined for any possible association
between them
Multivariate Analysis – More than two variables are
analyzed at a time
The type of statistical techniques used for analyzing
univariate and bivariate data depends upon the level
of measurements of the questions pertaining to those
variables
Further, the data analysis could be of two types,
namely, descriptive and inferential.
Descriptive vs Inferential Analysis
Descriptive analysis deals with summary measures
relating to the sample data
The common ways of summarizing data are by
calculating average, range, standard deviation, frequency
and percentage distribution
The first thing to do when data analysis is taken up is
to describe the sample
Examples of Descriptive Analysis:
What is the average income of the sample?
What is the standard deviation of incomes in the sample?
What percentage of sample respondents are married?
What is the median age of the sample respondents?
Is there any association between the frequency of purchase of
product and income level of the consumers?
Descriptive vs Inferential Analysis …
Types of Descriptive Analysis
The table below presents the type of
descriptive analysis that is applicable
under each form of measurement
Descriptive vs Inferential Analysis
Inferential Analysis:
Under inferential statistics, inferences are drawn on
population parameters based on sample results
The researcher tries to generalize the results to the
population based on sample results
Examples of Inferential Analysis:
Is the average income of population significantly
greater than 25,000 per month?
Is the job satisfaction of unskilled workers
significantly related with their pay packet?
Are consumption expenditure and disposable
income of households significantly correlated?
Do urban and rural households differ significantly in
terms of average monthly expenditure on food?
Descriptive Analysis of Univariate Data
Measures of Central Tendency
Arithmetic mean (appropriate for
Interval and Ratio scale data)
Median (appropriate for Ordinal,
Interval and Ratio scale data)
Mode (appropriate for Ordinal, Interval
and Ratio scale data)
Descriptive Analysis of Univariate Data …
Measures of Dispersion
Range (appropriate for Interval and Ratio
scale data)
Variance and Standard Deviation
(appropriate for interval and ratio scale data)
Coefficient of variation (appropriate for Ratio
scale data)
Relative and absolute frequencies
(appropriate for Nominal scale data)
Descriptive Analysis of Bivariate Data
Preparation of cross-tables
Interpretation of cross-tables
First it is required to identify dependent
and independent variable
Percentages should be computed in the
direction of independent variable
The dependent and independent
variables can be taken either in rows or in
columns
Illustration: Cross-tabulation
Ans: 0.915
More on Analysis of Data
Calculating summarized rank order
The rankings of attributes while choosing
a restaurant for dinner for 32 respondents
can be presented in the form of
frequency distribution in the table below:
More on Analysis of Data …
To calculate a summary rank ordering, the attribute with the
first rank is given the lowest number (1) and the least
preferred attribute is given the highest number (5)
The summarized rank order is obtained with the following
computations as:
Mean : SD :
Run Test for Randomness
The hypothesis to be tested is:
n = number of women = 8
1
n = number of men =9
2
r = number of runs =9
Runs tests are always two-tailed tests because
the question to be answered is whether there are
too many or too few runs
Two-Sample Sign Test
This test is based upon the sign of a pair of
observations i.e. based on the direction and not on
their magnitude.
Suppose a sample of respondents is selected and
their views on the image of a company are sought.
After some time, these respondents are shown an
advertisement, and thereafter, the data is again
collected on the image of the company.
For those respondents, where the image has
improved, there is a positive and for those where the
image has declined there is a negative sign assigned
and for the one where there is no change, the
corresponding observation is dropped from the
analysis and the sample size reduced accordingly.
Two-Sample Sign Test …
The key concept underlying the test is that if
the advertisement is not effective in improving
the image of the company, the number of
positive signs should be approximately equal
to the number of negative signs
For small samples, a binomial distribution
could be used, whereas for a large sample,
the normal approximation to the binomial
distribution could be used.
Illustration: Two Sample Sign Test
40 College Juniors evaluate the effectiveness
of two types of classes
Long Lectures by full professors
Short lectures by graduate assistants
PM
1 2 3 4 5 6 7 8 9 10
No.
Score for LL 2 1 4 4 3 3 4 2 4 1
Score for SL 3 2 2 3 4 2 2 1 3 1
Sign for LL - - + + - + + + + 0
This is, therefore, a more powerful test than the sign test.
Illustration: WSRTPS
A Sample of 16 salesmen was selected in
an organization and their score on
performance appraisal was noted.
The salesmen were sent for a 3-week
training and in the next appraisal, their
scores were noted again.
The appraisal scores before and after the
training are given in the next slide
Use a 5% level of significance to test the
hypothesis that the training has not caused
any change in the performance appraisal
score
Illustration: WSRTPS …
Salesman 1 2 3 4 5 6 7 8
Scores Before 85 76 64 59 72 68 43 54
Scores After 82 79 68 52 75 69 40 53
Salesman 9 10 11 12 13 14 15 16
Scores Before 57 61 71 82 39 51 54 57
Scores After 50 67 74 83 54 59 51 58
Solution:
Ho: There is no difference in the appraisal
score because of training
H1: There is difference in the appraisal
score because of training
S. Score Score ABS Rank of -ve +ve
DIFF
No Before After DIFF ABDIFF Rank Rank
1 85 82 -3 3 7.5 7.5
2 76 79 3 3 7.5 7.5
3 64 68 4 4 11.0 11.0
4 59 52 -7 7 13.5 13.5
5 72 75 3 3 7.5 7.5
6 68 69 1 1 2.5 2.5
7 43 40 -3 3 7.5 7.5
8 54 53 -1 1 2.5 2.5
9 57 50 -7 7 13.5 13.5
10 61 67 6 6 12.0 12.0
11 71 74 3 3 7.5 7.5
12 82 83 1 1 2.5 2.5
13 39 54 15 15 16.0 16.0
14 51 59 8 8 15.0 15.0
15 54 51 -3 3 7.5 7.5
16 57 58 1 1 2.5 2.5
52 84
Wilcoxon Signed-Rank Test for
Paired Samples
The test procedure is outlined in the following steps:
i. Let di denote the difference in the score for the ith
matched pair. Retain signs, but discard any pair for
which d = 0.
ii. Ignoring the signs of difference, rank all the di’s from the
lowest to highest. In case the differences have the
same numerical values, assign to them the mean of the
ranks involved in the tie.
iii. To each rank, prefix the sign of the difference.
iv. Compute the sum of the absolute value of the negative
and the positive ranks to be denoted as T– and T+
respectively.
v. Let T be the smaller of the two sums found in step iv.
Wilcoxon Signed-Rank Test for
Paired Samples
When the number of the pairs of observation (n) for which
the difference is not zero is greater than 15, the T statistic
follows an approximate normal distribution.
The mean μT and standard deviation σT of T are given by:
T T
Z
T
Wilcoxon Signed-Rank Test for
Paired Samples
For a given level of significance α, the absolute
sample Z should be greater than the absolute Zα /2 to
reject the null hypothesis.
For a one-sided upper tail test, the null hypothesis is
rejected if the sample Z is greater than Zα and for a
one-sided lower tail test, the null hypothesis is
rejected if sample Z is less than – Zα.
The Kruskal-Wallis Test
TheKruskal-Wallis test is in fact a non-parametric
counterpart to the one-way ANOVA.
The test is an extension of the Mann-Whitney U
test.
Both of them require that the scale of the
measurement of a sample value should be at
least ordinal.
The hypothesis to be tested in-Kruskal-Wallis test
is:
H : The k populations have identical
0
probability distribution.
H : At least two of the populations differ in
1
locations.
The Kruskal-Wallis Test
The procedure for the test is listed below:
i. Obtain random samples of size n1, ..., nk from
each of the k populations. Therefore, the total
sample size is
n = n1 + n2 + ... + nk
ii. Pool all the samples and rank them, with the
lowest score receiving a rank of 1. Ties are to
be treated in the usual fashion by assigning an
average rank to the tied positions.
2
which follows a distribution with the k–1 degrees of
freedom.
Dependency Interdependency
FACTOR ANALYSIS
(An interdependency Technique)
How to Increase Admissions @ FoM
Variable No. Variable
X1 Placements
X2 Teaching soft skills
X3 Allowing mobile usage
X4 Reduced Fee
X5 Providing Wifi Internet access
X6 Live Projects/ Internships
X7 Study Notes
X8 Free transport / food
X9 Outside classroom learning
X10 Language skills (Hindi…)
X11 Guest Lectures
Introduction to Factor Analysis
Factor analysis is a multivariate statistical technique in which
there is no distinction between dependent and independent
variables.
In factor analysis, all variables under investigation are analyzed
together to extract the underlying factors.
Factor analysis is a data reduction method. It is a very useful
method to reduce a large number of variables resulting in data
complexity to a few manageable factors.
These factors explain most part of the variations of the original
set of data.
A factor is a linear combination of variables.
It is a construct that is not directly observable but that needs to
be inferred from the input variables.
The factors are statistically independent.
Conditions for a Factor Analysis
The following conditions must be
ensured before executing the technique:
Factor analysis exercise requires metric data.
This means the data should be either interval or
ratio scale in nature.
The variables for factor analysis are identified
through exploratory research
As the responses to different statements are
obtained through different scales, all the
responses need to be standardized. The
standardization helps in comparison of different
responses from such scales.
Conditions for a Factor Analysis …
The size of the sample respondents should be at least four
to five times more than the number of variables (number of
statements).
The basic principle behind the application of factor analysis
is that the initial set of variables should be highly
correlated.
Ifthe correlation coefficients between all the variables are
small, factor analysis may not be an appropriate technique.
The significance of correlation matrix is tested using
Bartlett’s test of sphericity. The hypothesis to be tested is
X X
X SD( )
* i i
i
X i
Steps in a Factor Analysis …
Example: In a study to analyze the
investment behavior of employees of PSUs
80 respondents were asked to rate on a 5-
point Likert scale, their level of agreement on
the following parameters:
1. Risk averseness
2. Returns
3. Insurance cover
4. Tax rebate
5. Maturity time
6. Credibility of the financial institution
7. Easy accessibility
Steps in a Factor Analysis …
The principal component methodology involves
searching for those values of Wi so that the first
factor explains the largest portion of total variance.
This is called the first principal factor.
This explained variance is then subtracted from the
original input matrix so as to yield a residual matrix.
A second principal factor is extracted from the
residual matrix in a way such that the second factor
takes care of most of the residual variance.
One point that has to be kept in mind is that the
second principal factor has to be statistically
independent of the first principal factor. The same
principle is then repeated until there is little variance
to be explained.
Steps in a Factor Analysis …
To decide on the number of factors to be extracted
Kaiser Guttman methodology is used which states
that the number of factors to be extracted should be
equal to the number of factors having an eigenvalue
of at least 1.
Variable I II h2 I II
A 0.70 -.40 0.65 0.79 0.15
B 0.60 Eigen Values
-.50 0.61 are the 0.75sum of 0.03
the
C 0.60 variances0.48
-.35 of the factor
0.68 values0.10
(for
D 0.50 factor I the
0.50 0.50eigen value
0.06 is .700.70
2
+ .602
E 0.60 0.50 0.61
+ .602 + .50 2
+ .602 +0.13
.602) 0.77
F 0.60 0.60 0.72 0.07 0.85
Eigenvalue 2.18 1.39
Percent of variance 36.3 23.2
Cumulative percent 36.3 59.5
Steps in a Factor Analysis …
Step 2 : Rotation of factors:
The second step in the factor analysis exercise is the
rotation of initial factor solutions
The initial solution is rotated so as to yield a solution
that can be interpreted easily.
The varimax rotation method is used.
Steps in a Factor Analysis …
A
_____Unrotated Factors_____
B
__Rotated Factors__
The varimax rotation method maximizes the variance of
Variable the loadings within
I eachIIfactor. h2 I II
A 0.70 -.40 0.65 0.79 0.15
The variance of the factor is largest when its smallest
B 0.60 -.50 0.61 0.75 0.03
loading tends towards zero and its largest loading tends
C 0.60 -.35 0.48 0.68 0.10
towards unity.
D 0.50 0.50 0.50 0.06 0.70
EThe basic idea
0.60of rotation
0.50 is to get
0.61some factors
0.13 that0.77
have
Fa few variables
0.60 that 0.60
correlate
A high
0.72 with 0.07
that factor
B 0.85 and
some that correlate
Eigenvalue 2.18 poorly
_____Unrotated with that factor.
1.39Factors_____ __Rotated Factors__
Percentof variance 36.3 23.2
Similarly, there are other factors that correlate high with
Cumulative percent
Variable those 36.3
I
variables 59.5
II
with which h2
the other factorsI do not have
II
Asignificant correlation.
0.70 -.40 0.65 0.79 0.15
B 0.60 -.50 0.61 0.75 0.03
Therefore, the rotation is carried out in such way that the
C 0.60 -.35 0.48 0.68 0.10
factor loadings as in the first step are close to unity or zero.
D 0.50 0.50 0.50 0.06 0.70
E 0.60 0.50 0.61 0.13 0.77
F 0.60 0.60 0.72 0.07 0.85
Eigenvalue 2.18 1.39
Steps_____Unrotated
in a FactorA
Analysis
Factors_____
…
B
__Rotated Factors__
A B
_____Unrotated Factors_____ __Rotated Factors__
Variable I II h2 I II
A 0.70 -.40 0.65 0.79 0.15
B 0.60 -.50 0.61 0.75 0.03
C 0.60 -.35 0.48 0.68 0.10
D 0.50 0.50 0.50 0.06 0.70
E 0.60 0.50 0.61 0.13 0.77
F 0.60 0.60 0.72 0.07 0.85
Eigenvalue 2.18 1.39
Percent of variance 36.3 23.2
Cumulative percent 36.3 59.5
Uses of Factor Analysis
Scale construction: Factor analysis could be used to
develop concise multiple item scales for measuring
various constructs.
Unstandardized Standardized
X1 .36084 .65927
X2 2.61192 .57958
X3 .53028 .97505
Constant 12.89685
D = 0.659X1 + 0.580X2 + 0.975X3
Discriminant Analysis: Example
Discriminant Analysis: Example …
Discriminant Analysis: Example …
Objectives of Discriminant Analysis
The objectives of discriminant analysis are the
following:
To find a linear combination of variables that
discriminate between categories of dependent
variable in the best possible manner
To find out which independent variables are relatively
better in discriminating between groups
To determine the statistical significance of the
discriminant function
To develop the procedure for assigning new objects,
firms or individuals whose profile but not the group
identity are known to one of the two groups.
To evaluate the accuracy of classification
Uses of Discriminant Analysis
Some of the uses of Discriminant Analysis
are:
Scale construction: Discriminant analysis is used
to identify the variables/statements that are
discriminating and on which people with diverse
views will respond differently.
Define variables
Compute similarities
Buyer of a Sedan
Cluster Membership
________Number of Clusters ________
Multiple Regression
Discriminant Analysis
MANOVA
Conjoint Analysis
Uses of Multiple Regression
Backward
(Remove variables that change
R2 the least)
Stepwise
SPSS Output
2.289
2.289
MDS output
(Metric or Non Metric)