Data Preparation and Analysis 3

DATA PREPARATION AND
ANALYSIS
Unit 4
1
Chi-Square
 Goodness of fit – Whether the
sequence/occurrence in sample is
similar to the occurrences in the
population. (Card Game – Win-Loss-
Tie)
 For Finding Association:
Nominal Vs Nominal (Association)
 Equality of proportions: Car
Customer’s Satisfactions.
T-test
 One sample t-test – Whether the mean
of the sample is equal to mean of the
population.
 Paired sample t-test –
1. When I am selecting samples for the study in a pair – Ex (Mom
and Child, Husband and Wife)
2. Before and After Training – For measuring Effectiveness
 T-test for independence (Difference) –

When there are two variables and we are asked to find the
independency among them.
IV – 2 Categories and should be in nominal scale.
DV- Scale Variable (Interval of Ratio Scale)
Ex- Gender Vs Satisfaction
Annova -
 IV – Nominal (more than 2 categories)
 DV – Scale (Interval or Ratio)
Correlation: (Relationship effect)
 IV – Scale
 DV – Scale
Simple Regression (Cause and Effect)
 IV – Scale
 DV – Scale
Multiple Regression:
IV – More than one IV (Scale)
DV – Scale
Discriminant Analysis:
Two or More Categories in DV
DV – Nominal (High/Low – Good/bad etc)
IV – Scale
Data Preparation
6
The Data Preparation Process
Data Editing
Data Coding
Data Classification
Data Tabulation
Exploratory Data Analysis

Data Editing
 The process of cleaning the raw data
to make it free from inconsistencies
and incompleteness is called data
editing
 In other words, editing is the process
of examining the data collected through
various methods to detect errors and
omissions and correct them to enable
further analysis
Objectives of Data Editing
 To ensure accuracy of data
 To establish consistency of data
 Todetermine whether the data is
complete
 Toensure the coherence of aggregated
data
Stages in the Editing Process
 Initial Screening
Whether legible, consistent and complete
 Establishing Response Categories
ex: response to open-ended questions
 Field Editing
 Central Editing
Types of Data Editing
 Field Editing
Usually done by the field investigators at the end of
every field day
The investigator(s) review the filled-in forms for any
inconsistencies, non-response, illegible responses or
incomplete questionnaires
 Centralized in-house Editing
Editing performed by one or more centralized office
staff
Follows a standard process discussed in the next
slide.
Types of Data Editing… …
 Backtracking
Backtracking involves returning to the field and to
the respondents, so as to follow up the
unsatisfactory responses
 Allocating Missing Values
Here the researcher might assign a missing value to
the blanks if such blanks are few and not on
important parameters
 Plug Value
When the missing data relates to the key variable,
then an average or neutral value is plugged
 Discarding Unsatisfactory Responses
If the response sheet has too many blanks /
illegible / unsatisfactory responses, it is not worth
editing and the whole questionnaire discarded
Data Coding
 Codinginvolves assigning numbers or other
symbols to responses
 Coding helps to group the responses into a
limited number of categories
 Codes are organized in:
 Fields (ex: gender)
 Records
 Files
 Data matrix
 Both closed and open questions must be coded
 A codebook contains each variable in the study
and specifies the application of coding rules to
the variable
Sample Code Book Extract
Q.
Variable Name Coding Instruction Symbol
No.
1. Buy ready to eat Yes = 1 X1

food products No = 0
2. Use ready to eat Yes = 1 X2

food products No = 0
Less than 20 yrs = 1,
21 to 26 years = 2,
22. Age 27 to 35 years = 3, X22
36 to 45 years = 4,
More than 45 years = 5
23. Gender Male = 1 X23
Female = 2
Single = 1
24. Marital status Married = 2 X24
Divorced/widow = 3
25. No. of children Exact no. to be written X25
Sample Code Book Extract …
One to two = 1,
26. Family size Three to five = 2, X26
Six & more = 3
Monthly Rs.20000 to Rs.34999 = 1,

27. household Rs.35000 to Rs.50000 = 2, X27
income Rs.50001 to Rs.74999 = 3
Rs.75000 & above = 4
Less than graduation = 1

28. Education Graduation = 2 X28
Post graduation & above = 3
Student = 1
Businessman = 2
29. Occupation Professional = 3 X29
Service = 4
Housewife = 5
Others = 6
Classification and
Tabulation of Data
 Reducing the information into
homogeneous categories is called
classification of data
 Classification by attributes is mostly
categorical
 Classification by class intervals could be
exclusive or inclusive
 Tabulation is arrangement of data into
rows and columns to enable statistical
analysis
 Sample characteristics: age group of the
sample
Age groups Frequency Percent
20-25 27 27.0
26-30 37 37.0
31-35 9 9.0
36-40 22 22.0
41-45 3 3.0
46 & above 2 2.0
Total 100 100.0
Exploratory Data Analysis…
Pie Charts
Age Group
20-25
26-30
31-35
36-40
41-45
46 & Above
Bar Charts
Age Group
40
30
F re q u e n c y
20
10
0
20-25 26-30 31-35 36-40 41-45 46 & Above
Age Group
Histograms
Histogram
6
F re q u e n c y
Mean =18.3553
Std. Dev. =6.55777
0
N =15
10.00 15.00 20.00 25.00 30.00 35.00 40.00
purchase in gms
Stem and Leaf Diagrams
 Itshows individual data values in each
set as against the histogram which
presents only group aggregates
13 1339
15 668
16 36
17 37
18 2
22 2
31 0
35 6
Qualitative Vs Quantitative
Data Analyses
Qualitative Data Analysis
 Qualitative research is ‘any kind of research
that produces findings not arrived at by means
of statistical procedures or other means of
quantification’
 Here,the phenomenon of interest unfolds
naturally and the research does not
attempt to manipulate it
 Qualitativeresearch seek understanding
and exploration and seldom seek
‘causality’
Quantitative Data Analysis
 Quantitativeresearch employs
experimental methods and quantitative
measures to test hypothetical
generalizations
 Here,the emphasis is on facts and causes
of behavior
 The information is in the form of numbers
that can be quantified and summarized
 Quantitative analysis uses mathematical
and statistical techniques for analyzing
and interpreting numerical data
Univariate, Bivariate and
Multivariate Statistical
Techniques
Meaning of Uni- Bi- &
Multivariate Analysis of Data
 Univariate Analysis – One variable is analyzed at a
time
 Bivariate Analysis – Two variables are analyzed
together and examined for any possible association
between them
 Multivariate Analysis – More than two variables are
analyzed at a time
 The type of statistical techniques used for analyzing
univariate and bivariate data depends upon the level
of measurements of the questions pertaining to those
variables
 Further, the data analysis could be of two types,
namely, descriptive and inferential.
Descriptive vs Inferential Analysis
 Descriptive analysis deals with summary measures
relating to the sample data
 The common ways of summarizing data are by
calculating average, range, standard deviation, frequency
and percentage distribution
 The first thing to do when data analysis is taken up is
to describe the sample
 Examples of Descriptive Analysis:
 What is the average income of the sample?
 What is the standard deviation of incomes in the sample?
 What percentage of sample respondents are married?
 What is the median age of the sample respondents?
 Is there any association between the frequency of purchase of
product and income level of the consumers?
Descriptive vs Inferential Analysis …
Types of Descriptive Analysis
 The table below presents the type of
descriptive analysis that is applicable
under each form of measurement
Descriptive vs Inferential Analysis
Inferential Analysis:
Under inferential statistics, inferences are drawn on
population parameters based on sample results
The researcher tries to generalize the results to the
population based on sample results
Examples of Inferential Analysis:
Is the average income of population significantly
greater than 25,000 per month?
Is the job satisfaction of unskilled workers
significantly related with their pay packet?
Are consumption expenditure and disposable
income of households significantly correlated?
Do urban and rural households differ significantly in
terms of average monthly expenditure on food?
Descriptive Analysis of Univariate Data
Measures of Central Tendency
 Arithmetic mean (appropriate for
Interval and Ratio scale data)
 Median (appropriate for Ordinal,
Interval and Ratio scale data)
 Mode (appropriate for Ordinal, Interval
and Ratio scale data)
Descriptive Analysis of Univariate Data …
 Measures of Dispersion
Range (appropriate for Interval and Ratio
scale data)
Variance and Standard Deviation
(appropriate for interval and ratio scale data)
Coefficient of variation (appropriate for Ratio
scale data)
Relative and absolute frequencies
(appropriate for Nominal scale data)
Descriptive Analysis of Bivariate Data
Preparation of cross-tables
Interpretation of cross-tables
First it is required to identify dependent
and independent variable
Percentages should be computed in the
direction of independent variable
The dependent and independent
variables can be taken either in rows or in
columns
Illustration: Cross-tabulation
 With Third Variable:

 Spearman’s rank order correlation
measures the association between two
variables when the data is on ordinal
scale
 Ex: Suppose in a beauty contest two
judges are asked to rank ten female
participants
 A correlation coefficient between the
ranks awarded by two judges would
give how consistent they are in
awarding the rank.
Spearman’s rank correlation coefficient is given by:
The rank correlation coefficient takes a

value between –1 and +1.
Example: Rank Correlation
Participant Ranking by Ranking by
Judge 1 Judge 2 di di2
A 10 9 1 1
B 1 3 -2 4
C 5 4 1 1
D 2 1 1 1
E 8 8 0 0
F 3 2 1 1
G 4 6 -2 4
H 6 5 1 1
I 7 7 0 0
J 9 10 -1 1
Total 14
Ans: 0.915
More on Analysis of Data
Calculating summarized rank order
The rankings of attributes while choosing
a restaurant for dinner for 32 respondents
can be presented in the form of
frequency distribution in the table below:
More on Analysis of Data …
 To calculate a summary rank ordering, the attribute with the
first rank is given the lowest number (1) and the least
preferred attribute is given the highest number (5)
 The summarized rank order is obtained with the following
computations as:
 The total lowest score indicates the first preference

ranking. The results show the following rank ordering:
(1) Food quality (2) Service (3) Ambience
(4) Menu variety (5) Location
Parametric Vs Non-parametric Tests
Parametric Tests:
 The population mean (μ), standard
deviation (σ) and proportion (P) are
called the parameters of a distribution.
 Tests of hypotheses concerning the
mean and proportion are based on the
assumption that the population from
where the sample is drawn is normally
distributed
 Tests based on the above parameters
are called parametric tests.
Parametric Vs. Non-parametric Tests
Non-Parametric Tests:-
 Thereare situations where the populations under
study are not normally distributed
 Thedata collected from these populations is
skewed.
 The option is to use a non-parametric test. These
tests are called distribution-free tests as they do
not require any assumption regarding the shape
of the population distribution.
 These tests could also be used for small sample
sizes where the normality assumption does not
hold true.
Advantages of Non-Parametric Tests
 They can be applied to many situations
 They can be applied where a numeric
observation is difficult to obtain but a
rank value is not
 Can be applied to nominal and ordinal
data
 Involvesimple computations compared
to parametric tests.
Disadvantages of Non-Parametric Tests
 A lot of information is wasted. For Ex:
The increase or the gain is denoted by a
plus sign whereas a decrease or loss is
denoted by a negative sign. No
consideration is given to the quantity of the
gain or loss.
 Non-parametric methods are less
powerful than parametric tests when the
basic assumptions of parametric tests
are valid
 Nullhypothesis in a non-parametric test
is loosely defined as compared to the
parametric tests
Difference between Parametric &
Non-parametric Tests
Chi-Square Tests
Properties of Chi-Square
 For the use of a chi-square test, data is
required in the form of frequencies
 Unlike the normal and t distribution, chi-
square distribution is not symmetric
 The values of chi-square are greater
than or equal to zero
 The shape of a chi-square distribution
depends on the degrees of freedom
 With the increase in degrees of
freedom, the distribution tends to
normal
Applications of Chi-square
 Chi-square test for the goodness of fit
 Chi-square test for the independence of
variables
 Chi-squaretest for the equality of more
than two population proportions.
Chi-Square Test Procedure
 Statethe null and the alternative hypothesis
about a population
 Specify a level of significance
 Compute the expected frequencies under
the assumption that the null hypothesis is
true
 Make a note of the observed counts of the
data points falling in different cells
 Compute the chi-square value given by
the formula.
Chi-Square Test Procedure … …
Compare the sample value of the statistic

as obtained in previous step with the
critical value at a given level of
significance and make the decision.
Chi-square test for goodness of fit
 The hypothesis to be tested in this case is:
H0 : Probabilities of the occurrence of events

E1, E2, ..., Ek are given by the specified
probabilities p1, p2, ..., pk
H1 : Probabilities of the k events are not the

pi stated in the null hypothesis
The procedure has already been explained.

Chi-square Test for Independence of Variables
 The chi-square test can be used to test the
independence of two variables each having at least
two categories
 The test makes use of contingency tables also
referred to as cross-tabs with the cells corresponding
to a cross classification of attributes or events
 A contingency table with three rows and four columns
(as an example) is as shown below.
Chi-square Test for Independence of Variables …
 The hypothesis test for independence is:

H0 : Row and column variables are independent of each
other
H1 : Row and column variables are not independent.
 Thehypothesis is tested using a chi-

square test statistic for independence
given by:
Chi-square Test for Independence of Variables …
The degrees of freedom for the chi-square statistic are given by

(r – 1) (c – 1).
The expected frequency in the cell corresponding to the i th row and the
jth column is given by:
For a given level of significance α, the sample value of the chi-

square is compared with the critical value for the degree of
freedom (r – 1) (c – 1) to make a decision.
Chi-square Test for Equality of More Than Two
Population Proportions
 Theanalysis is carried out exactly in the same
way as was done for the other two cases
 The formula for a chi-square analysis remains
the same
 The hypothesis to be tested is as under:
 H0 : The proportion of people satisfying a particular
characteristic is the same in all populations.
 H1 : The proportion of people satisfying a particular
characteristic is not the same in all populations
 The expected frequency formula and the
decision procedure remains the same.
Illustration: Uty. Question June 2010
 Ina study concerned with preferences
for four different types of check-in
accounts offered by a bank, 100 people
in a higher-income group and 200
people in a lower-income group were
interviewed. The results of their
choices are shown in the accompanying
table.
Higher-Income Lower-Income
Preference
Group Group
Account A 36 84
Account B 39 51
Account C 16 44
Account D 9 21
Illustration: Uty. Question June 2010 ….
 Set up this study in formal statistical
terms and draw appropriate
conclusions. Use α = 0.05.
Higher-Income Lower-Income
Preference
Group Group
Account A 36 84
Account B 39 51
Account C 16 44
Account D 9 21
H : Account Preference and Income-Group
o
are independent
H : Account Preference and Income-Group
1
are NOT independent
R C
Expected Frequency Eij
i j

n
Higher- Lower-
Preference Income Income Row Total (Ri)
Group Group
Account A 36 84 120
Account B 39 51 90
Account C 16 44 60
Account D 9 21 30
Col. Total
Cj
100 200 300
R C
Expected Frequency Eij
i j

n
Some More Non-Parametric
Tests:
1. One Sample Runs Test for Randomness
2. One Sample Sign Test
3. Mann-Whitney U Test for Independent
Samples
4. Wilcoxon Signed-Rank Test for Paired
Samples
5. Kruskal-Wallis Test
Runs Test for Randomness
 Runs test is used to test the randomness of a sample.
 Run: A run is defined as a sequence of like elements that
are preceded and followed by different elements or no
elements at all.
 M,W,W,M,M,M,M,W,W,W,M,M,W,M,W,W,M
 Let n = Total size of the sample
n1 = number of occurrences of type 1
n2 = number of occurrences of type 2
r = Number of runs
 For large samples, either n1 > 20 or n2 > 20, the distribution
of runs (r) is normally distributed with:
Mean : SD :
Run Test for Randomness
The hypothesis to be tested is:
H0 : The pattern of sequence is random.

H1 : The pattern of sequence is not random.
For a large sample the test statistic is given by:
For a given level of significance, if the absolute value of

computed z is greater than the absolute value of tabulated z, null
hypothesis is rejected.
Illustration: Two Sample Runs Test
 Let
us consider the pattern of arrivals of
applicants for a job interview:
M,W,W,M,M,M,M,W,W,W,M,M,W,M,W,W,M
n = number of women = 8
1
n = number of men =9
2
r = number of runs =9
 Runs tests are always two-tailed tests because
the question to be answered is whether there are
too many or too few runs
Two-Sample Sign Test
 This test is based upon the sign of a pair of
observations i.e. based on the direction and not on
their magnitude.
 Suppose a sample of respondents is selected and
their views on the image of a company are sought.
 After some time, these respondents are shown an
advertisement, and thereafter, the data is again
collected on the image of the company.
 For those respondents, where the image has
improved, there is a positive and for those where the
image has declined there is a negative sign assigned
and for the one where there is no change, the
corresponding observation is dropped from the
analysis and the sample size reduced accordingly.
Two-Sample Sign Test …
The key concept underlying the test is that if
the advertisement is not effective in improving
the image of the company, the number of
positive signs should be approximately equal
to the number of negative signs
For small samples, a binomial distribution
could be used, whereas for a large sample,
the normal approximation to the binomial
distribution could be used.
Illustration: Two Sample Sign Test
 40 College Juniors evaluate the effectiveness
of two types of classes
Long Lectures by full professors
Short lectures by graduate assistants
PM
1 2 3 4 5 6 7 8 9 10
No.
Score for LL 2 1 4 4 3 3 4 2 4 1
Score for SL 3 2 2 3 4 2 2 1 3 1
Sign for LL - - + + - + + + + 0
No. of + signs = 19; Minus signs = 11; Zeros = 10

N=(40-10) = 30; p bar =0.633 Proportion of success
in the sample; q bar = 0.367 Proportion of failures in
the sample
Illustration: Two Sample Sign Test…
H : p=0.5 There is no difference between
0
the two types of classes
H : p=0.5 There is a difference between the
1
two types of classes
 We use the normal distribution to
approximate the binomial distribution
pq (0.5)(0.5)
p    0.091
n 30
p  pH 0 0.633 - 0.5
z   1.462
p 0.091
Mann-Whitney U Test for
Independent Samples
 One of the Rank Sum Tests (the other being Kruskal-
Wallis Test) which is based on the ranks of the
sample observations
 MWU test is used when only two populations are
involved and KW test is used if more than two
populations are involved
 A two-tailed hypothesis for a Mann-Whitney test
could be written as:
 H0 : Two samples come from identical populations (OR)
Two populations have identical probability
distribution.
 H1 : Two samples come from different populations (OR)

Two populations differ in locations.
Illustration: Mann-Whitney U Test
 Suppose that the board of a state
university wants to test the hypothesis
that the mean SAT scores of student at
two branches of the state university are
equal
 A random sample of 15 students from
each branch has produced the data
shown in the table in next slide
Illustration: Mann-Whitney U Test …
Branch A 1,000 1,100 800 750 1,300 950 1,050 1,250
Branch B 920 1,120 830 1,360 650 725 890 1,600
Branch A 1,400 850 1,150 1,200 1,500 600 775

Branch B 900 1,140 1,550 550 1,240 925 500
 To apply the MWU Test to this problem,

we begin by ranking ALL the scores in
order from lowest to highest, indicating
the symbol of the branch.
 This is shown in the table in next slide
Illustration: Mann-Whitney U Test …
Rank Score Branch Rank Score Branch
1 500 S 16 1,000 A
2 550 S 17 1,050 A
3 600 A 18 1,100 A
4 650 S 19 1,120 S
5 725 S 20 1,140 S
6 750 A 21 1,150 A
7 775 A 22 1,200 A
8 800 A 23 1,240 S
9 830 S 24 1,250 A
10 850 A 25 1,300 A
11 890 S 26 1,360 S
12 900 S 27 1,400 A
13 920 S 28 1,500 A
14 925 S 29 1,550 S
15 950 A 30 1,600 S
Mann-Whitney U Test …
 The following steps are used in conducting this test:
 The two samples are combined (pooled) into one large
sample and then we determine the rank of each
observation in the pooled sample
 If two or more sample values in the pooled samples are
identical, i.e., if there are ties, the sample values are each
assigned a rank equal to the mean of the ranks that
would otherwise be assigned.
 We determine the sum of the ranks of each sample
 Let R1 and R2 represent the sum of the ranks of the first
and the second sample whereas n1 and n2 are the
respective sample sizes
 For convenience, choose n1 as a small size if they are
unequal so that n1 ≤ n2
 A significant difference between R1 and R2 implies a
significant difference between the samples.
Illustration: Calculating R1 and R2
Branch A Score Rank Branch B Score Rank
1,000 16 920 13
1,100 18 1,120 19
800 8 830 9
750 6 1,360 26
1,300 25 650 4
950 15 725 5
1,050 17 890 11
1,250 24 1,600 30
1,400 27 900 12
850 10 1,140 20
1,150 21 1,550 29
1,200 22 550 2
1,500 28 1,240 23
600 3 925 14
775 7 500 1
Total of Ranks 247 Total of Ranks 218
Independent Samples …
n1 (n1  1)
U  n1n2   R1
2
 If n1 or n2 > 10, a Z test would be appropriate
 For this purpose, either of U1 or U2 could be
used for testing a one-tailed or a two-tailed
test.
Independent Samples …
Under the assumption that the null hypothesis is
true, the U statistic follows an approximately
normal distribution with mean:
 Assuming the level of significance as equal to α, if the absolute

sample value of Z is greater than the absolute critical value of Z,
i.e., Zα/2, the null hypothesis is rejected.
 A similar procedure is used for a one-tailed test.
Wilcoxon Signed-Rank Test for
Paired Samples
 In two-sample sign test, only the sign of the difference
(positive or negative) was taken into account and no
weightage was assigned to the magnitude of the
difference.
 The paired samples are dependent (before & after,

husband & wife, brother & sister, etc.)
 The Wilcoxon matched-pair signed rank test takes care of

this limitation and attaches a greater weightage to the
matched pair with a larger difference.
 The test, therefore, incorporates and makes use of more

information than the sign test.
 This is, therefore, a more powerful test than the sign test.
Illustration: WSRTPS
A Sample of 16 salesmen was selected in
an organization and their score on
performance appraisal was noted.
 The salesmen were sent for a 3-week
training and in the next appraisal, their
scores were noted again.
 The appraisal scores before and after the
training are given in the next slide
 Use a 5% level of significance to test the
hypothesis that the training has not caused
any change in the performance appraisal
score
Illustration: WSRTPS …
Salesman 1 2 3 4 5 6 7 8
Scores Before 85 76 64 59 72 68 43 54
Scores After 82 79 68 52 75 69 40 53
Salesman 9 10 11 12 13 14 15 16
Scores Before 57 61 71 82 39 51 54 57
Scores After 50 67 74 83 54 59 51 58
 Solution:
Ho: There is no difference in the appraisal
score because of training
H1: There is difference in the appraisal
score because of training
S. Score Score ABS Rank of -ve +ve
DIFF
No Before After DIFF ABDIFF Rank Rank
1 85 82 -3 3 7.5 7.5
2 76 79 3 3 7.5 7.5
3 64 68 4 4 11.0 11.0
4 59 52 -7 7 13.5 13.5
5 72 75 3 3 7.5 7.5
6 68 69 1 1 2.5 2.5
7 43 40 -3 3 7.5 7.5
8 54 53 -1 1 2.5 2.5
9 57 50 -7 7 13.5 13.5
10 61 67 6 6 12.0 12.0
11 71 74 3 3 7.5 7.5
12 82 83 1 1 2.5 2.5
13 39 54 15 15 16.0 16.0
14 51 59 8 8 15.0 15.0
15 54 51 -3 3 7.5 7.5
16 57 58 1 1 2.5 2.5
52 84
Paired Samples
 The test procedure is outlined in the following steps:
i. Let di denote the difference in the score for the ith
matched pair. Retain signs, but discard any pair for
which d = 0.
ii. Ignoring the signs of difference, rank all the di’s from the
lowest to highest. In case the differences have the
same numerical values, assign to them the mean of the
ranks involved in the tie.
iii. To each rank, prefix the sign of the difference.
iv. Compute the sum of the absolute value of the negative
and the positive ranks to be denoted as T– and T+
respectively.
v. Let T be the smaller of the two sums found in step iv.
Paired Samples
 When the number of the pairs of observation (n) for which
the difference is not zero is greater than 15, the T statistic
follows an approximate normal distribution.
 The mean μT and standard deviation σT of T are given by:
 The test statistics is given by:
T  T
Z 
T
Paired Samples
 For a given level of significance α, the absolute
sample Z should be greater than the absolute Zα /2 to
reject the null hypothesis.
 For a one-sided upper tail test, the null hypothesis is
rejected if the sample Z is greater than Zα and for a
one-sided lower tail test, the null hypothesis is
rejected if sample Z is less than – Zα.
The Kruskal-Wallis Test
 TheKruskal-Wallis test is in fact a non-parametric
counterpart to the one-way ANOVA.
 The test is an extension of the Mann-Whitney U
test.
 Both of them require that the scale of the
measurement of a sample value should be at
least ordinal.
 The hypothesis to be tested in-Kruskal-Wallis test
is:
H : The k populations have identical
0
probability distribution.
H : At least two of the populations differ in
1
locations.
 The procedure for the test is listed below:
i. Obtain random samples of size n1, ..., nk from
each of the k populations. Therefore, the total
sample size is
n = n1 + n2 + ... + nk
ii. Pool all the samples and rank them, with the
lowest score receiving a rank of 1. Ties are to
be treated in the usual fashion by assigning an
average rank to the tied positions.
iii. Let ri = the total of the ranks from the ith

sample.
2
 
The Kruskal-Wallis test uses the to test the
null hypothesis. The test statistic is given by:
2

which follows a distribution with the k–1 degrees of
freedom.
Where, k = Number of samples
n = Total number of elements in k samples.

The null hypothesis is rejected, if the computed
2
is greater  the critical value of
than at the
2

level of significance α.
Unit 4 – Data Preparation and Analysis
 Data Preparation
Editing
Coding
Data Entry
Validity of Data
 QualitativeVs Quantitative Data Analyses
 Bivariate and Multivariate Statistical
Techniques
 Factor Analysis
 Discriminant Analysis
 Cluster Analysis
Unit 4 – Data Preparation and Analysis …
 Multiple Regression and Correlation
 Multidimensional Scaling
 Application of statistical software for data
analysis
Multivariate Statistical
Techniques
Classifying
1. Multiple regression Multivariate Techniques
1. Factor analysis
2. MANOVA and 2. Cluster analysis, and
3. Discriminant analysis 3. Multidimensional scaling
Dependency Interdependency
FACTOR ANALYSIS
(An interdependency Technique)
How to Increase Admissions @ FoM
Variable No. Variable
X1 Placements
X2 Teaching soft skills
X3 Allowing mobile usage
X4 Reduced Fee
X5 Providing Wifi Internet access
X6 Live Projects/ Internships
X7 Study Notes
X8 Free transport / food
X9 Outside classroom learning
X10 Language skills (Hindi…)
X11 Guest Lectures
Introduction to Factor Analysis
 Factor analysis is a multivariate statistical technique in which
there is no distinction between dependent and independent
variables.
 In factor analysis, all variables under investigation are analyzed
together to extract the underlying factors.
 Factor analysis is a data reduction method. It is a very useful
method to reduce a large number of variables resulting in data
complexity to a few manageable factors.
 These factors explain most part of the variations of the original
set of data.
 A factor is a linear combination of variables.
 It is a construct that is not directly observable but that needs to
be inferred from the input variables.
 The factors are statistically independent.
Conditions for a Factor Analysis
 The following conditions must be
ensured before executing the technique:
 Factor analysis exercise requires metric data.
This means the data should be either interval or
ratio scale in nature.
 The variables for factor analysis are identified
through exploratory research
 As the responses to different statements are
obtained through different scales, all the
responses need to be standardized. The
standardization helps in comparison of different
responses from such scales.
Conditions for a Factor Analysis …
 The size of the sample respondents should be at least four
to five times more than the number of variables (number of
statements).
 The basic principle behind the application of factor analysis
is that the initial set of variables should be highly
correlated.
 Ifthe correlation coefficients between all the variables are
small, factor analysis may not be an appropriate technique.
 The significance of correlation matrix is tested using
Bartlett’s test of sphericity. The hypothesis to be tested is
H0 : Correlation matrix is insignificant
H1 : Correlation matrix is significant.

Conditions for a Factor Analysis
…
 The test converts it into a chi-square statistics with
degrees of freedom equal to [(k(k-1))/2], where k is
the number of variables on which factor analysis is
applied.
 The significance of the correlation matrix ensures that
a factor analysis exercise could be carried out.
 The value of Kaiser-Meyer-Olkin (KMO) statistics
which takes a value between 0 and 1 should be
greater than 0.5 for the application of factor analysis.
 A small value of KMO shows that correlation between
variables cannot be explained by other variables.
Steps in a Factor Analysis
There are basically two steps that are required in a factor analysis
exercise.
Step 1: Extraction of factors:

The first and the foremost step is to decide on how many
factors are to be extracted from the given set of data. The
principal component method is discussed very briefly here.
As we know that factors are linear combinations of the
variables which are supposed to be highly correlated, the
mathematical form of the same could be written as
X X
X   SD( )
* i i
i
X i
Steps in a Factor Analysis …
 Example: In a study to analyze the
investment behavior of employees of PSUs
80 respondents were asked to rate on a 5-
point Likert scale, their level of agreement on
the following parameters:
1. Risk averseness
2. Returns
3. Insurance cover
4. Tax rebate
5. Maturity time
6. Credibility of the financial institution
7. Easy accessibility
 The principal component methodology involves
searching for those values of Wi so that the first
factor explains the largest portion of total variance.
This is called the first principal factor.
 This explained variance is then subtracted from the
original input matrix so as to yield a residual matrix.
 A second principal factor is extracted from the
residual matrix in a way such that the second factor
takes care of most of the residual variance.
 One point that has to be kept in mind is that the
second principal factor has to be statistically
independent of the first principal factor. The same
principle is then repeated until there is little variance
to be explained.
 To decide on the number of factors to be extracted
Kaiser Guttman methodology is used which states
that the number of factors to be extracted should be
equal to the number of factors having an eigenvalue
of at least 1.
 Step 2 : Rotation of factors:

 Factor Rotation will be explained later. Now we will
see the PCA output from SPSS
Factor Analysis Output
Correlation Coefficient h2 gives the Communalities, or estimates of
between the factor and the variance in each variable that is explained
the variables; also by the two factors
called factor loadings
A B
_____Unrotated Factors_____ __Rotated Factors__
Variable I II h2 I II
A 0.70 -.40 0.65 0.79 0.15
B 0.60 Eigen Values
-.50 0.61 are the 0.75sum of 0.03
the
C 0.60 variances0.48
-.35 of the factor
0.68 values0.10
(for
D 0.50 factor I the
0.50 0.50eigen value
0.06 is .700.70
2
+ .602
E 0.60 0.50 0.61
+ .602 + .50 2
+ .602 +0.13
.602) 0.77
F 0.60 0.60 0.72 0.07 0.85
Eigenvalue 2.18 1.39
Percent of variance 36.3 23.2
Cumulative percent 36.3 59.5
Step 2 : Rotation of factors:
The second step in the factor analysis exercise is the
rotation of initial factor solutions
The initial solution is rotated so as to yield a solution
that can be interpreted easily.
The varimax rotation method is used.
A
_____Unrotated Factors_____
B
__Rotated Factors__
 The varimax rotation method maximizes the variance of
Variable the loadings within
I eachIIfactor. h2 I II
A 0.70 -.40 0.65 0.79 0.15
 The variance of the factor is largest when its smallest
B 0.60 -.50 0.61 0.75 0.03
loading tends towards zero and its largest loading tends
C 0.60 -.35 0.48 0.68 0.10
towards unity.
D 0.50 0.50 0.50 0.06 0.70
EThe basic idea
0.60of rotation
0.50 is to get
0.61some factors
0.13 that0.77
have
Fa few variables
0.60 that 0.60
correlate
A high
0.72 with 0.07
that factor
B 0.85 and
some that correlate
Eigenvalue 2.18 poorly
_____Unrotated with that factor.
1.39Factors_____ __Rotated Factors__
Percentof variance 36.3 23.2
Similarly, there are other factors that correlate high with
Cumulative percent
Variable those 36.3
I
variables 59.5
II
with which h2
the other factorsI do not have
II
Asignificant correlation.
0.70 -.40 0.65 0.79 0.15
B 0.60 -.50 0.61 0.75 0.03
 Therefore, the rotation is carried out in such way that the
C 0.60 -.35 0.48 0.68 0.10
factor loadings as in the first step are close to unity or zero.
D 0.50 0.50 0.50 0.06 0.70
E 0.60 0.50 0.61 0.13 0.77
F 0.60 0.60 0.72 0.07 0.85
Steps_____Unrotated
in a FactorA
Analysis
Factors_____
…
B
__Rotated Factors__
 To interpret the results, a cut-off point on the factor

loading is selected
A 0.70 -.40 0.65 0.79 0.15
BThere is no0.60 hard and-.50
fast rule0.61
to decide0.75 0.03
on the cut-off
Cpoint. However,
0.60 -.35
generally it 0.48
is taken 0.68 0.10
to be greater
Dthan 0.5 0.50 0.50 0.50 0.06 0.70
E 0.60 0.50 0.61 0.13 0.77
FAll those variables
0.60 attached
0.60 to0.72
a factor,0.07
once the0.85
cut-
Eigenvalue off point is 2.18
decided, 1.39
are used for naming the factors.
Percent of This
varianceis a 36.3
very subjective
23.2 procedure and different
Cumulativeresearchers
percent may name
36.3 59.5 same factors differently
 A variable which appear in one factor should not

appear in any other factor. This means that a variable
should have a high loading only on one factor and a
low loading on other factors.
 Ifthat is not the case, it implies that the question has
not been understood properly by the respondent or it
may not have been phrased clearly.
 Another possible cause could be that the respondent
may have more than one opinion about a given item
(statement).
 The total variance explained by Principal component
method and Varimax rotation is same. However, the
variance explained by each factor could be different.
 The communalities of each variable remains
unchanged by both the methods.
A
_____Unrotated
Exhibit 20-16 Orthogonal Factor
Factors_____
Variabl Rotations
e I II
A 0.70 -.40
B 0.60 -.50
C 0.60 -.35
D 0.50 0.50
E 0.60 0.50
F 0.60 0.60
Factor Analysis Output
A B
A 0.70 -.40 0.65 0.79 0.15
B 0.60 -.50 0.61 0.75 0.03
C 0.60 -.35 0.48 0.68 0.10
D 0.50 0.50 0.50 0.06 0.70
E 0.60 0.50 0.61 0.13 0.77
F 0.60 0.60 0.72 0.07 0.85
Percent of variance 36.3 23.2
Cumulative percent 36.3 59.5
Uses of Factor Analysis
 Scale construction: Factor analysis could be used to
develop concise multiple item scales for measuring
various constructs.
 Establish antecedents: This method reduces multiple

input variables into grouped factors. Thus, the
independent variables can be grouped into broad factors.
 Psychographic profiling: Different independent

variables are grouped to measure independent factors.
These are then used for identifying personality types.
 Segmentation analysis: Factor analysis could also be

used for segmentation. For example, there could be
different sets of two-wheelers-customers owning two-
wheelers because of different importance they give to
factors like prestige, economy consideration and
functional features.
Uses of Factor Analysis …
 Marketing studies: The technique has extensive use
in the field of marketing and can be successfully used
for new product development; product acceptance
research, developing of advertising copy, pricing
studies and for branding studies.
For example we can use it to:
 identifythe attributes of brands that influence
consumers’ choice;
 get an insight into the media habits of various
consumers;
 identifythe characteristics of price-sensitive
customers.
A Variabl B
Key terms used in Factor Analysis e
A 0.70 -.40
 Factor Scores – It is the composite scores estimated for
Variable each respondent I II hB2 0.60
I -.50
II
on the extracted factors.
C 0.60 -.35
A 0.70 -.40 0.65 0.79 0.15
 Factor Loading – The correlationD coefficient 0.50 between0.50 the
B 0.60 -.50 0.61 0.75 0.03
factor score and the variables included E in the study
0.60 0.50 is
C
called factor0.60
loading. -.35 0.48
F
0.68
0.60
0.10
0.60
D 0.50 0.50 0.50 0.06 0.70
 Factor Matrix (Component Variabl
0.50 e Matrix) – 2.18
It contains 0.77the
1.39
E 0.60 0.61 0.13
F factor loadings
0.60 of all0.60
the variables
0.72 on all 36.3
0.07the extracted
23.2
0.85
factors. A 0.7036.3 -.4059.5
B 0.60 -.50
 Eigenvalue
Percent of variance –
36.3 The percentage
23.2 of variance explained by
C 0.60 -.35
Cumulativeeach
percentfactor36.3
can be 59.5
computed using eigenvalue. The
eigenvalue of any factor is obtainedD by0.50
taking the 0.50
sum of
squares of the factor loadings of E each component.
0.60 0.50
F 0.60 0.60
 Communality - It indicates how much of each variable is
2.18 1.39
accounted for by the underlying factors taken together. In
36.3 23.2
other words, it is a measure of the percentage of
variable’s variation that is explained by the 36.3factors. 59.5
Applications of Factor Analysis
in other Multivariate
Techniques
1. Multiple regression – Factor scores can be used in place of
independent variables in a multiple regression estimation. This
way we can overcome the problem of multicollinearity.
2. Simplifying the discrimination solution – A number of
independent variables in a discriminant model can be replaced
by a set of manageable factors before estimation.
3. Simplifying the cluster analysis solution - To make the data
manageable, the variables selected for clustering can be
reduced to a more manageable number using a factor analysis
and the obtained factor scores can then be used to cluster the
objects/cases under study.
4. Perceptual mapping in multidimensional scaling - Factor
analysis that results in factors can be used as dimensions with
the factor scores as the coordinates to develop attribute-based
perceptual maps where one is able to comprehend the
placement of brands or products according to the identified
factors under study.
Editing
Coding
Data Entry
Validity of Data
Techniques
 Factor Analysis
analysis
DISCRIMINANT ANALYSIS
(A Dependency Technique)
What is Discriminant Analysis?
 Discriminant analysis is used to predict group
membership
 Thistechnique is used to classify individuals /
objects into one of the alternative groups on
the basis of a set of predictor variables.
 The dependent variable in discriminant
analysis is categorical whereas the
independent variables are either interval or
ratio scale in nature.
 When we have more than two groups
(categories) of dependent variables, it is a
case of multiple discriminant analysis.
What is Discriminant Analysis? …
 DV = Nominal Scaled
Buyers – Non Buyers
Creditworthy – Not Creditworthy
Superior – Average – Poor Products
 IV
may be one or more interval/ratio
scaled
 Once the discriminant equation is found
it can be used to predict the
classification of a new observation
Discriminating Variables: Example
A wool mfr. is interested in ascertaining the
relative importance of following yarn
characteristics between prospective buyers /
non-buyers:
Durability
Lightness in weight
Low investment in conversion facilities
Rot resistance
 Prospects rate the product on these
characteristics on, say, a 11 point scale
 Discriminant model can be developed to
identify relative importance of variables in
discriminating between the two groups
Discriminant Function
 Thediscriminant equation is a linear
function of the form
Di = d0 + d1X1 + d2X2 + … + dpXp
Where
Di is the score on discriminant function i
The di’s are weighting coefficients
d0 is a constant
X’s are values of the discriminating
variables
Discriminant Analysis Model
 The method of estimating dis is based on the principle that
the ratio of:
 between group sum of squares to
 within group sum of squares
be maximized
 This will make the groups differ as much as possible on
the values of the discriminant function.
 After having estimated the model, the di coefficients (also
called discriminant coefficient) are used to calculate Di, the
discriminant score by substituting the values of Xs in the
estimated discriminant model.
 The discriminant function with a constant term is called un-
standardized whereas without the constant term is known
as standardized discriminant function
DA Method
 If the categorization calls for 2 groups a
single discriminant equation is required
 If three groups are involved in the
classification, it requires two
discriminant equations
 If more categories are called for in the
dependent variable, one needs N -1
discriminant function
DA Example:
 KDL, a media firm, is hiring MBAs for its
“Account Executives” position
 You begin by gathering data on 30 MBAs
who have been hired in recent years.
 Fifteen of these have been successful
employees, while other 15 have been
unsatisfactory.
 You need to develop a procedure to improve
the process.
 The files provide the following information:
DA Example …
X = years of prior work experience
1
X = GPA in graduate program

2
X = employment test scores

3
 Discriminant analysis determines how
well these three independent variables
will correctly classify those who are
judged successful from those judged
unsuccessful
SPSS - Discriminant Analysis
Output
Predicted Success
Number
of
Actual Group Cases 0 1
Unsuccessful 0 15 13 2
86.70% 13.30%
Successful 1 15 3 12
20.00% 80.00%
Note: Percent of “grouped” cases correctly classified:
83.33%
SPSS - Discriminant Analysis
Output
Unstandardized Standardized
X1 .36084 .65927
X2 2.61192 .57958
X3 .53028 .97505
Constant 12.89685
D = 0.659X1 + 0.580X2 + 0.975X3
Discriminant Analysis: Example
Discriminant Analysis: Example …
Discriminant Analysis: Example …
Objectives of Discriminant Analysis
 The objectives of discriminant analysis are the
following:
 To find a linear combination of variables that
discriminate between categories of dependent
variable in the best possible manner
 To find out which independent variables are relatively
better in discriminating between groups
 To determine the statistical significance of the
discriminant function
 To develop the procedure for assigning new objects,
firms or individuals whose profile but not the group
identity are known to one of the two groups.
 To evaluate the accuracy of classification
Uses of Discriminant Analysis
 Some of the uses of Discriminant Analysis
are:
Scale construction: Discriminant analysis is used
to identify the variables/statements that are
discriminating and on which people with diverse
views will respond differently.
Perceptual mapping: The technique is also used

extensively to create attribute-based spatial maps
of the respondent’s mental positioning of brands.
Segment discrimination: To understand what are

the key variables on which two or more groups
differ from each other
Uses of Discriminant Analysis …
 Questionsto which one may seek
answers are as follows:
What are the demographic variables on which
potentially successful / unsuccessful salesmen
differ?
What are the variables on which users/non-users
of a product can be differentiated?
What are the variables on which the buyers of
local/national brand of a product be differentiated?
Definitions of Key Terms used
in Discriminant Analysis
 Eigenvalue - The basic principle in the estimation of a
discriminant function is that the variance between the
groups relative to the variance within the group should
be maximized. The ratio of between group variance to
within group variance is called Eigenvalue.
 Canonical Correlation - Canonical correlation is the

simple correlation coefficient between the discriminant
score and the group membership. (0, 1 or 1,2 etc.)
 Wilks’ Lambda – It is given by ratio of within group

sum of squares to total sum of squares. The Wilks’
lambda takes a value between 0 and 1 and lower the
value of Wilks’ lambda, the higher is the significance of
the discriminant function
CLUSTER ANALYSIS
(An Interdependency Technique)
What is Cluster Analysis?
 Unliketechniques for analyzing relationships
among variables, Cluster Analysis is a set of
techniques for grouping similar objects,
people, entities, products, etc.
 Ex:Benefit segmentation of customers into
groups based on the benefits they seek from
the product category
 ClusterAnalysis is similar to factor analysis,
especially when FA is applied to people
instead of variables
What is Cluster Analysis? …
 Itdiffers from discriminant analysis in that DA
begins with a well defined group composed of
two or more distinct sets of characteristics in
search of a set of variable to separate them
 CA starts with a undifferentiated group of
people, events, or objects and attempts to
reorganize them into homogeneous
subgroups
 In CA similarity is based on multiple variables
 CA measures proximity between study
variables
What is Cluster Analysis? …
 The advantage of the technique is that it is
applicable to both metric and non-metric
data
 The grouping can be done post hoc , i.e. after
the primary data survey is over
 Groups
that are grouped in one cluster are
homogenous as compared to others
Maximization
Buyers,
Market
Computation
Segment
Medicalof
ofsimilarities
Patients,
with-in
Characteristics,
cluster
among
the
Basic Steps in Cluster Analysis
Inventory,
Product
similarity
entities
Competition
and
Products,
is done
between-cluster
through
Definitions,
Employees…..
Financial
correlation,
differencesStatus,
Euclidean
Political
distances
Affiliation,
and other techniques
Symptom Classes,
Productivity Attributes
Select sample to cluster
Define variables
Compute similarities
Select mutually exclusive clusters
Compare and validate cluster

Cluster Analysis: Example
Luxury car buyer
Potential Minivan May be a sport
or Sport Utility and performance
Vehicle buyers car segment
Buyer of a Sedan
Cluster Membership
________Number of Clusters ________
Film Country Genre Case 5 4 3 2

Cyrano de Bergerac France DramaCom 1 1 1 1 1
Il y a des Jours France DramaCom 4 1 1 1 1
Nikita France DramaCom 5 1 1 1 1
Les Noces de Papier Canada DramaCom 6 1 1 1 1
Leningrad Cowboys . . . Finland Comedy 19 2 2 2 2
Storia de Ragazzi . . . Italy Comedy 13 2 2 2 2
Conte de Printemps France Comedy 2 2 2 2 2
Tatie Danielle France Comedy 3 2 2 2 2
Crimes and Misdem . . . USA DramaCom 7 3 3 3 2
Driving Miss Daisy USA DramaCom 9 3 3 3 2
La Voce della Luna Italy DramaCom 12 3 3 3 2
Che Hora E Italy DramaCom 14 3 3 3 2
Attache-Moi Spain DramaCom 15 3 3 3 2
White Hunter Black . . . USA PsyDrama 10 4 4 3 2
Music Box USA PsyDrama 8 4 4 3 2
Dead Poets Society USA PsyDrama 11 4 4 3 2
La Fille aux All . . . Finland PsyDrama 18 4 4 3 2
Alexandrie, Encore . . . Egypt DramaCom 16 5 3 3 2
Dreams Japan DramaCom 17 5 3 3 2
Dendogram
Usage of Cluster Analysis
 Market segmentation – customers /
potential customers can be split into
smaller more homogenous groups by
using the method.
 Segmenting industries – the same
grouping principle can be applied for
industrial consumers.
 Segmenting markets – cities or regions
with similar or common traits can be
grouped on the basis of climatic or socio-
economic conditions.
Usage of Cluster Analysis …
 Career planning and training analysis –
for human resource planning people can
be grouped into clusters on the basis of
their educational/experience or aptitude
and aspirations.
 Segmenting financial sector /
instruments – different factors like raw
material cost, financial allocations,
seasonality and other factors are being
used to group sectors together to
understand the growth and performance of
a group of industries.
Key Concepts in Cluster Analysis
 Cluster seeds: Initial points from which one
starts. Then the clusters are created around
these seeds.
 Clustermembership: This indicates the
address or the cluster to which a particular
person/object belongs.
 Dendrogram: This is a tree like diagram that
is used to graphically present the cluster
results. The vertical axis represents the
objects and the horizontal represents the
inter-respondent distance. The figure is to be
read from left to right.
Key Concepts in Cluster Analysis …
 Entropy group: The individuals or small groups
that do not seem to fit into any cluster.
 Hierarchical methods: A step-wise process that
starts with the most similar pair and formulates a
tree-like structure composed of separate clusters.
 Non-hierarchical methods: Cluster seeds or
centres are the starting points and one builds
individual clusters around it based on some pre-
specified distance of the seeds.
Editing
Coding
Data Entry
Validity of Data
Techniques
 Factor Analysis
analysis
MULTIPLE REGRESSION
(A Dependency Technique)
Dependency Techniques
Multiple Regression
Discriminant Analysis
MANOVA
Structural Equation Modeling (SEM)
Conjoint Analysis
Uses of Multiple Regression
Develop Control Test

self-weighting for and
estimating confounding explain
equation to Variables causal
predict values theories
for a DV
Simple Regression
 Simplelinear regression equation can be
presented as
Y = α + βX + ε
 Where,
ε = Stochastic error term
α, β = Parameters to be estimated
 The equation is estimated using the ordinary

least squares (OLS) method of estimation.
 The OLS method of estimation states that the
regression line should be drawn in such a
way so as to minimize the error sum of
squares.
Simple Regression Analysis
The OLS method of estimation would result in the
following two normal equations:
Solving the above normal equations results in:

Once βˆ is
estimated, the value
of α may be
computed as,
Generalized Multiple Regression Equation
Multiple Regression: Example
‘Hybrid-mail’ Business
 Type a document on a PC and email to a
distant international terminal near the
addressee
 Itwill be printed and delivered via local postal
service
 Links the world’s ‘wired’ with the ‘non-wired’
(ex: Afghanistan & Iraq)
 Key Drivers of Customer Usage:
1. Cost Vs. Speed Valuation
2. Security (document privacy)
3. Reliability
4. Impact / Emotional value
‘Hybrid-mail’ Business
 Let us consider the first three variables
 Measurements of customer perception
on all the 3 variables are done on 5-point
scales
 For this equation:
Y = Customer Usage
X1 = Cost / Speed
X2 = Security
X3 = Reliability
Methods of Selecting Variables for
the Equation
Forward
(Add largest R2 Variables)
Backward
(Remove variables that change
R2 the least)
Stepwise
SPSS Output
Y = - 0.093 + 0.448X1 + 0.315X2 + 0.254X3

Evaluating and Dealing with
Multicollinearity
Variable Inflation Factor >10
Collinearity
Statistics
suggests collinearity
VIF
Collinearity exists when two

IVs are highly correlated
1.000
2.289
2.289
Choose one of the variables

and delete the other
2.748
3.025
3.067
Create a new variable

that is a composite of the others
Dummy Variables in Regression
Analysis
 Inregression analysis, the dependent variable
is generally metric in nature and it is most
often influenced by other metric variables
 However, there could be situations where the
dependent variable may be influenced by the
qualitative variables like gender, marital
status, profession, geographical region, color,
or religion.
 The question arises how to quantify qualitative
variables
Dummy Variables in Regression
Analysis …
 Insituations like this, the dummy variables
come to our rescue. They are used to quantify
the qualitative variables.
 The number of dummy variables required in
the regression model is equal to the number
of categories of data less one
 Dummy variables usually assume either of the
two values 0 or 1.
CORRELATION
What is Correlation?
 Correlation
measures the degree of
association between two or more
variables
 When we are dealing with two
variables, we are talking in terms of
simple correlation
 When more than two variables are
involved, the subject matter of
interest is called multiple correlation.
Types of Correlation
 Positive correlation - When two
variables X and Y move in the same
direction, the correlation between the
two is positive.
 Negative correlation: When two
variables X and Y move in the opposite
direction, the correlation is negative.
 Zero correlation: The correlation
between two variables X and Y is zero
when the variables move in no
connection with each other.
Graphical Presentation of Positive
Correlation
Graphical Presentation of
Negative Correlation
Graphical Presentation of Zero
Correlation
Quantitative Estimate of a
Linear Correlation
 A quantitative estimate of a linear correlation between
two variables X and Y is given by Karl Pearson as:
Editing
Coding
Data Entry
Validity of Data
Techniques
 Factor Analysis
analysis
MULTI DIMENSIONAL SCALING (MDS)
MDS - Basics
 Multidimensional scaling (MDS) creates a
spatial description of a respondent’s
perception about a product, service, or
other object of interest on a perceptual map
 Helps to understand difficult-to-measure
constructs (eg. Product quality or product
desirability)
 Many construct are cognitively mapped in
different ways by individuals unlike
variables that can be measured
 With MDS items perceived to be similar will
fall close together on the perceptual map.
MDS – Basics …
 There are three types of attribute space,
each representing a multidimensional
map.
 There is objective space in which an object can
be positioned in terms of its measurable
attributes (eg. weight, nutritional value)
 There is subjective space were perceptions of
the object’s weight, nutritional value, etc. may
be positioned
 With a third map, we can describe respondents’
preferences using the object’s attributes. This
represents their ideal.
MDS – Basics …
 Ideally, objective and subjective attribute
assessments must coincide.
 A comparison of the two allows us to
judge how accurately an object is being
perceived
 Also, a person’s perception varies over
time and in different circumstances
 Such measurements are valuable to
gauge the impact of advertising
programs
MDS Example: Study of Restaurants
 Similarities
among the 16 restaurants were
measured by asking patrons questions on a
5-point metric scale about different
dimensions of service quality and price
 The matrix of similarities is shown in the next
slide
 Higher
numbers reflect the items that are
more dissimilar
A computer program analyzed the data matrix
and generated a perceptual map
Most similar pair (restaurants
3,6) must be located in the
Similarities
multidimensional space closer Matrix: Multidimensional
together thanScaling
any other pair
The least similar pair

(restaurants 14, 15)
must be farthest apart
Positioning of Selected Restaurants
The least similar pair
(restaurants 14, 15)
Most similar pair

(restaurants 3,6)
MDS Example …
 Alldistances between pairs of points
closely correspond to the original matrix
 The goal is to secure the structure that
provides a good fit with the fewest
dimensions
 MDS is best understood using two or at
most three dimensions
Uses of Multidimensional Scaling
 Scale construction: The dimensions can be
reproduced as attributes in questionnaire to validate
their existence
 Brand image analysis: To measure the gap or match

between brand positioning and brand perception.
 New product development: To identify quadrants that

are less crowded and where a launch opportunity
exists.
 Pricing studies: Spatial maps with and without the

price dimension can be made to assess the relevance
of price/benefit trade off.
 Communication effectiveness: Before and after spatial

maps can be made to measure new advertising
impact or repositioning exercise.
Uses of Multidimensional Scaling …
 The method of multidimensional scaling
is used under two conditions:
 For an exploratory study to decipher the
probable underlying attributes or causes of
certain observed patterns of behavior.
 For descriptive research studies when
comparative evaluations of objects,
individuals or brands needs to be done in
the consumer’s mind space.
MDS Procedure
Formulate the Research
Objectives
Identify unit of
Identify objects to analysis
be compared
Similarity data Preference data

Ordinal / Interval Ordinal / Interval
MDS output
(Metric or Non Metric)
Identify number of dimensions
Interpret the solution
Establish strength of MDS

solution
Establishing Strength of MDS Solution
 The Kruskal Stress Score, i.e. the discrepancy
scores obtained between the derived distances
on a configured map and the actual distance as
indicated by the respondents’ choice.
 The ideal representation would be a stress value
of 0%. However, it is acceptable to consider a
solution till a 20% stress between the actual and
the derived configuration.
 The R-square value: measures the proportion
of the variance of the final scaled solution that
can be accounted for by the MDS procedure.
 Theideal would be 1. However, an R-square
value of 0.6 or above is acceptable.
Editing
Coding
Data Entry
Validity of Data
Techniques
 Factor Analysis
analysis

Data Preparation and Analysis 3

Uploaded by

Copyright:

Available Formats

Data Preparation and Analysis 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preparation and Analysis 3

Uploaded by

Copyright:

Available Formats

DATA PREPARATION AND

 T-test for independence (Difference) –

Exploratory Data Analysis

1. Buy ready to eat Yes = 1 X1

2. Use ready to eat Yes = 1 X2

Monthly Rs.20000 to Rs.34999 = 1,

Less than graduation = 1

 With Third Variable:

Spearman’s rank correlation coefficient is given by:

The rank correlation coefficient takes a

 The total lowest score indicates the first preference

Compare the sample value of the statistic

H0 : Probabilities of the occurrence of events

H1 : Probabilities of the k events are not the

The procedure has already been explained.

 The hypothesis test for independence is:

 Thehypothesis is tested using a chi-

The degrees of freedom for the chi-square statistic are given by

For a given level of significance α, the sample value of the chi-

H0 : The pattern of sequence is random.

For a large sample the test statistic is given by:

For a given level of significance, if the absolute value of

No. of + signs = 19; Minus signs = 11; Zeros = 10

 H1 : Two samples come from different populations (OR)

Branch A 1,400 850 1,150 1,200 1,500 600 775

 To apply the MWU Test to this problem,

 Assuming the level of significance as equal to α, if the absolute

 The paired samples are dependent (before & after,

 The Wilcoxon matched-pair signed rank test takes care of

 The test, therefore, incorporates and makes use of more

 The test statistics is given by:

iii. Let ri = the total of the ranks from the ith

Where, k = Number of samples

n = Total number of elements in k samples.

H0 : Correlation matrix is insignificant

H1 : Correlation matrix is significant.

Step 1: Extraction of factors:

 Step 2 : Rotation of factors:

 To interpret the results, a cut-off point on the factor

 A variable which appear in one factor should not

 Establish antecedents: This method reduces multiple

 Psychographic profiling: Different independent

 Segmentation analysis: Factor analysis could also be

X = GPA in graduate program

X = employment test scores

Perceptual mapping: The technique is also used

Segment discrimination: To understand what are

 Canonical Correlation - Canonical correlation is the

 Wilks’ Lambda – It is given by ratio of within group

Select mutually exclusive clusters

Compare and validate cluster

Film Country Genre Case 5 4 3 2

Structural Equation Modeling (SEM)

Develop Control Test

 The equation is estimated using the ordinary

Solving the above normal equations results in:

Y = - 0.093 + 0.448X1 + 0.315X2 + 0.254X3

Collinearity exists when two

Choose one of the variables