CHAPTER-12-Data Analysis Using SPSS
CHAPTER-12-Data Analysis Using SPSS
ONE-SAMPLE TESTS
• One-sample t test
• Binomial test
• Chi square test for goodness-of-fit
ASSOCIATION TESTS
• Pearson correlation
• Spearman correlation
• Partial correlation
• Chi square test for association
• Loglinear analysis
PREDICTIVE TECHNIQUES
• Simple regression
• Multiple regression
• Multiple regression with dummy variables
• Sequential (hierarchical) regression
• Binomial regression
• Multinomial regression
• Ordinal regression
SCALING TECHNIQUES
• Reliability analysis
• Multidimensional scaling
Data Reduction
• Principal component analysis
• Correspondence analysis
GROUPING METHODS
• Cluster analysis
• Discriminant analysis
Employee data.sav
Cumulative
Frequency Percent Valid Percent Percent
Valid Clerical 363 76.6 76.6 76.6
• ..\2.practice Datasets\contacts.sav
𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 −0.286
𝑧= = = −0.505
𝑆𝑡𝑑. 𝑒𝑟𝑟𝑜𝑟 0.566
2. The second solution consists of running a nonparametric test if our variables are
not normal. (for example, the Spearman correlation or the Mann-Whitney test).
3. The third solution : run the analysis regardless. Many parametric tests are pretty
robust to deviations from normality, especially when the samples are large
There are two basic solutions for dealing with genuine outliers:
transformation:
𝒚= 𝒙
For strongly positively skewed variables, you can choose between the following
two transformations:
𝑦 = log10 𝑥 or
1
𝑦=
𝑥
..\1. Training datasets\Employee data.sav
𝒚= 𝒙𝒎𝒂𝒙 + 𝟏 − 𝒙
For strongly negatively skewed variables, you can choose between the following
two transformations:
𝑦 = log10(𝑥𝑚𝑎𝑥 + 1 − 𝑥) or
1
𝑦=
𝑥𝑚𝑎𝑥 + 1 − 𝑥
ONE-SAMPLE TESTS
• One-sample t test
• Binomial test
• Chi square test for goodness-of-fit
Number of Readers
Magazine (observed
Preference Frequency)
A 184
B 167
C 200
D 149
total 700
Test Statistics
Magazine preference magazine
preferences of readers
Chi-Square 8.263a
df 3
Asymp. Sig. .041
a. 0 cells (.0%) have expected frequencies less than 5. The
minimum expected cell frequency is 175.0.
29
Exercise
• In the database insurance_claims.sav, do a chi
square test for goodness-of-fit using the
variable townsize. (In the theoretical
distribution all frequencies will be equal.)
ASSOCIATION TESTS
• Pearson correlation
• Spearman correlation
• Partial correlation
2. The variable values are pairs (there are two values, two
measurements for each sample case).
r2 =0.8072 = 0.651
Educational Employment
Level (years) Category
Spearman's rho Educational Level (years) Correlation Coefficient 1.000 .484**
ρ2 =0.6882 = 0.473
The variation in the education level
statistically explains 47% of the
variation in the current salary
claim_type.
5. The variances of the dependent variable in the two groups are equal
(in other words, we have homogeneity of variances).
Dr. Abenet Yohannes,JJU
…
The null and alternative hypotheses of the Levene test:
…
The independent variable is categorical, with three or more categories.
4. The dependent variable does not present significant outliers in any group.
5. The dependent variable has equal variances in all groups. If this condition
is not met, we have to use a robust version of the F test, called the Welch
test.
…
Example: new vitamin test
Kolmogorov-Smirnova Shapiro-Wilk
Dose of vitamin Statistic df Sig. Statistic df Sig.
Effort resistance Placebo .054 120 .200* .992 120 .748
• the employees in the second group will take the vitamin in low dose
• the employees in the third group will take the vitamin in high dose.
H0: the group means for the factor A are equal in the
total population
H1: the group means for the factor A are different in the
total population
4. The dependent variable does not present significant outliers in any group.
• the employees in the second group will take the vitamin in low dose
• the employees in the third group will take the vitamin in high dose.
We are interested to know if there is a combined influence of the three factors, dose,
gender and type of employee, on the effort resistance. In other words, we want to
detect the interaction effect of these variables.
H0: the group means (for the composite variable) are equal in
population
3. The number of cases in each group is greater than the number of dependent variables.
5. The relationships between the dependent variables are approximately linear in all groups.
6. There is no significant multicollinearity – the dependent variables are not strongly correlated
with each other.
9. There is homogeneity of variances – the group variances are equal for each dependent variable.
• the employees in the second group will take the vitamin in low
dose
• the employees in the third group will take the vitamin in high
dose.
1.
…
The independent variable is categorical with three or more groups.
2. The dependent variable is continuous.
3. The covariate is continuous.
4. The residual values of the dependent variable are normally distributed.
5. The dependent variable does not present significant outliers.
6. The relationship between the dependent variable and the covariate is
approximately linear in all factor groups.
7. There is no correlation between the independent variable (factor) and
the covariate. This is the most important condition of all.
8. There is homogeneity of variances – the variances of the dependent
variable are equal in all factor groups.
9. There is homoscedasticity – the variance of the residuals is equal for all
the predicted values of the dependent variable.
Dr. Abenet Yohannes,JJU
Initial and adjusted
… group means
Group Initial mean Adjusted mean
Placebo 12.09 13.21
Low dose 13.10 13.43
High dose 16.44 14.66
• - type as a factor
• - price as a covariate.
H1: at least one mean differs from the others in the total
population
After a pause, the diet starts again, this time combined with physical
exercises. The weights are measured again, at the same moments:
• weight_beg_ex – weight_beg
• weight_mid_ex – weight_mid
• weight_end_ex – weight_end
Dr. Abenet Yohannes,JJU
…
The simple main effects for the moment factor consist of the differences between
the subjects' weights at the three moments of the dieting period, for the same
level of the exercise factor. So for the level 1 (without exercise), the differences to
compute will be:
• weight_mid – weight_beg
• weight_end – weight_mid
• weight_end – weight_beg
For the level 2 (with exercise) the differences will be:
• weight_mid_ex – weight_beg_ex
• weight_end_ex – weight_mid_ex
• weight_end_ex – weight_beg_ex
Our sample is divided into two independent groups, men and women.
4. The distributions of scores in the two groups have the same shapes. If
this assumption is met, we will use the test results to evaluate the
difference between the group medians. If it is not met, if the
distributions have different shapes, we will report and interpret the
difference between the mean ranks.
groups.
condition is not satisfied, we can use the median test instead of Kruskal-
Wallis.
( 𝐵 − 𝐶 − 1)2
𝜒2 =
𝐵+𝐶
PREDICTIVE TECHNIQUES
• Simple regression
• Multiple regression
• Multiple regression with dummy variables
• Sequential (hierarchical) regression
• Binomial regression
• Multinomial regression
• Ordinal regression
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 + 𝜺
Where y is the dependent variable, x is the
independent variable, b1 is the regression
coefficient, and b0 is the constant (intercept).
𝜀 = 𝑦 − (𝑏0 + 𝑏1 𝑥)
or
𝜀 = 𝑦 − 𝑦ො
1.
…
Both variables (dependent and independent) are continuous.
5. The dependent variable has the same variance for all the
values of the independent variable (there is homoscedasticity).
For example:
𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ 𝑏𝑘 𝑥𝑘 + 𝜀
5. The dependent variable has the same variance for all the values of the
independent variables (the assumption of homoscedasticity).
7. The independent variables are not strongly correlated with one another
(we don’t have important multicollinearity).
For example:
𝒑
𝒍𝒏 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ + 𝒃𝒌 𝒙𝒌
𝟏−𝒑
𝒑
= 𝒆𝒃𝟎 𝒆𝒃𝟏 𝒙𝟏 … 𝒆𝒃𝒌𝒙𝒌
𝟏−𝒑
𝒐
𝒑=
𝟏+𝒐
𝟐. 𝟗𝟓𝟎
𝒑= = 𝟎. 𝟕𝟒𝟔
𝟏 + 𝟐. 𝟗𝟓𝟎
1.004
𝑝= = 0.49 = 49%
1+1.004
• for high income subjects: 51%
• Daily News
• National Politics
• Free Tribune
Daily News 1 1 0
National Politics 2 0 1
Free Tribune 3 0 0
𝑜
𝑝=
1+𝑜
• the chance that a left-wing person reads Daily News is about 92%
lower than the chance that he or she reads Free Tribune (1-0,083)
– or about 12 times lower (1/0,083=12,04).
• the chance that a right-wing person reads Daily News is about 52%
lower than the chance that he or she reads Free Tribune (1-0,485) –
or about 2 times lower (1/0,485=2,06).
If a right-wing subject should choose from these two newspapers only:
B Exp(B)
0,303 1,354
• the chance that an aged subject reads Daily News is 1.3 times
higher than the chance that the same subject reads the Free
Tribune
Categories B Exp(B)
political=1 (left-wing) -2,171 0.114
political=2 (right-wing) 0,944 2,570
political=3 (center) 1,227 3,410
𝑜
𝑝=
1+𝑜
𝑜
𝑝=
1+𝑜
• the chance that a right-wing person reads National Politics is about 5,3
times higher than the chance that he or she reads Daily News.
• the chance that a center person reads National Politics is about 86%
lower than the chance that he or she reads Daily News (1-0,137=0,863) –
B Exp(B)
-0,278 0,757
2) the second variable will separate the “not at all satisfied” and “not really satisfied”
customers (1 and 2) from the others (3 and 4)
3) the third variable will separate the “not at all satisfied”, “not really satisfied” and
“satisfied” customers (1, 2 and 3) from the others (4)
𝑝𝑗
𝑙𝑛 = 𝑏0𝑗 + 𝑏1𝑗 𝑥1 + 𝑏2𝑗 𝑥2 + ⋯ + 𝑏𝑘𝑗 𝑥𝑘
1 − 𝑝𝑗
1. …
Using the PLUM procedure (PLUM stands for Polytomous
Universal Model). This procedure has a downside: it does not
automatically compute the antilogarithms of the regression
coefficients.
Not important 1 1 0
Somewhat important 2 0 1
Very important 3 0 0
•
…
If the antilog is higher than 1 for a certain category, the subjects
in that category have higher levels of satisfaction than the
subjects in the reference category
Level B Exp(B)
Not important (1) 0,037 1,038
Somewhat important (2) 1,161 3,194
Very important (3) -1,198 0,301
SCALING TECHNIQUES
• Reliability analysis
• Multidimensional scaling
• Cronbach’s alpha
• Cohen’s kappa
• Kendall’s W
Dr. Abenet Yohannes,JJU
Cronbach’s alpha
The Cronbach’s alpha help us determine whether the
items of a scale measure the same underlying
dimension, the same construct.
5. When I need a home appliance, the brand X is the first one I think of.
The rating scale is categorical (nominal or ordinal) with disjoint categories (each
rater gives to any subject one score and only one).
1. The evaluation takes place on the same sample, every subject receiving two
scores, one for every rater.
2. Each rater uses the same scale, with the same number of levels.
4. The two evaluators are predetermined (fixed), i.e. they are specifically
chosen to take part in the study. They are not selected at random, and we
cannot make generalizations for other evaluators.
W Concordance
<0.20 Very low
0.21 – 0.40 Low
0.41 – 0.60 Moderate
0.61 – 0.80 Strong
0.81 – 1 Very strong
1 Perfect
H0: W=0
H1: W>0
• Exercise #2
Data Reduction
• Principal component analysis
• Correspondence analysis
3. The number of cases in the sample is big enough (at least 10 cases for
each variable).
GROUPING METHODS
• Cluster analysis
• Discriminant analysis
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ + 𝒃𝒌 𝒙𝒌
where y is the dependent variable, the x’s
are the explainers and the b’s are the
coefficients
Assumptions :
…
The dependent variable is categorical, with disjoint groups.
min(c-1, k)
where c is the number of categories in the
dependent variable, and k is the number of
independent variables.
Multiple discriminant analysis
The first function discriminates the younger
employees, with higher education and big salaries, from
the older employees, with lower salaries and lower
education levels
o Soccer
o Baseball
o Tennis
o Football
o Swimming
o Adidas
o Nike
o Reebok
o Asics
4. Select the variables that form the set and move them to the Variables in
Set section
5. Enter the value at Counted value that represented a positive (usually you are
interested in how many cases chose the option)
7. Click on Add
8. Click on Close
4. Select the variables that form the set and move them to the Variables in
Set section
5. Enter the value at Counted value that represented a positive (usually you are
interested in how many cases chose the option)
7. Click on Add
8. Click on Close