0% found this document useful (0 votes)
75 views

CHAPTER-12-Data Analysis Using SPSS

Uploaded by

abdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

CHAPTER-12-Data Analysis Using SPSS

Uploaded by

abdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 619

JIGJIGA UNIVERSITY

SCHOOL OF GRADUATE STUDIES


COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 4:59 PM


1
Goal of the training:
• provide a practical guide for performing
the most usual statistical analyses with
the help of the SPSS package,

11/21/2021 Dr, Abenet, Jigjiga University 2


SECTION 1
The Basics
• Working with SPSS files
• Defining variables in SPSS
• Variable recoding
• Dummy variables
• Selecting cases
• File splitting
• Weighing cases

Dr, Abenet, Jigjiga University 3


SECTION 2

CREATING CHARTS IN SPSS


• Column charts
• Line charts
• Scatterplot charts
• Boxplot diagrams

Dr, Abenet, Jigjiga University 4


SECTION 3

SIMPLE ANALYSIS TECHNIQUES


• The Frequency procedure
• The Descriptives procedure
• The Explore procedure
• The Means procedure
• The Crosstabs procedure

Dr, Abenet, Jigjiga University 5


SECTION 4

ASSUMPTION CHECKING. DATA TRANSFORMATIONS

• Checking for normality


• Detecting outliers
• Data transformations

Dr, Abenet, Jigjiga University 6


SECTION 5

ONE-SAMPLE TESTS
• One-sample t test
• Binomial test
• Chi square test for goodness-of-fit

Dr, Abenet, Jigjiga University 7


SECTION 6

ASSOCIATION TESTS
• Pearson correlation
• Spearman correlation
• Partial correlation
• Chi square test for association
• Loglinear analysis

Dr, Abenet, Jigjiga University 8


SECTION 7

TESTS FOR MEAN DIFFERENCE


• Independent samples t test
• Paired samples t test
• One-way ANOVA
• Two-way ANOVA
• Three-way ANOVA
• Multivariate ANOVA
• Analysis of covariance

Dr, Abenet, Jigjiga University 9


SECTION 7

TESTS FOR MEAN DIFFERENCE ….


• Repeated measures ANOVA
• Within-within subjects ANOVA
• Mixed ANOVA
• Mann-Whitney test
• Wilcoxon test
• Kruskal-Wallis test
• Friedman test
• McNemar test

Dr, Abenet, Jigjiga University 10


SECTION 8

PREDICTIVE TECHNIQUES
• Simple regression
• Multiple regression
• Multiple regression with dummy variables
• Sequential (hierarchical) regression
• Binomial regression
• Multinomial regression
• Ordinal regression

Dr, Abenet, Jigjiga University 11


SECTION 9

SCALING TECHNIQUES
• Reliability analysis
• Multidimensional scaling

Dr, Abenet, Jigjiga University 12


Section 10

Data Reduction
• Principal component analysis
• Correspondence analysis

Dr, Abenet, Jigjiga University 13


SECTION 11

GROUPING METHODS
• Cluster analysis
• Discriminant analysis

Dr, Abenet, Jigjiga University 14


SECTION 12

• Multiple Response Analysis

Dr, Abenet, Jigjiga University 15


Dr, Abenet, Jigjiga University 16


SECTION 1
The Basics
• » Purpose and use of SPSS
• » Open SPSS
• » What is a Variable?
• » Defining variables in SPSS
• » Entering Data
• » Open and save data files
• » Import data from Excel
• » Handling missing data
• » Selecting cases
• » File splitting
• » Weighing cases

Dr. Abenet Yohannes


• » SPSS is a Statistical Software Package
• » SPSS is a tool
✓ It only does what it’s ‘told’ to do.
✓ It does not think for you
✓ It is not a black box

• » You need to know the correct statistics for your research


BEFORE using SPSS.
• » If you understand the statistics, then you are ready to do
analysis in SPSS.

Dr. Abenet Yohannes


Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
Dr. Abenet Yohannes
JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:00 PM


1
Dr. Abenet Yohannes,JJU
SECTION 2

CREATING CHARTS IN SPSS


• Column charts
• Line charts
• Scatterplot charts
• Boxplot diagrams

Dr. Abenet Yohannes,JJU


……
The column chart is used when our independent
variable has discrete values (i.e. it is categorical,
either nominal or ordinal).

Employee data.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
The line charts
… …
are generally used to
represent the time series, but they can be
also an alternative to bar charts when we
want to visualize the differences between
the levels of a categorical variable.
..\1. Training datasets\insurance_claims.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
……
The scatterplot charts are used to
represent the relationship between
two continuous variables.
Employee data.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:01 PM


1
SECTION 3

SIMPLE ANALYSIS TECHNIQUES


• The Frequency procedure
• The Descriptives procedure
• The Explore procedure
• The Means procedure
• The Crosstabs procedure

Dr. Abenet Yohannes 2


Dr. Abenet Yohannes 3
Dr. Abenet Yohannes 4
Dr. Abenet Yohannes 5
Dr. Abenet Yohannes 6

The Frequencies procedure generates
the frequency table for the variable of
interest, as well as a good number of
important statistics (if our variable is
continuous).
Dr. Abenet Yohannes 7
Dr. Abenet Yohannes 8
Research Question
#1What kind of computer do people prefer to own?
..\1. Training datasets\Frequencies.sav

Dr. Abenet Yohannes 9


Employment Category

Cumulative
Frequency Percent Valid Percent Percent
Valid Clerical 363 76.6 76.6 76.6

Custodial 27 5.7 5.7 82.3


Manager 84 17.7 17.7 100.0

Total 474 100.0 100.0

Dr. Abenet Yohannes 10


Dr. Abenet Yohannes 11
Exercise

• Using the data set contacts.sav, perform the following


tasks:

• - build a frequency table for the variable size

• - build a frequency table and generate the most important


statistical indicators for the variable sale

• ..\2.practice Datasets\contacts.sav

Dr. Abenet Yohannes 12


Dr. Abenet Yohannes 13

The Descriptives procedure is useful for
analyzing continuous variables. Just like
the Frequencies procedure, it computes
the main statistical metrics for our
variable.

Dr. Abenet Yohannes 14


Dr. Abenet Yohannes 15
Exercise

• Using the database advert.sav,


compute the main statistics for the
variables advert and sales with the
Descriptives procedure.

Dr. Abenet Yohannes 16


Descriptive Statistics
N Range Minimum Maximum Mean Std. Deviation Variance

Current Salary 474 $119,250 $15,750 $135,000 $34,419.57 $17,075.661 291578214.453


Beginning Salary 474 $70,980 $9,000 $79,980 $17,016.09 $7,870.638 61946944.959
Valid N (listwise) 474

Dr. Abenet Yohannes 17


Dr. Abenet Yohannes 18

The Explore computes and generates a list of
statistical indicators for our variable. Moreover, it
allows us to do the analyzes not only for the entire
population, but also for subgroups or strata of this
population.

Dr. Abenet Yohannes 19



The Explore procedure often involves two types of variables:

• dependent continuous variables – the


variables we want to perform the analysis for
• independent categorical variables (also called
factors) – used to define the subgroups or
subcategories of the population

Dr. Abenet Yohannes 20


Dr. Abenet Yohannes 21
Exercise
• Open the data set contacts.sav and use the Explore
procedure to compute the statistical indicators and the
boxplot diagram for the variable sales,

Dr. Abenet Yohannes 22


Exercise
• Using the data set advert.sav, generate the main
statistical indicators for the variables advert and sales
with the help of the Means procedure.

Dr. Abenet Yohannes 23


Dr. Abenet Yohannes 24
The Crosstabs procedure is used to create cross

tables or contingency tables. These tables are
useful for examining the relationship between two
categorical variables.

Basically, a crosstab contains the number of cases


for all the possible combinations of the two
variables.

Dr. Abenet Yohannes 25


Dr. Abenet Yohannes 26
Research Question #2
What color do people prefer for their computer?
..\1. Training datasets\Frequencies. Sav
Computer Owned * color Cross tabulation
Count
color
beige black gray white 5 Total
2 0 1 0 0 3
Toshiba
3 2 0 2 3 10
Apple
Computer
Owned IBM or 16 13 5 5 10 49
Compatible
3 0 0 0 0 3
Other
2 2 1 2 1 8
None
Total 26 17 7 9 14 73
Dr. Abenet Yohannes 27
Research Question #3
Is computer color preference different between genders?

Dr. Abenet Yohannes 28


Dr. Abenet Yohannes 29
Dr. Abenet Yohannes 30
Dr. Abenet Yohannes 31
Dr. Abenet Yohannes 32
Dr. Abenet Yohannes 33
Dr. Abenet Yohannes 34
Dr. Abenet Yohannes 35
Exercise
• Using the data set cereal.sav, create a cross
table to visualize the relationship between the
variables agecat and active. Next, examine the
data in your crosstab and tell whether you can
notice a relationship between these variables or
not.

Dr. Abenet Yohannes 36


JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:01 PM


1
Dr. Abenet Yohannes
SECTION 4

ASSUMPTION CHECKING. DATA TRANSFORMATIONS

• Checking for normality


• Detecting outliers
• Data transformations

Dr. Abenet Yohannes


Most statistical

procedures
require
us to verify at least two
assumptions :
• the normality assumption
• the assumption of absence of
outliers

Dr. Abenet Yohannes



There are two categories of methods
used to check the normality assumption:
• Numerical methods
• Graphical methods

Dr. Abenet Yohannes



Numerical methods:
• Skewness and kurtosis indicators
• The Shapiro-Wilk normality test
Graphical methods:
• The histogram
• The Q-Q plot
..\1. Training datasets\insure.sav

Dr. Abenet Yohannes


Dr. Abenet Yohannes

𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 0.362
𝑧= = = 1.261
𝑆𝑡𝑑. 𝑒𝑟𝑟𝑜𝑟 0.287

𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 −0.286
𝑧= = = −0.505
𝑆𝑡𝑑. 𝑒𝑟𝑟𝑜𝑟 0.566

Dr. Abenet Yohannes



The variable is normally distributed with
a 99% confidence level if both standard
scores lie in this interval: (-2.58, 2.58).

Dr. Abenet Yohannes


The hypotheses of …
the Shapiro-Wilk test:

H0: the variable is normally distributed


H1: the variable is not normally distributed

We accept the null hypothesis if we have Sig.>0,05

Dr. Abenet Yohannes


Dr. Abenet Yohannes
We accept the null hypothesis because Sig.>0,05

Dr. Abenet Yohannes


Graphical methods:
• The histogram
• The Q-Q plot

Dr. Abenet Yohannes


The Q-Q plot for a positively skewed distribution

Dr. Abenet Yohannes


The Q-Q plot for a negatively skewed distribution

Dr. Abenet Yohannes


The Q-Q plot for a leptokurtic distribution

Dr. Abenet Yohannes


The Q-Q plot for a platikurtic distribution

Dr. Abenet Yohannes


What to do if the normality assumption is violated?
1. The first solution consists of applying a transformation to the variable so that,
hopefully, the new variable is normally distributed.

2. The second solution consists of running a nonparametric test if our variables are
not normal. (for example, the Spearman correlation or the Mann-Whitney test).

3. The third solution : run the analysis regardless. Many parametric tests are pretty
robust to deviations from normality, especially when the samples are large

4. The fourth solution consists of performing a so called “sensitivity analysis”. In


other words, you would run two tests: a parametric test and a nonparametric
one. If both tests lead to the same conclusion, the normality violations are not an
issue. If the conclusions are totally different, the deviations from normality seem
to be a problem.

Dr. Abenet Yohannes


Exercise
• Open the data set creditpromo.sav and check the normality
for the variable dollars, using both numerical and graphical
methods.

Dr. Abenet Yohannes


Dr. Abenet Yohannes

The outliers – or extreme values – can
represent a danger for the analysis,
because they directly affect mean and
standard deviation. That’s why we should
detect, and in some cases remove the
outliers before running such tests.
Dr. Abenet Yohannes
There are two methods for identifying the outliers:

1. A numerical method, based on the
standardized values
2. A graphical method, bases on the
boxplot chart
..\1. Training datasets\insure.sav

Dr. Abenet Yohannes


Dr. Abenet Yohannes
Dr. Abenet Yohannes
Example 2 check outliers
• ..\1. Training datasets\Employee data.sav

Dr. Abenet Yohannes


Dr. Abenet Yohannes
How to manage the outliers?…
There are three kinds of outliers, depending on their source:

• Data entry errors, due to lack of attention,


negligence, tiredness etc.
• Measurement or data collecting errors, due either to
human mistakes or to equipment malfunction.
• Real non-typical, unusual values in your population.
These are the so called genuine outliers.

Dr. Abenet Yohannes


How to manage the outliers?…

There are two basic solutions for dealing with genuine outliers:

• Remove the outliers from the data series

• Keep the extreme values in the data series

Dr. Abenet Yohannes


How to manage the outliers?…
If we decide to keep the outliers, we have other four possible routes
to choose from:

1. Run a nonparametric test, because these tests are less


sensitive to outliers.

2. Replace the outliers with values closer to the normal. Let’s


suppose that our data series look like this:

2.7 2.2 5.9 3.4 3.0 3.7 2.8

Dr. Abenet Yohannes


How to manage the outliers?…
Solutions for managing the outliers (continued)

3. Run the parametric test regardless, being aware of the


possible effects of the outliers.

4. Perform a so called sensitivity analysis: run both the


parametric and the nonparametric test. If the results are
similar, we can conclude that the outliers do not affect
our findings.

Dr. Abenet Yohannes


Exercise
• Open the data set creditpromo.sav and check for the
extreme values in the variable dollars, using both the
numerical and the graphical method.

Dr. Abenet Yohannes


Dr. Abenet Yohannes

Data transformations are mostly used for
converting a non-normal variable into a
new variable which is approximately
normally distributed.

Dr. Abenet Yohannes


Drawbacks of data transformation:

1. it is not always successful.

2. it could be very difficult to interpret the output of


the analysis if we work with transformed
variables

3. if our variable is not normal on some subgroups


of the population and normal on other subgroups,
we must transform it for all the subgroups

Dr. Abenet Yohannes



Before doing a transformation, we must build
a histogram for our variable, in order to see:

• How it is skewed (positively or


negatively)

• How much it is skewed (moderately or


strongly)

Dr. Abenet Yohannes


Types of transformations
For moderately positively skewed data you can try to apply a square root

transformation:

𝒚= 𝒙
For strongly positively skewed variables, you can choose between the following
two transformations:

𝑦 = log10 𝑥 or
1
𝑦=
𝑥
..\1. Training datasets\Employee data.sav

Dr. Abenet Yohannes


Types of transformations
For moderately negatively skewed data you can try to apply the following
transformation:

𝒚= 𝒙𝒎𝒂𝒙 + 𝟏 − 𝒙
For strongly negatively skewed variables, you can choose between the following
two transformations:

𝑦 = log10(𝑥𝑚𝑎𝑥 + 1 − 𝑥) or

1
𝑦=
𝑥𝑚𝑎𝑥 + 1 − 𝑥

Dr. Abenet Yohannes


Exercise
• Open the data set insurance_claims.sav and check the
normality assumption for the variable claim_amount using the
Shapiro-Wilk test. You will notice that this variable is not
normally distributed.

• Next, do all the possible transformations for this variable and


check again the normality assumption for each of the new,
transformed variables.

Dr. Abenet Yohannes


JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:01 PM


1
ONE-SAMPLE TESTS

Dr. Abenet Yohannes,JJU


SECTION 5

ONE-SAMPLE TESTS
• One-sample t test
• Binomial test
• Chi square test for goodness-of-fit

Dr. Abenet Yohannes,JJU


• One-sample t test

The one sample t test compares the mean score of
our sample with a known value, usually with the
population mean. The sample mean is the observed
average, while the population mean is the expected
average.
This test is useful when we want to know whether our
sample comes from a particular population.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the one-sample t test:

H0: there is no significant difference between the sample mean

and the population mean

H1: there is a significant difference between the sample mean

and the population mean

We will reject the null hypothesis if Sig.<0,05.

Dr. Abenet Yohannes,JJU


Assumptions:

1. Our variable of interest is continuous.

2. The sample units are randomly extracted


from the population.

3. Our variable is normally distributed.

4. There are no significant outliers in our data.

Dr. Abenet Yohannes,JJU


What to report:
• The t test value
• The degrees of freedom
• The p value (Sig. column)
• The mean difference between the
sample mean and the population mean

Dr. Abenet Yohannes,JJU


Example …
One Sample t-test.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Assumptions:

Dr. Abenet Yohannes,JJU


Procedure …
Menu selection:- Analyze > Compare Means > One-Sample T test

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Exercise

• Using the data set customer_dbase.sav, perform a one-


sample t test to verify whether the average household
income (variable income) is higher than 50.

• Please do not forget to check the one-sample t test


assumptions first.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The binomial test is used when our variable of interest is
dichotomous. It compares an observed distribution with a
theoretical one and tells us whether there are significant
differences between the two.

The observed distribution is generally the sample


distribution, while the theoretical distribution is the
population distribution.

Dr. Abenet Yohannes,JJU



The hypotheses of the binomial test:

H0: there are no significant differences between the


sample distribution and the population distribution

H1: there are significant differences between the


sample distribution and the population distribution

We will reject the null hypothesis if Sig.<0,05.

Dr. Abenet Yohannes,JJU


Assumptions …
1. Our variable is dichotomous.
2. The two categories are mutually exclusive, (i.e.
each case can be in one group and only one).
..\1. Training datasets\cereal.sav

Dr. Abenet Yohannes,JJU



The male percentage in the total U.S.
population is 47.9%; so the female
percentage is 52.1%.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Exercise
• In the data set customer_dbase.sav, check whether

the proportion male/female (variable gender) in the

total population is 50/50. (Hint: you must use the

binomila test for that, of course.)

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The chi square test for goodness-of-fit is
used when our variable is nominal with
at least three categories. It compares an
observed distribution with a theoretical
distribution.

Dr. Abenet Yohannes,JJU



The hypotheses of the chi square test for goodness of fit:

H0: there is no significant difference between


the observed and the theoretical distribution

H1: there is a significant difference between


the observed and the theoretical distribution

We will reject the null hypothesis if Sig.<0,05.

Dr. Abenet Yohannes,JJU


Assumptions:
• …
Our variable is nominal with three or more categories.

• The categories are mutually exclusive, each sample case

belongs to one category and only one.

• For the theoretical distribution, the number of cases in each

category must be at least five.

Dr. Abenet Yohannes,JJU


Chi-square goodness of fit
A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical
variable differ from hypothesized proportions. For example, let's suppose a magazine distributer is
interested in determining readers preference among four different national magazines (A,B,C,D). A
survey was taken in which 700 persons were randomly chosen and asked to choose their favourite
from among four different magazines. The following result were obtained
megazin prefrence ( Goodness-of fit-test) chi-square.sav

Number of Readers
Magazine (observed
Preference Frequency)
A 184
B 167
C 200
D 149
total 700

So can we conclude that magazine C is preferable than the other Magazine?


Menu selection:- Analyze > Non parametric tests > chi-square > magazine preference > ok
Chi-square goodness of fit
Magazine Preferences Of Readers
Since Asymp.sig,(.041) is less than .05, we reject the null
hypothesis and conclude that there is difference in
Observed N Expected N Residual

readers preference among the four magazine. Magazine C


Magazine A 184 175.0 9.0
Magazine B 167 175.0 -8.0 is preferable than others (chi-square with three degrees
Magazine C 200 175.0 25.0 of freedom 8.263, p = 0.041).
Magazine D 149 175.0 -26.0
Total 700

Test Statistics
Magazine preference magazine
preferences of readers
Chi-Square 8.263a
df 3
Asymp. Sig. .041
a. 0 cells (.0%) have expected frequencies less than 5. The
minimum expected cell frequency is 175.0.

29
Exercise
• In the database insurance_claims.sav, do a chi
square test for goodness-of-fit using the
variable townsize. (In the theoretical
distribution all frequencies will be equal.)

Dr. Abenet Yohannes,JJU


JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:01 PM


1
Dr. Abenet Yohannes,JJU
SECTION 6

ASSOCIATION TESTS
• Pearson correlation
• Spearman correlation
• Partial correlation

Dr. Abenet Yohannes,JJU


The Pearson correlation
… coefficient
(r) measures the strength and
direction of the relationship between
two continuous variables. It can take
values in the [-1, 1] range.

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses for r:

H0: r=0 (the correlation coefficient is equal
to zero in the total population)

H1: r≠0 (the correlation coefficient is not


equal to zero in the total population)

We will reject the null hypothesis if


Sig.<0,05.

Dr. Abenet Yohannes,JJU


Values of Pearson’s correlation coefficient -
interpretation…

Value of r Strength of correlation


Up to 0.30 Small (weak correlation)
Between 0.30 and 0.70 Medium (moderate
correlation)
Over 0.70 Large (strong correlation)

Dr. Abenet Yohannes,JJU


Sign of Pearson’s correlation coefficient -
interpretation…

• r>0 – direct or positive correlation

• r<0 – inverse or negative correlation

Dr. Abenet Yohannes,JJU


Sign of Pearson’s correlation coefficient - interpretation…
Assumptions:
1. The two variables are continuous.

2. The variable values are pairs (there are two values, two
measurements for each sample case).

3. The relationship between the variables is approximately linear.


The Pearson correlation coefficient is sometimes called the
linear correlation coefficient.

4. The variables are approximately normally distributed.

5. There are no significant outliers among the data.

Dr. Abenet Yohannes,JJU


The coefficient of determination…

r2 =0.8072 = 0.651

The variation in the subjects’ height


statistically explains 65% of the
variation in the subjects’ weight

Dr. Abenet Yohannes,JJU


Example
..\1. Training datasets\insure.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
• Exercise #1
• Using the data set advert.sav, perform a Pearson correlation to
determine if there is a relationship between the variables
advert and sales.
• Exercise #2
• Using the data set advert.sav, perform a Pearson correlation to
determine if there is a relationship between sales volume
(variable sales and age of store location (variable ageloc).
• (Note: you should check the assumptions before runnning the
correlation procedure.)

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The Spearman correlation (ρ) measures the
relationship between two ordinal variables,
or between an ordinal and a continuous
variable. We can also use it when our
variables are continuous, but they fail to
meet some conditions (they are not normally
distributed or their relationship is not linear).

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses for
the Spearman … correlation:
H0: ρ=0 (the correlation coefficient is
equal to zero in the total population)

H1: ρ≠0 (the correlation coefficient is


not equal to zero in the total population)

We will reject the null hypothesis if Sig.<0,05.

Dr. Abenet Yohannes,JJU


Values of Spearman’s correlation coefficient - interpretation…

Value of ρ Strength of correlation


Up to 0,30 Small (weak correlation)
Between 0,30 and 0,70 Medium (moderate correlation)
Over 0,70 Large (strong correlation)

Dr. Abenet Yohannes,JJU


Assumptions:

Both variables are ordinal and/or continuous (we cannot
use Spearman if one of the variables is nominal).

1. The variable values are pairs (we have two


measurements for every sample case, one for each
variable)

2. The relationship between the variables is monotonic


(increasing or decreasing).

..\1. Training datasets\Employee data.sav


Dr. Abenet Yohannes,JJU
Correlations

Educational Employment
Level (years) Category
Spearman's rho Educational Level (years) Correlation Coefficient 1.000 .484**

Sig. (2-tailed) . .000


N 474 474

Employment Category Correlation Coefficient .484** 1.000

Sig. (2-tailed) .000 .


N 474 474

**. Correlation is significant at the 0.01 level (2-tailed).

Dr. Abenet Yohannes,JJU


The coefficient of determination

ρ2 =0.6882 = 0.473
The variation in the education level
statistically explains 47% of the
variation in the current salary

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set advert.sav, perform a
Spearman correlation to determine if there is
an association between the variables creddebt
and othdebt.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The null and alternative hypotheses for r:

H0: r=0 (the correlation coefficient is equal to


zero in the total population)

H1: r≠0 (the correlation coefficient is not equal


to zero in the total population)

We will reject the null hypothesis if Sig.<0,05.

Dr. Abenet Yohannes,JJU


The partial correlation … is the correlation
between two continuous variables, controlled
for a set of external variables called
“controlling variables”.
We use this correlation when we want to
remove the effect of the controlling variables.

Dr. Abenet Yohannes,JJU



rij.k - the partial correlation between the variables i
and j, controlled for the variable k
Depending on the number of controlling variables we can have:

- first order partial correlation (one controlling variable)

- second order partial correlation (two controlling variables)

- third order partial correlation (three controlling variables) and so


on.

The simple Pearson correlation (without controlling variables) is


also called zero order correlation.

Dr. Abenet Yohannes,JJU



There are four possible scenarios
of relationship between the partial
correlation and the Pearson
bivariate correlation.

Dr. Abenet Yohannes,JJU


Scenario 1
rij.k = rij
The controlling variable has no
effect on the relationship between
the main variables.

Dr. Abenet Yohannes,JJU


Scenario 2
|rij| > 0, rij.k = 0
The correlation between the main
variables i and j is fake.

Dr. Abenet Yohannes,JJU


Scenario 3
rij > rij.k > 0

There is a correlation between the main


variables, but that correlation is not as
high as it appears (the controlling variable
makes it seem higher).

Dr. Abenet Yohannes,JJU


Scenario 4
|rij.k| > rij
The correlation between the main
variables in stronger than it seems
(the controlling variables hide a real
correlation).

Dr. Abenet Yohannes,JJU


Example
• ..\1. Training datasets\heart-attacks-
icecream.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
The result is statistically
Correlations significant with r=0.87
icecream attacks and Pv <005
icecream Pearson Correlation 1 .870** But logically there may
Sig. (2-tailed) .000 not be direct r/s
N 24 24 between heart attach
attacks Pearson Correlation .870** 1 and ice-cream. So
Sig. (2-tailed) .000 there may be other
N 24 24
factors which affects
**. Correlation is significant at the 0.01 level (2-tailed).
this relationship. Let as
try a partial correlation
using temperature as a
controlling variable

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
The result is not
statistically significant
with r=0.182 and Pv
(.405) which is > 0.05
So it is temperature
which created a fake r/s
b/n icecream and heart
attach in previous
bivariate correlation

Dr. Abenet Yohannes,JJU


Exercise

• In the data set property_assess.sav, perform


a partial correlation between the variables
saleval and lastval, using time as a
controlling variable.

Dr. Abenet Yohannes,JJU


Exercise

• Using the data set insurance_claims.sav, perform a

chi square test for association to check whether there

is a relationship between the variables gender and

claim_type.

Dr. Abenet Yohannes,JJU


JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:01 PM


1
Dr. Abenet Yohannes,JJU
SECTION 7

TESTS FOR MEAN DIFFERENCE


7.1. Independent samples t test
7.2. Paired samples t test
7.3. One-way ANOVA
7.4. Two-way ANOVA
7.5. Three-way ANOVA
7.5. Multivariate ANOVA
7.6.Analysis of covariance

Dr, Abenet, Jigjiga University 3


… t test is used to
The independent sample
find out whether a difference exists
between the means of two independent
groups on a continuous variable. More
precisely, this test lets us determine if the
difference between the means is
statistically significant.
Dr. Abenet Yohannes,JJU
The null and alternative hypotheses of the
independent samples t test: …
H0: there is no difference between the means of
the two groups in the total population
H1: there is a significant difference between the
means of the two groups in the total population
We will reject the null hypothesis if the p value is
lower than 0,05.

Dr. Abenet Yohannes,JJU


Assumptions:

The independent variable is categorical (nominal or ordinal). If this variable
has three or more categories, we must decide, before running the test,
which categories we want to compare.

1. The dependent variable is continuous.

2. There is independence of observations; in other words, there is no


connection between the observations(measurements) in the two
groups.

3. The dependent variable is normally distributed in both groups.

4. The dependent variable has no significant outliers in either group.

5. The variances of the dependent variable in the two groups are equal
(in other words, we have homogeneity of variances).
Dr. Abenet Yohannes,JJU

The null and alternative hypotheses of the Levene test:

H0: the variances of the groups are not


significantly different in the total population
H1: the variance of the groups are significantly
different in the total population
We will reject the null hypothesis if the p value
is lower than 0.05.
Dr. Abenet Yohannes,JJU
What to report: …
• the t test value
• the degrees of freedom
• the p value
• ..\1. Training datasets\spanish-
course.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Exercise
• Using the information in the database
testmarket.sav, perform an independent-
sample t test to determine whether there
is a significant difference in average
sales volume (variable sales) for different
market sizes (variable mktsize).
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU

The paired sample t test is useful when we want to

determine whether the mean difference between

two variables, measured on the same subjects, at

two different moments, is statistically significant.

In order to run the paired-Samples t test, we must

have two paired measurements for each subject.

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses of the paired samples t test:

H0: the mean difference between the variable
scores is equal to zero in the total population

H1: the mean difference between the variable


scores is different from zero in the total population
We will reject the null hypothesis if the p value is lower than
0.05.

Dr. Abenet Yohannes,JJU


Assumptions:

The dependent variable is continuous, and it is measured twice on
the same sample of subjects (so in our database we'll have two
variables, corresponding to those measurements).

1. The differences between the scores of the variables are


normally distributed.

2. The differences between the scores of the variables do


not present significant outliers.

Dr. Abenet Yohannes,JJU



What to report:
the test value
• the degrees of freedom
• the p value
• the mean difference
• the confidence interval of the difference.
• ..\1. Training datasets\math-test.sav

Dr. Abenet Yohannes,JJU


Menu selection:- Analyze > Compare Means > Paired-Samples T test

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Exercise
Using the data set dietstudy.sav, run a
paired-sample t test to determine whether
there is a significant difference between
the average initial weight and the average
final weight of the subjects (variables wgt0
and wgt4, respectively).
Dr. Abenet Yohannes,JJU
The one-way analysis of variance (or ANOVA) can help

us determine whether there are significant differences
between the means of three or more groups, for a
continuous variable.

In order to use this test, we must have two variables:

1. a categorical independent variable with three or


more categories

2. a continuous dependent variable.


Dr. Abenet Yohannes,JJU
The null and alternative hypotheses of the one-way ANOVA:

H0: the group means are equal in the total
population
H1: at least one group mean is different in the
total population
We will reject the null hypothesis if the p value of
the Fisher test is lower than 0,05.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Assumptions:


The independent variable is categorical, with three or more categories.

1. The dependent variable is continuous.

2. There is independence of observations. In other words, there is no


relationship between the members of the groups.

3. The dependent variable is normally distributed in all groups.

4. The dependent variable does not present significant outliers in any group.

5. The dependent variable has equal variances in all groups. If this condition
is not met, we have to use a robust version of the F test, called the Welch
test.

Example: new vitamin test

1) The employees in the first group will receive a placebo (this


is the control group)

2) The employees in the second group will take the vitamin in


low dose

3) The employees in the third group will take the vitamin in


high dose.

4) The effort resistance is measured on a continuous scale from 1 to 30.



Example:

..\1. Training datasets\vitamin-oneway.sav

Dr. Abenet Yohannes,JJU


Assumption check

Dr. Abenet Yohannes,JJU


Tests of Normality

Kolmogorov-Smirnova Shapiro-Wilk
Dose of vitamin Statistic df Sig. Statistic df Sig.
Effort resistance Placebo .054 120 .200* .992 120 .748

Low dose .039 135 .200* .995 135 .896


High dose .066 100 .200* .990 100 .674
*. This is a lower bound of the true significance.
a. Lilliefors Significance Correction

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses of the Levene test:

H0: the variances of the groups are equal in the total
population

H1: the variance of the groups are significantly


different in the total population

We will reject the null hypothesis if the p value


is lower than 0.05.

Dr. Abenet Yohannes,JJU


If we have n groups in our factor, the number of

possible paired comparisons is n(n-1)/2.

In our case we have three groups, so the number


of paired comparisons is 3:

“placebo” vs. “low dose

• “placebo” vs. “high dose

• “”low dose” vs. “high dose

Dr. Abenet Yohannes,JJU


Multiple comparisons (summary table)
Comparison Mean
… Sig. Confidence Interval
Difference (95%)
Low dose - Placebo 1.0089 0.049 0.004 – 2.014
High dose - Placebo 4.3522 0.000 3.268 – 5.437
High dose – Low dose 3.3433 0.000 2.287 – 4.400

Dr. Abenet Yohannes,JJU


H0: the variances of the
groups are equal in the
total population

Since the pv is less than 0.05 we


fail to reject the Ho: of levene
test. Then assumption # 5 is
satisfied

H0: the group means are equal in


the total population. Since the pv
is less than 0.05. Ho is rejected

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set property_assess.sav, run a one-way

ANOVA to study the relationship between sale value of a

house (variable saleval) and township (variable town).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
The two-way analysis…of variance is used to
measure the combined influence of two
factors on a dependent variable. The
factors (independent variables) are
categorical, while the dependent variable is
continuous.

Dr. Abenet Yohannes,JJU


Example: new vitamin test
• …
the employees in the first group will receive a placebo (this is the control
group)

• the employees in the second group will take the vitamin in low dose

• the employees in the third group will take the vitamin in high dose.

The effort resistance is measured on a continuous scale from 1 to 30.

Moreover, we have information about each subject’s gender (male or


female).

We are interested to know if there is a combined influence of the


two factors, dose and gender, on the effort resistance. In other
words, we want to detect the interaction effect of these variables.
Dr. Abenet Yohannes,JJU
Groups to…compare:

Gender/Dose Placebo Low dose High dose

Male Group 1 Group 2 Group 3


Female Group 4 Group 5 Group 6

Dr. Abenet Yohannes,JJU


The two way ANOVA studies two types of effects:

• the main effects, meaning the separate influence
of each factor. We have two main effects here:
the dose effect and the gender effect.

• the interaction effect, meaning the combined


action of the two factors. We have only one
interaction effect in this model: dose*gender.

Dr. Abenet Yohannes,JJU


The analysis always starts with the study of the interaction

effect. Depending on the result, the analysis will continue in
one of the following ways:

1) if the interaction effect is not significant, we analyze the


main effects, if they are interesting for our research (if
not, we can simply stop the analysis).

2) if the interaction effect is statistically significant, we


recommend the analysis of the simple main effects.

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses of the two-way

ANOVA (for the main effects):

H0: the group means for the factor A are equal in


the total population
H1: the group means for the factor A are different
in the total population
We will reject the null hypothesis if the p value of
the Fisher test is lower than 0,05.
Dr. Abenet Yohannes,JJU
The null and alternative hypotheses of the two-way ANOVA
(for the interaction effect): …
H0: the sum of all the interaction effects between the
factors A and B is equal to zero

H1: the sum of all the interaction effects between the


factors A and B is different from zero

We will reject the null hypothesis if the p value of the


Fisher test is lower than 0,05.

Dr. Abenet Yohannes,JJU


Assumptions:

1)The two independent variables are categorical, each having at least two
categories.
2)The dependent variable is continuous.
3)There is independence of observation; in other words, there is
no relationship between the subjects in our groups.
4)The dependent variable is normally distributed in all groups.
5)The dependent variable does not present significant outliers in
any group.
6)The dependent variable has equal variances in all groups (there
is homogeneity of variances).

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses of the Levene test:
H0: the variances of

the groups are equal in the
total population

H1: the variance of the groups are significantly


different in the total population

We will reject the null hypothesis if the p value is


lower than 0.05.

Dr. Abenet Yohannes,JJU


Tests of Normalitya
Kolmogorov-Smirnovb Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Effort resistance .065 90 .200* .991 90 .808
*. This is a lower bound of the true significance.
a. Dose of vitamin = Placebo, Subject's gender = Male
b. Lilliefors Significance Correction

After you checked all


assumptions remove split file
and run two way ANOVA

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Levene's Test of Equality of Error Variancesa
Dependent Variable: Effort resistance
F df1 df2 Sig.
.372 5 534 .868
Tests the null hypothesis that the error variance of the dependent variable is equal across groups.

a. Design: Intercept + dose + gender + dose * gender

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
The simple main

effects represent the
influences of one factor at each level of the
other factor. In other words, we keep a
factor constant and make the other factor
vary.

Dr. Abenet Yohannes,JJU


The simple main effects for the factor “dose” represent the

effect of dose at every level of the gender. We must compute
two sets of differences here:

• the differences between the average effort resistance for


the “placebo”, “low dose” and “high dose” levels, for male
subjects

• the differences between the average effort resistance for


the “placebo”, “low dose” and “high dose” levels, for female
subjects
Dr. Abenet Yohannes,JJU
The simple main effects for the factor “gender”

represent the effect of gender at every level of the dose
factor. We must compute three differences here:
1) the difference between the average male and female
effort resistance, at the “placebo” level

2) the difference between the average male and female


effort resistance, at the “low dose” level

3) the difference between the average male and female


effort resistance, at the “high dose” level

Dr. Abenet Yohannes,JJU


Final conclusions:

• overall, the vitamin does increase the effort
resistance

• a high dose is significantly more effective


than a low dose

• at the same dose, the vitamin has a stronger


effect on male than on female employees.
Dr. Abenet Yohannes,JJU
Exercise
Using the data set carpet.sav, run a two-
way ANOVA to check whether the custoemr
preferences (variable pref) depend on
package design (package) and brand name
(brand).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
We use the three-way
… analysis of variance
to measure the influence of three
independent variables on a dependent
variable. The independent variables (also
called factors) are categorical, while the
dependent variable is continuous.

Dr. Abenet Yohannes,JJU



If the factors have k, l and q levels, respectively,
the total number of groups to compare is k*l*q.
The F test will tell us if the group means for the
dependent variable are equal or different.

Dr. Abenet Yohannes,JJU



The three way ANOVA studies two type of effects:

• the main effects, meaning the separate influence of


each factor.
• the interaction effects, meaning the combined action
of the factors.

Dr. Abenet Yohannes,JJU



There are seven effects in a three way ANOVA:

• three main effects: A, B and C (the separate factor


effects)
• three second order interaction effects: A*B, A*C and
B*C
• one third order interaction effect: A*B*C.

Dr. Abenet Yohannes,JJU



If we have n factors, the total number of effects is 2n-1.
This number grows exponentially:
• 4 factors mean 15 effects
• 5 factors mean 31 effects
• 6 factors mean 63 effects etc.

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses of the three-way

ANOVA (for the main effects):

H0: the group means for the factor A are equal in the
total population

H1: the group means for the factor A are different in the
total population

We will reject the null hypothesis if the p value of the


Fisher test is lower than 0,05.
Dr. Abenet Yohannes,JJU
The null and alternative hypotheses of the three-way

ANOVA (for the second order interaction effects):
H0: the sum of all the interaction effects between
factors A and B is equal to zero
H1: the sum of all the interaction effects between
factors A and B is different from zero
We will reject the null hypothesis if the p value of the
Fisher test is lower than 0,05.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the three-way
ANOVA (for the third order interaction effect):
H0: the sum of all the interaction effects between
factors A, B and C is equal to zero
H1: the sum of all the interaction effects between
factors A, B and C is different from zero
We will reject the null hypothesis if the p value of the

Fisher test is lower than 0,05.

Dr. Abenet Yohannes,JJU



The analysis always starts with the study of the highest order
interaction effect (the third order interaction effect). If this effect
is statistically significant, we continue the analysis as follows:

• we study the simple second order interaction effects – the


interaction effects of two factors at each level of the third factor

• if some of the simple second order interaction effects are


significant, we go on and examine the simple main effects

• if at least one simple main effect is significant, we can compute


and interpret the simple comparisons between various factor
levels.
Dr. Abenet Yohannes,JJU

If the third order interaction effect is not significant,
we continue the analysis this way:

• we inspect the second order interaction effects. If


some of them are significant, we can compute the
simple main effects
• if none of the second order interaction effects is
significant, we can either finish the analysis, or
examine the main effects (if they hold interest).
Dr. Abenet Yohannes,JJU
Assumptions:

The three independent variables are categorical, each having at least two
categories.

1. The dependent variable is continuous.

2. There is independence of observations; in other words, there is no


relationship between the subjects in our groups.

3. The dependent variable is normally distributed in all groups.

4. The dependent variable does not present significant outliers in any group.

5. The dependent variable has equal variances in all groups (there is


homogeneity of variances).

Dr. Abenet Yohannes,JJU


Example: new vitamin test

the employees in the first group will receive a placebo (this is the control group)

• the employees in the second group will take the vitamin in low dose

• the employees in the third group will take the vitamin in high dose.

The effort resistance is measured on a continuous scale from 1 to 30.

Moreover, for each employee we have information about:

• gender (male or female)

• type (blue collar or white collar)

We are interested to know if there is a combined influence of the three factors, dose,
gender and type of employee, on the effort resistance. In other words, we want to
detect the interaction effect of these variables.

Dr. Abenet Yohannes,JJU


Dependent variable:
• effort resistance …
Factors:
• dose of vitamin, with 3 levels (placebo, low, high)
• gender, with 2 levels (male, female)
• type of employee, with 2 levels (blue collar, white collar)

The number of groups in our model is 12 (3*2*2).

Dr. Abenet Yohannes,JJU


The null and alternative hypotheses of the
Bartlett’s test: …
H0: the variances of the groups are equal in
the total population

H1: the variance of the groups are


significantly different in the total population

We will reject the null hypothesis if the p


value is lower than 0,05.
Dr. Abenet Yohannes,JJU

The simple second order interaction effects represent the
interaction effects of two factors for each level of the
third factor. In our case, that means:

• the interaction of dose and gender for each type of


employee (blue collar and white collar)
• the interaction of gender and type of employee for
each dose level (placebo, low dose and high dose)
• the interaction of dose and type of employee for each
gender (male and female).

Dr. Abenet Yohannes,JJU


We are going to …
compute two separate
interaction effects:

• dose * type for male employees


• dose * type for female employees

Dr. Abenet Yohannes,JJU


The following second order … interaction effects
are statistically significant:

• dose * type for male employees


• dose * type for female employees

Dr. Abenet Yohannes,JJU



The simple main effects of any factor represent
the influences of that factor when the levels of
the other factors rest unchanged.

Dr. Abenet Yohannes,JJU



For the factor dose (placebo, low dose, high dose) we
have 4 simple main effects, as follows:

• the dose effect on the blue collar, male employees


• the dose effect on the blue collar, female employees
• the dose effect on the white collar, male employees
• the dose effect on the white collar, female employees

Dr. Abenet Yohannes,JJU



For the type of employee (blue or white collar) factor we have 6
simple main effects:

• the type effect on the male employees who took a placebo


• the type effect on the male employees who took a low dose
• the type effect on the male employees who took a high dose
• the type effect on the female employees who took a placebo
• the type effect on the female employees who took a low dose
• the type effect on the female employees who took a high dose

Dr. Abenet Yohannes,JJU



The simple comparisons between the dose
levels represent the average differences
between the three groups of the factor dose (for
the dependent variable), for every combination
of the other two variables.

Dr. Abenet Yohannes,JJU



Contrasts for the factor dose

Contrast Difference Placebo Low dose High dose


1 Low dose – Placebo -1 1 0
2 High dose - Placebo -1 0 1
3 High dose– Low dose 0 -1 1

Dr. Abenet Yohannes,JJU



In total, we have 12 simple comparisons – 3
contrasts multiplied by 4 groups:

• blue collar, male employees


• blue collar, female employees
• white collar, male employees
• white collar, female employees
Dr. Abenet Yohannes,JJU

Simple comparisons for the male employees,
blue collar
Contrast Difference Difference t value Sig.
1 Low dose – Placebo (-1 1 0) 7.03 18.852 0.000
2 High dose – Placebo (-1 0 1) 9.56 25.639 0.000
3 High dose– Low dose (0 -1 1) 2.53 6.787 0.000

Dr. Abenet Yohannes,JJU



Simple comparisons for the male employees,
white collar

Contrast Difference Difference t value Sig.


1 Low dose – Placebo (-1 1 0) 3.08 8.272 0.000
2 High dose – Placebo (-1 0 1) 6.70 17.969 0.000
3 High dose– Low dose (0 -1 1) 3.61 9.697 0.000

Dr. Abenet Yohannes,JJU


Find out whether there…is a difference in effort
resistance between the male and female
employees who do manual labor and took a
placebo.

That means doing a simple comparison for the


gender factor when the levels of the other
factors are “placebo” and “blue collar”.
Dr. Abenet Yohannes,JJU
Exercise
• Using the database carpet.sav, perform a three-way
ANOVA to check whether the customer preferences
(variable pref) depend on package design (package),
brand name (brand) and the existence of a money-back
guarantee (money).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
The multivariate analysis …
of variance (MANOVA) is a
generalization of the univariate analysis of variance that we
studied in the previous guides. It is used when we have two or
more dependent variables in our model.

The multivariate ANOVA tests the differences between the


group factors for a new variable, called “composite variable”.
This composite variable is a linear combination of the
dependent variables.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of one-way MANOVA:

H0: the group means (for the composite variable) are equal in

the total population

H1: at least one group mean is different in the total

population

We will reject the null hypothesis if the p value of the F test

is lower than 0.05.

Dr. Abenet Yohannes,JJU


Assumptions:

The independent variable is categorical.

1. The dependent variables are continuous.



2. There is independence of observations, i.e. there is no relationship between the group members.

3. The number of cases in each group is greater than the number of dependent variables.

4. The dependent variables are normally distributed in all factor groups.

5. The relationships between the dependent variables are approximately linear in all groups.

6. There is no significant multicollinearity – the dependent variables are not strongly correlated
with each other.

7. The dependent variables do not present significant outliers in any groups.

8. There are no significant multivariate outliers.

9. There is homogeneity of variances – the group variances are equal for each dependent variable.

10. There is homogeneity of variance-covariance matrices.

Dr. Abenet Yohannes,JJU


Example: new vitamin test

• the employees in the first group will receive a placebo (the
control group)
• the employees in the second group will take the vitamin in
low dose
• the employees in the third group will take the vitamin in
high dose.

There two dependent variables, effort resistance and stress


resistance, are measured on a continuous scale from 1 to 30.

Dr. Abenet Yohannes,JJU


Example: new vitamin test

the employees in the first group will receive a placebo (the control
group)

• the employees in the second group will take the vitamin in low
dose

• the employees in the third group will take the vitamin in high
dose.

The two dependent variables, effort resistance and stress


resistance, are measured on a continuous scale from 1 to 30.

Dr. Abenet Yohannes,JJU


Critical values for the…
Mahalanobis distance
Number of dependent variables Critical value
2 13.82
3 16.27
4 18.47
5 20.52
6 22.46
7 24.32
8 26.13
9 27.88
10 59.59

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set car_sales.sav, perform a
one-way multivariate ANOVA with the
variables sales and resale as dependent
variables and the variable type (vehicle type)
as factor.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
… (ANCOVA) studies the
The analysis of covariance
relationship between a continuous dependent
variable and one or more categorical
independent variables, controlled for one or
more continuous variables called covariates.

ANCOVA helps us control for the effect of the


covariates on the dependent variables.
Dr. Abenet Yohannes,JJU

Example: new vitamin test

• the employees in the first group received a placebo (this is the


control group)
• the employees in the second group took the vitamin in low dose
• the employees in the third group took the vitamin in high dose.

We want to measure the vitamin effect on the employees effort


resistance, controlled for the employees age (the covariate).

Dr. Abenet Yohannes,JJU


Assumptions :

1.

The independent variable is categorical with three or more groups.
2. The dependent variable is continuous.
3. The covariate is continuous.
4. The residual values of the dependent variable are normally distributed.
5. The dependent variable does not present significant outliers.
6. The relationship between the dependent variable and the covariate is
approximately linear in all factor groups.
7. There is no correlation between the independent variable (factor) and
the covariate. This is the most important condition of all.
8. There is homogeneity of variances – the variances of the dependent
variable are equal in all factor groups.
9. There is homoscedasticity – the variance of the residuals is equal for all
the predicted values of the dependent variable.
Dr. Abenet Yohannes,JJU
Initial and adjusted
… group means
Group Initial mean Adjusted mean
Placebo 12.09 13.21
Low dose 13.10 13.43
High dose 16.44 14.66

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set car_sales.sav, perform an analysis of
covariance with the following variables:

• - sales as a dependent variable

• - type as a factor

• - price as a covariate.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The repeated measures analysis of variance
(ANOVA) is used when a dependent continuous
variable was measured three or more times on
the same sample of subjects, and we want to
know whether the means of these
measurements are different in the total
population.

Dr. Abenet Yohannes,JJU


… on the subjects’ weight
Example: the effect of a diet
The weight is measured on the same sample of
subjects, in three different moments:
• before the diet
• during the diet
• after the diet.

Dr. Abenet Yohannes,JJU



The null and alternative hypothesis of the repeated measures
ANOVA are expressed as follows:

H0: the means of the dependent variables are equal in the


total population

H1: at least one mean differs from the others in the total
population

We will reject the null hypothesis if the p value of the F test is


lower than 0,05.
Dr. Abenet Yohannes,JJU
Assumptions: …
1. The dependent variable is continuous, and it is measured
three or more times on the same subjects.
2. The dependent variables are normally distributed.
3. The dependent variables do not present significant
outliers.
4. The variances of the differences between measurements
are equal (the sphericity assumption).

Dr. Abenet Yohannes,JJU


Exercise
• Open the data set dietstudy.sav and perform
a repeated-measures analysis of variance to
study the mean difference between the
following variables that represent weight
measurements at different moments in time:
wgt0, wgt1, wgt2, wgt3, wgt4.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The within-within-subjects analysis of variance is
used when a variable is measured three or more
times on the same sample of subjects, and each
measurement is repeated two or more times,
under different circumstances.

Dr. Abenet Yohannes,JJU



Example: the effect of a diet on the subjects’ weight

The weight is measured in three different periods:

• before the diet


• during the diet
• after the diet.

After a pause, the diet starts again, this time combined with physical
exercises. The weights are measured again, at the same moments:

• before the diet


• during the diet
• after the diet.

Dr. Abenet Yohannes,JJU



We have two factors of influence on the variable weight:

1. time, with three levels (beginning, middle, end)


2. physical exercises, with two levels (with and without
physical exercises)

The total number of measurements is the product of the


two factor levels (3*2=6).

Dr. Abenet Yohannes,JJU


… steps as in the two-way
The analysis follows the same
ANOVA:

• if there is a significant interaction effect we are going


to study the simple main effects of the factors
• if the interaction effect is not significant, we will
examine the main effect for each factor.

Dr. Abenet Yohannes,JJU


Assumptions:

1. The dependent variables are approximately normally
distributed.

2. The dependent variables do not present significant


outliers.

3. The variances of the score differences are equal for


each factor (the sphericity assumption). This
assumption only makes sense for the factors that
have at least three levels.
Dr. Abenet Yohannes,JJU

The simple main effects for the exercise factor consist of the
differences between the subjects’ weights with and without
exercise, at the same moment of the dieting period (beginning,
middle, end). So we have to compute three differences:

• weight_beg_ex – weight_beg

• weight_mid_ex – weight_mid

• weight_end_ex – weight_end
Dr. Abenet Yohannes,JJU

The simple main effects for the moment factor consist of the differences between
the subjects' weights at the three moments of the dieting period, for the same
level of the exercise factor. So for the level 1 (without exercise), the differences to
compute will be:
• weight_mid – weight_beg
• weight_end – weight_mid
• weight_end – weight_beg
For the level 2 (with exercise) the differences will be:
• weight_mid_ex – weight_beg_ex
• weight_end_ex – weight_mid_ex
• weight_end_ex – weight_beg_ex

Dr. Abenet Yohannes,JJU


Exercise
• Open the data set dietstudy.sav and perform a mixed
analysis of variance with the following variables:

• - wgt0, wgt1, wgt2, wgt3, wgt4, that represent weight


measurements at different moments in time (so the within-
subjects factor is time)

• - gender as the between-subjects factor.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The mixed analysis of variance is a combination
of the within-subjects ANOVA and the between-
subjects ANOVA. It is used when a dependent
variable is measured two or more times on the
same sample of subjects, and the sample is
divided into two or more independent groups.

Dr. Abenet Yohannes,JJU



So we have two factors in the mixed ANOVA:
• a within-subjects factor and
• a between-subjects factor.

The purpose of the analysis is to determine


whether there is an interaction effect between the
factors.
Dr. Abenet Yohannes,JJU

• If the interaction is significant, we will
compute the simple main effects for each
factor.
• If the interaction is not significant, we will
report and interpret the main effects.

Dr. Abenet Yohannes,JJU



Example: the effect of a diet on the subjects’ weight

The weight is measured in three different periods:

• at the beginning the diet


• in the middle of the diet
• at the end of the diet.

Our sample is divided into two independent groups, men and women.

Dr. Abenet Yohannes,JJU



• gender is the between-subjects factor, with
two levels: male and female.
• time is the within-subjects factor, with three
levels: beginning, middle and end of the diet
period.

Dr. Abenet Yohannes,JJU



Assumptions:

1. The dependent variables are normally distributed in every factor group.


2. The dependent variables do not present significant outliers in any group.
3. The variances of the dependent variables are equal in all groups (there is
homogeneity of variances).
4. The covariances of the dependent variables are equal in all groups.
5. There is sphericity – for the within-subjects factor, the variances of the
differences between the dependent variable scores are equal. This
assumption only makes sense if the within-subjects factor has three or
more levels.

Dr. Abenet Yohannes,JJU



Possible solutions if the covariances are not equal:

• Drop the mixed ANOVA and run separate within-subjects


ANOVAs for each gender group (female and male).
• Ignore the interaction effect and analyze only the main
effect of the within-subjects factor (moment).
• Run the mixed ANOVA regardless and report the violation
of this assumption.

Dr. Abenet Yohannes,JJU



The simple main effects for the between-
subjects factor (gender) consist of the average
weight differences between male and female
subjects, at each moment of the dieting period -
beginning, middle, end (in other words, the
effect of gender at every moment of the diet).

Dr. Abenet Yohannes,JJU



The simple main effects for the within-subjects
factor (moment) consist of the average weight
differences between the three moments of the
diet (beginning, middle, end) separately for each
gender category (male and female).

Dr. Abenet Yohannes,JJU


Exercise
• Open the data set dietstudy.sav and perform a mixed
analysis of variance with the following variables:

• - wgt0, wgt1, wgt2, wgt3, wgt4, that represent weight


measurements at different moments in time (so the
within-subjects factor is time)

• - gender as the between-subjects factor.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
The Mann-Whitney

(U) is used to evaluate
the difference between two independent
groups for continuous and ordinal variables.
It is a possible alternative to the
independent samples t test, when one or
both variables of interest are ordinal.

Dr. Abenet Yohannes,JJU


Assumptions: …
1. The independent variable is nominal and dichotomous (if it is
multinomial, we have to choose the two groups we want to compare).

2. The dependent variables are continuous or ordinal.

3. There is independence of observations.

4. The distributions of scores in the two groups have the same shapes. If
this assumption is met, we will use the test results to evaluate the
difference between the group medians. If it is not met, if the
distributions have different shapes, we will report and interpret the
difference between the mean ranks.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the U
test if the distributions have the same shape:
H0: there is no significant difference between
the medians of the two groups
H1: there is a significant difference between
the medians of the two groups

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the U test
if the distributions do not have the same shape:

H0: there is no significant difference between


the mean ranks of the two distributions

H1: there is a significant difference between the


mean ranks of the two distributions

Dr. Abenet Yohannes,JJU


Exercise
• In the database dietsudy.sav, use the Mann-Whitney test to determine if
there is a significant difference (on average) between genders, with respect
to the initial weight (variable wgt0).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The Wilcoxon test is used when we want to
evaluate the difference between the medians of
two variables measured on the same sample of
subjects, in two different moments. This test is
an alternative to the paired sample t test.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the Wilcoxon test:

H0: the median of the differences between the


variable scores is equal to zero in the total population
H1: the median of the differences between the
variable scores is not equal to zero in the total
population

Dr. Abenet Yohannes,JJU


Assumptions:

1. The dependent variable is continuous or ordinal, measured twice on the same
sample of subjects.

2. The distribution of the differences between the variable scores is approximately


symmetric. If this assumption is not met, we can use the sign test.

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set dietsudy.sav, perform a Wilcoxon test to
determina whether there is a significant mean difference for the
variables final weight (wgt4) and initial weight (wgt0).

• Then perform a sign test to the same effect.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
…an alternative to the
The Kruskal-Wallis test is
one-way analysis of variance, when the variables
of interest are ordinal or they are continuous,
but violate some important ANOVA
assumptions. The independent variable (factor)
must have three or more groups.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the Kruskal-
Wallis test:

H0: the medians of the factor groups for the


dependent variable are equal

H1: at least one group has a different median for the


dependent variable

Dr. Abenet Yohannes,JJU


Assumptions: …
1. The dependent variable is continuous or ordinal.

2. The independent variable is categorical, having three or more disjoint

groups.

3. The distribution shapes are approximately similar in all groups. If this

condition is not satisfied, we can use the median test instead of Kruskal-

Wallis.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the median test:

H0: the medians of the factor groups for the


dependent variable are equal

H1: at least one group has a different median for the


dependent variable

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set carpet.sav, run a Kruskal-Wallis test to see
whether there is a significant mean difference between the
customers preferences (pref) in the three groups of the variable
package design (package).

• Next, perform a median test to the same effect.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The Friedman test is employed when we have to evaluate the differences between the
medians of three or more variables measured on the same sample of subjects, in
different moments. So, in our database we will have three or more dependent
variables, each variable corresponding to one measurement.

The Friedman test can be considered an extension of the Wilcoxon test, or an


alternative to the repeated measures analysis of variance.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the Friedman test:

H0: the medians of the dependent


variables are equal in the total
population

H1: at least one median is different from


the others in the total population

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set dietstudy.sav, run a Friedman test to check whether
there are significant differences, on average, between the variables wgt0,
wgt1, wgt2, wgt3 and wgt4.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The McNemar test compares the distributions of
two related groups when the dependent
variable is binomial (dichotomous). So this test
can be considered a particular case of the
Wilcoxon test.

Dr. Abenet Yohannes,JJU


Assumptions: …
1. The independent variable (factor) has two levels
(usually, before and after the experimental treatment).

2. The dependent variable has two levels. Usually, it is a


success/failure variable, for example: the subject
quit/did not quit smoking, cured/did not cure from ulcer,
passed/did not pass the exam etc.

Dr. Abenet Yohannes,JJU



The null and alternative hypotheses of the McNemar
test:
H0: the distributions of the dependent
variable are the same on both levels of
the independent variable

H1: the distributions of the dependent


variable are different on the two levels
of the independent variable
Dr. Abenet Yohannes,JJU
A college management …board is worried about the
high percentage of students who fail the Math
exams – over 25%.
The managers are studying the opportunity of
implementing a special Math preparation
program to reduce the failure rate.

Dr. Abenet Yohannes,JJU


The number of students…who passed and did not
pass the math exam, before and after the
program
(Sample size: 100 students)

AFTER THE PROGRAM


Passed the exam Did not pass the exam
BEFORE THE Passed the exam 71 2
PROGRAM Did not pass the exam 10 17

Dr. Abenet Yohannes,JJU


… AFTER THE PROGRAM
Passed the exam Did not pass the exam
BEFORE THE Passed the exam A B
PROGRAM Did not pass the exam C D

The McNemar statistics formula:

( 𝐵 − 𝐶 − 1)2
𝜒2 =
𝐵+𝐶

Dr. Abenet Yohannes,JJU


Exercise

Dr. Abenet Yohannes,JJU


JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:02 PM


1
Dr. Abenet Yohannes,JJU
SECTION 8

PREDICTIVE TECHNIQUES
• Simple regression
• Multiple regression
• Multiple regression with dummy variables
• Sequential (hierarchical) regression
• Binomial regression
• Multinomial regression
• Ordinal regression

Dr. Abenet Yohannes,JJU


The simple linear

regression studies the
relationship between two continuous
variables, in order to predict the values of
the dependent variable based on the
values of the independent variable.

Dr. Abenet Yohannes,JJU



The simple linear regression equation writes as follows:

𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 + 𝜺
Where y is the dependent variable, x is the
independent variable, b1 is the regression
coefficient, and b0 is the constant (intercept).

Dr. Abenet Yohannes,JJU


𝒚 = 𝒃𝟎 …
+ 𝒃𝟏 𝒙 + 𝜺
The epsilon term (𝜀) is called residual
value or, more simple, error. This
variable captures all the influences
that are not explained by the
independent variable x.
Dr. Abenet Yohannes,JJU
We can rewrite this…equation as follows:

𝜀 = 𝑦 − (𝑏0 + 𝑏1 𝑥)
or
𝜀 = 𝑦 − 𝑦ො

Dr. Abenet Yohannes,JJU



The regression analysis has two types of goals:
1. predicting the values of the response variable
for different values of the independent variable.
2. determining whether the variations of the
independent variable can explain the variations
in the dependent variable, and to what extent.

Dr. Abenet Yohannes,JJU


Assumptions:

1.

Both variables (dependent and independent) are continuous.

2. The relationship between the variables is approximately linear.

3. There are no significant outliers in the data series.

4. The errors are independent (there is no relationship between


the residual variable and the independent variable).

5. The dependent variable has the same variance for all the
values of the independent variable (there is homoscedasticity).

6. The residual variable is approximately normally distributed.

Dr. Abenet Yohannes,JJU


Example ..\1. Training datasets\simple-regression.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
If this value is b/n 1.5-2.5 there is no
relationship between the residual
variable and the independent variable
and assumption 4 is satisfied.

H0=B1=0, since the sig value <0.05,


the null hypnosis is rejected

Dr. Abenet Yohannes,JJU


Our null hypothesis that residuals are
normally distributed rejected b/c the
sig. value is 0.668 which is greater
than 0.05. therefore assumption 5
there is homoscedasticity is
satisfied.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

score = −7.857 + 0.148 ∗ 𝑖𝑞

For example:

−7.857 + 0.148 ∗ 100 = 6.94

Dr. Abenet Yohannes,JJU


Exercise #
• In the data set car_sales.sav, run a
simple regression to determine whether
the price of a vehicle (variable price) is
influenced by the engine size (variable
engine_s).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
The multiple …
linear regression
studies the relationship between a
dependent variable and two or more
independent variables.
..\1. Training datasets\multiple-
regression.sav
Dr. Abenet Yohannes,JJU

The multiple linear regression equation writes as
follows:

𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ 𝑏𝑘 𝑥𝑘 + 𝜀

where y is the dependent variable, the x’s


are the independent variables, the b’s are
the regression coefficients, and b0 is the
constant (intercept).
Dr. Abenet Yohannes,JJU
Assumptions:
1. All the variables are continuous.
2. The relationships between the dependent variable and the independent
variables are linear, both for each independent variable and globally.

3. There are no significant outliers in the data series.

4. There is independence of errors (there is no relationship between the


independent variables and the residual variable).

5. The dependent variable has the same variance for all the values of the
independent variables (the assumption of homoscedasticity).

6. The residual variable is approximately normally distributed.

7. The independent variables are not strongly correlated with one another
(we don’t have important multicollinearity).

Dr. Abenet Yohannes,JJU


The results indicate that the overall
model is statistically significant
(F = 2.136408, p < 0.05).
Furthermore, all of the predictor
variables are statistically significant

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU

score = −17.395 + 0.156 ∗ 𝑖𝑞 + 0.419 ∗ ℎ𝑜𝑢𝑟𝑠

For example:

−17.395 + 0.156 ∗ 100 + 0.419 ∗ 22 = 7.72

Dr. Abenet Yohannes,JJU


Exercise
• In the data set car_sales.sav, run a multiple
regression to determine whether the price of a
vehicle (variable price) is influenced by the
engine size (variable engine_s), the power
(variable horsepow) and the fuel capacity
(variable fuel_cap).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU

score = −15.983 + 0.147 ∗ 𝑖𝑞 + 0.399 ∗


ℎ𝑜𝑢𝑟𝑠 − 0.286 ∗ 𝑔𝑒𝑛𝑑𝑒𝑟

Dr. Abenet Yohannes,JJU


score = −15.983 + 0.147 ∗ 𝑖𝑞 + 0.399 ∗ ℎ𝑜𝑢𝑟𝑠 − 0.286 ∗ 𝑔𝑒𝑛𝑑𝑒𝑟

The coefficient of the dummy variable
(gender) represents the mean difference
between the scores obtained by the two
categories: male and female, when the
other variables rest unchanged.

Dr. Abenet Yohannes,JJU



Example: two students, a boy and a girl, who have the
same intelligence quotient (110) and studied the same
number of hours (21).

Regression equation for the boy:


−15.983 + 0.147 ∗ 110 + 0.399 ∗ 21 − 0.286 ∗ 1 = 8.32

Regression equation for the girl:


−15.983 + 0.147 ∗ 110 + 0.399 ∗ 21 − 0.286 ∗ 0 = 8.60

Dr. Abenet Yohannes,JJU


Exercise
• In the data set car_sales.sav, run a multiple
regression to determine whether the price of a
vehicle (variable price) is influenced by the
engine size (variable engine_s), the power
(variable horsepow), the fuel capacity (variable
fuel_cap) and the type of vehicle (variable type).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
The sequential or …
hierarchical regression is a
particular type of multiple regression. It allows us to
enter the independent variables sequentially, by
blocks or sets of variables, in any order we choose.

The main goal of the sequential regression analysis is


to measure the exact contribution of each
independent variable (or set of variables) to the
variance of the response variable.
Dr. Abenet Yohannes,JJU
The sequential regression procedure generates a

number of regression equations equal to the

number of variable blocks entered by the analyst.

For each separate equation, the program will

compute the change in the coefficient of

determination (R square), as well as a

significance test for this change.


Dr. Abenet Yohannes,JJU
Block 1: 𝑠𝑐𝑜𝑟𝑒 = 𝑏0…+ 𝑏1 ∗ 𝑖𝑞 + 𝜀

Block 2: 𝑠𝑐𝑜𝑟𝑒 = 𝑏0 + 𝑏1 ∗ 𝑖𝑞 + 𝑏2 ∗ ℎ𝑜𝑢𝑟𝑠 + 𝜀

Block3: 𝑠𝑐𝑜𝑟𝑒 = 𝑏0 + 𝑏1 ∗ 𝑖𝑞 + 𝑏2 ∗ ℎ𝑜𝑢𝑟𝑠 +


𝑏3 ∗ 𝑔𝑒𝑛𝑑𝑒𝑟 + 𝜀

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
… + 0.148 ∗ 𝑖𝑞 + 𝜀
Block 1: 𝑠𝑐𝑜𝑟𝑒 = −7.857

Block 2: 𝑠𝑐𝑜𝑟𝑒 = −17.395 + 0.156 ∗ 𝑖𝑞 +


0.419 ∗ ℎ𝑜𝑢𝑟𝑠 + 𝜀

Block 3: 𝑠𝑐𝑜𝑟𝑒 = −15.983 + 0.147 ∗ 𝑖𝑞 +


0.399 ∗ ℎ𝑜𝑢𝑟𝑠 − 0.286 ∗ 𝑔𝑒𝑛𝑑𝑒𝑟 + 𝜀

Dr. Abenet Yohannes,JJU


Exercise
• In the data set car_sales.sav, run a sequential regression to
determine whether the price of a vehicle (variable price) is
influenced by the following variables:

• - Engine Size (variable engine_s) and power (variable horsepow)


– block 1

• - fuel capacity (variable fuel_cap) – block 2

• - type of vehicle (variable type) – block 3.

Dr. Abenet Yohannes,JJU


The Binomial Logistic Regression
..\1. Training datasets\binomial-logistic-regression.sav

Dr. Abenet Yohannes,JJU


… regression is a
The binomial logistic
predictive technique which is used
when the dependent variable is
dichotomous, and the independent
variables are continuous, ordinal or
nominal.
Dr. Abenet Yohannes,JJU
The logistic regression allows us to determine the probability for a

given case to be in a particular category of the dependent variable.
For example:

• estimate the probability that a person buys a life insurance


depending on influencing variables like age, gender, income,
marital status, health status etc.

• predict the probability that a person contracts a heart disease


based on variables like age, gender, weight, if they use to smoke
or not, if they use to drink alcohol or not, if they use to drink
coffee or not etc.

Dr. Abenet Yohannes,JJU


p - probability of success, i.e. the probability that the

predicted event happens is also called (for example,
the probability to buy a health insurance, or to
contract a heart disease).

1-p – the probability of failure, i.e. meaning, the


probability that the event does not happen

The ratio p/(1-p) represents the odds that the


predicted event happens.

Dr. Abenet Yohannes,JJU


… logistic regression:
The equation of the

𝒑
𝒍𝒏 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ + 𝒃𝒌 𝒙𝒌
𝟏−𝒑

Dr. Abenet Yohannes,JJU


𝒑
= 𝒆𝒃𝟎 𝒆𝒃𝟏 𝒙𝟏 … 𝒆𝒃𝒌𝒙𝒌
𝟏−𝒑

Dr. Abenet Yohannes,JJU


Assumptions: …
1. The response variable is dichotomous, with disjoint
categories.

2. There are no significant outliers in our data series.

3. There is no important multicollinearity (the


independent variables are not strongly correlated
with each other).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
How to interpret the coefficients exp(b) – the antilogarithms – of the
categorical independent variables: …
• If the value of exp(b) is higher than 1 for a certain category, the subjects in
that category have higher odds of success than the subjects in the
reference category
• If the value of exp(b) is lower than 1 for a certain category, the subjects in
that category have lower odds of success than the subjects in the
reference category
• If the value of exp(b) is equal to 1 for a certain category, the subjects in
that category have the same odds of success as the subjects in the
reference category

Dr. Abenet Yohannes,JJU


Exp(b) for the variable where: 2.950

• it is 2.95 times more likely that a person who
surfs at the office has mobile Internet than a
person who surfs at home

• the chance that a person who surfs at the


office has mobile Internet is 295% higher than
for a person who surfs at home

Dr. Abenet Yohannes,JJU


…event to happen (p),
The probability of an
depending on the odds (o):

𝒐
𝒑=
𝟏+𝒐

Dr. Abenet Yohannes,JJU


So, the probability that a person who surfs at

the office uses mobile Internet is about 75%:

𝟐. 𝟗𝟓𝟎
𝒑= = 𝟎. 𝟕𝟒𝟔
𝟏 + 𝟐. 𝟗𝟓𝟎

For a person who surfs at home, the


probability is about 25% (1-0.75).

Dr. Abenet Yohannes,JJU


How to interpret the coefficients exp(b) – the antilogarithms – of the
continuous independent variables:
1) If the value of exp(b) is higher than 1, the subjects who present higher
values of the variable have greater odds of success than the subjects who
present lower values
2) If the value of exp(b) is lower than 1, the subjects who present higher
values of the variable have lower odds of success than the subjects who
present lower values
3) If the value of exp(b) is equal to 1, the subjects who present higher values
of the variable have the same odds of success as the subjects who present
lower values (in other words, that independent variable does not have a
significant influence on the response variable).
For the continuous variables, the values of reference are the lower values.
Dr. Abenet Yohannes,JJU
Exp(b) for the variable income: 1,004

• the chance that a person with a higher income has
mobile Internet is only 0.4% higher than for a person
with a lower income
Exp(b) for the variable hours: 1,691

• the chance that a person who surfs many hours a day


uses mobile Internet is 69% higher than for a person
who surfs fewer hours a day

Dr. Abenet Yohannes,JJU


The probabilities of success for the variable
income: …
• for low income subjects:

1.004
𝑝= = 0.49 = 49%
1+1.004
• for high income subjects: 51%

Dr. Abenet Yohannes,JJU


The probabilities of
… success for the
variable hours:
• for subjects who surf few hours weekly:
𝟏.𝟔𝟗𝟏
𝒑= = 𝟎. 𝟑𝟕 = 37%
𝟏+𝟏.𝟔𝟗𝟏

• for subjects who surf many hours weekly:


63%
Dr. Abenet Yohannes,JJU
Conclusion: the …person with the
greatest chances to use the mobile
Internet has a high income, surfs
many hours a week and uses the
Web primarily at the office.

Dr. Abenet Yohannes,JJU


Exercise #
• Using the data set customer_dbase.sav,
perform a binomial regression in order to
compute the probability that a customer
is an union member (variable union)
depending on the following variables: ed,
jobsat, polview.
Dr. Abenet Yohannes,JJU
The Multinomial Regression

Dr. Abenet Yohannes,JJU


The multinomial logistic … regression studies the
relationship between a nominal multinomial variable,
on the one hand, and one or more continuous, ordinal
or nominal variables, on the other hand.

The dependent variable in a multinomial regression is


a nominal variable with 3 or more categories. The
independent variables can be of any type: continuous
or categorical.

Dr. Abenet Yohannes,JJU


The goal of
…the multinomial
regression is to predict the
probability (or the odds) that a given
case belongs to one or another
category of the response variable.

Dr. Abenet Yohannes,JJU


The multinomial regression operates in the following way:

• a reference category is chosen for the
dependent variable, then

• several binomial regressions are run.

By these binomial regressions, each of the


remaining categories are compared with the
reference category.

Dr. Abenet Yohannes,JJU


Assumptions:

1. The dependent variable is nominal with three or
more categories.
2. There are no significant outliers among the
data.
3. There is no important multicollinearity (the
independent variables are not strongly
correlated with each other).

Dr. Abenet Yohannes,JJU



We are conducting a study about the preferences
for three newspapers:

• Daily News

• National Politics

• Free Tribune

Dr. Abenet Yohannes,JJU



Recoding the newspaper variable – transformation table:

Initial categories Initial codes Variable dnews Variable natpol

Daily News 1 1 0

National Politics 2 0 1

Free Tribune 3 0 0

Dr. Abenet Yohannes,JJU



Recoding the political variable – transformation table:

Initial categories Initial codes Variable left Variable right


Left-wing 1 1 0
Right-wing 2 0 1
Center 3 0 0

Dr. Abenet Yohannes,JJU


Comparison between Daily News and Free Tribune

Coefficients and antilogarithms for the variable


political orientation:
Categories B Exp(B)
political=1 (left-wing) -2,489 0,083
political=2 (right-wing) -0,724 0,485

political=3 (center) 3,123 24,853

Dr. Abenet Yohannes,JJU


Comparison between Daily News and Free Tribune

The probability of an event to happen (p),


depending on the odds or chance (o):

𝑜
𝑝=
1+𝑜

Dr. Abenet Yohannes,JJU


Comparison between Daily News and Free Tribune
For the left-wing subjects (political=1, exp(B)=0.8083):

• the chance that a left-wing person reads Daily News is about 92%
lower than the chance that he or she reads Free Tribune (1-0,083)
– or about 12 times lower (1/0,083=12,04).

If a left-wing subject should choose from these two newspapers


only:
• the probability that they choose Daily News is 7,6%:
0,083/(1+0,083)

• the probability that they choose Free Tribune is 92,4%.

Dr. Abenet Yohannes,JJU


Comparison between Daily News and Free Tribune
For the right-wing subjects (political=2, exp(B)=0,485):

• the chance that a right-wing person reads Daily News is about 52%
lower than the chance that he or she reads Free Tribune (1-0,485) –
or about 2 times lower (1/0,485=2,06).
If a right-wing subject should choose from these two newspapers only:

• the probability that they choose Daily News is 32,6%:


0,485/(1+0,485)

• the probability that they choose Free Tribune is 67,4%.

Dr. Abenet Yohannes,JJU


…Comparison between Daily News and Free Tribune
For the center subjects (political=3, exp(B)=24,853):

• the chance that a center person reads Daily News is about


24,8 times higher than the chance that he or she reads Free
Tribune.
If a center subject should choose from these two newspapers only:

• the probability that they choose Daily News is 96,1%:


24,853/(1+24.853)

• the probability that they choose Free Tribune is 3,9%.

Dr. Abenet Yohannes,JJU


Comparison between Daily News and Free Tribune
Coefficient and antilogarithm for the variable age:

B Exp(B)

0,303 1,354

The chance of reading Daily News instead of


Free Tribune increases when the age increases.
Dr. Abenet Yohannes,JJU
Comparison between Daily News and Free Tribune

For the variable age (exp(B)=1.354):

• the chance that an aged subject reads Daily News is 1.3 times
higher than the chance that the same subject reads the Free
Tribune

If an aged subject should choose from these two newspapers only:

• the probability that they choose Daily News is 57,5%:


1,354/(1+1,354)

• the probability that they choose Free Tribune is 42,5%

For a young subject these probabilities are reversed.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Free Tribune

Coefficients and antilogarithms for the variable political orientation:

Categories B Exp(B)
political=1 (left-wing) -2,171 0.114
political=2 (right-wing) 0,944 2,570
political=3 (center) 1,227 3,410

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Free Tribune

The probability of an event to happen (p),


depending on the odds or chance (o):

𝑜
𝑝=
1+𝑜

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Free Tribune
For the left-wing subject (political=1, exp(B)=0,114):

• the chance that a left-wing subject reads National Politics is


about 89% lower than the chance that he or she reads Free
Tribune (1-0,114=0,886) – or about 8,7 times lower (1/0,114=8,76).

If a left-wing subject should choose from these two newspapers


only:
• the probability that they choose National Politics is 10,2%:
0,114/(1+0,114)

• the probability that they choose Free Tribune is 89,8%.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Free Tribune
For the right-wing subjects (political=2, exp(B)=2,570):

• the chance that a right-wing person reads National Politics is


about 2,5 times higher than the chance that he or she reads Free
Tribune.
If a right-wing subject should choose from these two newspapers only:

• the probability that they choose National Politics is 71,9%:


2,57/(1+2,57)

• the probability that they choose Free Tribune is 28,1%.

Dr. Abenet Yohannes,JJU


…Comparison between National Politics and Free Tribune
For the center subjects (political=3, exp(B)=3,410):

• the chance that a center person reads National Politics is about


3,4 times higher than the chance that he or she reads Free
Tribune.

If a center subject should choose from these two newspapers only:

• the probability that they choose National Politics is 77,3%:


3,410/(1+3,410)

• the probability that they choose Free Tribune is 22,7%.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Free Tribune

Coefficient and antilogarithm for the variable age:


B Exp(B)
0,024 1,025

The chance of reading National Politics instead of Free

Tribune slightly increases when the age increases.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Free Tribune
For the variable age (exp(B)=1,025):

• the chance that an aged subject reads National Politics is 1.02


times higher than the chance that the same person reads the
Free Tribune

If an aged subject should choose from these two newspapers only:

• the probability that they choose National Politics is 50,6%:


1,025/(1+1,025)

• the probability that they choose Free Tribune is 49.4%.

For a young subject these probabilities are reversed.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Daily News

Coefficients and antilogarithms for the variable


political orientation:
Categories B Exp(B)

political=1 (left-wing) 0,318 1,374

political=2 (right-wing) 1,668 5,301

political=3 (center) -1.986 0,137

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Daily News
For the left-wing subjects (political=1, exp(B)=1,374):

• the chance that a left-wing person reads National Politics is about


1,3 times higher than the chance that he or she reads Daily News

If a left-wing subjects should choose from these two newspapers


only:
• the probability that they choose National Politics is 57,8%:
1,374/(1+1,374)

• the probability that they choose Daily News is 42,2%.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Daily News

The probability of an event to happen (p),


depending on the odds or chance (o):

𝑜
𝑝=
1+𝑜

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Daily News
For the right-wing subjects (political=2, exp(B)=5,301):

• the chance that a right-wing person reads National Politics is about 5,3
times higher than the chance that he or she reads Daily News.

If a right-wing subjects should choose from these two newspapers only:

• the probability that they choose National Politics is 84,1%:


5,301/(1+5,301)

• the probability that they choose Daily News is 15,9%.

Dr. Abenet Yohannes,JJU


…Comparison between National Politics and Daily News
For the center subjects (political=3, exp(B)=0.137):

• the chance that a center person reads National Politics is about 86%

lower than the chance that he or she reads Daily News (1-0,137=0,863) –

or about 7,2 times lower (1/0,137).

If a center subjects should choose from these two newspapers only:

• the probability that they choose National Politics is 12%: 0,137/(1+0,137)

• the probability that they choose Daily News is 88%.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Daily News

Coefficient and antilogarithm for the variable age:

B Exp(B)

-0,278 0,757

The chance of reading National Politics instead of Daily


News decreases when the age increases.

Dr. Abenet Yohannes,JJU


Comparison between National Politics and Daily News
For the variable age (exp(B)=0.757):

• the chance that an aged subject reads National Politics is about


24% lower than the chance that the same subject reads Daily
News (1-0,757=0,243) – or 1.3 times lower (1/0,757)

If an aged subject should choose from these two newspapers only:

• the probability that they choose National Politics is 43%:


0,757/(1+0,757)

• the probability that they choose Daily News is 57%.

For a young subject these probabilities are reversed.

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set customer_dbase, run a multinomial logistic
regression having the variable card (primary credit card) as a
response variable, and the following independent variables:
age, gender, income.

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
… regression is used
The ordinal logistic
to determine the relationship between
an ordinal dependent variable and one
or more independent variables that
can be continuous, ordinal or nominal.

Dr. Abenet Yohannes,JJU


The dependent variable must be measured on an interval

scale (for instance, Likert or Osgood). For example:

• the customer satisfaction with a hotel services: not at all

satisfied – not satisfied – satisfied – very satisfied

• the level of agreement or disagreement with a tax

increase: strongly disagree – slightly disagree – slightly

agree – strongly agree.

Dr. Abenet Yohannes,JJU



There is one more condition that must
be absolutely satisfied for applying
the ordinal regression analysis: the
assumption of proportional odds.

Dr. Abenet Yohannes,JJU


Any ordinal regression
… represents a
succession of binomial logistic
regressions. If the dependent
variable has k categories, it is
cumulatively split into k-1 binomial
(dichotomous) variables.
Dr. Abenet Yohannes,JJU
These binomial variables are defined as follows:

• the first binomial variable separates the cases in the first
category of the dependent variable (1) from the others (2,
3, …, k)

• the second binomial variable separates the cases in the


first two categories of the dependent variable (1, 2) from
the others (3, 4, …, k)

• finally, the last binomial variable (k-1) separates the cases


in the first k-1 categories of the dependent variable (1, 2,
3, …, k-1) from the last one (k)
Dr. Abenet Yohannes,JJU
To generalize: the
…binomial variable #j
separates the cases in the first j categories
of the dependent variable (1, 2, 3, …, j) from
the others (j+1, j+2, …, k).

Dr. Abenet Yohannes,JJU


Example: let’s suppose that our dependent variable measures the

satisfaction with hotel services, on a four level scale:
• 1 – not at all satisfied
• 2 – not really satisfied
• 3 – satisfied
• 4 – very satisfied
The algorithm will define 3 dichotomous variables (4-1):
1) the first variable will separate the “not at all satisfied” customers (1) from the
others (2, 3 and 4)

2) the second variable will separate the “not at all satisfied” and “not really satisfied”
customers (1 and 2) from the others (3 and 4)

3) the third variable will separate the “not at all satisfied”, “not really satisfied” and
“satisfied” customers (1, 2 and 3) from the others (4)

Dr. Abenet Yohannes,JJU


For each binomial variable j, the binomial

regression writes as follows:

𝑝𝑗
𝑙𝑛 = 𝑏0𝑗 + 𝑏1𝑗 𝑥1 + 𝑏2𝑗 𝑥2 + ⋯ + 𝑏𝑘𝑗 𝑥𝑘
1 − 𝑝𝑗

𝑝𝑗 is the probability of success for the variable j,


while the b’s are the regression coefficients

We will end up with a family of (k-1) equations like


the one above.
Dr. Abenet Yohannes,JJU
The assumption of
…proportional odds:
The coefficients b1j, b2j up to bkj are equal
for all the (k-1) logistic regressions. In other
words, the slopes of the regression lines
defined by this family of equations are
equal.

Dr. Abenet Yohannes,JJU


Conditions (assumptions):

1. The dependent variable is ordinal.
2. The assumption of proportional odds
is satisfied.
3. There is no important multicollinearity
(the independent variables are not
strongly correlated with each other).
Dr. Abenet Yohannes,JJU
In SPSS, the ordinal regression can be accessed in two ways:

1. …
Using the PLUM procedure (PLUM stands for Polytomous
Universal Model). This procedure has a downside: it does not
automatically compute the antilogarithms of the regression
coefficients.

2. Using the GENLIN procedure. This procedure has its downsides


too: it does not allow us to test the assumption of proportional
odds, it does not compute the pseudo R square indicators and it
does not save the predicted probabilities in a convenient way.

Dr. Abenet Yohannes,JJU


Transformation table for the variable imprice:

Initial categories Initial codes Variable price1 Variable price2

Not important 1 1 0
Somewhat important 2 0 1
Very important 3 0 0

Dr. Abenet Yohannes,JJU


The interpretation rules for the antilogs of the factors:



If the antilog is higher than 1 for a certain category, the subjects
in that category have higher levels of satisfaction than the
subjects in the reference category

• If the antilog is lower than 1 for a certain category, the subjects


in that category have lower levels of satisfaction than the
subjects in the reference category

• If the antilog is equal to 1 for a certain category, the subjects in


that category have the same levels of satisfaction as the subjects
in the reference category

Dr. Abenet Yohannes,JJU


Coefficients of the factor type of traveler:

Category B Exp(B)

Pleasure (0) 0,665 1,944

Business (1) -0,665 0,514

Dr. Abenet Yohannes,JJU


For the category …
pleasure traveler (0) –
exp(B)=1.944:

• the odds that a pleasure traveler is


satisfied with the hotel are 1.9 times
higher than the same odds for a
business traveler

Dr. Abenet Yohannes,JJU


The probability of an…event to happen (p),
depending on the odds (o):
𝒐
𝒑=
𝟏+𝒐

Dr. Abenet Yohannes,JJU



The probability that a pleasure traveler is
satisfied is 66%: 1,944/(1+1,944)

The probability that a pleasure traveler is


not satisfied is 34%.

Dr. Abenet Yohannes,JJU


For the category …
business traveler (1) –
exp(B)=0.514:

• the odds that a business traveler is satisfied


with the hotel are 1.9 times lower than the
same odds for a pleasure traveler (1/0.514) –
or about 49% lower (1-0.514)

Dr. Abenet Yohannes,JJU


The probability

that a business traveler is
satisfied is 34%: 0,514/(1+0,514)

The probability that a business traveler is


not satisfied is 66%.

Dr. Abenet Yohannes,JJU



Coefficients of the factor importance of price:

Level B Exp(B)
Not important (1) 0,037 1,038
Somewhat important (2) 1,161 3,194
Very important (3) -1,198 0,301

Dr. Abenet Yohannes,JJU



For the level not important (1) – exp(B)=1,038:

• the odds that a subject who considers


the price not important is satisfied are
1,03 times higher than the same odds
for a subject who considers the price
very important
Dr. Abenet Yohannes,JJU
… a traveler who
The probability that
considers the price not important is
satisfied is 50,9%: 1,038/(1+1,038)

The probability that the same traveler


is not satisfied is 49,1%.

Dr. Abenet Yohannes,JJU


For the level somewhat important (2) –
exp(B)=3,194:

• the odds that a subject who
considers the price somewhat
important is satisfied are 3,1 times
higher than the same odds for a
subject who considers the price
very important

Dr. Abenet Yohannes,JJU


The probability that
… a traveler who
considers the price somewhat
important is satisfied is 76,1%:
3,194/(1+3,194)
The probability that the same traveler
is not satisfied is 23,9%.
Dr. Abenet Yohannes,JJU
For the level very important (3) –

exp(B)=0,301:
The probability that a traveler who
considers the price very important is
satisfied is 23,1%: 0,301/(1+0,301)
The probability that the same traveler is not
satisfied is 76,9%.
Dr. Abenet Yohannes,JJU
The interpretation rule for the antilogs of the

covariates:

the greater the antilogarithm, the higher the


odds that the subjects with greater values of the
variable are satisfied.

So for a continuous variable, the reference


values are considered to be the lowest values.

Dr. Abenet Yohannes,JJU


The coefficient for the variable age: 0,242

Exp(B) = 1,274

the odds that an older subject is satisfied are 1,2


times higher than the odds that a younger subject
is satisfied

Dr. Abenet Yohannes,JJU


The probability that…an aged traveler is
satisfied is 56%: 1,274/(1+1,274)

The probability that the same traveler is


not satisfied is 44%.
For a young traveler, these probabilities are
reversed.

Dr. Abenet Yohannes,JJU


… 192
Total number of cases:

Correctly classified cases: 107

Percentage of correctly classified cases:


55,8%

Dr. Abenet Yohannes,JJU


Exercise
• Using the data set customer_dbase,
perform an ordinal logistic regression
having the variable jobsat (job
satisfaction) as a response variable, and
the following independent variables: age,
gender, income, ed.
Dr. Abenet Yohannes,JJU
JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:02 PM


1
Dr. Abenet Yohannes,JJU
SECTION 9

SCALING TECHNIQUES
• Reliability analysis
• Multidimensional scaling

Dr. Abenet Yohannes,JJU



We are going to study three reliability indicators:

• Cronbach’s alpha

• Cohen’s kappa

• Kendall’s W
Dr. Abenet Yohannes,JJU
Cronbach’s alpha
The Cronbach’s alpha help us determine whether the
items of a scale measure the same underlying
dimension, the same construct.

So this indicator is used when we have a construct


composed by multiple items measured on an interval
scale and want to know whether the construct is
reliable (consistent).

Dr. Abenet Yohannes,JJU


Cronbach’s alpha
Example: a fictitious study where the subjects were asked to express their

opinions on a brand of home appliances, using a 5-item scale:

1. The products of the brand X are high quality.

2. The company X is sincerely interested in the customers’ satisfaction.

3. If I had a problem with a product of the brand X, the company

representatives would do everything to fix it.

4. The advertisements of the brand X are trustworthy.

5. When I need a home appliance, the brand X is the first one I think of.

Dr. Abenet Yohannes,JJU


Cronbach’s alpha
Critical values
Alpha Internal consistency
Greater than 0.90 Excellent
Between 0.80 and 0.90 Very good
Between 0.70 and 0.80 Good
Between 0.60 and 0.70 Acceptable (for exploratory studies only)
Between 0.50 and 0.60 Poor
Lower than 0.50 Unacceptable

..\1. Training datasets\reliability-alpha.sav

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU
Dr. Abenet Yohannes,JJU
Cohen’s Kappa
The Cohen’s kappa measures the degree of agreement
(or the level of concordance) between two evaluators,
when the evaluation uses a categorical scale.

This indicator compares the observed percentage of


agreements with the probability that an agreement
appears by simple chance.

Dr. Abenet Yohannes,JJU


Cohen’s Kappa
Critical Values
k Concordance
<0.20 Very low
0.21 – 0.40 Low
0.41 – 0.60 Moderate
0.61 – 0.80 Strong
0.81 – 1 Very strong
1 Perfect

Dr. Abenet Yohannes,JJU


Cohen’s Kappa
Assumptions:

The rating scale is categorical (nominal or ordinal) with disjoint categories (each
rater gives to any subject one score and only one).

1. The evaluation takes place on the same sample, every subject receiving two
scores, one for every rater.

2. Each rater uses the same scale, with the same number of levels.

3. The evaluations are independent.

4. The two evaluators are predetermined (fixed), i.e. they are specifically
chosen to take part in the study. They are not selected at random, and we
cannot make generalizations for other evaluators.

Dr. Abenet Yohannes,JJU


Cohen’s Kappa
The null and alternative hypotheses:
H0: k=0
H1: k≠0

Dr. Abenet Yohannes,JJU


Kendall’s W
The Kendall’s W coefficient is used to
measure the inter-rater reliability when we
have three or more raters and the
evaluation uses a continuous or an ordinal
scale.

It ranges between 0 and 1.


Dr. Abenet Yohannes,JJU
Kendall’s W
Critical Values

W Concordance
<0.20 Very low
0.21 – 0.40 Low
0.41 – 0.60 Moderate
0.61 – 0.80 Strong
0.81 – 1 Very strong
1 Perfect

Dr. Abenet Yohannes,JJU


Kendall’s W
The null and alternative hypotheses:

H0: W=0
H1: W>0

Dr. Abenet Yohannes,JJU


Kendall’s W
Assumptions:

The evaluation scale is continuous or ordinal.

1. The evaluation takes place on the same sample, every subject


receiving as many scores as there are raters.

2. The evaluations are independent.

3. The evaluators are fixed, i.e. they are specifically selected to


take part in the study. They are not selected at random, and
we cannot make generalizations for other evaluators.

Dr. Abenet Yohannes,JJU


• Exercise #1

• Using the database anorectic.sav, compute and interpret the


Cronbach’s alpha for a construct made up by the following
variables: mens, fast, binge, vomit, purge, hyper, preo, body.

• Exercise #2

• Using the database site.sav, compute and interpret the Cohen’s


kappa statistic to determine whether there is a concordance
between the scores given by the two consultants, cons1 and
cons2.

Dr. Abenet Yohannes,JJU


Exercise #3
• Using the database judges.sav, compute and interpret
the Kendall’s W indicator to find out whether there is a
concordance between the scores given by all the
judges that evaluated the gymnasts (do also the pair
comparisons to see where the concordances and
discordances are).

Dr. Abenet Yohannes,JJU


Dr. Abenet Yohannes,JJU

The multidimesional scaling (MDS) is a technique used to
visualize the level of similarity of the individual objects in a
dataset. It places these objects in an n-dimensional space,
the coordinates of which are formed by a series of hidden
or underlying attributes. The purpose of the MDS is to
identify those attributes, compute the coordinates of each
object and represent the objects in space.
The MDS procedure starts from a single object attribute
and tries to discover the underlying dimensions behind that
attribute.
Dr. Abenet Yohannes,JJU
Ways of Collecting Data
1. Recording the individual scores given by the subjects for
every object in the set. Based on these scores, the program
will compute the distances between objects.
For example, let’s suppose that we ask the subjects to assess
the car brands in a certain country. Each brand gets a score
from 1 to 7, 1 meaning “it’s a very poor car” and 7 meaning “it’s
an excellent car”. Using these scores, the algorithm will
determine the similar and dissimilar car brands.

Dr. Abenet Yohannes,JJU


Ways of Collecting Data
2. Recording the distances between objects, as
they are perceived by the subjects. In this case,
the subjects are presented with all the possible
pairs of comparisons. They give a score to each
pair, according to the level of similarity between
the objects.

Dr. Abenet Yohannes,JJU


Ways of Collecting Data
• If the higher scores correspond to similar objects,
and the lower scores correspond to dissimilar
objects, than we get a similarity or proximity matrix.

• If the higher scores correspond to dissimilar objects,


and the lower scores correspond to similar objects,
than we get a dissimilarity or discrepancy matrix.

Dr. Abenet Yohannes,JJU


ALSCAL and PROXSCAL
We are going to present two procedures of multidimensional
scaling:

1. The ALSCAL procedure


2. The PROXSCAL procedure

Dr. Abenet Yohannes,JJU


ALSCAL procedure, data are not distances

Example: 315 subjects were invited


to give a score to 11 brands of
orange juice.

Dr. Abenet Yohannes,JJU


ALSCAL procedure, data are not distances
Critical values of stress

Stress (phi) Model quality


Lower than 0,10 Excellent
Between 0,10 and 0,20 Good
Greater than 0,20 Poor

Dr. Abenet Yohannes,JJU


ALSCAL procedure, data are distances
Example: a survey where the subjects were
invited to compare seven travel destinations
(A-G) by pairs. A scale from 1 to 10 was
used, where 1 means “very similar” and 10
means “very different” (so the resulting
matrix is a dissimilarity matrix).
Dr. Abenet Yohannes,JJU
PROXSCAL procedure, data are not distances

Example: 315 subjects were


invited to give a score to 11
brands of orange juice.

Dr. Abenet Yohannes,JJU


PROXSCAL procedure, data are not distances

Critical values of stress

Stress (phi) Model quality


Lower than 0,10 Excellent
Between 0,10 and 0,20 Good
Greater than 0,20 Poor

Dr. Abenet Yohannes,JJU


PROXSCAL procedure, data are distances

Example: a survey where the subjects were


invited to compare seven travel destinations
(A-G) by pairs. A scale from 1 to 10 was
used, where 1 means “very similar” and 10
means “very different” (so the resulting
matrix is a dissimilarity matrix).

Dr. Abenet Yohannes,JJU


Exercise
• The data set brands.sav contains the perceptions of 315
respondents on 15 fictitious car brands. Build a
perceptual map for these brands using the
multidimensional scaling.

Dr. Abenet Yohannes,JJU


JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
FIVE DAYS WORKSHOP
ON
CITATION MANAGEMENT USING MENDELEY SOFTWARE
&
DATA ANALYSIS USING
Statistical Package for Social Science
.
Dr Abenet Yohannes (Ph.D.)
Assistant Professor
College of Business and Economics
Department of Accounting and Finance

Sunday, 21 November 2021 5:02 PM


Dr. Abenet Yohannes 1
Section 10

Data Reduction
• Principal component analysis
• Correspondence analysis

Dr, Abenet, Jigjiga University 3



The principal component analysis is used to reduce a
large set of variables into a smaller set of variables
called principal components (or factors). The initial
variables are grouped based on the correlations
between them. The principal components synthetize
the information contained in those variables.

For short, the principal component analysis is called


PCA.
The PCA is a mixture
… of science and
art, a mixture of objective and
subjective techniques. The program
identifies a number of essential
factors, but afterwards the
researcher’s analytical skills come
into play: he or she must give to each
factor a relevant interpretation.
Assumptions:

The variables are continuous (although, in practice, ordinal variables are often used).

1. The relationships between all the variables are linear.

2. The principal components do not present significant outliers.


Obviously, this assumption is to be checked after running the PCA
procedure and saving the components.

3. The number of cases in the sample is big enough (at least 10 cases for
each variable).

4. The initial variables should be strongly correlated with is other.


Running a PCA is justified only if the correlations between the initial
variables are sufficiently high.

The main steps of the PCA procedure are:

1. Testing the overall correlation between the variables.

2. Extracting the principal components (the factors).

3. Determining the “meaningful” or “relevant” factors that

can be retained (finding the final solution of the analysis).

4. Computing and saving the factor scores.

5. Interpreting the final solution and reporting the results.


Example
..\1. Training datasets\World95.sav
KMO – critical values

KMO Sampling adequacy

Greater than 0.90 Very good

Between 0.80 and 0.90 Good

Between 0.70 and 0.80 Medium

Between 0.60 and 0.70 Mediocre

Between 0.50 and 0.60 Poor

Lower than 0.50 Unacceptable


Kaiser criterion: only …
the components that have
eigenvalues higher than 1 will be retained.

Benzecri criterion: the components that should be


retained for the final solutions are those who explain
together at least 70% of the total variance.

Evrard criterion: the component that corresponds to


the inflexion point (in the Screeplot chart) will be the
last included in the final solution.
Exercise
• Using the information in the database

car_sales.sav, do a principal component

analysis using the following variables: engine_s,

horsepow, wheelbas, width, length, curb_wgt,

fuel_cap, mpg, price.



The correspondence analysis is used to visualize
the relationship between two categorical
variables. Unlike the PCA, that groups variables,
the correspondence analysis groups the variable
categories (levels).
Both variables must be categorical, their
categories must be disjoint and not coded with
negative numbers.
Every level or category…is also called a profile.
The program will compute the distances
between profiles and represent them in a 2-
dimensional chart.
For building a 2-dimensional map, each variable
should have at least 3 categories.
Exercise
• Using the data set productsales.sav, perform a
correspondence analysis and build a perceptual map in order
to analyze the relationship between the products and the
countries they were sold in.
JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
FIVE DAYS WORKSHOP
ON
CITATION MANAGEMENT USING MENDELEY SOFTWARE
&
DATA ANALYSIS USING
Statistical Package for Social Science
.
Dr Abenet Yohannes (Ph.D.)
Assistant Professor
College of Business and Economics
Department of Accounting and Finance

Sunday, 21 November 2021 5:02 PM


Dr. Abenet Yohannes 1
SECTION 11

GROUPING METHODS
• Cluster analysis
• Discriminant analysis

Dr, Abenet, Jigjiga University 2


The cluster analysis …
groups the population
members in homogeneous classes or
clusters (or groups). The researcher does
not know the groups in advance; they will
be created during the clustering
procedure.
The cluster analysis technique computes the distances

between cases and then builds a similarity or proximity
matrix based on those distances. The cases that are
close to one another are included in the same cluster.

The members of every cluster should be similar to one


another and different from the members of the other
clusters.
There are two main types of cluster analysis:

1) The hierarchical cluster. This technique determines
how many clusters best suit the data, depending on
the definition of the distance and the linking
method.

1. The K-means cluster. This clustering method allows


the analyst to specify the number of clusters in
advance, and then assigns the cases to the clusters.

Assumptions:

1. The observations (measurements)


are independent.

2. There are no significant outliers.


Hierarchical Cluster
The hierarchical cluster technique is
preferred when the number of cases is
low (usually under 200).
The initial variables for a hierarchical
cluster can be continuous, ordinal or
nominal.
K-Means Cluster
The K-means cluster (also called disjoint cluster)
can be used when we have a big number of
sample cases.

For this analysis, the researcher establishes the


number of clusters from the beginning and the
program generates one final solution with that
number of classes.
K-Means Cluster
The K-means cluster can only work with
continuous or ordinal variables.
Furthermore, it is very advisable to
standardize the initial variables before
running the procedure, to make the
interpretation easier.
EXAMPLE
..\..\1. Training datasets\car-models-
2002.sav
Exercise #
• Using the data set car_sales.sav,
perform a k-means cluster analysis for
the car models in the database, with
the following variables: engine_s,
horsepow, wheelbas, width, length,
curb_wgt, fuel_cap, mpg.

The discriminant analysis (DA) is both a grouping method
and a predictive technique. It is used to predict the values of
a grouping variable (which is the dependent variable) based
on the values of the independent variables.

The grouping variable is categorical (usually dichotomous,


but it can be also multinomial) and the independent
variables should be continuous (however, ordinal variables
are also used).
The DA procedure …
creates the following
discriminant function:

𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ + 𝒃𝒌 𝒙𝒌
where y is the dependent variable, the x’s
are the explainers and the b’s are the
coefficients
Assumptions :

The dependent variable is categorical, with disjoint groups.

1. The independent variables are continuous (or at least ordinal).

2. The measurements are independent (the subjects are randomly selected).

3. The independent variables do not present significant outliers.

4. The independent variables are normally distributed in all groups of the


dependent variable.

5. The variances of the independent variables are equal in all groups.

6. The covariances of the independent variables are equal in all groups.

7. There is no important multicollinearity (the independent variables are not


strongly correlated with each other).
There are two types of discriminant analysis:

• simple, when the dependent
variable is binomial

• multiple, when the dependent


variable is multinomial
Multiple discriminant analysis
The number of discriminant functions created by
the algorithm:

min(c-1, k)
where c is the number of categories in the
dependent variable, and k is the number of
independent variables.
Multiple discriminant analysis
The first function discriminates the younger
employees, with higher education and big salaries, from
the older employees, with lower salaries and lower
education levels

• The second function discriminates the older


employees with lower education from the younger
employees with higher education.
Exercise
Using the data set car_sales.sav, perform a
discriminant analysis to determine which of
the following variables discirminate between
the two groups of the variable type
(automobile and truck): engine_s, horsepow,
wheelbas, width, length, curb_wgt, fuel_cap,
mpg, price.
JIGJIGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
COLLEGE OF BUSINESS AND ECONOMICS
DEPARTMENT OF ACCOUNTING AND FINANCE
Advanced Research Methods For Accounting And Finance

DATA ANALYSIS USING


STATISTICAL PACKAGE FOR SOCIAL SCIENCE

Dr. Abenet Yohannes (Ph.D.)

Sunday, 21 November 2021 5:02 PM


1
Dr. Abenet Yohannes, JJU
SECTION 12

• Multiple Response Analysis


• Use datasets
• sports-(Frequency).sav
sports-shoes( cross tab).sav

Dr. Abenet Yohannes, JJU



What sports do you practice? (Check all that apply.)

o Soccer
o Baseball
o Tennis
o Football
o Swimming

Dr. Abenet Yohannes, JJU



What is your preferred sport shoes brand? (Check all that apply.)

o Adidas
o Nike
o Reebok
o Asics

Dr. Abenet Yohannes, JJU


Frequency or cross table of multiple response set
1. To create the set:
1. Click in the menubar on Analyze

2. Click on Multiple Responses

3. Click on Define Variable Sets

4. Select the variables that form the set and move them to the Variables in
Set section
5. Enter the value at Counted value that represented a positive (usually you are
interested in how many cases chose the option)

6. Enter a name and a label for the set

7. Click on Add

8. Click on Close

Dr. Abenet Yohannes, JJU


Frequency or cross table of multiple response set
2. To create a frequency table from the set:
1.Click in the menu bar on Analyze
2.Click on Frequencies
3.Move the multiple response set(s) to
the Table(s) for section
4.Click on OK

Dr. Abenet Yohannes, JJU


Dr. Abenet Yohannes, JJU
Dr. Abenet Yohannes, JJU
Frequency or cross table of multiple response set
1. To create the set:
1. Click in the menu bar on Analyze

2. Click on Multiple Responses

3. Click on Define Variable Sets

4. Select the variables that form the set and move them to the Variables in
Set section
5. Enter the value at Counted value that represented a positive (usually you are
interested in how many cases chose the option)

6. Enter a name and a label for the set

7. Click on Add

8. Click on Close

Dr. Abenet Yohannes, JJU


To create a Cross table from the set:
1. Click in the menu bar on Analyze
2. Click on Crosstabs
3. Click on the multiple response set and move it to the Row(s)
4. Click on the variable you want to cross with and move it to the Column(s)
5. Click on Define Ranges
6. Specify the range of codes for the values you want to show
7. Click on Continue
8. In case you want to see percentages
1. Click on Options
2. Tick the type of percentages you want (based on row totals, column totals and/or
entire total)
3. Click on Continue
9. Click on OK

Dr. Abenet Yohannes, JJU


Dr. Abenet Yohannes, JJU

You might also like