STA3022Test2 2023 v2

UNIVERSITY OF CAPE TOWN
STATISTICAL SCIENCES DEPARTMENT

STA3022F: Multivariate Analysis
CLASS TEST 2
8 May 2023
TIME: 1 ½ hours Total Marks: 50
Answer ALL questions. (5 pages – 7 questions)

Marks are allocated for intermediate calculations. Round all your calculations to 4 decimal
places. Do not answer on the QUESTION PAPER! All your answers need to be on the answer
books provided.
QUESTION 1: [Total: 8 marks]
This question deals with aspects of multiple correspondence analysis (MCA).
a) Describe the type of data or observations that are suitable for the application of MCA?
(3)
b) What is the algebraic relationship between the indicator matrix, X, of MCA and the Burt
matrix. (2)
c) Give one advantage of using the Burt matrix over the indicator matrix when carrying out
MCA. (1)
d) Name an alternative technique to the MCA and what matrix does it use as the input?
Mention any advantage it has over MCA.
(2)
In a certain company 100 employees were rated on their suitability for an advanced training course
in computer programming based on eight ratings given by their manager (rated 1=low to 5=high) on
the following:
i.Intellect.
ii.Interest in doing the course.
iii.Experience of computer programming.
iv.Likelihood of them staying with the company.
v.Commitment to the company.
vi.Loyalty to their team.
vii.Number of GCSEs.
viii.Score on a computer programming aptitude test.
Suppose you were told that items iv, v and vi are meant to measure the concept “loyalty”.
a) What technique or method would you use to analyse the items. (1)
b) What is meant by internal consistency or reliability of the group of items? (1)
c) Name the formula you would use to calculate the internal consistency or reliability of a
group of items.
(1)
d) Write down the formula named in (c) and define each term used in the formula. (3)
1
e) What other two techniques are used to improve the reliability of a group of items if the
technique named in (c) shows that the group of items is not reliable? (2)
f) Suppose you established that the group of items were reliable using the techniques above.
What does being reliable actually mean in this context in relation to the “concept” being
measured? (1)

For the next three questions, consider a dataset containing six one-dimensional points:
{2, 4, 7, 8, 12, 14}.
After three iterations of Hierarchical Agglomerative Clustering

using Euclidean distance between points, we get the 3 clusters:
C1 = {2, 4},
C2 = {7, 8} and
C3 = {12, 14}.
a) What is the distance between clusters C1 and C2 using Single Linkage? (1)
b) What is the distance between clusters C1 and C2 using Complete Linkage? (1)
c) What clusters are merged at the next iteration using Single Linkage? (2)
d) Consider the dendrogram below. Using this dendrogram create 3 clusters, what would the
clusters be? (1)

This question uses the USArrests dataset which is part of the stats R package. The data set contains
statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in
1973. Also given is the percent of the population living in urban areas. The following is a snippet of
the data.
a) What might motivate us to scale the data before using it? (1)
2
b) K-means was used to cluster the data with the value of clusters k set to 1, 2, …, 14. The
scree plot below is a graph of within groups sums of squares plotted on the y-axis against
the number of clusters, k, plotted on the x-axis. From plot, what should be the ’optimal’
number of clusters to use? Explain. (2)
c) Give two disadvantages of using K-means for clustering. (2)
a) What type of matrix is used as input to an MDS algorithm? (1)

b) When carrying out principal component analysis (PCA) the goal is to preserve variance in
low dimension space. What do we try to preserve in low dimension when MDS is carried
out? (1)
c) Describe in point form the five steps of the algorithm for performing metric (not classical)
MDS. (5)
QUESTION 6 [Total: 6 marks]
The BUPA Liver Disorders dataset was created by BUPA Medical Research and Development Ltd.
(BMRDL) during the 1980s as part of a larger health screening database. There are 345
observations each corresponding to one human subject. The first 5 columns (x1-x5) are
integer-valued and represent the results of various blood tests which are thought to be sensitive to
liver disorders that might arise from excessive alcohol consumption. The sixth column (y_num) is
real-valued and represents the number of drinks (equivalent of half pints of beer) taken per day by
the subject, self-reported. The dataset does not contain any variable representing presence or
absence of a liver disorder.
Ref: http://archive.ics.uci.edu/ml/datasets/liver+disorders
The numerical y_num variable (number of half-pint equivalents of alcoholic beverages drunk per
day) has been dichotomised (converted into the binary variable y_cat) using the following rule:
“light drinker” if y_num < 3
3
y_cat_disc =
“heavy drinker” if y_num ≥ 3
The researchers applied stepwise discriminant analysis to the data and found the following results:
Data for the first 6 respondents
> head(disorder)
ID x1 x2 x3 x4 x5 y_num y_cat_disc
1 86 48 20 20 6 3.0 heavy drinker
2 83 45 19 21 13 4.0 heavy drinker
3 82 48 27 15 12 0.5 light drinker
4 92 53 51 33 92 6.0 heavy drinker
5 103 75 19 30 13 1.0 light drinker
6 94 71 25 26 31 5.0 heavy drinker
> formulaStepwise=y_cat_disc ~ x1+x4

> fit <- lda(formulaStepwise,data=disorder, method="moment")
> fit
Call:
lda(formulaStepwise, data = disorder, method = "moment")
Prior probabilities of groups:

heavy drinker light drinker
0.510145 0.489855
Group means:
x1 x4
heavy drinker 91.1193 26.7045
light drinker 89.1598 22.4970
Coefficients of linear discriminants:

LD1
Cons 15.7794
x1 -0.1574
x4 -0.0644
> centroidLightDrinker
[1] ?
> centroidHeavyDrinker
[1] -0.2839
Questions:
a) After applying a stepwise method, at the next step, the discriminant analysis is applied using
only “x1” and “x4” variables. Write the discriminant function.
(1)
4
b) Calculate the cut-off value and clearly indicate the classification rule.
(2.5)
c) Data has been gathered for a person with the following profile
x1 = 83
x4 = 15
What would be the predicted level of alcohol consumption for this person? Justify your answer.
(Hint: Obtain a predicted discriminant score)
(2.5)
QUESTION 7 [Total: 10 marks]
The data was further explored by using classification trees for predicting the level of drinking using
all the variables:
> rpart(y_cat_disc~.,data = disorder[,c(1:5,7)])

n= 345
node), split, n, loss, yval, (yprob)

* denotes terminal node
1) root 345 169 heavy drinker (0.5101449 0.4898551)

2) x5>=43.5 90 25 heavy drinker (0.7222222 0.2777778)
4) x4>=31 41 5 heavy drinker (0.8780488 0.1219512) *
5) x4< 31 49 20 heavy drinker (0.5918367 0.4081633)
10) x3< 29.5 26 4 heavy drinker (0.8461538 0.1538462) *
11) x3>=29.5 23 7 light drinker (0.3043478 0.6956522) *
3) x5< 43.5 255 111 light drinker (0.4352941 0.5647059)
6) x5>=19.5 120 58 heavy drinker (0.5166667 0.4833333)
12) x4>=20.5 83 35 heavy drinker (0.5783133 0.4216867)
24) x1>=90.5 40 10 ? (0.7500000 0.2500000) *
25) x1< 90.5 43 18 light drinker (0.4186047 0.5813953)
50) x3< 23 11 2 heavy drinker (0.8181818 0.1818182) *
51) x3>=23 32 9 light drinker (0.2812500 0.7187500) *
13) x4< 20.5 37 14 light drinker (0.3783784 0.6216216) *
7) x5< 19.5 ? ? light drinker (0.3629630 0.6370370)
14) x2< 51 20 7 heavy drinker (0.6500000 0.3500000) *
15) x2>=51 115 36 light drinker (0.3130435 0.6869565) *
Questions:
a) Explain what a diversity index is and in what way(s) a diversity index is used in classification
trees. (2)
b) Calculate the diversity index at the root node and explain what the value indicates. (2)
c) What is the purpose of Bonsai and Pruning techniques? Why do we apply any of these
techniques?
(1)
d) Determine the predicted category of node 24, justify your answer. (1)
24) x1>=90.5 40 10 ? (0.7500000 0.2500000) *
5
e) Obtain the following missing frequencies (?) for node 7: (2)
7) x5< 19.5 ? ? light drinker (0.3629630 0.6370370)
f) Define an appropriate rule or rules for people who have x1 levels greater than 80, and x2 levels
less than 50, and x3 levels greater than 30, and x4 levels greater than 35 and x5 levels greater
than 50. Give the probability that the rule(s) makes an incorrect classification (give the rule’s
error rate). (2)
𝑛𝑐ℎ𝑖𝑙𝑑1𝐷𝐼𝑐ℎ𝑖𝑙𝑑1+𝑛𝑐ℎ𝑖𝑙𝑑2𝐷𝐼𝑐ℎ𝑖𝑙𝑑2 = 𝐷𝐼𝑝𝑎𝑟𝑒𝑛𝑡 − 𝑊𝐴𝐷𝐼

= 𝑛𝑐ℎ𝑖𝑙𝑑1+𝑛𝑐ℎ𝑖𝑙𝑑1
(𝑛−1−𝑝)𝑛1𝑛2 2 1
2 𝑛1𝑍 +𝑛2𝑍
= 𝑑
(
𝑝(𝑛−2) 𝑛1+𝑛2 ) = 𝑛1+𝑛2
𝐹2,4845,0.05 = 2. 997
𝑞
= 1 − ∑ ρ𝑖
2 𝐹2,4845,0.01 = 4. 610
𝑖=1
END OF TEST2

STA3022Test2 2023 v2

Uploaded by

Copyright:

Available Formats

STA3022Test2 2023 v2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STA3022Test2 2023 v2

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF CAPE TOWN

STATISTICAL SCIENCES DEPARTMENT

TIME: 1 ½ hours Total Marks: 50

Answer ALL questions. (5 pages – 7 questions)

QUESTION 1: [Total: 8 marks]

This question deals with aspects of multiple correspondence analysis (MCA).

QUESTION 2: [Total: 9 marks]

QUESTION 3: [Total: 5 marks]

After three iterations of Hierarchical Agglomerative Clustering

QUESTION 4: [Total: 5 marks]

QUESTION 5: [Total: 7 marks]

a) What type of matrix is used as input to an MDS algorithm? (1)

QUESTION 6 [Total: 6 marks]

“light drinker” if y_num < 3

> formulaStepwise=y_cat_disc ~ x1+x4

Prior probabilities of groups:

Coefficients of linear discriminants:

QUESTION 7 [Total: 10 marks]

> rpart(y_cat_disc~.,data = disorder[,c(1:5,7)])

node), split, n, loss, yval, (yprob)

1) root 345 169 heavy drinker (0.5101449 0.4898551)

24) x1>=90.5 40 10 ? (0.7500000 0.2500000) *

7) x5< 19.5 ? ? light drinker (0.3629630 0.6370370)

𝑛𝑐ℎ𝑖𝑙𝑑1𝐷𝐼𝑐ℎ𝑖𝑙𝑑1+𝑛𝑐ℎ𝑖𝑙𝑑2𝐷𝐼𝑐ℎ𝑖𝑙𝑑2 = 𝐷𝐼𝑝𝑎𝑟𝑒𝑛𝑡 − 𝑊𝐴𝐷𝐼

You might also like