STA3022Test2 2023 v2
STA3022Test2 2023 v2
STA3022Test2 2023 v2
a) Describe the type of data or observations that are suitable for the application of MCA?
(3)
b) What is the algebraic relationship between the indicator matrix, X, of MCA and the Burt
matrix. (2)
c) Give one advantage of using the Burt matrix over the indicator matrix when carrying out
MCA. (1)
d) Name an alternative technique to the MCA and what matrix does it use as the input?
Mention any advantage it has over MCA.
(2)
In a certain company 100 employees were rated on their suitability for an advanced training course
in computer programming based on eight ratings given by their manager (rated 1=low to 5=high) on
the following:
i.Intellect.
ii.Interest in doing the course.
iii.Experience of computer programming.
iv.Likelihood of them staying with the company.
v.Commitment to the company.
vi.Loyalty to their team.
vii.Number of GCSEs.
viii.Score on a computer programming aptitude test.
Suppose you were told that items iv, v and vi are meant to measure the concept “loyalty”.
a) What technique or method would you use to analyse the items. (1)
b) What is meant by internal consistency or reliability of the group of items? (1)
c) Name the formula you would use to calculate the internal consistency or reliability of a
group of items.
(1)
d) Write down the formula named in (c) and define each term used in the formula. (3)
1
e) What other two techniques are used to improve the reliability of a group of items if the
technique named in (c) shows that the group of items is not reliable? (2)
f) Suppose you established that the group of items were reliable using the techniques above.
What does being reliable actually mean in this context in relation to the “concept” being
measured? (1)
a) What is the distance between clusters C1 and C2 using Single Linkage? (1)
b) What is the distance between clusters C1 and C2 using Complete Linkage? (1)
c) What clusters are merged at the next iteration using Single Linkage? (2)
d) Consider the dendrogram below. Using this dendrogram create 3 clusters, what would the
clusters be? (1)
a) What might motivate us to scale the data before using it? (1)
2
b) K-means was used to cluster the data with the value of clusters k set to 1, 2, …, 14. The
scree plot below is a graph of within groups sums of squares plotted on the y-axis against
the number of clusters, k, plotted on the x-axis. From plot, what should be the ’optimal’
number of clusters to use? Explain. (2)
c) Give two disadvantages of using K-means for clustering. (2)
The BUPA Liver Disorders dataset was created by BUPA Medical Research and Development Ltd.
(BMRDL) during the 1980s as part of a larger health screening database. There are 345
observations each corresponding to one human subject. The first 5 columns (x1-x5) are
integer-valued and represent the results of various blood tests which are thought to be sensitive to
liver disorders that might arise from excessive alcohol consumption. The sixth column (y_num) is
real-valued and represents the number of drinks (equivalent of half pints of beer) taken per day by
the subject, self-reported. The dataset does not contain any variable representing presence or
absence of a liver disorder.
Ref: http://archive.ics.uci.edu/ml/datasets/liver+disorders
The numerical y_num variable (number of half-pint equivalents of alcoholic beverages drunk per
day) has been dichotomised (converted into the binary variable y_cat) using the following rule:
3
y_cat_disc =
“heavy drinker” if y_num ≥ 3
The researchers applied stepwise discriminant analysis to the data and found the following results:
Data for the first 6 respondents
> head(disorder)
ID x1 x2 x3 x4 x5 y_num y_cat_disc
1 86 48 20 20 6 3.0 heavy drinker
2 83 45 19 21 13 4.0 heavy drinker
3 82 48 27 15 12 0.5 light drinker
4 92 53 51 33 92 6.0 heavy drinker
5 103 75 19 30 13 1.0 light drinker
6 94 71 25 26 31 5.0 heavy drinker
Group means:
x1 x4
heavy drinker 91.1193 26.7045
light drinker 89.1598 22.4970
> centroidLightDrinker
[1] ?
> centroidHeavyDrinker
[1] -0.2839
Questions:
a) After applying a stepwise method, at the next step, the discriminant analysis is applied using
only “x1” and “x4” variables. Write the discriminant function.
(1)
4
b) Calculate the cut-off value and clearly indicate the classification rule.
(2.5)
c) Data has been gathered for a person with the following profile
x1 = 83
x4 = 15
What would be the predicted level of alcohol consumption for this person? Justify your answer.
(Hint: Obtain a predicted discriminant score)
(2.5)
The data was further explored by using classification trees for predicting the level of drinking using
all the variables:
Questions:
a) Explain what a diversity index is and in what way(s) a diversity index is used in classification
trees. (2)
b) Calculate the diversity index at the root node and explain what the value indicates. (2)
c) What is the purpose of Bonsai and Pruning techniques? Why do we apply any of these
techniques?
(1)
d) Determine the predicted category of node 24, justify your answer. (1)
5
e) Obtain the following missing frequencies (?) for node 7: (2)
f) Define an appropriate rule or rules for people who have x1 levels greater than 80, and x2 levels
less than 50, and x3 levels greater than 30, and x4 levels greater than 35 and x5 levels greater
than 50. Give the probability that the rule(s) makes an incorrect classification (give the rule’s
error rate). (2)
(𝑛−1−𝑝)𝑛1𝑛2 2 1
2 𝑛1𝑍 +𝑛2𝑍
= 𝑑
(
𝑝(𝑛−2) 𝑛1+𝑛2 ) = 𝑛1+𝑛2
𝐹2,4845,0.05 = 2. 997
𝑞
= 1 − ∑ ρ𝑖
2 𝐹2,4845,0.01 = 4. 610
𝑖=1
END OF TEST2