Final Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

K=Final Assignment (60 points)

Problem 1 (15 points)

Here is a short dataset consisting of a binary outcome variable Y and two independent variables
X1 and X2. The independent variables are normalized, so no need to normalize further.

X1 X2 Y
3 5 1
1 4 0
3 2 0
2 2 1
4 1 1

a) Suppose you are asked to predict the outcome for (X1, X2) = (4, 4). Use KNN with k = 3 to
predict this outcome. You can use Euclidean distance as the distance measure.
b) Predict the outcome with k = 5.

c) Use k = 3 with Manhattan distance and reevaluate the prediction. Should the prediction
with k = 5 change? Why or why not? Is there another name you can use to call the
prediction with k = 5.
Problem 2 (30 points)

Consider the same dataset as in Problem 1.

X1 X2 Y
3 5 1
1 4 0
3 2 0
2 2 1
4 1 1

a) We would now like to train a classification tree to the above dataset. Consider the split X1 =
2.5. Compute the weighted Gini Index of this split.
b) Suppose we change the split to X1 = 3.5. Provide an argument for why this split is better or
worse.
c) Based on your judgment, introduce a split on X2 in addition to the previous split on X1
(either 2.5 or 3.5). Show that this split improves fit and draw the corresponding decision tree.
Problem 3 (15 points)

For the same dataset, ignore the Y variable and simply consider the X variables:

Record X1 X2
#
R1 3 5
R2 1 4
R3 3 2
R4 2 2
R5 4 1
We are now interested in a clustering exercise.

a) Fill up a distance matrix that stipulates the distance from each record to every other record.
Use either Euclidean or Manhattan distance as the measure depending on your convenience.

R1 R2 R3 R4 R5
R1
R2
R3
R4
R5
b) Construct 2 clusters based on the distance matrix above. Can you improve these clusters?
How would you measure this improvement?

You might also like