Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
)
Introduction to Machine Learning
Prof. B. Ravindran
1. Assume that you are given a data set and a neural network model trained on the data set. You
are asked to build a decision tree model with the sole purpose of understanding/interpreting
the built neural network model. In such a scenario, which among the following measures would
you concentrate most on optimising?
(a) Accuracy of the decision tree model on the given data set
(b) F1 measure of the decision tree model on the given data set
(c) Fidelity of the decision tree model, which is the fraction of instances on which the neural
network and the decision tree give the same output
(d) Comprehensibility of the decision tree model, measured in terms of the size of the corre-
sponding rule set
Sol. (c)
Here the aim is not the traditional one of modelling the data well, rather it is to build a
decision tree model which is as close to the existing neural network model as possible, so that
we can use the decision tree to interpret the neural network model. Hence, we optimise the
fidelity measure.
2. Which of the following properties are characteristic of decision trees?
1
(b) It has the potential to affect the performance of the resultant regression/classification
model
(c) It is computationally infeasible
(a) Identify the best partition of the input space and response per partition to minimise sum
of squares error
(b) Identify the best approximation of the above by the greedy approach (to identifying the
partitions)
(c) Identify the model which gives the best performance using the greedy approximation
(option (b)) with the smallest partition scheme
(d) Identify the model which gives performance close to the best performance (option (a))
with the smallest partition scheme
(e) Identify the model which gives performance close to the best greedy approximation per-
formance (option (b)) with the smallest partition scheme
Sol. (e)
As was discussed in class we use a greedy approximation to identifying the partitions and
typically use pruning techniques which result in a smaller tree with probably some degradation
in the performance.
5. Having built a decision tree, we are using reduced error pruning to reduce the size of the
tree. We select a node to collapse. For this particular node, on the left branch, there are 3
training data points with the following outputs: 5, 7, 9.6 and for the right branch, there are
four training data points with the following outputs: 8.7, 9.8, 10.5, 11. What were the original
responses for data points along the two branches (left & right respectively) and what is the
new response after collapsing the node?
Sol. (c)
Original responses:
5+7+9.6
Left: 3 = 21.6
3 = 7.2
8.7,9.8,10.5,11
Right: 4 = 40
4 = 10
New response: 7.2 ∗ 7 + 10 ∗ 47
3
= 8.8
2
6. Given that we can select the same feature multiple times during the recursive partitioning of
the input space, is it always possible to achieve 100% accuracy on the training data (given
that we allow for trees to grow to their maximum size) when building decision trees?
(a) Yes
(b) No
Sol. (b)
Consider a pair of data points with identical input features but different class labels. Such
points can be part of the training data but will not be able to be classified without error.
7. Suppose on performing reduced error pruning, we collapsed a node and observed an improve-
ment in the prediction accuracy on the validation set. Which among the following statements
are possible in light of the performance improvement observed?
(a) The collapsed node helped overcome the effect of one or more noise affected data points
in the training set
(b) The validation set had one or more noise affected data points in the region corresponding
to the collapsed node
(c) The validation set did not have any data points along at least one of the collapsed branches
(d) The validation set did have data points adversely affected by the collapsed node
3
Considering ’profitable’ as the binary values attribute we are trying to predict, which of the
attributes would you select as the root in a decision tree with multi-way splits using the
cross-entropy impurity measure?
(a) price
(b) maintenance
(c) capacity
(d) airbag
Sol. (c)
4 2 2
cross entropyprice (D) = 11 (− 4 log2 4 − 42 log2 42 ) + 4 2 2
11 (− 4 log2 4 − 24 log2 24 ) + 3 2 2
11 (− 3 log2 3 −
1 1
3 log2 3 ) = 0.9777
2 2 2 0 0 4 2 2 2 2 5 2 2
cross entropymaintenance (D) = 11 (− 2 log2 2 − 2 log2 2 )+ 11 (− 4 log2 4 − 4 log2 4 )+ 11 (− 5 log2 5 −
3 3
5 log2 5 ) = 0.8050
3 1 1
cross entropycapacity (D) = 11 (− 3 log2 3 − 23 log2 32 ) + 6 3 3
11 (− 6 log2 6 − 36 log2 36 ) + 2 2 2
11 (− 2 log2 2 −
0 0
2 log2 2 ) = 0.7959
5 3 3
cross entropyprice (D) = 11 (− 5 log2 5 − 25 log2 25 ) + 6 3 3
11 (− 6 log2 6 − 36 log2 36 ) = 0.9868
9. For the same data set, suppose we decide to construct a decision tree using binary splits and the
Gini index impurity measure. Which among the following feature and split point combinations
would be the best to use as the root node assuming that we consider each of the input features
to be unordered?
Sol. (c)
8 4 4 3 2 1
giniprice({low,med}|{high}) (D) = 11 ∗2∗ 8 ∗ 8 + 11 ∗2∗ 3 ∗ 3 = 0.4848
5 2 3 6 4 2
ginimaintenance({high}|{med,low}) (D) = 11 ∗2∗ 5 ∗ 5 + 11 ∗2∗ 6 ∗ 6 = 0.4606
9 4 5 2
ginimaintenance({high,med}|{low}) (D) = 11 ∗2∗ 9 ∗ 9 + 11 ∗ 2 ∗ 1 ∗ 0 = 0.4040
3 1 2 8 5 3
ginitcapacity({2}|{4,5}) (D) = 11 ∗2∗ 3 ∗ 3 + 11 ∗2∗ 8 ∗ 8 = 0.4621
10. Consider building a spam filter for distinguishing between genuine e-mails and unwanted spam
e-mails. Assuming spam to be the positive class, which among the following would be more
important to optimise?
(a) Precision
(b) Recall
Sol. (a)
If we optimise recall, we may be able to capture more spam e-mails, but in the process, we
may also increase the number of genuine mails being predicted as spam. On the other hand, if
we optimise precision, we may not be able to capture as many spam e-mails as in the previous
4
approach, but the percentage of e-mails being classified as spam actually being spam will be
high.
Of the two possible trade-offs, we would prefer the later (optimising precision), since in the
former approach, we will be risking more genuine e-mails being classified as spam, which is a
costlier error to make in a spam filtering application (as compared to missing out a few spam
e-mails which the user may have to deal with manually).
Weka-based assignment questions
In this assignment, we will use the UCI Mushroom data set available here.
We will be using the J48 decision tree algorithm which can be found in Weka under classi-
fiers/trees.
We will consider the following to be the default parameter settings:
5
Note the following:
The class for which the prediction model is to be learned is named ’class’ and is the first
attribute in the data.
We will use the default ’Cross-validation’ test option with ’Folds = 10’.
Once a decision tree model has been built, you can right click on the corresponding entry in the
’Result list’ pane on the bottom left, and select ’Visualize tree’ to see a visual representation
of the learned tree.
11. How many levels does the unpruned tree contain considering multi-way and binary splits
respectively, with the other parameters remaining the same as above?
6
(a) 6, 8
(b) 6, 7
(c) 5, 7
(d) 5, 8
Sol. (b)
Multi-way split
7
Binary split
12. How many levels does the pruned tree (unpruned = false, reducedErrorPruning = false) contain
considering multi-way and binary splits respectively?
(a) 6, 6
(b) 6, 7
(c) 5, 7
(d) 5, 6
Sol. (a)
8
Multi-way split
Binary split
13. Consider the effect of the parameter minNumObj. Try modifying the parameter value and
observe the changes in the performance values. For example, considering respectively, multi-
way and binary splits with the default parameters as discussed before, at what (maximum)
value of the parameter do we start to see zero error?
9
(a) 11, 7
(b) 11, 6
(c) 10, 6
(d) 10, 7
Sol. (c)
This can be observed by trying out the listed values with the default parameters. The idea is
for you to appreciate the significance of the minNumObj parameter.
14. Which among the following pairs of attributes seem to be the most important for this particular
classification task?
Sol. (d)
From multiple models built using binary/multi-way splits and with and without pruning, we
observe that the attributes, odor and spore-print-color always appear at higher levels in the
trees indicating their importance for the classification task.
10