Homework 2
Homework 2
Homework 2
Instructions
• Late homework policy: Homework is worth full credit if submitted before the due date, half credit
during the next 48 hours, and zero credit after that. You must turn in at least n−1 of the n homeworks
to pass the class, even if for zero credit.
• Collaboration policy: Homeworks must be done individually, except where otherwise noted in the
assignments. “Individually” means each student must hand in their own answers, and each student
must write and use their own code in the programming parts of the assignment. It is acceptable for
students to collaborate in figuring out answers and to help each other solve the problems, though you
must in the end write up your own solutions individually, and you must list the names of students you
discussed this with. We will be assuming that, as participants in a graduate course, you will be taking
the responsibility to make sure you personally understand the solution to any work arising from such
collaboration.
• Online submission: You must submit your solutions online on autolab. We recommend that you
use LATEX, but we will accept scanned solutions as well. On the Homework 2 autolab page, you can
download the submission template, which is a tar archive containing blank placeholder pdfs for each of
the three problems. Replace each of these pdf files with your solutions for the corresponding problem,
create a new tar archive of the top-level directory, and submit your archived solutions online by clicking
the “Submit File” button. You should submit a single tar archive identical to the template, except
with each of the problem pdfs replaced with your solutions.
DO NOT change the name of any of the files or folders in the submission template. In other words,
your submitted pdfs should have exactly the same names as those in the submission template. Do not
modify the directory structure.
P (B|A)P (A)
P (A|B) = .
P (B)
(b) [4 Points] For events A, B, and C, rewrite P (A, B, C) as a product of several conditional probabilities
and one unconditional probability involving a single event. Your conditional probabilities can use only
one event on the left side of the conditioning bar. For example, P (A|C) and P (A) would be okay, but
P (A, B|C) is not.
(c) [4 Points] Let A be any event, and let X be a random variable defined by
(
1 if event A occurs
X=
0 otherwise.
X is sometimes called the indicator random variable for the event A. Show that E[X] = P (A), where
E[X] denotes the expected value of X.
1
(d) Let X, Y , and Z be random variables taking values in {0, 1}. The following table lists the probability
of each possible assignment of 0 and 1 to the variables X, Y , and Z:
Z=0 Z=1
X=0 X=1 X=0 X=1
Y =0 1/15 1/15 4/15 2/15
Y =1 1/10 1/10 8/45 4/45
For example, P (X = 0, Y = 1, Z = 0) = 1/10 and P (X = 1, Y = 1, Z = 1) = 4/45.
(i) [4 Points] Is X independent of Y ? Why or why not?
(ii) [4 Points] Is X conditionally independent of Y given Z? Why or why not?
(iii) [4 Points] Calculate P (X = 0|X + Y > 0).
(a) [4 Points] Write a formula for the likelihood function, L(θ̂). Your function should depend on the random
variables X1 , . . . , Xn and the hypothetical parameter θ̂. Does the likelihood function depend on the
order of the random variables?
(b) [4 Points] Suppose that n = 10 and the data set contains six 1s and four 0s. Write a short computer
program that plots the likelihood function of this data for each value of θ̂ in {0, 0.01, 0.02, . . . , 1.0}. For
the plot, the x-axis should be θ̂ and the y-axis L(θ̂). Scale your y-axis so that you can see some variation
in its value. Please submit both the plot and the code that made it. Please include all plots for this
question in the problem2.pdf file, as well as the source code for producing them. That is, do not submit
the source code and plots as separate files.
(c) [4 Points] Estimate θ̂MLE by marking on the x-axis the value of θ̂ that maximizes the likelihood. Find
a closed-form formula for the MLE. Does the closed form agree with the plot?
(d) [4 Points] Create three more likelihood plots: one where n = 5 and the data set contains three 1s and
two 0s; one where n = 100 and the data set contains sixty 1s and fourty 0s; and one where n = 10 and
there are five 1s and five 0s.
(e) [4 Points] Describe how the likelihood functions and maximum likelihood estimates compare for the
different data sets.
1 iid means Independent, Identically Distributed.
2
Maximum a Posteriori Probability Estimation
In the maximum likelihood estimate, we treated the true parameter value θ as a fixed (non-random) number.
In cases where we have some prior knowledge about θ, it is useful to treat θ itself as a random variable, and
express our prior knowledge in the form of a prior probability distribution over θ. For example, suppose that
the X1 , . . . , Xn are generated in the following way:
• First, the value of θ is drawn from a given prior probability distribution
• Second, X1 , . . . , Xn are drawn independently from a Bernoulli distribution using this value for θ.
Since both θ and the sequence X1 , . . . , Xn are random, they have a joint probability distribution. In this
setting, a natural way to estimate the value of θ is to simply choose its most probable value given its prior
distribution plus the observed data X1 , . . . , Xn .
This is called the maximum a posteriori probability (MAP) estimate of θ. Using Bayes rule, we can rewrite
the posterior probability as follows:
Since the probability in the denominator does not depend on θ̂, the MAP estimate is given by
In words, the MAP estimate for θ is the value θ̂ that maximizes the likelihood function multiplied by the
prior distribution on θ. When the prior on θ is a continuous distribution with density function p, then the
MAP estimate for θ is given by
θ̂MAP = argmax L(θ̂)p(θ̂).
θ̂
For this problem, we will use a Beta(3,3) prior distribution for θ, which has density function given by
θ̂2 (1 − θ̂)2
p(θ̂) = ,
B(3, 3)
(f) [4 Points] Suppose, as in part (c), that n = 10 and we observed six 1s and four 0s. Write a short
computer program that plots the function θ̂ 7→ L(θ̂)p(θ̂) for the same values of θ̂ as in part (c).
(g) [4 Points] Estimate θ̂MAP by marking on the x-axis the value of θ̂ that maximizes the function. Find
a closed form formula for the MAP estimate. Does the closed form agree with the plot?
(h) [4 Points] Compare the MAP estimate to the MLE computed from the same data in part (c). Briefly
explain any significant difference.
(i) [4 Points] Comment on the relationship between the MAP and MLE estimates as n goes to infinity.
3
Problem 3: Splitting Heuristic for Decision Trees
Recall that the ID3 algorithm iteratively grows a decision tree from the root downwards. On each iteration,
the algorithm replaces one leaf node with an internal node that splits the data based on one decision attribute
(or feature). In particular, the ID3 algorithm chooses the split that reduces the entropy the most, but there
are other choices. For example, since our goal in the end is to have the lowest error, why not instead choose
the split that reduces error the most? In this problem we will explore one reason why reducing entropy is a
better criterion.
Consider the following simple setting. Let us suppose each example is described by n boolean features:
X = hX1 , . . . Xn i, where Xi ∈ {0, 1}, and where n ≥ 4. Furthermore, the target function to be learned is
f : X → Y , where Y = X1 ∨ X2 ∨ X3 . That is, Y = 1 if X1 = 1 or X2 = 1 or X3 = 1, and Y = 0 otherwise.
Suppose that your training data contains all of the 2n possible examples, each labeled by f . For example,
when n = 4, the data set would be
X1 X2 X3 X4 Y X1 X2 X3 X4 Y
0 0 0 0 0 0 0 0 1 0
1 0 0 0 1 1 0 0 1 1
0 1 0 0 1 0 1 0 1 1
1 1 0 0 1 1 1 0 1 1
0 0 1 0 1 0 0 1 1 1
1 0 1 0 1 1 0 1 1 1
0 1 1 0 1 0 1 1 1 1
1 1 1 0 1 1 1 1 1 1
(a) [4 Points] How many mistakes does the best 1-leaf decision tree make, over the 2n training examples?
(The 1-leaf decision tree does not split the data even once)
(b) [4 Points] Is there a split that reduces the number of mistakes by at least one? (I.e., is there a decision
tree with 1 internal node with fewer mistakes than your answer to part (a)?) Why or why not?
(c) [4 Points] What is the entropy of the output label Y for the 1-leaf decision tree (no splits at all)?
(d) [4 Points] Is there a split that reduces the entropy of the output Y by a non-zero amount? If so, what
is it, and what is the resulting conditional entropy of Y given this split?