Homework 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

10-601 Machine Learning: Homework 2

Due 5 p.m. Wednesday, January 28, 2015

Instructions
• Late homework policy: Homework is worth full credit if submitted before the due date, half credit
during the next 48 hours, and zero credit after that. You must turn in at least n−1 of the n homeworks
to pass the class, even if for zero credit.
• Collaboration policy: Homeworks must be done individually, except where otherwise noted in the
assignments. “Individually” means each student must hand in their own answers, and each student
must write and use their own code in the programming parts of the assignment. It is acceptable for
students to collaborate in figuring out answers and to help each other solve the problems, though you
must in the end write up your own solutions individually, and you must list the names of students you
discussed this with. We will be assuming that, as participants in a graduate course, you will be taking
the responsibility to make sure you personally understand the solution to any work arising from such
collaboration.
• Online submission: You must submit your solutions online on autolab. We recommend that you
use LATEX, but we will accept scanned solutions as well. On the Homework 2 autolab page, you can
download the submission template, which is a tar archive containing blank placeholder pdfs for each of
the three problems. Replace each of these pdf files with your solutions for the corresponding problem,
create a new tar archive of the top-level directory, and submit your archived solutions online by clicking
the “Submit File” button. You should submit a single tar archive identical to the template, except
with each of the problem pdfs replaced with your solutions.
DO NOT change the name of any of the files or folders in the submission template. In other words,
your submitted pdfs should have exactly the same names as those in the submission template. Do not
modify the directory structure.

Problem 1: More Probability Review


(a) [4 Points] For events A and B, prove

P (B|A)P (A)
P (A|B) = .
P (B)

(b) [4 Points] For events A, B, and C, rewrite P (A, B, C) as a product of several conditional probabilities
and one unconditional probability involving a single event. Your conditional probabilities can use only
one event on the left side of the conditioning bar. For example, P (A|C) and P (A) would be okay, but
P (A, B|C) is not.
(c) [4 Points] Let A be any event, and let X be a random variable defined by
(
1 if event A occurs
X=
0 otherwise.

X is sometimes called the indicator random variable for the event A. Show that E[X] = P (A), where
E[X] denotes the expected value of X.

1
(d) Let X, Y , and Z be random variables taking values in {0, 1}. The following table lists the probability
of each possible assignment of 0 and 1 to the variables X, Y , and Z:
Z=0 Z=1
X=0 X=1 X=0 X=1
Y =0 1/15 1/15 4/15 2/15
Y =1 1/10 1/10 8/45 4/45
For example, P (X = 0, Y = 1, Z = 0) = 1/10 and P (X = 1, Y = 1, Z = 1) = 4/45.
(i) [4 Points] Is X independent of Y ? Why or why not?
(ii) [4 Points] Is X conditionally independent of Y given Z? Why or why not?
(iii) [4 Points] Calculate P (X = 0|X + Y > 0).

Problem 2: Maximum Likelihood and Maximum a Posteriori Esti-


mation
This problem explores two different techniques for estimating an unknown parameter of a probability distri-
bution: the maximum likelihood estimate (MLE) and the maximum a posteriori probability (MAP) estimate.
Suppose we observe the values of n iid1 random variables X1 , . . . , Xn drawn from a single Bernoulli
distribution with parameter θ. In other words, for each Xi , we know that
P (Xi = 1) = θ and P (Xi = 0) = 1 − θ.
Our goal is to estimate the value of θ from these observed values of X1 through Xn .

Maximum Likelihood Estimation


The first estimator of θ that we consider is the maximum likelihood estimator. For any hypothetical value
θ̂, we can compute the probability of observing the outcome X1 , . . . , Xn if the true parameter value θ were
equal to θ̂. This probability of the observed data is often called the data likelihood, and the function L(θ̂)
that maps each θ̂ to the corresponding likelihood is called the likelihood function. A natural way to estimate
the unknown parameter θ is to choose the θ̂ that maximizes the likelihood function. Formally,
θ̂MLE = argmax L(θ̂).
θ̂

(a) [4 Points] Write a formula for the likelihood function, L(θ̂). Your function should depend on the random
variables X1 , . . . , Xn and the hypothetical parameter θ̂. Does the likelihood function depend on the
order of the random variables?
(b) [4 Points] Suppose that n = 10 and the data set contains six 1s and four 0s. Write a short computer
program that plots the likelihood function of this data for each value of θ̂ in {0, 0.01, 0.02, . . . , 1.0}. For
the plot, the x-axis should be θ̂ and the y-axis L(θ̂). Scale your y-axis so that you can see some variation
in its value. Please submit both the plot and the code that made it. Please include all plots for this
question in the problem2.pdf file, as well as the source code for producing them. That is, do not submit
the source code and plots as separate files.
(c) [4 Points] Estimate θ̂MLE by marking on the x-axis the value of θ̂ that maximizes the likelihood. Find
a closed-form formula for the MLE. Does the closed form agree with the plot?
(d) [4 Points] Create three more likelihood plots: one where n = 5 and the data set contains three 1s and
two 0s; one where n = 100 and the data set contains sixty 1s and fourty 0s; and one where n = 10 and
there are five 1s and five 0s.
(e) [4 Points] Describe how the likelihood functions and maximum likelihood estimates compare for the
different data sets.
1 iid means Independent, Identically Distributed.

2
Maximum a Posteriori Probability Estimation
In the maximum likelihood estimate, we treated the true parameter value θ as a fixed (non-random) number.
In cases where we have some prior knowledge about θ, it is useful to treat θ itself as a random variable, and
express our prior knowledge in the form of a prior probability distribution over θ. For example, suppose that
the X1 , . . . , Xn are generated in the following way:
• First, the value of θ is drawn from a given prior probability distribution
• Second, X1 , . . . , Xn are drawn independently from a Bernoulli distribution using this value for θ.
Since both θ and the sequence X1 , . . . , Xn are random, they have a joint probability distribution. In this
setting, a natural way to estimate the value of θ is to simply choose its most probable value given its prior
distribution plus the observed data X1 , . . . , Xn .

θ̂MAP = argmax P (θ = θ̂|X1 , . . . , Xn ).


θ̂

This is called the maximum a posteriori probability (MAP) estimate of θ. Using Bayes rule, we can rewrite
the posterior probability as follows:

P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂)


P (θ = θ̂|X1 , . . . , Xn ) = .
P (X1 , . . . , Xn )

Since the probability in the denominator does not depend on θ̂, the MAP estimate is given by

θ̂MAP = argmax P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂)


θ̂

= argmax L(θ̂)P (θ = θ̂).


θ̂

In words, the MAP estimate for θ is the value θ̂ that maximizes the likelihood function multiplied by the
prior distribution on θ. When the prior on θ is a continuous distribution with density function p, then the
MAP estimate for θ is given by
θ̂MAP = argmax L(θ̂)p(θ̂).
θ̂

For this problem, we will use a Beta(3,3) prior distribution for θ, which has density function given by

θ̂2 (1 − θ̂)2
p(θ̂) = ,
B(3, 3)

where B(α, β) is the beta function and B(3, 3) ≈ 0.0333.

(f) [4 Points] Suppose, as in part (c), that n = 10 and we observed six 1s and four 0s. Write a short
computer program that plots the function θ̂ 7→ L(θ̂)p(θ̂) for the same values of θ̂ as in part (c).

(g) [4 Points] Estimate θ̂MAP by marking on the x-axis the value of θ̂ that maximizes the function. Find
a closed form formula for the MAP estimate. Does the closed form agree with the plot?
(h) [4 Points] Compare the MAP estimate to the MLE computed from the same data in part (c). Briefly
explain any significant difference.

(i) [4 Points] Comment on the relationship between the MAP and MLE estimates as n goes to infinity.

3
Problem 3: Splitting Heuristic for Decision Trees
Recall that the ID3 algorithm iteratively grows a decision tree from the root downwards. On each iteration,
the algorithm replaces one leaf node with an internal node that splits the data based on one decision attribute
(or feature). In particular, the ID3 algorithm chooses the split that reduces the entropy the most, but there
are other choices. For example, since our goal in the end is to have the lowest error, why not instead choose
the split that reduces error the most? In this problem we will explore one reason why reducing entropy is a
better criterion.
Consider the following simple setting. Let us suppose each example is described by n boolean features:
X = hX1 , . . . Xn i, where Xi ∈ {0, 1}, and where n ≥ 4. Furthermore, the target function to be learned is
f : X → Y , where Y = X1 ∨ X2 ∨ X3 . That is, Y = 1 if X1 = 1 or X2 = 1 or X3 = 1, and Y = 0 otherwise.
Suppose that your training data contains all of the 2n possible examples, each labeled by f . For example,
when n = 4, the data set would be

X1 X2 X3 X4 Y X1 X2 X3 X4 Y
0 0 0 0 0 0 0 0 1 0
1 0 0 0 1 1 0 0 1 1
0 1 0 0 1 0 1 0 1 1
1 1 0 0 1 1 1 0 1 1
0 0 1 0 1 0 0 1 1 1
1 0 1 0 1 1 0 1 1 1
0 1 1 0 1 0 1 1 1 1
1 1 1 0 1 1 1 1 1 1

(a) [4 Points] How many mistakes does the best 1-leaf decision tree make, over the 2n training examples?
(The 1-leaf decision tree does not split the data even once)

(b) [4 Points] Is there a split that reduces the number of mistakes by at least one? (I.e., is there a decision
tree with 1 internal node with fewer mistakes than your answer to part (a)?) Why or why not?
(c) [4 Points] What is the entropy of the output label Y for the 1-leaf decision tree (no splits at all)?
(d) [4 Points] Is there a split that reduces the entropy of the output Y by a non-zero amount? If so, what
is it, and what is the resulting conditional entropy of Y given this split?

You might also like