HW 1
HW 1
HW 1
(a) Suppose we have a randomly sampled training set D (drawn independently of our test data), and we
select an estimator denoted θ = θ̂(D) (for example, via empirical risk minimization). Show that we
can decompose our expected mean squared error for a test input x into a bias, a variance and an
irreducible error term as below:
You may find it helpful to recall the formulaic definitions of Variance and Bias, reproduced for you
below:
h i
Var(fθ̂(D) (x)) = ED (fθ̂(D) (x) − E[fθ̂(D) (x)])2
Bias(fθ̂(D) (x)) = EY ∼p(Y |x),D [fθ̂(D) (x) − Y ]
(b) Suppose our training dataset consists of D = {(xi , yi )}ni=1 where the only randomness is coming from
the label vector Y = Xθ∗ + ε where θ∗ is the true underlying linear model and each noise variable
εi is i.i.d. with zero mean and variance 1. We use ordinary least squares to estimate a θ̂ from this
data. Calculate the bias and covariance of the θ̂ estimate and use that to compute the bias and
variance of the prediction at particular test inputs x. Recall that the OLS solution is given by
θ̂ = (X ⊤ X)−1 X ⊤ Y,
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Homework 1 @ 2022-09-08 06:14:27Z
where X ∈ Rn×d is our (nonrandom) data matrix, Y ∈ Rd is the (random) vector of training targets.
For simplicity, assume that X ⊤ X is diagonal.
(a) Recall that ridge regression can be understood as the unconstrained optimization problem
where X ∈ Rn×d is a design matrix (data), and y ∈ Rn is the target vector of measurement values.
One way to interpret “ridge regression” is as the ordinary least squares for an augmented data set —
i.e. adding a bunch of fake data points to our data. Consider the following augmented measurement
vector ŷ and data matrix X̂:
" # " #
y X
ŷ = X̂ = √ ,
0d λId
where 0d is the zero vector in Rd and Id ∈ Rd×d is the identity matrix. Show that the optimization
problem minw ∥ŷ − X̂w∥22 has the same minimizer as (1).
(b) Perhaps more surprisingly, one can achieve the same effect as in the previous part by adding fake
features to each data point instead of adding fake data points. Let’s construct the augmented design
matrix in the following way:
X̂ = [X αIn ]
i.e. we stack X with αIn horizontally. Here α is a scalar multiplier. Now our problem is underdeter-
mined: the new dimension d + n is larger than the number of points n. Therefore, there are infinitely
many values η ∈ Rd+n for which X̂η = y. Consider the following problem:
Find the α such that if η ∗ is the minimizer of (2), then the first d coordinates of η ∗ form the
minimizer of (1).
Can you interpret what the final n coordinates of η ∗ represent?
(c) We know that the Moore-Penrose pseudo-inverse for an underdetermined system (wide matrix) is
given by A† = AT (AAT )−1 , which corresponds to the min-norm solution for Aη = z. That is, the
optimization problem
ŵ = XT (XXT + α2 I)−1 y
Then, show that with the α you identified in the previous part, this is equivalent to the standard
formula for Ridge Regression (ŵ = (XT X + λI)−1 XT y)
(Hint: For the last part, it might be useful to lookup Kernel Ridge Regression in your machine learning
notes or textbook.)
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Homework 1 @ 2022-09-08 06:14:27Z
(d) Suppose the SVD of X is X = UΣV⊤ . Let’s make a change of coordinates in the feature space, such
that V becomes identity: Xnew = XV and wnew = V⊤ w. Denote the the solution to ridge regression
(1) in these new coordinates as ŵnew . Show that the i-th coordinate of ŵnew can be obtained from
the corresponding coordinate of U⊤ y by multiplication by σ2σ+λ i
, where σi is the i-th singular
i
value of X (or zero if i is greater than the rank of X.)
The remaining parts (e-h) of this question are optional.
(e) One reason why we might want to have small weights w has to do with the sensitivity of the predictor
to its input. Let x be a d-dimensional list of features corresponding to a new test point. Our predictor
is w⊤ x. What is an upper bound on how much our prediction could change if we added noise
ϵ ∈ Rd to a test point’s features x?
(f) We know that the solution to ridge regression (1) is given by ŵr = (X⊤ X + λI)−1 X⊤ y. What
happens when λ → ∞? It is for this reason that sometimes ridge regularization is referred to as
“shrinkage.”
(g) Another advantage of ridge regression can be seen for under-determined systems. Say we have the
data drawn from a d = 5 parameter model, but only have n = 4 training samples of it, i.e. X ∈ R4×5 .
Now this is clearly an underdetermined system, since n < d. Show that ridge regression with
λ > 0 results in a unique solution, whereas ordinary least squares can have an infinite number
of solutions.
[Hint: To make this point, it may be helpful to consider w = w0 + w∗ where w0 is in the null space
of X and w∗ is a solution.]
[Alternative Hint: You might want to consider (2) as the way to interpret ridge regression.]
(h) For the previous part, what will the answer be if you take the limit λ → 0 for ridge regression?
[Hint: You might want to consider (2) as the way to interpret ridge regression.]
Where W1 , A, and W2 are matrices and x, b and c are vectors. W1 can be viewed as a generic weighting of
the residuals and W2 along with c can be viewed as a generic weighting of the parameters.
(a) Solve this optimization problem manually by expanding it out as matrix-vector products, setting the
gradient to 0, and solving for x.
(b) Construct an appropriate matrix C and vector d that allows you to rewrite this problem as
and use the OLS solution (x∗ = (C T C)−1 C T d to solve. Confirm your answer is in agreement with
the previous part.
(c) Under which case would this reduce to the simple case of ridge regression that you’ve seen in the
previous problem x∗ = (AT A + λI)−1 AT b?
5. System ID for Continuous Systems: Understanding how PyTorch can be used to estimate underlying
dynamics given observed samples of a trajectory
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Homework 1 @ 2022-09-08 06:14:27Z
d
x(t) = λx(t) + bu(t) (3)
dt
where x(t) ∈ R and u(t) ∈ R is a scalar input.
Given an initial condition x(t0 ), and that u(t) is some constant input ū over the interval [t0 , tf ), then
for all t ∈ [t0 , tf ), we know that this differential equation eq. (3) has the unique solution
eλ(t−t0 ) − 1
x(t) = x(t0 )eλ(t−t0 ) + bū. (4)
λ
Assume that we record the state at known times τ0 , τ1 , . . . , τn as having corresponding state values
x0 = x(τ0 ), x1 = x(τ1 ), . . . , xn = x(τn ). The continuous-time input is known to be piecewise
constant u(t) = ui for t ∈ [τi , τi+1 ), where we know the sequence of inputs u0 , u1 , . . . , un−1 .
We now want to formulate this as a system ID question by relating the unknown parameters λ, b to
the data we have. However, the relationship between the parameters and the data we collected is now
non-linear. For the data point xi+1 , use eq. (4) to write out how xi+1 should be related to λ and b
in the form
where f (λ, x, σ, τ ) and g(λ, x, u, σ, τ ) are given nonlinear scalar functions, and ℓ(s, p) is a loss func-
tion that penalizes how far the prediction p is from the measured state s. For example, you could use
squared loss ℓ(s, p) = (s − p)2 .
To find the best possible λ∗ , b∗ , you observe that you want to solve the nonlinear system of equations:
∂
c(λ, b) =0 (7)
∂λ λ∗ ,b∗
∂
c(λ, b) =0 (8)
∂b λ∗ ,b∗
and decide to do so using Newton’s method starting with an initial guess λ0 , b0 and linearizing the
system of equations eq. (7) and eq. (8) to get a system of linear equations to solve at each step.
The system of linear equations at each iteration j + 1 can be expressed in vector form as:
" #
λ − λj
A = y. (9)
b − bj
What are the entries of the matrix A and the vector y in terms of the appropriate partial deriva-
tives of c(λ, b) evaluated at the appropriate arguments?
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 4
Homework 1 @ 2022-09-08 06:14:27Z
Assume that you can use PyTorch to compute whatever derivatives of c(λ, b) that you want — all given
functions are sufficiently differentiable. You don’t have to take the derivatives by hand. You just need
to tell us what derivatives and what arguments to evaluate them at.
6. Movie Ratings with Missing Entries: understanding validation in a way that points towards self-
supervision
In a matrix R, you have users’ movie ratings. However, not all users watched all the movies.
0.50 0.00 0.50 0.50 0.20 1.0
0.60 0.20 0.40 0.50 ? ?
R = 0.50 0.50 0.00 0.25 0.60 1.0 (10)
0.60 0.10 0.50 0.55 ? ?
1.00 0.40 0.60 ? ? ?
where the element at the ith row and jth column indicates the rating of movie i by user j. A “?” means that
there’s no rating for that movie.
Our goal is to predict ratings for the missing entries, so we can recommend movies to users. In order to do
this, you want to find the hidden goodness vectors for the movies, and the hidden sensitivity vectors of the
users. However, due to missing entries, it is not possible to run an SVD on the entire rating matrix R.
It turns out that we have a submatrix R′ in R that does not have any missing entries.
0.50 0.00 0.50 0.50
0.60 0.20 0.40 0.50
R′ = (11)
0.50 0.50 0.00 0.25
0.60 0.10 0.50 0.55
We provide a decomposition of this matrix:
0.5 0.0 " #" #
0.4 0.2 4.0 0.0 0.25 0.0 0.25 0.25
′
R = (12)
0.0 0.5 0.0 2.0 0.5 0.5 0.0 0.25
0.5 0.1 | {z } | {z }
| {z } D S
G
where the ith row of the matrix G represents the goodness row vector g⊤ i of the movie i, the jth column of
the matrix S represents the sensitivity vector sj of user j, and each diagonal entry of the matrix D shows
the weight each attribute has in determining the rating of a movie by a user.
NOTE: This decomposition in eq. (12) is not an SVD; G and S do not have orthonormal vectors.
Use this to estimate the rating of movie 2 as rated by user 6. (Show work that uses s6 . Unjustified
answers will get no credit.)
(b) For the 5th movie, we have three ratings and want to find two parameters of goodness. Formulate a
least squares problem Ag5 ≈ b to estimate g5 (goodness vector of movie 5). You need to tell us A
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 5
Homework 1 @ 2022-09-08 06:14:27Z
1.00
explicitly as a matrix with numerical entries. We give you that b = 0.40 since those are the three
0.60
ratings we know for this movie.
(c) We now consider a ratings matrix R without missing entries (that is different from the previous R)
where the matrix is partitioned into four blocks R11 , R12 , R21 , R22 as below.
" #
R11 R12
R= (14)
R21 R22
In order to find the optimal number of principal components, we compute a PCA model from the SVD
of R11 with k principal components, with k = 2, 3, 4, 5. We then use the chosen components and the
singular values of R11 together with the information in R12 and R21 to create an estimate R b22 for the
held-out ratings in R22 . We can also use the first k terms of the SVD of R11 to reconstruct Rb11 as the
best rank-k approximation to R11 .
The training errors ∥R11 − R b11 ∥2 and validation errors ∥R22 − R b22 ∥2 for each candidate choice for
F F
k are given in the table below.
Select k Training error Validation error
⃝ 1 1.428 3.104
⃝ 2 0.414 2.494
⃝ 3 0.093 0.462
⃝ 4 0.017 0.090
⃝ 5 0.011 0.132
⃝ 6 0.006 0.161
Find the optimal number of principal components k we should use. (No need to give any justifica-
tion.)
where x, y ∈ R, b ∈ Rd , W(1) ∈ Rd×1 , and W(2) ∈ R1×d . We define our loss function to be the squared
error, 1
2
ℓ x, y, W(1) , b, W(2) =
fˆ(x) − y
.
2 2
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Homework 1 @ 2022-09-08 06:14:27Z
(c) Now we return to the full network function fˆ(x). Derive the location ei of the elbow of the i’th
elementwise ReLU activation.
(d) Derive the new elbow location e′i of the i’th elementwise ReLU activation after one stochastic
gradient update with learning rate λ.
(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll
need to remember it for the self-grade form.
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 7
Homework 1 @ 2022-09-08 06:14:27Z
Contributors:
• Saagar Sanghavi.
• Brandon Trabucco.
• Alexander Tsigler.
• Anant Sahai.
• Jane Yu.
• Philipp Moritz.
• Soroush Nasiriany.
• Ashwin Vangipuram.
• Jichan Chung.
• Druv Pai.
• Josh Sanz.
Homework 1, © UCB EECS 182, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 8