COS324 Course Notes
COS324 Course Notes
INTRODUCTION TO
MACHINE LEARNING
L E C T U R E N O T E S F O R C O S 3 2 4 AT P R I N C E T O N U N I V E R S I T Y
Copyright © 2024 Sanjeev Arora, Simon Park, Dennis Jacob, Danqi Chen
published by
tufte-latex.github.io/tufte-latex
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic
License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/2.0/ or send a
letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
September 8, 2024
Contents
I Supervised Learning 13
4 Linear Classification 47
4.1 General Form of a Linear Model 47
4.2 Logistic Regression 48
4
II Unsupervised Learning 67
6 Clustering 69
6.1 Unsupervised Learning 69
6.2 Clustering 69
6.3 k-Means Clustering 71
6.4 Clustering in Programming 76
7 Low-Dimensional Representation 81
7.1 Low-Dimensional Representation with Error 82
7.2 Application 1: Stylometry 84
7.3 Application 2: Eigenfaces 87
19 Calculus 239
19.1 Calculus in One Variable 239
19.2 Multivariable Calculus 241
7
Introduction
Finally, note that all the above material has to fit in Princeton’s
12-week term. We do not have 13-15 weeks as at other universities.
Part III covers the basics of deep learning. Topics include feedfor-
ward neural networks and convolutional neural networks.
Data: We have a set of data to learn from. The data may have an ad-
ditional field called a label. If the data is labeled, the goal of the
learning is to predict the label of newly seen data. Such learning
is called supervised learning; examples include regression (Chapter
1, Chapter 5) or classification (Chapter 4). If the data is unlabeled,
the goal of the learning is to extract the structure of the current
data. Such learning is called unsupervised learning; examples in-
clude clustering (Chapter 6) or dimensionality reduction (Chapter
7).
Model fitting: The process of finding the best (or good enough) values
of model parameters, based on the provided data, is called fitting
the model. There are many different ways to assess how “good”
a model is. In many cases, a loss function is defined to measure
how far the predictions of the model are from the actual output.
In Chapter 4, we see how different loss functions are defined for
different models.
Supervised Learning
1
Linear Regression: An Introduction
This chapter introduces least squares linear regression, one of the sim-
plest and most popular model in data science. Several of you may
have seen it in high school. In particular, we focus on understanding
linear regression in the context of machine learning. Using linear
regression as an example, we will introduce the terminologies and
ideas (some of them were mentioned in the Preface) that are widely
applicable to more complicated models in ML.
w = a0 + a1 h (1.1)
1 n
n i∑
( wi − a0 − a1 h i )2 (1.2)
=1
Example 1.1.1. If the data points (h, w) are given as {(65, 130), (58, 120),
(73, 160)}, the least squares fit will find the values of a0 , a1 that minimize
1
((130 − a0 − 65a1 )2 + (120 − a0 − 58a1 )2 + (160 − a0 − 73a1 )2 )
3
which are a0 = − 510
13 , a1 =
35
13 .
Problem 1.1.2. Between the two lines in Figure 1.2, which is more preferred
by the least squares regression method?
a1 and minimize for a0 . Then minimize for a1 . Completing the square may
be useful.) 2 2
A more general calculus based ap-
proach will be introduced in a later
chapter.
1.1.1 Multivariate Linear Regression
One can generalize the above example to multi-variable settings. In
general, we have k predictor variables and one effect variable. 3 The 3
In the above example k = 1. The
data points consist of k + 1 coordinates, where the last coordinate is predictor variable was height and effect
variable was weight.
the value y of the effect variable and the first k coordinates contain
values of the predictor variables x1 , x2 , . . . , xk . Then the relationship
we are trying to fit has the form
y = a0 + a1 x1 + a2 x2 + · · · + a k x k (1.3)
and the least squares fit method will find the values of a0 , a1 , · · · , ak
that minimize
n
1
n ∑ (yi − a0 − a1 x1i − a2 x2i − · · · − ak xki )2 (1.4)
i =1
and the least squares fit method will find ⃗a ∈ Rk+1 that minimize
n
1
n ∑ (yi −⃗a ·⃗xi )2 (1.6)
i =1
where (⃗xi , yi ) is the i-th data point. We discuss how to find the best
values of a0 , a1 , . . . , ak later in Chapter 3; for now just assume that the
solution can be found.
1.2.1 Introduction
While you might have seen linear regression as early as in high
school, you probably did not see this cool application. In sentiment
classification, we are given a piece of text and have to label it with +1
if it expresses positive sentiment and −1 otherwise.
The film’s performances are thrilling. +1 Table 1.1: Data from Stanford Sentiment
Treebank (SST). https://nlp.stanford.
It’s not a great monster movie. −1 edu/sentiment/treebank.html
It is definitely worth seeing. +1
Unflinchingly bleak and desperate. −1
How can we train a model to label text snippets with the correct
sentiment value, given a dataset of training examples? Here is an
idea to try to solve it using a linear regression model. We first enu-
merate all English words and assign a real-valued score to each word,
where the score for the i-th word is denoted by wi . These scores will
linear regression: an introduction 19
minimize ∑ yi − ∑ wj (1.7)
i j ∈ Si
where Si is the multiset of words in the i-th piece of text. Each of the
!2
values yi − ∑ w j is called a least squares error or more generally
j ∈ Si
the loss for that particular training example. The full summation is
called a training loss of the dataset.
Example 1.2.2. Assume we are training a sentiment prediction model on
a dataset. Table 1.1 shows some of the model parameter values. Then the
output of the model on the sentence “I like this movie” from the training
data will be 0.15 + 0.55 + 0.03 − 0.07 = 0.66. The output for “I dislike this
movie” from the training data will be 0.15 − 0.74 + 0.03 − 0.07 = −0.63
which shows that the linear model we have proposed for sentiment
prediction is just a subcase of linear regression (see (1.5)).
Example 1.2.3. Consider the same model in Example 1.2.2. The BoW repre-
sentation for the sentence “I like this movie” is (1, 1, 0, 1, 1, 0 · · · ). The BoW
representation for the sentence “I dislike this movie” is (1, 0, 1, 1, 1, 0 · · · ).
20 introduction to machine learning lecture notes for cos 324 at princeton university
I like this movie. 0.12 Table 1.4: The squared residual for four
training examples.
I dislike this movie. 0.14
I like this. 0.07
I dislike this. 0.19
Now it is time to test the model. Assume that the sentence “I like
a movie” is provided to the model as a test data. The test loss can be
calculated in a way similar to the training loss as (+1 − 0.63)2 ≃ 0.14.
But to actually test if the model produces the correct sentiment label
for this newly seen data, we now wish the model to output either +1
or −1, the only two labels that exist in the population. An easy fix
is to change the output of the model at test time to be sign(∑ j∈S w j ).
For this test data, the model will output sign(0.63) = +1.
On the Stanford Sentiment Treebank, this approach of training a
least squares model yields a success rate of 78% 7 . By contrast, the 7
To be more exact, this result is from
state-of-the-art deep learning methods yield success rates exceeding a model called ridge regression model,
which is linear regression model
96%! augmented by an ℓ2 regularizer, which
One thing to note is that while the training loss is calculated and will be explained in Chapter 3
explicitly used in the training process, the test loss is only a statistic
that is generated after the training is over. It is a metric to assess if
the model fitted on the training data also performs well for a more
general data.
Let’s try to understand the relationship between low test loss (the
squared residual) and high test accuracy (for what fraction of test
data points the sentiment was correct). Heuristically, the test loss
(average squared residual) being 0.75 means that the the absolute
√
value of the residual on a typical point is 0.75 ≈ 0.87. This means
that for a data point with an actual positive sentiment (i.e., label +1),
the output of the model is roughly expected to lie in the interval
[1 − 0.87, 1 + 0.87], and similarly, for a data point with an actual
negative sentiment (i. e., label −1), the output of the model is roughly
expected to lie in the interval [−1 − 0.87, −1 + 0.87]. Once we take
the sign sign(∑ j∈S w j ) of the output of the model, the output is thus
likely to be rounded off to the correct label. We also note that the
training accuracy is almost 100%. This usually happens in settings
where the number of parameters (i.e., number of predictor variables)
exceeds the number of training data points (or is close to it). The
following problem explores this.
Example 1.3.1. Patients’ raw data might include height and weight.
If we use linear regression, the effect variable can only depend upon a
linear combination of height and weight. But it is known that several
health outcomes are better modeled using Body Mass Index, defined as
weight/(height2 ). Thus the experimenter may include a separate coordinate
for BMI, even though it duplicates information already present in the other
coordinates.
Example 1.3.2. In Example 1.3.1, the weight in pounds may span a range
of [90, 350], whereas cholesterol ratio may span a range of [1, 10]. It is often
a good idea to normalize the coordinate, which means to replace x with
( x − µ)/σ where µ is the mean of the coordinate values in the dataset and σ
is the standard deviation.
Thus the same raw dataset can have multiple featurizations, with
different number of coordinates. Problem 1.2.5 may make us wary
of using featurizations with too many coordinates. We will learn a
technique called regularization in Chapter 3, which helps mitigate the
issue identified in Problem 1.2.5.
24 introduction to machine learning lecture notes for cos 324 at princeton university
tions defined in some of these external packages. You may not always
be familiar with the usage of these functions. It is important to check
the official documentation to learn about the usage and the signature
of the functions.
The code snippet below uses the three aforementioned packages to
perform linear regression on any given dataset.
# import necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
import matplotlib.pyplot as plt
For readers who are not familiar with Python, we discuss some
key details. In the first section of the code, we import the relevant
packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
import matplotlib.pyplot as plt
As seen in this example, there are two ways to load a package. The
first option is to import the full package with the import keyword
import numpy as np
Here we refer to the method sign() of the numpy package with the
customized name np. Alternatively, we can selectively import particu-
lar methods or classes with the from keyword
from sklearn.model_selection import train_test_split
The next part of the code is preparing the train, test data.
X = ...
y = ...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Note that we have omitted some of the bounding indices. If the start
index is omitted, Python assumes it to be 0 (so that the subarray is
from the start of the array); for example, X[:train_size] is the first
train_size entries of X. If the end index is omitted, Python assumes it
to be n, the length of the array (so that the subarray ends at the end
of the array); for instance, X[train_size:] is the remaining entries of X,
once we remove the first train_size entries. Another way to slice the
arrays is by specifying the number of test data points
test_size = ...
X_train = X[:-test_size]
X_test = X[-test_size:]
y_train = y[:-test_size]
y_test = y[-test_size:]
Here, notice that the index -test_size is a negative number. In this case,
Python interprets this as n - test_size, where n is the size of the array.
In other words, it is the index of the test_size-th element from the
back of the array.
The third part of the code is fitting the linear regression model.
linreg = LinearRegression().fit(X_train, y_train)
pred_train = linreg.predict(X_train)
pred_test = linreg.predict(X_test)
The first line will generate the least squares fit model based on the
train data. Then we can have the model make predictions on the
train, test data.
Next, we print out the mean squared loss and the accuracy for the
train, test data.
print(’Train MSE: ’, ’{0:.4f}’.format(mse(y_train, pred_train)))
print(’Test MSE: ’, ’{0:.4f}’.format(mse(y_test, pred_test)))
print(’Train Acc: ’, ’{0:.2f}’.format(100*(np.sign(pred_train)==y_train).
mean()))
print(’Test Acc: ’, ’{0:.2f}’.format(100*(np.sign(pred_test)==y_test).mean()
))
Notice that we use the mse() method that we imported from the
sklearn package. Also notice that when computing the accuracy, we
linear regression: an introduction 27
The first line draws a scatter plot with the y_test in the x-axis and
pred_test in the y-axis. Notice that you can specify the color of the
data points by specifying the value of the parameter c. In general,
parameters are optional values you can provide to Python functions.
If the values to parameters are omitted, the functions will use their
default values. The second and third lines specify the labels that will
be written next to the axes. The final line specifies the title of the
plot.
2
Statistical Learning: What It Means to Learn
W = 50 + H + 0.1C − 4E (2.1)
Let’s also say that the average squared residual on train and test
data were both 100. This means that the relationship (2.1) holds with
an error of 10 lbs on a typical test data point. 2 2
Also, the trained model exhibits perfect
generalization: test loss is the same as
Question 2.1.1. Alice was one of the Princeton residents in the study, but training loss!
the prediction of the model is very off from her actual value (squared residual
is 300). Does this prove the model wrong?
The answer is no. The least squares linear regression finds the
model that minimizes the average squared residual across all training
data points. The residual could be large for a particular individual.
30 introduction to machine learning lecture notes for cos 324 at princeton university
Question 2.1.2. There was a follow-up research for every Princeton resident
who is taller than 7 feet. All of them reported squared residual of 500. Does
this prove the model wrong?
The answer is still no. People who are taller than 7 feet make up
a tiny fraction of the entire population. Their residuals have a very
small effect on the average squared residual. The residual could be
large for a small subset of the population.
Question 2.1.3. There was a follow-up survey that tested the model on every
single Princeton resident. Is it possible that the average squared residual is
200 for the entire population?
The answer is yes, although it is unlikely. Consider the distribution
of 4-tuples ( H, E, C, W ) over the entire Princeton population. This is
some distribution over a finite set of 4-dimensional vectors. 3 The 200 3
31, 000 vectors to be more exact. The
residents we surveyed were randomly drawn from this distribution. population of Princeton is 31, 000.
The answer is no. The model was fitted to and tested on the distri-
bution obtained before everyone tried to gain weight. It has not been
fitted on the distribution of data points from people who changed
their values of C and E. In particular, note that if everyone reduces
their E and increases their C, then the distribution has definitely
changed — the average value of the E coordinate in this distribution
has decreased, whereas the average value of the C coordinate has
increased.
In general, a relationship learned from a fitted model illustrates
correlation and need not imply causation. The values of H, C, E in (2.1)
do not cause W to take a specific value. The equation only shows that
the values are connected via this linear relationship on average (with
some bounded square residuals).
The learnt relationship holds only for the distribution D that the data was sampled from.
The performance of the model on test data is an estimate of the
performance of the model on the full distribution D .
There is a small probability that the estimate using test data is off. This is
analogous to polling errors in opinion polls. The computation of
“confidence bounds” is discussed in Chapter 18.
the studies were done primarily on males. It turns out that heart
diseases in female patients behave differently. Many practices
that came out of those studies turned out to be harmful to female
patients. 4 4
See https://www.theatlantic.
com/health/archive/2015/10/
2. Classifiers released by tech companies in the recent past were found to heart-disease-women/412495/.
w ← w − η f ′ (w)
w + ⃗h) ≈ f (⃗
f (⃗ w) · ⃗h
w) + ∇ f (⃗
w) − η ∥∇ f ∥22
w − η ∇ f ) ≈ f (⃗
f (⃗
⃗ ←w
w ⃗ − η ∇ f (⃗
w) (3.1)
optimization via gradient descent 35
∂f
= 2w1 (w12 + w22 )(4(w12 + w22 )2 − 21(w12 + w22 ) + 26)
∂w1
∂f
= 2w2 (w12 + w22 )(4(w12 + w22 )2 − 21(w12 + w22 ) + 26)
∂w2
Now imagine initiating the gradient descent algorithm from the point
(0.5, 1) where the gradient vector is (7.5, 15). One iteration of gradient
descent with η = 0.01 would move from (0.5, 1) to (0.425, 0.85). The
gradient vector at (0.425, 0.85) is (7.90, 15.81) and the next iteration of GD
will move the point from (0.425, 0.85) to (0.35, 0.69). After 200 iterations,
the algorithm moves the point to (0.03, 0.06), which is very close to the
global minimum of f .
local minimum that the gradient descent algorithm outputs depends on the
initial point.
3.3 Regularizers
ℓ(⃗
w) + λR(⃗
w) (3.2)
Example 3.3.1. Recall the sentiment prediction model using least squares
loss. Suppose the training data consists of two data points: (⃗x1 , y1 ) =
((1, 0, 1), −1) and (⃗x2 , y2 ) = ((1, 1, 0), +1). Then the least squares loss,
without any regularizer, can be written as
1
((−1 − (w0 + w2 ))2 + (1 − (w0 + w1 ))2 ) (3.3)
2
A little thought suggests that the minimum value of this loss is 0 provided
there exists (w0 , w1 , w2 ) such that
⃗ ∗ = ( w0 , w1 , w2 )
You can verify that infinitely many solutions exist: all w
that lie on the line (0, 1, −1) + t(1, −1, −1) where t ∈ R. In other words,
the loss has infinitely many minimizers.
Now if impose an ℓ2 regularizer, the loss becomes
1
((−1 − (w0 + w2 ))2 + (1 − (w0 + w1 ))2 ) + λ(w02 + w12 + w22 ) (3.4)
2
Any minimizer of this loss must make the gradient zero. In other words, the
minimizer will satisfy the following system of linear equations:
(2 + 2λ)w0 + w1 + w2 = 0
w0 + (1 + 2λ)w1 = 1
w0 + (1 + 2λ)w2 = −1
You can verify that w⃗ ∗∗ = 0, 1+12λ , − 1+12λ is the unique minimizer for
any λ > 0. For a sufficiently small value of λ, the corresponding w ⃗ ∗∗ is
close enough to the line (0, 1, −1) + t(1, −1, −1). That is, it has a non-zero
training loss, but the value is very close to zero. Combined with the fact it
has a relatively small norm, w⃗ ∗∗ becomes the minimizer for the regularized
loss.
∇ℓ + 2λw
⃗
⃗ t +1 ← w
w ⃗ t − η (∇ℓ + 2λw
⃗ t)
where w⃗ t denotes the weight vector at the t-th time step. This update
rule can be rewritten as
⃗ t +1 ← w
w ⃗ t (1 − 2ηλ) − η ∇ℓ (3.5)
# initialize variables
num_iter = ...
x = np.zeros(num_iter + 1)
y = np.zeros(num_iter + 1)
x[0], y[0], eta = ...
will create an array [−2, −1, 0, 1, 2]. Then np.meshgrid(x, y) will create
a grid from the array of x values and the array of y values. We can
now perform the 3D plotting with the following code.
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(projection=’3d’)
ax.set_xlabel(’X’)
ax.set_ylabel(’Y’)
ax.set_zlabel(’Z’)
ax.view_init(elev=ax.elev, azim=ax.azim)
ax.plot_surface(X, Y, Z, cmap=cm.coolwarm, alpha=0.5)
lines:
import torch
model = ...
opt = torch.optim.SGD(model.parameters(), lr=0.1)
The code above will create an instance of the Optimzer class, which
has pre-defined methods that will compute the gradients and auto-
mate the Gradient Descent process.
The answer is that using the sign(z) function in the loss makes
gradient-based optimization ill-behaved. The derivative of sign(z) is
0 except at z = 0 (where the derivative is discontinuous) and thus the
gradient is uninformative about how to update the weight vector.
48 introduction to machine learning lecture notes for cos 324 at princeton university
regression model — (1) given an input ⃗x, the models learn a parame-
ter vector w⃗ that minimizes a loss, defined as a differentiable function
on w⃗ ·⃗x; (2) at test-time, the model outputs sign(⃗ w ·⃗x). 3 The main 3
There are other ways to output a
difference, however, is that the loss for the linear models introduced discrete ±1 label, but using the sign
function is the most canonical way. We
in this chapter is designed specifically for the binary classification will discuss the behavior of the models
task. Pay close attention to our “story” for why the loss makes sense. at test-time later in the chapter.
This will prepare you to understand any new loss functions you
come across in your future explorations.
See Figure 4.1. Note that “the probability that the output is +1”
is greater than 12 precisely if w
⃗ ·⃗x > 0. Furthermore, increasing the
value of w⃗ ·⃗x causes the probability to rise towards 1. Conversely, if
⃗ ·⃗x < 0, then “the probability of label −1” is greater than 12 . When
w
⃗ ·⃗x = 0, the probability of label +1 and −1 are both equal to 12 . In
w
this way, the logistic model can be seen as a continuous version of the
sign(⃗w ·⃗x).
1
1 + exp(−yi w
⃗ ·⃗xi )
Note that this expression involves a sum over training data points,
which as discussed in Section 3.2, is a very desirable and practical
property of loss in machine learning.
Problem 4.2.5. Verify that the gradient for the logistic loss function is
N
−yi⃗xi
∇ℓ = ∑ 1 + exp(yi w
⃗ ·⃗xi )
(4.6)
i =1
Sleep (S) Music (M) Compatible? Table 4.2: Sample data of compatibility
scores for four pairs of students.
1 0.5 +1
0.75 1 +1
0.25 0 −1
0 1 −1
compatibility scores for four pairs of roommates from previous years and
whether or not they turned out to be compatible (+1 for compatible, −1 for
incompatible).
Model 2 only uses the music compatibility score. For example, Model
1 assigns the probability that the first pair of students are compatible
as
1
w1 ·⃗x1 ) =
σ (⃗ ≃ 0.73
1 + exp(−1)
We can calculate the probability for the other pairs and for Model 2
and fill out the following Table 4.3:
4
∑ log(1 + exp(−yi w
⃗ ·⃗xi )) = log(1 + exp(−(w0 · 1 + w1 · 1 + w2 · 0.5))+
i =1
log(1 + exp(−(w0 · 1 + w1 · 0.75 + w2 · 1))+
log(1 + exp(w0 · 1 + w1 · 0.25 + w2 · 0)+
log(1 + exp(w0 · 1 + w1 · 0 + w2 · 1)
and the values that minimize this loss can be found as w0 ≈ −21, w1 ≈
32, w2 ≈ 8.9.
Note that this function is always at least zero, and strictly positive
for z < 1. When z decreases to negative infinity, there is no finite
upper bound to the value. The derivative is zero for z > 1 and 1 for
Figure 4.3: The graph of the hinge
z < 1. The derivative is currently undefined at z = 1, but we can function.
arbitrarily choose between 0 or 1 as the newly defined value.
For a single labeled data point (⃗x, y) where y ∈ {−1, 1}, the SVM
loss is defined as
ℓ = Hinge(yw ⃗ ·⃗x) (4.9)
The SVM loss for the entire training dataset can be defined as
∑ Hinge(yi w
⃗ ·⃗xi ) (4.10)
i
54 introduction to machine learning lecture notes for cos 324 at princeton university
that is, the sum of the SVM loss on each of the training data points.
Suppose y = +1. Then this loss is 0 only when w ⃗ ·⃗x > 1. In other
words, making loss zero not only requires w ⃗ ·⃗x to be positive, but also
be comfortably above 0. If w ⃗ ·⃗x dips below 1, the loss is positive and
increases towards +∞ as w ⃗ ·⃗x → −∞. (Likewise if the label y = −1,
then the loss is 0 only when w⃗ ·⃗x < −1.)
Recall that the goal of a gradient-based optimization algorithm is
to minimize the loss. Therefore, the loss gives a clear indication of
the direction of improvement until the data point has been classified
correctly with a comfortable margin away from 0, out of the zone of
confusion.
Example 4.3.1. Recall the roommate compatibility data from Table 4.2.
Consider the soft-margin SVM with the weight vector w ⃗ = (−1.5, 3, 0).
This means the decision boundary — the set of points where w ⃗ ·⃗x = 0 — is
1
drawn at Sleep = 2 , and the margins — the set of points where w ⃗ ·⃗x = ±1
— are drawn at Sleep = 56 and Sleep = 16 . Figure 4.4 shows the decision
boundary and the two margin lines of the model. The SVM loss is zero
for the point (1, 0.5) because it is labeled +1 and away from the decision
boundary with enough margin. Similarly, the loss is zero for the point (0, 1).
The loss for the point (0.75, 1), however, can be calculated as
and the update rule for a gradient descent algorithm will be written as
which is now lower than the SVM loss before the update.
So far, we have only seen problems where the model has to classify
using two labels ±1. In many settings there are k possible labels
for each data point 9 and the model has to assign one of them. The 9
This is the case in most settings in
conceptual framework is similar to logistic regression, except the modern machine learning. For instance
in the famous ImageNet challenge, each
model defines a nonzero probability for each label as follows. The image belongs to one of 1000 classes.
notation assumes data is d-dimensional and the model parameters
consist of k vectors ⃗θ 1 , ⃗θ 2 , . . ., ⃗θ k ∈ Rd . We define a new vector ⃗z ∈ Rk
where each coordinate zi satisfies zi = ⃗θ i ·⃗x. Then the probability
of a particular label is defined through the softmax function (see
Chapter 19):
Problem 4.4.1. Using the result of Problem 19.2.4, verify that the definition
of logistic regression as in (4.2), (4.3) are equivalent to the definition of
multi-class regression as in (4.11).
Since exp(z) > 0 for every real number z the model above assigns
a nonzero probability to every label. In some settings that may be
appropriate. But as in case of logistic regression, at test time we also
have the option of extracting a deterministic answer out of the model:
the i ∈ {1, 2, . . . , k} that has the largest value of ⃗θ j ·⃗x.
∑ Hinge(yi w w∥22
⃗ ·⃗xi ) + λ ∥⃗ (4.12)
i
Let’s see why regularization is sensible for SVMs, and even needed.
The Hinge function (4.8) treats the point z = 1 as special. In terms of
the SVM loss, this translates to the thought that having w ⃗ ·⃗x > 1 is
a more “confident” classification of ⃗x than just having the sign(⃗ w ·⃗x)
⃗ ·⃗x > 0). But this choice is arbitrary because we
be correct (i.e., w
have not specified the scale of w ⃗ ·⃗x = 1/10 then scaling w
⃗ . If w ⃗ by
a factor 10 ensures w ⃗ ·⃗x > 1. Thus the training algorithm has cheap
and meaningless ways of reducing the training loss. By applying an
ℓ2 regularizer, we are able to prevent this easy route for the model,
and instead, force the training to find optimal weights w ⃗ with a small
norm.
Problem 4.5.1. Write a justification for why it makes sense to limit the ℓ2
norm of the classifier during logistic regression. How can large norm lead to
false confidence (i. e., unrealistically low training loss)?
# prepare dataset
X = ... # array of shape (n, d), each row is a d-dimensional data point
y = ... # array of shape (n), each value = -1 or +1
w = ... # array of shape (d), each value is a weight for each dimension
X_train, X_test, y_train, y_test, eta = ...
# define functions
def loss(X, y, w):
# returns the logistic loss
# return sum log(1 + exp(-y*w*x))
...
This is equivalent to a hash table from Java. Here, 1 and −1 are the
keys and “blue” and “red” are respectively their values.
We will now discuss multi-dimensional arrays in Python. There
are multiple ways to perform array indexing. For example, if X is a
2-dimensional array, both X[i][j] and X[i, j] can be used to extract the
entry at the i-th row, j-th column. It is also possible to provide a set
of rows or a set of columns to extract. The following code snippet
generates an array of shape (2, 2), where each entry is from the row 0
58 introduction to machine learning lecture notes for cos 324 at princeton university
or 1 and column 0 or 2:
X[[0, 1], [0, 2]]
This extracts the full set of rows and the column 0, or in other words,
the first column of X.
Finally, we use a list comprehension to specify the plotting color for
each data point:
[colors[y_i] for y_i in y]
This is Python syntactic sugar that allows the user to create an array
while iterating over the elements of an iterator. The code snippet here
is equivalent to the following code.
list = []
for y_i in y:
list.append(colors[y_i])
5
Exploring “Data Science” via Linear Regression
So far, our treatment of machine learning has been from the perspec-
tive of a computer scientist. It is important to note, however, that
models such as linear regression are useful in a variety of other fields
including the physical sciences, social sciences, etc. In this chapter,
we present case studies from different fields. Here, the inputs xi are
considered to be explanatory variables, the output y is considered to be
the effect variable, and the weights wi quantify the causal significance
of the associated inputs xi on the output y. The interpretation of
weights as a type of causality is crucial; often, the ideal method of
determining causality through a set of rigorous randomized control
trials is too expensive.
Our first case study comes from the field of economics. In 1978,
Harrison and Rubinfeld released a classic study on the willingness to
pay for clean air in the Boston metropolitan area. Their methodology
involved an economic model called hedonic pricing, 1 which essentially 1
This definition is paraphrased from
estimates the value of a good by breaking it down into “constituent the following Wikipedia article: https:
//en.wikipedia.org/wiki/Hedonic_
characteristics.” It turns out we can use linear regression to help regression
determine which of these attributes are most important. Specifically,
suppose we have a dataset of house sales where y represents the
price of the house and ⃗x ∈ R15 represents a set of house attributes. 2 2
x0 is a dummy variable, and the
Then, we aim to find an optimum set of weights w ⃗ for the linear remaining 14 coordinates x1 , . . . , x14
each correspond to an attribute.
model:
14
y≈ ∑ wi x i (5.1)
i =0
Table 5.1 lists all 14 attributes that were used in the linear regres-
sion model. Before fitting the model with these attributes, it is useful
to intuitively reason about some of the attributes. For instance, we
expect the weight w5 corresponding to RM, the number of bedrooms,
60 introduction to machine learning lecture notes for cos 324 at princeton university
1. If the world has a problem, the data will reflect it and so will our
models
k
yt ≈ ∑ ws Xts
s =1
Unsupervised Learning
6
Clustering
6.2 Clustering
But this definition does not apply to the clusters in Figure 6.1. The
points in the middle of the plot are far away from the points on the
top right corner or the bottom left corner. So whichever cluster we
assign the middle points to, they will be farther away from some
clustering 71
where ⃗yi′ is the mean of the newly defined cluster Ci′ . Notice that
2
each of the cluster cost ∑ ⃗x −⃗yi′ 2
is the sum of the squared dis-
⃗x∈Ci′
tance between a set of points ⃗x ∈ Ci′ and their mean. By Lemma 6.3.1,
this sum is smaller than the sum of squared distance between the
same set of points to any other point. In particular, we can compare
with ⃗yi , the mean of Ci before the update. That is,
∑′ ∑′
2
⃗x −⃗yi′ 2
≤ ∥⃗x −⃗yi ∥22
⃗x∈Ci ⃗x∈Ci
Now notice that the summand ∥⃗x −⃗yi ∥22 in the right hand side of the
inequality is the squared distance between the point ⃗x and the mean
⃗yi (before update) of the cluster Ci′ that ⃗x is newly assigned to. In
other words, we can rewrite this term as ∥⃗x −⃗yix ∥22 and instead sum
over all points ⃗x in the dataset. That is,
k
∑ ∑′ ∥⃗x −⃗yi ∥22 = ∑ ∥⃗x −⃗yix ∥22
i =1 ⃗x∈Ci ⃗x∈D
Finally, recall that the index i x was defined as i x = arg mini ∥⃗x −⃗yi ∥2 .
In particular, if j was the index of the cluster that a data point ⃗x
2
originally belonged to, then ∥⃗x −⃗yix ∥22 ≤ ⃗x −⃗y j 2 . Therefore, we
have the following inequality,
k
∑ ∑ ∑
2
∥⃗x −⃗yix ∥22 ≤ ⃗x −⃗y j 2
⃗x∈D j=1 ⃗x∈Cj
74 introduction to machine learning lecture notes for cos 324 at princeton university
Notice that the right hand side of the inequality is the total cost at the
beginning of the iteration.
This example also shows that clustering into two clusters can be
turned into a technique for binary classification — use training data
to come up with two clusters; at test time, compute a ±1 label for
each data point according to which of the two cluster centers it is
closer to.
2. For each data point ⃗x compute D (⃗x), the distance between ⃗x and
the nearest center which has already been chosen.
Problem 6.3.4. Argue that the optimum cost for k + 1 clusters is no more
than the optimum cost for k clusters.
76 introduction to machine learning lecture notes for cos 324 at princeton university
Note that Problem 6.3.4 only concerns the optimum cost, which
as we discussed may not be attained by the k-means algorithm.
Nevertheless, it does suggest that we can try various values of k and
see when the cost is low enough to be acceptable.
A frequent heuristic is the elbow method: create a plot of the num-
ber of clusters vs. the final value of the cost as in Figure 6.3 and look
for an “elbow” where the objective tapers off. Note that if the dataset
is too complicated for a simple Euclidean distance cost, the data
might not be easy to cluster “nicely” meaning there is no “elbow”
shown on the plot.
# prepare dataset
X, y = load_digits(n_class=2, return_X_y=True)
X = scale(X)
X_train, X_test = ...
# define functions
def initialize_cluster_mean(X, k):
# X: array of shape (n, d), each row is a d-dimensional data point
# k: number of clusters
# returns Y: array of shape (k, d), each row is the center of a cluster
def assign_cluster(X, Y)
# X: array of shape (n, d), each row is a d-dimensional data point
# Y: array of shape (k, d), each row is the center of a cluster
# returns loss, the sum of squared distance from each point to its
assigned cluster
clustering 77
# returns C: array of shape (n), each value is the index of the closest
cluster
for i in range(max_iters):
loss, C = assign_cluster(X, Y)
Y = update_cluster_mean(X, k, Y)
return loss, C, Y
k = int(C.max()) + 1
from itertools import cycle
colors = cycle(’bgrcmk’)
for i in range(k):
idx = (C == i)
plt.scatter(X[idx, 0], X[idx, 1], c=next(colors))
plt.show()
The load_digits() method loads the MNIST digits dataset, with around
180 data points per digit. The scale() method linearly scales each of
the data points such that the mean is 0 and variance is 1. The PCA()
method helps visualize the MNIST digits data points, which are 64-
dimensional, in the Cartesian plane (i.e., R2 ). See the next Chapter 7
for details on this process.
Next we prepare the dataset by calling the load_digits() method.
X, y = load_digits(n_class=2, return_X_y=True)
X = scale(X)
X_train, X_test = ...
def assign_cluster(X, Y)
return loss, C
for i in range(max_iters):
loss, C = assign_cluster(X, Y)
Y = update_cluster_mean(X, k, Y)
return loss, C, Y
k = int(C.max()) + 1
from itertools import cycle
colors = cycle(’bgrcmk’)
for i in range(k):
idx = (C == i)
plt.scatter(X[idx, 0], X[idx, 1], c=next(colors))
plt.show()
The cycle() method from the itertools package lets you iterate through
an array indefinitely, with the index wrapping around back to the
start, once it reaches the end of the array.
Now, consider the for loop section in the helper function above. We
first use Boolean conditions to concisely generate a new array.
idx = (C == i)
This generates a Boolean array with the same length as C, where each
entry is either True/False based on whether the corresponding entry in
C is equal to i. The following code is equivalent.
idx = np.zeros(C.size)
for j in range(C.size):
clustering 79
idx[j] = (C[j] == i)
The first line of this code snippet shows how we can use the _ symbol
to selectively disregard individual return values of a function call.
The second line of code uses the PCA() method to transform the
64-dimensional data X_train into 2-dimensional data so that we can
visualize it with the scatter_plot() method. We will learn the details of
this process in the next Chapter 7.
7
Low-Dimensional Representation
Example 7.0.1. Figure 7.1 shows three vectors ⃗v1 = (3.42, −1.33, 6.94), ⃗v2 =
(7.30, 8.84, 1.95), ⃗v3 = (−7.92, −6.37, −5.66) in R3 . The three vectors
have rank 2 — they are all in the 2-dimensional linear subspace generated
82 introduction to machine learning lecture notes for cos 324 at princeton university
1 2
N ∑ ⃗vi − ⃗v
bi
2
≤ϵ (7.1)
i
We say that ⃗v
b1 , . . . , ⃗vb N are the low-rank or low-dimensional approxi-
mation of ⃗v1 , . . . , ⃗v N . Typically we will assume without loss of generality
that the basis vectors are orthonormal (i.e., have ℓ2 norm equal to 1 and are
pairwise orthogonal).
Example 7.1.2. Figure 7.2 shows three vectors ⃗v1 = (3.42, −1.33, 6.94), ⃗v2 =
(7.30, 8.84, 1.95), ⃗v3 = (−6.00, −7.69, −6.86) in R3 . The three vectors
have rank 2 with mean-squared error 2.5. If you choose the basis vec-
tors ⃗u1 = (8, 8, 4), ⃗u2 = (1, −4, 6) and the low-rank approximations
⃗v
b1 = ⃗v1 , ⃗v b3 = (−7.92, −6.37, −5.66) ∈ span(⃗u1 , ⃗u2 ) then,
b2 = ⃗v2 , ⃗v
1 2
3∑
⃗vi − ⃗v
bi ≃ 2.28 ≤ 2.5
2
i
Note that the basis vectors in this example are only orthogonal and not
orthonormal, but it is easy to set them as orthonormal by normalizing them.
Problem 7.1.3. Show that if ⃗u1 , ⃗u2 , . . . , ⃗uk ∈ Rd is any set of orthonor-
mal vectors and ⃗v ∈ Rd then the vector ⃗v b in span(⃗u1 , ⃗u2 , . . . , ⃗uk ) that
2
minimizes ⃗v − ⃗vb is
2
k
∑ (⃗v · ⃗u j )⃗u j (7.2)
j =1
2
(Hint: If α1 , α2 , . . . , αk minimize ⃗v − ∑ j α j⃗u j then the gradient of this
2
expression with respect to α1 , α2 , . . . , αk must be zero.)
Problem 7.1.4. Suppose the ⃗vi ’s are unit vectors 2 and the vectors 2
Note: the maximum possible value
⃗u1 , ⃗u2 , . . . , ⃗uk were the basis vectors of a random k-dimensional subspace in of ϵ when ⃗vi ’s are unit vectors is 1.
Convince yourself!
Rd . (That is, chosen without regard to the ⃗vi ’s.) Heuristically argue that the
ϵ one would need in (7.1) would be 1 − k/n.
Formally, SVD takes a matrix as its input; the rows of this matrix
are the vector ⃗vi ’s. The procedure operates on this matrix to output
a low-rank approximation. We discuss details in Section 20.3. To
follow the rest of this chapter, you do not need to understand details
of the procedure. You just need to remember the fact that the best
k-dimensional representation is computable for each k. In practice,
programming languages have packages that will do the calculations
for you. Below is a Python code snippet that will calcuate the SVD.
import sklearn.decomposition.TruncatedSVD
# n * n matrix
data = ...
# prepare transform on dataset matrix "data"
svd = TruncatedSVD(n_components=k)
svd.fit(data)
# apply transform to dataset and output an n * k matrix
transformed = svd.transform(data)
From Figure 7.9, we also see that the approximations are more
accurate when the corresponding value of k is larger. In fact, Fig-
2
∥⃗vi −⃗vbi ∥2
ure 7.10 shows the average value of 2 as a function of the
∥⃗vi ∥2
rank of the approximation. Note that this value roughly represents
the fraction of ⃗v lost in the approximation. It can be seen that the error
is a decreasing function in terms of k. 5 However, note that doing 5
This was also explored in Prob-
machine learning — specifically face recognition — on low-rank lem 7.1.5
By the Chain Rule, we can split this joint probability into the product
of a marginal probability and four conditional probabilities:
The first assumption says that, for example, the conditional proba-
bility
Pr[ X3 = ”you” | X1 = ”I”, X2 = ”love”]
can be simplified as
Count("I")
Pr[”I”] ≈ (8.4)
total number of words
where Count refers to the number of occurrences of the word in
the text. In other words, this is the proportion of the occurrence of
the word “I” in the entire corpus. Similarly, we can estimate the
conditional probability of the word “love” given its previous word is
“I” as
Count("I love")
Pr[”love” | ”I”] ≈ (8.5)
∑ Count("I" + w)
w
where the n-gram probabilities are estimated from a training corpus as the
following
Count(wi )
Pr[wi ] ≈
total number of words
Count(wi . . . w j−1 w j )
Pr[w j | wi . . . w j−1 ] ≈
∑ Count(wi . . . w j−1 w)
w
n-gram language models 93
2 2
Pr[”Yee” | ”Yee”] = Pr[”Haw” | ”Yee”] =
4 4
2 1
Pr[”Yee” | ”Haw”] = Pr[”Haw” | ”Haw”] =
3 3
Then by the bigram model, the probability that we see the sentence “Yee Haw
Yee” out of all sentences of length 3 can be calculated as
5 2 2
Pr[”Yee”] × Pr[”Haw” | ”Yee”] × Pr[”Yee” | ”Haw”] = × × ≃ 0.21
8 4 3
V
We want to maximize this value under the constraint ∑ pi = 1. A
i =1
solution to this type of a problem can be found via the Lagrange
multiplier method. We will illustrate with an example.
Example 8.2.3. We revisit the cowperson language from Example 8.2.2. Here
V = 2 and T = 8. Let p1 = Pr[”Yee”] and p2 = Pr[”Haw”]. Then the
probability assigned to the training corpus by the unigram model is
Pr[”Yee Haw Haw Yee Yee Yee Haw Yee”] = p51 p32
f ( p1 , p2 ) = p51 p32 + λ( p1 + p2 − 1)
Example 8.3.1. The training corpus “Yee Haw Haw Yee Yee Yee Haw Yee”
actually consists of three different sentences: (1) "Yee Haw," (2) "Haw Yee
Yee," and (3) "Yee Haw Yee." We can append the start and stop tokens to the
corpus and transform it into
With these start and stop tokens in mind, we slightly relax the As-
sumption 2 of the n-gram model and investigate the probability of a
word w being the first or the last word of a sentence, separately from
other probabilities. We will denote these probabilities respectively as
Pr[w | ⟨s⟩] and Pr[⟨/s⟩ | w]. The former probability will be estimated
as
Count(⟨s⟩ w)
Pr[w | ⟨s⟩] ≈ (8.11)
total number of sentences
which is the proportion of sentences that start with the word w in the
corpus. The latter probability is estimated as
Count(w ⟨/s⟩)
Pr[⟨/s⟩ | w] ≈ (8.12)
Count(w)
which is the proportion of the occurrence of w that is at the end of a
sentence in the corpus.
Also, notice that other bigram probabilities are also affected when
introducing the stop tokens. In (8.6), the denominator originally did
not include the occurrence of the substring at the end of the sentence
because there was no word to follow that substring. However, if we
consider ⟨/s⟩ as a word in the dictionary, the denominator can now
include the case where the substring is at the end of the sentence.
Therefore, the denominator is just equivalent to the Count of the
substring in the corpus. Therefore, the bigram probabilities after
introducing start, stop tokens can be estimated instead as 6 6
If we consider ⟨s⟩ , ⟨/s⟩ as vocabularies
of the dictionary, (8.13) can also include
(8.11), (8.12).
96 introduction to machine learning lecture notes for cos 324 at princeton university
Count(w j−1 w j )
Pr[w j | w j−1 ] ≈ (8.13)
Count(w j−1 )
Example 8.3.2. We revisit Example 8.2.2. The bigram frequency table and 7
Note that the values in the Total
column now correspond to the unigram
the bigram probability table can be recalculated as in Table 8.3 and Table 8.4. count of that word.
7
2 1
Pr[”Yee” | ⟨s⟩] = Pr[”Haw” | ⟨s⟩] =
3 3
1 2 2
Pr[”Yee” | ”Yee”] = Pr[”Haw” | ”Yee”] = Pr[⟨/s⟩ | ”Yee”] =
5 5 5
2 0 1
Pr[”Yee” | ”Haw”] = Pr[”Haw” | ”Haw”] = Pr[⟨/s⟩ | ”Haw”] =
3 3 3
Example 8.3.4. The probability that we see the sentence “Yee Haw Yee” in
the cowperson language can be calculated as
Problem 8.3.5. Verify that (8.14) defines a probability distribution over all
sentences of finite length.
Once the quadrigram model outputs the word “traitor” after the
n-gram language models 99
phrase “go seek the,” the problem is even worse. As can be seen
in Figure 8.6, the quadrigram model assigns probability of 1 to the
word “Gloucester” meaning that the phrase “seek the traitor” only
appears before the word “Gloucester.” So the model has memorized
one completion of the phrase from the training text. From this exam-
ple, we can see that text production based on n-grams is sampling
and remixing text fragments seen in the training corpus.
8.4.2 Perplexity
Having described a way to train a simple language model, we now
turn our attention to a formal way of testing 10 a language model. 10
This method is used even for testing
Just like any other model in ML, a language model will be given state of the art models.
Example 8.4.2. Consider the uniform (“clueless”) model which assumes that
the probability of all words are equal in any given situation. That is, if V is
the vocabulary size (i. e., size of the dictionary),
1
Pr[wi ] = Pr[wi | w1 . . . wi−1 ] =
V
T
for any given w1 , . . . , wi ∈ V. This model assigns V1 to every sequence
of T words, including the corpus. Therefore, the perplexity of the model is
T !− T1
1
=V
V
8.4.5 Smoothing
One big problem with our naive definition of the perplexity of a
model is that it does not account for a zero denominator. That is,
if the model assigns probability exactly 0 to the corpus, then the
perplexity of the model will be ∞! 17 17
Mathematically, it is undefined, but
here assume that the result is a positive
Example 8.4.5. Suppose the phrase “green cream” never appeared in the infinity that is larger than any real
number.
training corpus, but the test corpus contains the sentence “You like green
cream.” Then a bigram model will have a perplexity of ∞ because it assigns
probability 0 to the bigram “green cream.”
Example 8.4.6. Recall the cowperson language with the start and stop
tokens from Example 8.3.2. Upon further research, it turns out the language
actually consists of three words: {Yee, Haw, Moo }, but the training corpus
n-gram language models 103
“Yee Haw Haw Yee Yee Yee Haw Yee” left out one of the vocabularies in
the dictionary. By applying add-1 smoothing to the bigram model, we can
recalculate the bigram frequency and the bigram probability table as in
Table 8.6 and Table 8.7
∑ Pr[w | w′ ] = 1
w
Also, say we want to calculate the trigram probability of “like green cream,”
which is also zero in the naive trigram model. We can approximate it instead
as
There are other variants of the backoff smoothing, 19 with some For instance, Good-Turing and
19
theory for what the “best” choice is, but we will not cover it in these Kneser-Ney smoothing.
notes.
9
Matrix Factorization and Recommender Systems
Definition 9.1.1 (User Affinity). Given a taste vector Ai = ( ai,1 , ai,2 , . . . , ai,r )
for user i and the genre vector B j = (b1,j , b2,j , . . . , br,j ) for movie j, we de-
fine the affinity of user i for movie j as
r
Ai · B j = ∑ ai,k bk,j (9.1)
k =1
Definition 9.1.2 (Similarity Metric). Given taste vector Ai for user i and
taste vector A j for user j, we define the similarity of user i and user j as
r
∑ ai,k a j,k (9.2)
k =1
1
L(A, B) =
|Ω| ∑ ( Mij − ( AB)ij )2 (9.5)
(i,j)∈Ω
During 2006-09, DVDs were all the rage. Companies like Netflix were
quite interested in recommending movies as accurately as possible
in order to retain clients. At the time, Netflix was using an algorithm
√
which had stagnated around RMSE = 0.95. 1 Seeking fresh ideas, 1
RMSE is shorthand for MSE.
Netflix curated an anonymized database of 100M ratings (each rating
was on a 1 − 5 scale) of 0.5M users for 18K movies. Adding a cash
incentive of $1, 000, 000, Netflix challenged the world to come up
with a model that could achieve a much lower RMSE! 2 It turned out 2
This was an influential competi-
tion, and is an inspiration for today’s
that matrix factorization would be the key to achieving lower scores.
hackathons, Kaggle, etc.
In this example, m = 0.5M, n = 18k, and Ω corresponds to the 100M
ratings out of m · n = 10B affinities. 3 3
Less than 1% of possible elements are
After a lengthy competition, 4 the power of matrix factorization is accounted for by Ω.
4
Amazingly, a group of Princeton
on full display when we consider the final numbers: undergraduates managed to achieve the
second place!
• Netflix’s algorithm: RMSE = 0.95
1
|Ω| (i,j∑
L(A, B) = ( Mij − ( AB)ij )2 (9.5 revisited)
)∈Ω
∂ 1 ∂
∂Ai′ k′
L(A, B) =
|Ω| ∑ 2( Mij − ( AB)ij )
∂Ai′ k′
(−( AB)ij )
(i,j)∈Ω
1 ∂
=
|Ω| ∑ 2( Mi′ j − ( AB)i′ j )
∂Ai′ k′
(−( AB)i′ j )
j: (i′ ,j)∈Ω
1
=
|Ω| ∑ 2( Mi′ j − ( AB)i′ j ) · (− Bk′ j )
j: (i′ ,j)∈Ω
1
=
|Ω| ∑ −2Bk′ j ( Mi′ j − ( AB)i′ j ) (9.6)
j: (i′ ,j)∈Ω
Note that the second step is derived because ( AB)ij = ∑k Aik Bkj and
∂( AB)ij
if i ̸= i′ , then ∂Ai′ k′ = 0. Enumerating (i, j) ∈ Ω can be changed to
only enumerating (i ′ , j )
∈ Ω. Similarly, we can consider an arbitrary
element Bk′ j′ :
∂ 1 ∂
∂Bk′ j′
L(A, B) =
|Ω| ∑ 2( Mij − ( AB)ij )
∂Bk′ j′
(−( AB)ij )
(i,j)∈Ω
1 ∂
|Ω| i: (i,j∑
= 2( Mij′ − ( AB)ij′ ) (−( AB)ij′ )
′ )∈Ω ∂B k′ j′
1
|Ω| i: (i,j∑
= 2( Mij′ − ( AB)ij′ ) · (− Aik′ )
′ )∈Ω
1
|Ω| i: (i,j∑
= −2Aik′ ( Mij′ − ( AB)ij′ ) (9.7)
′ )∈Ω
∂ b 1
∂Ai′ k′
L(A, B) =
|S| ∑ −2Bk′ j ( Mi′ j − ( AB)i′ j ) (9.9)
j: (i′ ,j)∈S
∂ b 1
∂Bk′ j′
L(A, B) =
|S| ∑ −2Aik′ ( Mij′ − ( AB)ij′ ) (9.10)
i: (i,j′ )∈S
Deep Learning
10
Introduction to Deep Learning
Deep learning does not represent a specific model per se, but
rather categorizes a group of models called (artificial) neural net-
works (NNs) (or deep nets) which involve several computational
layers. Linear models studied in earlier chapters, such as logistic
regression in Section 4.2, can be seen as special sub-cases involving
only a single layer. The main difference, however, is that general deep
nets employ nonlinearity in between each layer, which allows a much
broader scale of expressivity. Also, the multiple layers in a neural net
can be viewed as computing “intermediate representations” of the
data, or “high level features” before arriving at its final answer. By
contrast, a linear model works only with the data representation it
was given.
Deep nets come in various types, including Feed-Forward NNs
(FFNNs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Resid-
ual Nets, and Transformers. 2 Training uses a variant of Gradient 2
Interestingly, a technique called Neural
Descent, and the gradient of the loss is computed using an algorithm Architecture Search uses deep learn-
ing to design custom deep learning
called backpropagation. architectures for a given task.
Due to the immense popularity of deep learning, a variety of
software environments such as Tensorflow and PyTorch allow quick
implementation of deep learning models. You will encounter them in
the homework.
116 introduction to machine learning lecture notes for cos 324 at princeton university
these models. 5 However, by the 21st century deep learning had gone 5
Paper: https://www.nature.com/
articles/323533a0.
out of fashion. This changed in 2012, when Krizhevsky, Sutskever,
and Hinton leveraged deep learning techniques through their AlexNet
model and set new standards for performance on the ImageNet
dataset. 6 Deep learning has since begun a resurgence throughout 6
Paper: https:
//papers.nips.cc/paper/2012/file/
the last decade, boosted by some key factors:
c399862d3b9d6b76c8436e924a68c45b-Paper.
pdf.
• Hardware, such as GPU and TPU (Tensor Processing Unit, specifi-
cally developed for neural network machine learning) technology
has made training faster.
w1
x2 w2
w3 f
x3 w4 ⃗ ·⃗x
w w ·⃗x)
f (⃗
w5
x4
x5
1
w ·⃗x) = σ (0) =
y = σ (⃗
2
If the activation is ReLU, it will output:
w ·⃗x]+ = [0]+ = 0
[⃗
values produced by all nodes that have a directed edge to it. If the
directed graph of connections is acyclic — which is the case in most
popular architectures — this process of producing the values takes
finite time and we end up with a unique value at each of the output
nodes. 11 The term hidden nodes is used for nodes that are not input 11
We will not study Recurrent Neural
or output nodes. Nets (RNNs), where the graph contains
cycles. These used to be popular until
a few years ago, and present special
difficulties due to the presence of
10.3 Why Deep Learning? directed loops. For instance, can you
come up with instances where the
Now that we are aware of the basic building blocks of neural net- output is not well-defined?
Proof. Assume to the contrary that there are such values. Let ⃗x1 =
(0, 0),⃗x2 = (0, 1),⃗x3 = (1, 0),⃗x4 = (1, 1). Then we know that
w ·⃗x1 + b) = g(⃗
g(⃗ w ·⃗x4 + b) = 0
w ·⃗x2 + b) = g(⃗
g(⃗ w ·⃗x3 + b) = 1
⃗ ·⃗x1 + b ≤ 0,
w ⃗ ·⃗x4 + b ≤ 0
w
⃗ ·⃗x2 + b > 0,
w ⃗ ·⃗x3 + b > 0
w
1 1
Now let ⃗x = 2, 2 . Since we have ⃗x = 21⃗x1 + 12⃗x4 , we should have
1
⃗ ·⃗x + b =
w · ((⃗
w ·⃗x1 + b) + (⃗
w ·⃗x4 + b)) ≤ 0
2
since we are taking the average of two non-positive numbers. But at
the same time, since ⃗x = 21⃗x2 + 12⃗x3 , we should have
1
⃗ ·⃗x + b =
w · ((⃗
w ·⃗x2 + b) + (⃗
w ·⃗x3 + b)) > 0
2
since we are taking the average of two positive numbers. This leads
to a contradiction.
Figure 10.5 visualizes the truth table for XOR in the 2D plane. The
two axes represent the two inputs x1 and x2 ; blue circles denote that
y = 1; and white circles denote that y = 0. A single linear neuron can
be understood as drawing a red line that can separate white points
from blue points. Notice that it is possible to draw such a line for
AND and OR functions, but the data points that characterize the
XOR function are not linearly separable.
1
x2 h2
0
0 −1
b b
Problem 10.3.3. Verify that the model in Figure 10.6 represents the XOR
function by constructing a truth table.
has two layers of neurons. If we only focus on the final layer of the
neural network, we expect the boundary of the binary classification
to be linear in the values of h1 , h2 . However, the values of h1 , h2 are
not linear in the input values x1 , x2 because the hidden nodes utilize a
nonlinear activation function. Hence the boundary of the classification
is also not linear in the input values x1 , x2 . The nonlinear activation
function transforms the input space into a space where the XOR
operation is linearly separable. As shown in Figure 10.7, the h space is
quite clearly linearly separable in contrast to the original x space.
Example 10.4.1. Say ⃗o = (3, 0, 1) are the values of the output neurons
of a neural network before applying the activation function. If we decide to
introduction to deep learning 123
apply the softmax function as the activation function, the final outputs of the
network will be so f tmax (⃗o) ≃ (0.84, 0.04, 0.11). If the network was trying
to solve a multi-class classification task, we can understand that the given
input is most likely to be of class 1, with probability 0.84 according to the
model.
Example 10.4.2. Consider two vectors ⃗z1 = (5, 2, 3) and ⃗z2 = (3, 0, 1).
Then so f tmax (⃗z1 ) = so f tmax (⃗z2 ) because ⃗z2 = ⃗z1 − (2, 2, 2).
Problem 10.4.3. Prove the property that so f tmax (⃗z) = so f tmax (⃗z + c · ⃗1)
for any c ∈ R. (Hint: multiply both the numerator and the denominator of
so f tmax (⃗z)k by exp(c).)
11
Feedforward Neural Network and Backpropagation
Note that not all layers of feedforward neural networks are nec-
essarily fully-connected (a typical case is a Convolutional Neural
Network, which we will explore in Chapter 12). However, feedfor-
ward neural networks with fully-connected layers are very common
and also easy to implement.
The two hidden nodes in the first hidden layer are characterized
by the following equations:
(1)
h1 = ReLU (2x1 − 3x2 )
(11.1)
(1)
h2 = ReLU (− x1 + x2 + 2x3 )
and the two hidden nodes in the second hidden layer are character-
ized by the following equations:
(2) (2)
o = −h1 + 2h2
(2)
h1 = ReLU (0 + 2 · 2) = 4
(2)
h2 = ReLU (0 − 2 · 2) = 0
Networks can have more than one output node. An example is the
network in Figure 11.2.
The networks in Figure 11.1 and Figure 11.2 are the same except
for the output layer; the former has one output node, while the latter
has three output nodes. Now the output values of the network in
Figure 11.2 can be calculated as:
(2) (2)
o1 = −h1 + 2h2
(2) (2)
o2 = 2h1 + h2 (11.3)
(2) (2)
o3 = h1 + 2h2
o1 = so f tmax (o1 , o2 , o3 )1
b
o2 = so f tmax (o1 , o2 , o3 )2
b (11.4)
o3 = so f tmax (o1 , o2 , o3 )3
b
128 introduction to machine learning lecture notes for cos 324 at princeton university
o1 = −4 + 0 = −4
o2 = 2 · 4 + 0 = 8
o3 = 4 + 0 = 4
e −4
o1 = so f tmax (−4, 8, 4)1 = ≃ 0.00
e −4 + e 8 + e 4
b
e8
o2 = so f tmax (−4, 8, 4)2 = −4 ≃ 0.98
e + e8 + e4
b
e4
o3 = so f tmax (−4, 8, 4)3 = −4 ≃ 0.02
e + e8 + e4
b
(1) (1)
Notice that if we set ⃗x = ( x1 , x2 , x3 ) ∈ R3 and ⃗h(1) = (h1 , h2 ) ∈ R2
(1)
and define a matrix W(1) ∈ R2×3 where its (i, j) entry is wi,j , then we
can further rewrite (11.1) as 1 1
Here we interpret the vectors ⃗x, ⃗h(1) as
column vectors, or equivalently a 3 × 1
matrix and a 2 × 1 matrix respectively.
⃗h(1) = ReLU W(1)⃗x (11.5) This will be a convention throughout
this chapter.
where the ReLU function is applied element-wise.
(2) (2)
Similarly, if we let wi,j be the weight between the i-th node hi
(1)
in the second hidden layer and the j-th node h j in the first hidden
layer, (11.2) can be rewritten as
or in a matrix notation as
⃗h(2) = ReLU W(2)⃗h(1) (11.6)
(2) (2)
where ⃗h(2) = (h1 , h2 ) ∈ R2 and W(2) ∈ R2×2 is a matrix whose
(2)
(i, j) entry is wi,j .
feedforward neural network and backpropagation 129
(o )
Next, if we let wi,j be the weight between the i-th output node oi
(2)
(before softmax) and the j-th node h j in the second hidden layer,
(11.3) can be rewritten as
( o ) (2) ( o ) (2)
o1 = w1,1 h1 + w1,2 h2
( o ) (2) ( o ) (2)
o2 = w2,1 h1 + w2,2 h2
( o ) (2) ( o ) (2)
o3 = w3,1 h1 + w3,2 h2
or in a matrix notation as
⃗o = W(o)⃗h(2) (11.7)
e −4 e8 e4
⃗o
b = so f tmax (⃗o) = , ,
e −4 + e 8 + e 4 e −4 + e 8 + e 4 e −4 + e 8 + e 4
130 introduction to machine learning lecture notes for cos 324 at princeton university
for each k = 1, 2, . . . , L.
If dout = 1, we let o = W( L)⃗h L−1 denote the final output of the
model. If dout > 1, we let ⃗o = W( L)⃗h L−1 denote the output layer
before the softmax and ⃗o b = f ( L) (⃗o) denote the output layer after the
softmax.
Example 11.2.1. The number of weights in the model in Figure 11.2 can be
calculated as
3 × 2 + 2 × 2 + 2 × 3 = 16
feedforward neural network and backpropagation 131
b = so f tmax (W′⃗x)
⃗o
where ⃗x ∈ Rdin is the input vector, y is its gold value (i. e., actual value
in the training data), and o = W(o)⃗h( L−1) is the final output (i.e.,
prediction) of the neural network.
(y − o )2 = (0 − (−4))2 = 16
When there are multiple nodes in the output layer (e.g., for multi-
class classification), we can use a loss function that is similar to the
logistic loss. Recall that in logistic regression, we defined the logistic
132 introduction to machine learning lecture notes for cos 324 at princeton university
∑ − log b
oy (Cross-Entropy Loss)
(⃗x,y)∈D
where y ∈ {1, . . . , dout } denotes the gold label, and b oy denotes the
probability that the model assigns to class y — that is, the y-th coordi-
nate of the output vector ⃗o b = so f tmax (⃗o) after applying the softmax
function.
e4
− log b
oy = − log ≃ 4.02
e −4 + e8 + e4
But once a neural network grows in size, the second step of cal-
culating the gradients starts to become a problem. Naively trying to
calculate each of the gradients separately becomes inefficient. Instead,
Backpropagation 2 is an efficient way to calculate the gradients with 2
Reference: https://www.nature.com/
articles/323533a0
respect to the network parameters such that the number of steps for
the computation is linear in the size of the neural network.
The key idea is to perform the computation in a very specific
sequence — from the output layer to the input layer. By the Chain
Rule, we can use the already computed gradient value of a node in a
higher layer in the computation of the gradient of a node in a lower
layer. This way, the gradient values propagate back from the top layer
to the bottom layer.
feedforward neural network and backpropagation 133
Using the updated values of the first hidden layer, the second hidden layer
will be calculated as
(2)
h1 = ReLU (1 · 3 + 2 · (2 + ∆)) = 7 + 2∆
(2)
h2 = ReLU (2 · 3 + (−2) · (2 + ∆)) = 2 − 2∆
(2) (2)
This shows that the rate of change of h1 and h2 with respect to change of
(1)
w2,2 is 2 and −2 respectively.
Finally the output layer can now be calculated as
(1)
This shows that the rate of change of o1 with respect to change of w2,2 is −6.
(2)
Example 11.3.2. Now we consider the meaning of ∂o1 /∂h1 : how changing
(2)
the value of h1 by an infinitesimal ∆ affects o1 . Note that this is a thought
feedforward neural network and backpropagation 135
experiment that does not correspond to a change that is possible if the net
(2)
were a physical object constructed of nodes and wires — the value of h1 is
completely decided by the previous layers and cannot be changed in isolation
without touching the previous layers. However, treating these values as
variables, it is possible to put on a calculus hat and and think about the rate
of change of one with respect to the other.
(2)
If only the value of h1 is changed from 7 to 7 + ∆ in Figure 11.3, then o1
can be calculated as
o1 = (−1) · (7 + ∆) + 2 · 2 = −3 − ∆
(2)
which shows that ∂o1 /∂h1 = −1.
Some readers may notice that (11.10) is just the result of the Chain
(1)
Rule in multivariable calculus. It shows that the effect of w2,2 on o1 is
the sum of its effect through all possible paths that the values prop-
agate through the network, and the amount of effect for each path
can be calculated by multiplying the appropriate partial derivative
between each layer.
In this section, we calculated by hand one partial derivative
(1)
∂o1 /∂w2,2 . But in general, to compute the loss gradient, we see below
that we want to calculate the partial derivative of each output node
136 introduction to machine learning lecture notes for cos 324 at princeton university
Notice that the partial derivative ∂yi /∂x j is equal to Aij . This means that
the (i, j) entry of the Jacobian matrix is the (i, j) entry of the matrix A, and
hence J (⃗y,⃗x) = A.
y1 = 2x1 − x2 + 3x3
y2 = − x1 + 2x3
We can also denote this with an indicator function 1( xi > 0). Also for any
j ̸= i, we see that ∂yi /∂x j = 0. Therefore, the Jacobian matrix J (⃗y,⃗x) is a
feedforward neural network and backpropagation 137
diagonal matrix whose entry down the diagonal is 1( xi > 0); that is
∂ℓ
for any layer k. The matrix will be called the gradient with
∂W(k)
respect to the weights of the k-th layer. 6 Now the update rule for the 6
Alternatively, you can think of flat-
gradient descent algorithm can be written as the following: tening W(k) into a single vector, then
finding the Jacobian matrix ∂ℓ/∂W(k) ,
and later converting it back to a matrix
∂ℓ
W(k) → W(k) − η · (11.13) form.
∂W(k)
138 introduction to machine learning lecture notes for cos 324 at princeton university
1. Output Layer: First recall that the cross-entropy loss due to one
data point is
eoy
ℓ = − log
dout
∑ e o
i
i =1
!
dout
oy
= − log(e ) + log ∑e oi
i =1
!
dout
= −oy + log ∑e oi
i =1
∂ℓ ∂ℓ
= J (⃗o, ⃗h( L−1) )
∂⃗h( L−1) ∂⃗o
∂ ℓ (o )
= W (11.15)
∂⃗o
Now as an inductive hypothesis, assume that we have already com-
puted the gradient (or Jacobian matrix) ∂ℓ/∂⃗h(k+1) . We now compute
feedforward neural network and backpropagation 139
∂ℓ ∂ℓ
= J (⃗h(k+1) , ⃗h(k) )
∂⃗h ( k ) ∂⃗h(k+1)
∂ℓ
= diag(1(W(k+1)⃗h(k) > 0))W(k+1) (11.16)
⃗
∂ h ( k +1)
(o )
Therefore, the gradient with respect to wi,j can be calculated as
∂ℓ ∂ℓ ∂oi ∂ℓ ( L −1)
= · = ·h
(o )
∂wi,j ∂oi ∂w(o) ∂oi j
i,j
(k)
∂ℓ ∂ℓ ∂zi ∂ℓ ( k −1)
(k)
= (k)
· (k)
= (k)
· hj
∂wi,j ∂zi ∂wi,j ∂zi
∂ℓ ⊺ ⃗ (k−1) ⊺
∂ℓ
= h
∂W(k) ∂⃗z(k)
⊺ ⊺
∂ℓ ⃗h(k−1)
= J (⃗h(k) ,⃗z(k) )
∂⃗h(k)
⊺ ∂ ℓ ⊺ ⊺
= J (⃗h(k) ,⃗z(k) ) ⃗h(k−1)
∂⃗h(k)
⊺ ⊺
∂ℓ
= diag(1(W h (k)⃗ (k−1)
> 0)) ⃗h(k−1) (11.18)
∂⃗h(k)
∂ℓ ⊺
= ⃗o
b −⃗ey ((11.14) revisited)
∂⃗o
2. Compute the Jacobian matrix with respect to the last hidden layer,
1 × d L −1 :
⃗ ( L −1) ∈ R
∂ℓ
∂h
∂ℓ ∂ ℓ (o )
= W ((11.15) revisited)
⃗
∂h ( L − 1 ) ∂⃗o
∂ℓ ∂ℓ
= diag(1(W(k+1)⃗h(k) > 0))W(k+1) ((11.16) revisited)
⃗
∂h ( k ) ⃗
∂ h ( k +1)
Next, compute the gradient with respect to the (k + 1)-th hidden
layer weights ∂(kℓ+1) ∈ Rdk+1 ×dk :
∂W
⊺ ⊺
∂ℓ ∂ℓ
= diag(1(W(k+1)⃗h(k) > 0)) ⃗h(k)
∂W(k+1) ∂⃗h(k+1)
((11.18) revisited)
Note that these instructions are based on a model that adopts the
cross-entropy loss and the ReLU activation function. Using alterna-
tive losses and/or activation functions would result in a similar form,
although the actual derivatives may be slightly different.
′
1
σ′ (z) =
1 + e−z
e−z
=
(1 + e − z )2 (11.19)
−z
e
= σ(z) ·
1 + e−z
= σ(z) · (1 − σ(z))
e2z −1
There is also the hyperbolic tangent function tanh(z) = e2z +1
.
Problem 11.4.7. Compute f ′ (z) for f (z) = tanh(z); show how f ′ (z) can
be written in terms of f (z).
∂ℓ ∂ℓ
= diag( g(W(k+1)⃗h(k) ))W(k+1)
⃗
∂h ( k ) ⃗
∂ h ( k +1) ⊺ ⊺
∂ℓ (k+1)⃗ (k) ∂ℓ ⃗h(k)
= diag ( g ( W h ))
∂W(k+1) ∂⃗h(k+1)
self.activation = nn.ReLU()
net = Net()
import torch.nn as nn
self.activation = nn.ReLU()
⃗ · ( x t , x t +1 , x t +2 )
y t +1 = w1 x t + w2 x t +1 + w3 x t +2 = w
⃗ · ( x t +1 , x t +2 , x t +3 )
y t +2 = w1 x t +1 + w2 x t +2 + w3 x t +3 = w
Notice that we are reusing the same weights and applying them to multiple
different values of xt to calculate yt . It is almost like sliding a filter down the
array of xt ’s and applying it to every set of 3 consecutive inputs. For this
reason, we call w⃗ the convolution filter weight of length 3.
Example 12.1.2. Consider an input sequence ⃗x = (2, 1, 1, 7, −1, 2, 3, 1) and
⃗ = (3, 2, −1). The first two output values will be:
a convolution filter w
y1 = 2 × 3 + 1 × 2 + 1 × (−1) = 7
y2 = 1 × 3 + 1 × 2 + 7 × (−1) = −2
Following a similar calculation for the other values, we see that the full
output sequence is ⃗y = (7, −2, 18, 17, −2, 11). Note that the length of ⃗y
should be |⃗x| − |⃗
w| + 1 = 8 − 3 + 1 = 6.
Problem 12.1.3. If yt = 2xt−1 − xt+1 , yt+1 = 2xt − xt+2 , and yt+2 =
2xt+1 − xt+3 , what is the convolution filter weight?
The result is a new image and we can view each filter as a transfor-
mation which takes an image and returns an image. In the above
equation, the filter size is (2k + 1) × (2k + 1). For example, if k = 1, we
can consider the convolution weight filter to be
w−1,−1 w−1,0 w−1,+1
w0,−1 w0,0 w0,+1
w+1,−1 w+1,0 w+1,+1
The filter can only be applied to an image of size m × n at a location
where the filter completely fits inside the image. Therefore, the
locations in the input image where the center of the filter can be
placed are k < i ≤ m − k, k < j ≤ n − k and the size of the output
image is (m − 2k) × (n − 2k).
Example 12.2.3. If the input and convolution filter are given as follows:
1 1 1 0 0
0 1 1 1 0 1 0 1
X = 0 0 1 1 1 and W = 0 1 0
0 0 1 1 0 1 0 1
0 1 1 0 0
148 introduction to machine learning lecture notes for cos 324 at princeton university
then the pixel at (1, 1) of the resulting image can be calculated by applying
the filter at the top left corner of the input image. That is, we take the inner
product of the following parts (in red) of the two matrices
1 1 1 0 0
0 1 1 1 0 1 0 1
0 0 1 1 1 and 0 1 0
0 0 1 1 0 1 0 1
0 1 1 0 0
which is
1×1+1×0+1×1+0×0+1×1+1×0+0×1+0×0+1×1 = 4
Therefore, the (1, 1) entry of the resulting image is 4. Similarly, the remain-
ing pixels of the resulting image can be calculated by moving around the
filter as in Figure 12.4. The output image is given as:
4 3 4
Y = 2 4 3
2 3 4
12.2.2 Padding
In standard 2D convolution, the size of the output image is not equal
to the size of the input image because we only consider locations
where the filter fits completely in the image. However, sometimes
we may want their sizes to be the same. In such a case, we apply a
convolutional neural network 149
technique called padding. The idea is to pad pixels to all four edges
of the input image (left, right, up, and down) so that the number of
valid locations for the filter is the same as the number of pixels in the
original image. In particular, if the filter size is (2k + 1) × (2k + 1), we
need to pad k pixels on each side.
There are multiple ways to implement padding. Zero padding is
when the values at all padded pixels are set to 0. “Same” padding is
when the values at padded pixels are set to the value of the nearest
pixel at the edge of the input image. In practice, zero padding is a
more common form of padding (it is equivalent to adding a black
frame around the image).
12.2.5 Channels
In general, we do not only use one convolution filter. We construct
a network of multiple layers, and for each of the layers, we apply
multiple convolutional filters. Different filters will be able to detect
different features of the image (e.g., one filter detects edges and one
filter detects dots), and we want to apply different filters indepen-
dently on the input image. The result of applying a given filter on a
single input image is called a channel. We can stack the channels to
create a 3D structure, as shown in Figure 12.8.
where nin is the number of input channels, X(u) is the image at the
u-th input channel, Y(v) is the image at the v-th output channel, and
W(u,v) is the filter between the u-th input channel and the v-th output
channel.
Example 12.2.6. Assume there are 6 input channels and 3 output channels,
and the filter size is 5 × 5. Then for every 6 × 3 pair of input and output
channel, we have a kernel of weights of size 5 × 5, so there are a total of
6 × 3 × 5 × 5 = 450 weights.
12.2.6 Pooling
Pooling is another popular way to reduce the size of the output of a
convolutional layer. In contrast to stride, which applies convolution
operation every s pixels, pooling partitions each image (channel) to
patches of size ∆ × ∆ and performs a reduction operation on each
patch. You can think of this as similar to what happens when you
lower the resolution of an image. The reduction operation can either
involve taking the max of all the values in the patch (“max-pooling”):
1 ∆
∆2 r,s∑
yi,j = X(i−1)·∆+r,( j−1)·∆+s
=1
automatically!
Finally, the above convolutional neural network is still a simple
network, compared to modern convolutional neural networks. Inter-
ested students can look up architectures such as AlexNet, Inception,
VGG, ResNet and DenseNet.
For example, suppose the net has to decide if the input image has
a triangle anywhere in it. If the net were fully connected, it would
have to learn how to detect triangles centered at every possible pixel
location (i, j). By contrast, if a simple convolutional filter can detect
a triangle, we can just replicate this filter in all patches to detect a
triangle anywhere in the image.
Now consider the CNN architecture in Figure 12.12. The architec-
ture has two convolutional layers, the first with a ReLU activation
function, and the second with a sigmoid activation function. 5 We 5
In both Example 12.2.8 and Exam-
will choose an appropriate convolutional weight and bias such that ple 12.2.9, the second convolutional
layer can be considered a fully con-
the architecture can detect a particular simple visual pattern. nected layer if we flatten image Y
Conv 1 Conv 2 σ
Input X Image Y Output o o
Output b
ReLU
where xi,j is the (i, j)-th entry of the input image. Notice that this
value is zero everywhere, except if xi,j = 255, in which case yi,j takes
the value 1. That is,
1 x = 255
i,j
yi,j =
0 otherwise
See Figure 12.13 to see the effect of this choice of convolutional layer
on a sample image. We see that we have now successfully identified
the pixels in the input image that take the value 255.
Notice that this value is −5 if and only if there is no pixel such that
yi,j = 1 (i.e., xi,j = 255). If there is one such pixel, the output is 5; if
there are two, the output is 15. The important thing is, the output is
at least 5 if there is at least one pixel in the input image whose value
is 255, and otherwise the output is ≤ 0. That is
≥ 5 ∃(i, j) : x = 255
i,j
o=
−5 otherwise
Finally, when we apply the sigmoid function, the final output of the
model will be
≥ 0.99 ∃(i, j) : x = 255
i,j
o = σ(o ) =
b
0.01 otherwise
the output of the CNN should have a value close to 1 and otherwise the
output should have a value close to 0.
where xi,j is the (i, j)-th entry of the input image. 7 Notice that this 7
Since there is no padding, the values
value is zero everywhere, except if xi,j + xi−1,j + xi,j−1 + xi,j+1 + y1,1 , y1,8 , y8,1 , y8,8 are not defined.
xi+1,j = 1275, in which case yi,j takes the value 1. This can only
happen if xi−1,j = xi,j−1 = xi,j = xi,j+1 = xi+1,j = 255. That is, if the
input image has the pattern in (12.4) centered around (i, j).
1 Pattern in (12.4) exists at (i, j)
yi,j =
0 otherwise
See Figure 12.14 to see the effect of this choice of convolutional layer
on two sample images.
Notice that this value is −5 if and only if there is no pixel such that
yi,j = 1 (i.e., the pattern exists at (i, j)). If there is one such pixel, the
output is 5; if there are two, the output is 15. The important thing
is, the output is at least 5 if there is at least one copy of the given
pattern, and otherwise the output is ≤ 0. That is
≥ 5 ∃(i, j) : Pattern in (12.4) exists at (i, j)
o=
−5 otherwise
Finally, when we apply the sigmoid function, the final output of the
model will be
≥ 0.99 ∃(i, j) : Pattern in (12.4) exists at (i, j)
o = σ(o ) =
b
0.01 otherwise
But the basic idea is the same — identify all paths through which the
corresponding weight affects the output of the model and add up the
amount of effect for each path.
Figure 12.15 shows a portion of a sample neural network where
weight sharing occurs. That is, the same weight w is used between
the following four pairs of nodes: ( x1 , y1 ), ( x2 , y2 ), ( x2 , y3 ), ( x3 , y4 ).
If we wanted to find the gradient ∂o/∂w, we need to consider the
four paths that the weight w affects the output: w → yi → o where
1 ≤ i ≤ 4.
158 introduction to machine learning lecture notes for cos 324 at princeton university
y2 ···
w (2)
x2 o
w (3)
y3 ···
x3
w (4)
y4 ···
4 4
∂o ∂o ∂w(i) ∂o
=∑ ( i )
· = ∑ (i )
∂w i =1 ∂w
∂w i =1 ∂w
(k)
∂ℓ ∂ℓ ∂hi ∂ℓ ( k −1)
(k)
= (k)
· (k)
= (k)
· hj
∂wi,j ∂hi ∂wi,j ∂hi
(k) (k)
This is because the weight wi,j is only used to compute hi out
of all nodes in the next hidden layer. In comparison, consider a
convolutional layer, which computes an output image Y ∈ Rn×n from
an input image X ∈ Rm×m and filter W ∈ R(2k+1)×(2k+1) . Notice
that the weight wi,j is used to compute all of the pixels in the output
image. Therefore, we just need to add (or pool) the gradient flow from
each of these paths. The gradient with respect to a particular weight
convolutional neural network 159
wi,j will be
!
∂ℓ ∂ℓ ∂y ∂ℓ ∂y ∂ℓ ∂y
= · 1,1 + · 1,2 + . . . + · 1,n
∂wi,j ∂y1,1 ∂wi,j ∂y1,2 ∂wi,j ∂y1,n ∂wi,j
!
∂ℓ ∂y ∂ℓ ∂y ∂ℓ ∂y
+ · 2,1 + · 2,2 + . . . + · 2,n
∂y2,1 ∂wi,j ∂y2,2 ∂wi,j ∂y2,n ∂wi,j
+...
!
∂ℓ ∂y ∂ℓ ∂y ∂ℓ ∂yn,n
+ · n,1 + · n,2 + . . . + ·
∂yn,1 ∂wi,j ∂yn,2 ∂wi,j ∂yn,n ∂wi,j
∂ℓ ∂ℓ ∂ℓ ∂ℓ
= · xi+1,j+1 + · xi+1,j+2 + . . . + · xi+1,j+n
∂wi,j ∂y1,1 ∂y1,2 ∂y1,n
∂ℓ ∂ℓ ∂ℓ
+ ·x + ·x +...+ ·x
∂y2,1 i+2,j+1 ∂y2,2 i+2,j+2 ∂y2,n i+2,j+n
+...
∂ℓ ∂ℓ ∂ℓ
+ · xi+n,j+1 + · xi+n,j+2 + . . . + · xi+n,j+n
∂yn,1 ∂yn,2 ∂yn,n
∂ℓ ∂ℓ
= ∑ · xi+r,j+s (12.5)
∂wi,j 1≤r,s≤n
∂y r,s
That is, the Jacobian matrix ∂ℓ/∂W is the output when applying a
convolution filter ∂ℓ/∂Y to the input matrix X.
Similarly, we can try to calculate the Jacobian matrix with respect
to the input matrix X. Each input pixel xi,j is used to calculate the
output pixels yi+r,j+s where −k ≤ r, s ≤ k. The gradient with respect
to a particular input pixel will be
!
∂ℓ ∂ℓ ∂yi−k,j−k ∂ℓ ∂yi−k,j+k
= · +...+ ·
∂xi,j ∂yi−k,j−k ∂xi,j ∂yi−k,j+k ∂xi,j
!
∂ℓ ∂yi−k+1,j−k ∂ℓ ∂yi−k+1,j+k
+ · +...+ ·
∂yi−k+1,j−k ∂xi,j ∂yi−k+1,j+k ∂xi,j
+...
!
∂ℓ ∂yi+k,j−k ∂ℓ ∂yi+k,j+k
+ · +...+ ·
∂yi+k,j−k ∂xi,j ∂yi+k,j+k ∂xi,j
160 introduction to machine learning lecture notes for cos 324 at princeton university
That is, the Jacobian matrix ∂ℓ/∂X is the output when applying the
horizontally and vertically inverted image of W as the convolutional
filter to the input matrix ∂ℓ/∂Y.
return x
# forward propagation
net = ConvNet()
output = net(image)
# backpropagation
loss = torch.norm(output - torch.ones(output.shape[1]))**2
loss.backward()
optimizer.step()
optimizer.zero_grad()
import random
import numpy as np
import torch
import torch.nn as nn
162 introduction to machine learning lecture notes for cos 324 at princeton university
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
return x
Just like the FFNN code from the previous chapter, we define all the
layers and activations we are going to use in the constructor. Note
that in addition to instances of the nn.Linear class and the nn.ReLU
class, we also make use of classes like nn.Conv2d and nn.MaxPool2d
which are specifically designed for CNNs.
We extract one training image with the following code.
images, labels = next(iter(train_loader))
image = images[0].unsqueeze(0)
Next, we calculate the gradients of the loss with the following line of
code
loss.backward()
Finally, we reset the values of the gradients to zero with the following
code.
optimizer.zero_grad()
164 introduction to machine learning lecture notes for cos 324 at princeton university
Reinforcement Learning
13
Introduction to Reinforcement Learning
to the agent, and the agent learns about the environment while also
learning to act in it. 2 2
Think of playing a Role-playing Game
(RPG) where you need to unlock parts
of the map by advancing the story.
168 introduction to machine learning lecture notes for cos 324 at princeton university
Example 13.1.2. Self-driving cars, like those built by Tesla, are becoming
increasingly popular. Let’s imagine how we could construct a state diagram
for the task of driving autonomously. Each state can be represented by the
current configuration of a number of factors (e.g., the car speed, distance
from lane boundaries, distance to nearest vehicle, etc.) Possible actions
include increasing/decreasing speed, changing gear, changing direction,
changing lane, etc.
In general, not all states are reachable, given a state s and an action
a. That is, some transition probability p(s′ | s, a) is zero. For these
states, it is conventional to leave out the corresponding transitions
when representing the RL environment as a state diagram as in
Figure 13.1 or Figure 13.2.
Example 13.1.4. Consider the state diagram shown in Figure 13.1. This is
a special case where there is only one action a in the set A. In other words,
the agent is not making any choices; instead, it is just following probabilistic
transition over time steps. To calculate the probability of reaching state s3
from s0 , we note there are two different paths. The first path is s0 − s1 − s3
and the second path is s0 − s2 − s3 . We thus calculate the probabilities of each
of these paths and note that the overall probability of reaching s3 will be the
sum of both: 0.2 · 0.7 + 0.8 · 0.4 = 0.46.
states that consider the pressure and CO2 level in the patient’s level for the
past k seconds. Actions might include adjusting the flow rate of oxygen
via valve settings as needed. Possible transitions might include the typical
mechanical response of the lungs, or unexpected spasms. Finally, we can
define the goal to be maintaining steady pressure in the patient without
“overshooting” and causing damage.
Example 13.1.8 (Example 13.1.5 revisited). When the baby stands and
feels secure after grabbing onto something, the parents applaud the baby,
and the baby receives a positive reward: r ( a′ | s0 , s1 ) = 5. When the baby
feels secure without grabbing onto the nearest support, the parents feel even
prouder and the baby gets a more positive reward: r ( a | s0 , s1 ) = 10. When
the baby falls to the ground, the baby feels pain and receives negative reward:
introduction to reinforcement learning 171
r ( a | s0 , s2 ) = r ( a′ | s0 , s2 ) = −5.
Real-life robots with precise and reliable hardware can get very
expensive to buy, let alone train. An easier playground for students
(especially those trying to work with a single GPU on CoLab) is
doing RL in a virtual environment.
MuJoCo is a famous physics engine that allows creating virtual
objects with somewhat realistic “joints” that can be commanded to
move similar to real-life robots. OpenAI and DeepMind have open-
source environments that allow experimentation in the MuJoCo
environment. The official website gives a pretty good overview of the
software:
the set of coordinates, velocity, and acceleration for each limb, the velocity
and acceleration for the motors in each joint, and the environment itself
straight ahead. The actions can include the agent increasing or decreasing
motor speed in their joints. Finally, the final goal is to stay upright, run
forward at a reasonable pace, and avoid obstacles.
Even though you eat only two out of the three slices during the
first night, there is a 12 chance that your roommate eats the remaining
slice overnight. Therefore, the action of “eating 2 slices” can lead
to two possible states — “1 slice left” or “0 slices left” — each with
probability 12 .
Example 13.3.2. We can calculate the expected reward associated with eating
two slices on the first night by analyzing the Figure 13.5. You first gain
174 introduction to machine learning lecture notes for cos 324 at princeton university
reward of 1.5 by eating the two slices on the first night. Then with proba-
bility 12 (where the roommate does not eat the remaining slice overnight),
you get to eat the last slice on the second night and gain additional re-
ward of 1. With probability of 12 (where the roommate eats the remaining
slice), you cannot gain anymore reward. That is, the expected reward is
1.5 + 0.5 · 1 + 0.5 · 0 = 2.
Problem 13.3.3. Consider the result of Example 13.3.2. Would you prefer to
take two slices on the first night or three slices?
We next consider the action where you decide to eat one out of
the three slices on the first night. Successive states and associated
probabilities are shown in Figure 13.6.
Let’s review the key ingredients of RL. We have the agent, who senses
the environment and captures it as the current state. There is a finite
number of actions available at any given state, and taking an action a
in state s will cause a transition to s′ with probability p(s′ | s, a). Each
transition is accompanied by a reward r ( a | s, si ) ∈ R. Finally, the
goal of the agent is to maximize the expected reward via a sequence
of actions.
A Markov Decision Process (MDP) is a formalization of these con-
cepts. It is a directed graph which consists of four key features:
that cycle. For instance, continuing our cake theme, we may have a
scenario in which you receive a fresh cake every 3 days. But now we
run into a problem: how can we calculate the expected reward when
there is an unbounded number of steps?
The solution lies in the concept of future discounting. The basic
idea is to reduce, or discount, the amount of reward we get from
markov decision process 177
Example 14.2.1. Consider again the cake eating MDP example without
a discount factor. We already established through Example 13.3.2 and
Example 13.3.5 that to maximize the expected reward, you need to eat one
slice per day until all slices are gone. That is, in any state j where j = 1, 2, 3,
you need to take action 1.
Recall that if there are M actions and N states, there are at most
MN 2 transitions in the graph of the MDP. Because a policy specifies
one action per state, there are at most N 2 transitions that remain
when we choose a specific policy. Therefore, it can be understood
that a policy trims out the MDP.
Example 14.2.3. Let’s revisit Figure 14.1. If we fix the policy to be π (s) = 1
for any s ∈ S, we can focus our attention to the action a = 1. Then there
are three trajectories that will lead from state 3 to state 0, based on what
the roommate does overnight. The first trajectory is 3 → 1 → 0 with
probability 0.5 × 1 and reward 1 + 1. The second trajectory is 3 → 2 → 0
with probability 0.5 × 0.5 and reward 1 + 1. The last trajectory is 3 → 2 →
1 → 0 with probability 0.5 × 0.5 × 1 and reward 1 + 1 + 1.
Note that in an MRP tree, the same state can appear multiple
times, but each copy of the same state is identical — that is, the
subtree rooted at each copy must be identical. In Figure 14.2, the
state 1 appears twice in the tree. Every time it appears, it can only
lead to state 0 with probability 1. This is simply the result of fixing
a policy π — once we know the state we are in, we only have one
choice for the action to take.
The policy also induces a value function on this tree. The value
function assigns a value to each node of the tree, and each value
intuitively measures how much reward the agent should expect to
collect once the agent knows they have arrived at that node. By the
observation from the previous paragraph, this expected reward is the
same for two nodes if they are copies of the same state. Therefore,
we can equivalently define the value function for each state s instead.
Formally, we define the value function as the following.
Definition 14.2.5 (Value Function). vπ (s), the value of state s under the
policy π, is the expected discounted reward of a random trajectory starting
from s. We can define this value by using the following recursive formula:
Out of all choices for a policy, we are interested in the optimal policy,
the one that maximizes the expected (discounted) reward. Surpris-
ingly, it is known that there always exists a policy π ∗ that obtains the
maximum expected reward from all initial states simultaneously; that
is π ∗ = arg max vπ (s) for every state s. 4 . The value function of the 4
If there are multiple such policies, we
π denote any one of them by π ∗ .
optimal policy is called the optimal value function and is often denoted
as v∗ (s). Then we can express the optimal value function using (14.2)
as:
v∗ (s) = max ∑ p(s′ | s, π (s))(r (π (s) | s, s′ ) + γvπ (s′ ))
π
s′
This is just restating the fact that the optimal value of state s is the
maximum of all possible values vπ (s) of s under a policy π — i.e.,
the Bellman equation evaluated with the values vπ (s′ ) of each child
node s′ under that specific policy π.
But we can even go further than this result. It is known that the
optimal value also satisfies the following:
Notice that vπ (s′ ) in the summation has now been replaced with
v∗ (s′ ). This property, known as the Bellman Optimality condition,
states that the optimal value is even the maximum when the Bellman
equation is evaluated with the values v∗ (s′ ), regardless of the choice
of the policy π.
Notice that the right-hand side of (14.3) only depends on the
choice of the action a of the given state s, not any other states. There-
fore, we can rewrite (14.3) as:
which also suggests that the optimal action at state s can be ex-
pressed as:
But the problem is: it is unclear how to turn this into an efficient
algorithm. Computing the value v∗ (s) depends on the value v∗ (s′ ),
which can also depend on v∗ (s), which becomes recursive.
In this section, we present an iterative algorithm called the value
iteration method which will be used to compute the optimal policy.
Before we describe the algorithm, we unpack the underlying ideas.
policy that looks at least near-optimal, and use its value as a lower
bound for the optimal policy.
First, let v∗ (s) denote the value vπ ∗ (s) of state s for an optimal
policy π ∗ . Since there is only one action to choose from at state A, we
know that
v∗ ( A) = 10 + γv∗ ( A′ ) (14.6)
Now, at the state A′ , one possible trajectory you can follow is “go
up four steps” (each with reward 0) back to A. We know that the
optimal value has to be at least as great as this value. That is
v ∗ ( A ′ ) ≥ γ4 v ∗ ( A ) (14.7)
v∗ ( A) ≥ 10 + γ5 v∗ ( A)
10
v∗ ( A) ≥ ≈ 24.4
1 − γ5
Example 14.3.2. See Figure 14.6. Suppose there are two actions to take at
state s. The first action, labeled as blue, will lead to state s1 with reward −1
with probability 0.5 and s3 with reward −1 with probability 0.5. The second
action, labeled as red, will lead to state s2 with reward 2 with probability
1. The discount factor is given as 0.6. Now assume that someone tells us
that they know a way to get an expected reward of 12 starting from s1 , 1
from s2 , and 4 from s3 , regardless of the choice of initial action at s. In
other words, the optimal values for these three states are lower bounded by:
v∗ (s1 ) ≥ 12, v∗ (s2 ) ≥ 1 and v∗ (s3 ) ≥ 4. Using this fact, we consider two
strategies 7 — (1) first take action blue at state s and play optimally thereon 7
This is not necessarily a policy because
based on the other person’s knowledge; (2) first take action red at state s and the second part of playing optimally
may require you to return to state s and
play optimally thereon. The lower bound for the expected reward for each of take an action that is inconsistent with
the two strategies can be computed as: your initial choice of action.
vblue (s) ≥ 0.5 × (−1 + 0.6 × 12) + 0.5 × (−1 + 0.6 × 4) = 3.8
vred (s) ≥ 1.0 × (2 + 0.6 × 1) = 2.6
1. Initialize some values v0 (s) for each state s such that we are guar-
anteed v0 (s) ≤ v∗ (s)
2. For each time step k = 1, 2, . . ., and for each state s, use the values
vk (s′ ) of the immediate children s′ to compute an updated value
vk+1 (s) such that vk+1 (s) ≤ v∗ (s). 8 8
These values vk (s) maintained by the
algorithm is not necessarily associated
3. When k → ∞, each vk (s) will converge to the optimal value v∗ (s). with a specific policy. They are just a
lower bound for the optimal value v∗ (s)
that will be improved over time.
h [− R, R], i
Recall from (14.1) that if all transition rewards are within
R R
then the expected rewards at any state for any policy lies in − 1− γ , 1− γ .
markov decision process 185
R
Therefore, we can set the initial value v0 (s) = − 1− γ to be the lower
bound for each state s. 9 9
Our proof assumes this special
After the k-th iteration of the algorithm, we will maintain a value initialization where all v0 (s) = − 1−Rγ
for all states s. It turns out the value
vk (s) for state s, where the condition vk (s) ≤ v∗ (s) is maintained as iteration method converges to the
an invariant. Now at the (k + 1)-th iteration, the algorithm will update optimal value for arbitrary initialization,
but the proof is more complicated.
the values at each state s as the following:
This is just the Bellman equation evaluated with the values vk (s′ ) of
each children node.
v1 ( A) = p( A′ | A, a) · r ( a | A, A′ ) + γv0 ( A′ )
Proposition 14.3.5. For each time step k = 1, 2, . . ., and for each state s, the
invariant vk (s) ≤ v∗ (s) holds.
Notice that for any specific policy π and for any next state s′ , we
have
Since this inequality holds for every policy π, we have the following:
where the last equality uses the fact that ∑ p(s′ | s, a∗ ) = 1 because p
s′
is a probability distribution. Since this inequality holds for any state
s, we conclude that
where v∗ (s) is the value that the value iteration algorithm converges
to. If there are multiple actions a that satisfy the equation above,
arbitrarily choose an action.
188 introduction to machine learning lecture notes for cos 324 at princeton university
Example 14.3.9 (Example 14.3.1 revisited). Say we ran the value iteration
algorithm on the Gridworld. The output of the algorithm (the optimal values
of each state) is given in Table 14.1.
22.0 24.4 22.0 19.4 17.5 Table 14.1: Optimal values v∗ (s) of the
Gridworld.
19.8 22.0 19.8 17.8 16.0
17.8 19.8 17.8 16.0 14.4
16.0 17.8 16.0 14.4 13.0
14.4 16.0 14.4 13.0 11.7
Consider the state A′ = (5, 2). There are four actions to take: left-
/right/up/down. Each action would yield the following values when evaluat-
ing the Bellman equation:
The only action that maximizes the value is the action “go up.” Therefore,
we can conclude that the optimal policy π ∗ will adopt the action “go up” for
the state A′ .
In model-free RL, we know the set of states S and the set of actions A,
but the transition probabilities and rewards are unknown. The agent
now needs to explore the environment to estimate the transition
probabilities and rewards. Suppose the agent is originally in state
s1 , chooses to take an action a, and ends up in state s2 . The agent
immediately observes some reward r ( a | s1 , s2 ), but we need more
information to figure out p(s2 , |s1 , a).
One way we can estimate the transition probabilities is through the
Maximum Likelihood Principle. This concept has been used before
when considering estimating unigram probabilities in Chapter 8. In
model-free RL, an agent can keep track of the number of times they
took action a at state s1 and ended up in state s2 — denote this as
#(s1 , a, s2 ). Then the estimate of the transition probability p(s′ |s, a) is:
#(s1 , a, s2 )
p ( s2 | s1 , a ) = (15.1)
∑ #(s1 , a, s′ )
s′
The Central Limit Theorem (see Chapter 18) guarantees that esti-
mates will improve with more observations and quickly converge to
underlying state-action transition probabilities and rewards.
Groundhog Day is an early movie about a “time loop” and the title
has even become an everyday term. The film tracks cynical TV weath-
erman Phil Connors (Bill Murray) who is tasked with going to the
small town of Punxsutawney and filming its annual Groundhog Day
celebration. He ends up reliving the same day over and over again,
and becomes temporarily trapped. Along the way, he attempts to
court his producer Rita Hanson (Andie MacDowell), and is only
released from the time loop after a concerted effort to improve his
character.
Sounds philosophically deep! On the internet you can find various
interpretations of the movie: Buddhist interpretation (“many reincar-
nations ending in Nirvana”) and psychoanalysis (“revisiting of the
same events over and over again to reach closure”). The RL interpre-
tation is that Phil is in an model-free RL environment, 1 revisiting 1
Specifically a model-free RL environ-
the same events of the day over and over again and figuring out his ment with an ability to reset to an initial
state. This happens for example with a
optimal actions. robot vacuum that periodically returns
to its charging station. After charging,
it starts exploring the MDP from the
initial state again.
reinforcement learning in unknown environments 191
In 1972, the classic game of Pong was released by Atari. This was the
first commercially successful video game, and had a major cultural
impact on the perception of video games by the general public. The
rules of the game are simple: each player controls a virtual paddle
which can move vertically in order to rally a ball back and forth
(one participant may be an AI agent). If a player misses the ball, the
other player wins a point. We can consider the total number of points
accumulated by a player to be their reward so far. While technology
and video games have become far more advanced in the present, it
is still useful to analyze Pong today. This is because it is a simple
example of a physics-based system, similar to (but far less advanced
than) the MuJoCo stick figure simulations discussed in Chapter 13. It
thus provides a useful case study to demonstrate how an agent can
learn basic principles of physics through random exploration and
estimation of transition probabilities.
Let’s apply some simplifications in the interest of brevity. We
define the pong table to be 5 × 5 pixels in size, the ball to have a size
of 1 pixel, and the paddles to be 2 pixels in height. We define the
state at a time t as the locations of the two paddles at time t, and the
locations of the ball at time t and time t − 1. 2 2
Storing the location of the ball at time
We additionally restrict attention to the problem of tracking and t − 1 and time t allows us to calculate
the difference between the two locations
returning the ball, also known as “Pico-Pong.” Thus, we define the and thus gives an estimate for the
game to begin when the opponent hits the ball. The agent gets a velocity.
reward of +1 if they manage to hit the ball, −1 if they miss, and 0
if the ball is still in play. As soon as the agent either hits the ball or
misses, we define that the game ends. Of course, these additional
rules of the game are not available to the agent playing the game.
The agent needs to “learn” these rules by observing the possible
states, transitions, and corresponding rewards.
In general, these simplifications remove complications of modeling
the opponent and makes the MDP acyclic; an explanatory diagram
is shown in Figure 15.1. Throughout this section, we will build
intuition about different aspects of our Pico-Pong model through
some examples.
the agent will be able to implicitly “learn” that the ball can never
move away from them.
Problem 15.2.1. Suppose the agent is playing randomly and the ball is trav-
eling at a speed of 1 pixel per step. Which of the transitions in Figure 15.4 is
never seen, and why?
Problem 15.2.2. Suppose the agent is playing randomly and the ball is
traveling at a speed of 1 pixel per step. What reward is achieved given the
current state and chosen action in Figure 15.6, and why?
moving away from the agent or ball moving too fast). The agent will
be able to ignore these states, while learning how to play optimally in
states that did occur while exploring.
Also, the agent has now “learnt” the transition probabilities and
rewards of the MDP. Using these estimates, the agent is able to build
up a representation of the MDP. Since the underlying MDP for the
simplified Pico-Pong model is acyclic, the optimal policy can be
determined using a simple look-ahead tree. An example diagram is
shown in Figure 15.7.
We provide a specific example to aid the exposition. Suppose
an agent finds themselves in the state shown in Figure 15.8. Since
the path of the ball is already determined, the next possible state is
uniquely determined by the choice of the action — “go down” or
“stay in place” or “go up.” If the agent chooses to “go down,” the
game will end with a reward of +1. If the agent chooses to “stay in
place” or “go up,” the game continues for another time step, but no
matter the choice of action on that step, the game will end with a
reward of −1. Therefore, the agent will learn that the optimal policy
will assign the action of “go down” in the state shown in Figure 15.8.
Problem 15.2.3. Draw out the look-ahead tree from the state shown in
Figure 15.8.
Problem 15.2.4. Suppose we start from the state shown in Figure 15.9.
Assuming optimal play, what is the expected reward for the agent? (Hint:
consider if the agent will be able to reach the ball in time.)
reinforcement learning in unknown environments 195
Impressive! The agent has learnt how to return the ball in Pico-
Pong by first building up the MDP and its transitions/rewards
through repeated observations, and then computing the optimum
policy for the constructed MDP through a look-ahead tree. 3 3
How would you extend these ideas to
design a rudimentary ping pong robot
which can track and return the ball?
15.3 Q-learning
• Exploration: This pertains to what the agent did in the first phase.
Random paddle movements were used to help build up previously
unknown knowledge of the MDP — transition probabilities and
rewards.
15.3.2 Q-function
We now introduce the Q-function, an important concept that helps tie
together concepts of exploration and exploitation when considering
general MDPs with discounted rewards.
Q(s, a) = ∑
′
p(s′ | s, a)(r ( a | s, s′ ) + γ max Q(s′ , b))
b
(15.2)
s;a
The first condition of Definition 15.3.2 states that for a fixed state s,
the action a that maximizes Q(s, a) is a = π ∗ (s). This condition only
cares about the relative ordering of the values of Q(s, a) — as long
as Q(s, π ∗ (s)) is the maximum value among all Q(s, a), then it is fine.
This condition guarantees that the action we take in the exploitation
step is an optimal action.
The second condition is formally stating that the values of the
Q-function are estimates of the expected reward when we take action
a from state s. It also suggests that Q-function needs to “behave
like” a value function vπ for some policy π. However, whereas a
similar condition for a value function vπ only needs to hold for one
particular action (i.e., a = π (s)) given a state s, this condition for a
Q-function should hold for any arbitrary action a. Note that for an
optimal Q-function, the term maxb Q(s′ , b) in (15.2) is equivalent to
v π Q ( s ′ ).
reinforcement learning in unknown environments 197
15.3.3 Q-learning
Now that we have defined the Q-function and the optimal Q-function,
it is time for us to study how to learn the optimal Q-function. This
process is called Q-learning. The basic idea is to probabilistically
choose between exploration or exploitation: we define some probabil-
ity ϵ ∈ [0, 1] such that we choose a random action a with probability ϵ
(exploration) or choose the action a according to the current canonical
policy πQ with probability 1 − ϵ (exploitation). If we choose the explo-
ration option, we use its outcome to update the Q(s, a) table. But how
should we define the update rule?
Let’s take a step back and consider a (plausibly?) real life scenario.
You are a reporter for the Daily Princetonian at Princeton, and want
to estimate the average wealth of alumni at a Princeton Reunions
event. The alumni, understandably vexed by such a request, strike
a compromise that you are only allowed to ask one alum about their
net worth. Can you get an estimate of the average? Well, you could
pick an alum at random and ask them their net worth! 4 4
The expectation gives the right aver-
With this intuition, we return to the world of Q-learning. Suppose age. But typically the answer would be
far from the true average; especially if
you start at some state st , take an action at , receive a reward of rt , and Jeff Bezos happens to be attending the
arrive at state st+1 . We call this process an experience. Now, when we reunion.
update the current estimate of Q(st , at ), we ideally want to mimic the
behavior of the optimal Q-function in (15.2) and update it to:
Q(st , at ) = ∑
′
p(s′ | st , at ) · (r ( at | st , s′ ) + γ max Q(s′ , b))
b
(15.3)
s ; at
E[ Q′t ] = ∑
′
p(s′ | st , at ) · (r ( at | st , s′ ) + γ max Q(s′ , b))
b
s ; at
This is because the agent took a transition to state st+1 with probabil-
ity p(st+1 | st , at ) (of course, the agent does not know this value). This
198 introduction to machine learning lecture notes for cos 324 at princeton university
for some learning rate η > 0. You can understand this update rule in
two different ways. First, we are gently nudging the value of Q(st , at )
towards the estimate Q′t from the most recent experience. We can
alternatively think of the updated value of Q(st , at ) as the weighted
average of the previous value of Q(st , at ) and the estimate Q′t . In
either approach, the most important thing to note is that we combine
both the previous Q value and the new estimate to compute the
updated Q value. This is because the new estimate is just a single
sample that can be far off from the actual expectation, and also
because after enough iterations, we can assume the previous Q value
to contain information from past experience.
Many Deep RL models are trained to play games (e.g., chess, Go) be-
cause it is easy to evaluate progress. By letting them compete against
reinforcement learning in unknown environments 201
How can we utilize RL concepts to play this game? In general, we model beat the world chess champion
Kasparov in 1997.
can create a Deep Policy Net (DPN) to learn W, which is a function
that takes state s as an input and outputs a probability distribution
pW ( a | s) over the next possible actions from s. AlphaGo is an
example of a DPN engineered by the Google Deepmind lab. It takes
the current board position as the input and uses ConvNet to learn the
internal weights, and outputs the value given by a softmax function.
In its initial setup, the DPN was trained using a big dataset of past
games. 12 12
Source: https://www.youtube.com/
watch?v=Wujy7OzvdJk
To be more specific, AlphaGo used supervised learning from
human data to learn the optimal policy (action to take at each game
setting). In other words, it used convolutional layers to replicate the
moves of professional players as closely as possible. Since the CNN
reinforcement learning in unknown environments 203
Advanced Topics
16
Machine Learning and Ethics
At first sight, this may appear to be very good idea: even if it saves
just one life, surely the project is worth it? But the announcement of
the project cause a lot of controversy among people. The following
are some of the potential problems that people identified:
from the previous section. If the binary classification model for the
loan approval satisfies the demographic parity property, then the
model approves loans for different races at similar rates. One way
to achieve this condition is to use a regularizer term λ(Pr[y | xi =
a] − Pr[y | xi = b])2 ) during training. 3 3
Does this seem like a good formula-
Another property we might want a “fair” model to satisfy is called tion of fairness?
the predictive parity. This is the property that the model in Table 16.1
failed to satisfy.
Example 17.1.2. Assume that you first heard the word tejuino and have no
idea what the word means. But you learn that the word may appear in the
following four contexts.
··· computer data result pie sugar ··· Table 17.2: A portion of a word-word
co-occurrence matrix for a corpus
cherry ··· 2 8 9 442 25 ··· of Wikipedia articles. Source: https:
strawberry ··· 0 0 1 60 19 ··· //www.english-corpora.org/wiki/.
digital ··· 1670 1683 85 5 4 ···
information ··· 3325 3982 378 5 13 ···
ing objective:
2
J (θ ) = ∑ f ( Xij ) ui · v j + bi + bej − log Xij (17.1)
i,j∈V
where f is some non-linear function and bi , bej are bias terms, and θ is
the set of all entries in ui , v j and bi , bej . This is within the same line of
logic as optimizing
1
|Ω| i,j∑
L( A, B) = ( Mij − ( AB)ij )2 ((9.5) revisited)
∈Ω
1. Similar words should have similar word vectors: This is the most
important property we can think of.
Notice Scandanavian countries are the top 3 entries on the list, and the rest
are also European country names.
Example 17.1.6. In Figure 17.1, notice that vman − vwoman ≈ vking − vqueen .
The vector difference in common can be understood as representing the male-
female relationship. Similarly, there seems to be a common vector difference
for representing the difference in verb tense.
Problem 17.2.1. “The students opened their .” Can you guess the
next word?
Problem 17.2.2. “As the proctor started the clock, the students opened their
.” Can you guess the next word?
Example 17.2.4. Assume we are given two contexts “You like green
” and “You like yellow ” to fill the blanks in. A n-gram
model will try to calculate the raw probabilities Pr[w | You like green] and
Pr[w | You like yellow]. However, if the word embeddings showed that
v green ≈ vyellow , then we can imagine that the two contexts are similar
enough. Then we may be able to estimate the probabilities better.
v 1 , v 2 , . . . , v n ∈ Rd
deep learning for natural language processing 221
This will be the input layer. Then we define the fully connected
hidden layer as
⃗h = tanh(W⃗x + ⃗b) ∈ Rh
⃗z = U⃗h ∈ R|V |
d |V | + ndh + h + h |V |
where the terms are respectively for the input embeddings, W, ⃗b, U.
When d = h, sometimes we tie the input and output embeddings.
That is, we can consider U to be the parameters required for the
output embeddings. At this point, the language model reduces
to a |V |-way classification, and we can create lots of training ex-
ample by sliding the input-output indices. That is, when given
a huge text, we can create lots of input-output tuple as follows:
((w1 , . . . , wn ), wn+1 ), ((w2 , . . . , wn+1 ), wn+2 ), . . ..
sequence length can vary and the key is to reuse the weight matrices at
different time steps. When the inputs are given as ⃗x1 ,⃗x2 , . . .⃗x T ∈ Rd
and we want to find outputs ⃗h1 , ⃗h2 , . . . ⃗h T ∈ Rh , we train the parame-
ters
W ∈ Rh×h , U ∈ Rh×d , ⃗b ∈ Rh
such that
⃗ht = g(W⃗ht−1 + U⃗xt + ⃗b) ∈ R
18.1.2 Probability
Given a sample space S, we define a probability for each event A of
that space. This probability measures the likelihood that the outcome
of the random phenomenon belongs to the set A.
Definition 18.1.3 (Probability). A probability Pr : S → R≥0 is a
mapping from each event A ⊂ S to a non-negative real number Pr[ A] ≥ 0
such that the following properties are satisfied:
1. 0 ≤ Pr[ A] ≤ 1 for any A ⊂ S
2. Pr[S] = 1
When the sample space is finite or countably infinite, 1 the properties above 1
A countably infinite set refers to a
can be simplified into the following condition: set whose elements can be numbered
with integer indices. The set N of
natural numbers or the set Q of rational
∑ Pr[{ x }] = 1 numbers are examples of countably
x ∈S infinite sets.
Example 18.1.4. Consider the sample space of “the outcome of tossing two
dice” again. Assuming the two dice are fair, the probability of each outcome
can be defined as 1/36. Then the probability of the event “the sum of the
numbers on the two dice is 5” is 4/36.
Example 18.1.5. We are picking a point uniformly at random from the sam-
ple space [0, 2] × [0, 2] in the Cartesian coordinate system. The probability of
the event that the point is drawn from the bottom left quarter [0, 1] × [0, 1] is
1/4.
Example 18.1.7. Consider the sample space of “the outcome of tossing two
dice” again. Let A1 be the event “the number on the first die is 1” and let
A2 be the event “the number on the second die is 4.” The joint probability of
A1 , A2 is 1/36. The marginal probability of each of the events is 1/6.
Pr[ A ∩ A1 ∩ . . . ∩ An ]
Pr[ A | A1 , . . . , An ] =
Pr[ A1 ∩ . . . ∩ An ]
Example 18.1.9. Consider the sample space of “the outcome of tossing
two dice” again. Let A1 be the event “the number on the first die is 1” and
let A2 be the event “the sum of the numbers on the two dice is 5.” The
conditional probability of A1 given A2 is 1/4. The conditional probability of
A2 given A1 is 1/6.
Pr[ A] = Pr[ A | B]
or equivalently
Pr[ B] = Pr[ B | A]
or equivalently
Pr[ A ∩ B] = Pr[ A] · Pr[ B]
Example 18.1.13. Consider the sample space of “the outcome of tossing two
dice” again. Let A1 be the event “the number on the first die is 1” and let
A2 be the event “the sum of the numbers on the two dice is 5.” A1 and A2
are not independent.
Example 18.2.2. Consider the sample space of “the outcome of tossing two
dice” again. Then the random variable X = “sum of the numbers on the two
dice” maps the outcome (1, 4) to the value 5.
X = X1 + . . . + X n
X = X1 · · · X n
Example 18.2.4. Consider the sample space of “the outcome of tossing two
dice” again. Then the random variable X = “sum of the numbers on the two
dice” is the sum of the two random variables X1 = “the number on the first
die” and X2 = “the number on the second die.”
Example 18.2.5. Consider the sample space of “the outcome of tossing two
dice” and the random variable X = “sum of the numbers on the two dice”
again. Then
Pr[ X = 5] = Pr[{(1, 4), (2, 3), (3, 2), (4, 1)}] = 4/36
FX ( x ) = Pr[ X ≤ x ]
probability and statistics 229
1. FX is increasing
p X (i ) = Pr[ X = i ]
1. ∑ p X (i ) = 1
i
2. FX ( x ) = ∑ p X (i )
i≤x
for any a, b ∈ R.
E[ X ] = ∑ i · pX (i) = ∑ i · Pr[X = i]
i i
Example 18.2.12. Consider the sample space of “the outcome of tossing one
die.” Then the expected value of the random variable X = “the number on
the first die” can be computed as
6 6 6 6 6 6
E[ X ] = 1 · +2· +3· +4· +5· +6· = 3.5
36 36 36 36 36 36
Proposition 18.2.13 (Linearity of Expectation). If X is the sum of the
random variables X1 , . . . , Xn , then the following holds:
E [ X ] = E [ X1 ] + . . . + E [ X n ]
E[ aX + b] = aE[ X ] + b
Example 18.2.14. Consider the sample space of “the outcome of tossing two
dice.” Then the expected value of the random variable X = “the sum of the
numbers of the two dice” can be computed as
E[ X ] = 3.5 + 3.5 = 7
1
Pr[| X − µ| ≥ kσ ] ≤
k2
for any k > 0. (Hint: Suppose the probability was greater than 1/k2 . What
could you conclude about E[( X − µ)2 ]? )
p X,Y (i, j)
p X | Y (i | j ) =
pY ( j )
232 introduction to machine learning lecture notes for cos 324 at princeton university
for all possible values j′ that Y can take. If we plug this into the
denominator above,
Pr[ X = i | Y = j] Pr[Y = j]
Pr[Y = j | X = i ] =
∑ Pr[ X = i | Y = j′ ] Pr[Y = j′ ]
j′
But we are more interested in the probability that the coin is fair/biased
based on the observation of the coin flip. Therefore, we can apply the Bayes’
Rule.
Pr[ D = H | θ = 0.7] Pr[θ = 0.7]
Pr[θ = 0.7 | D = H ] =
Pr[ D = H ]
which can be calculated as
Pr[ D = H | θ = 0.7] Pr[θ = 0.7]
Pr[ D = H | θ = 0.7] Pr[θ = 0.7] + Pr[ D = H | θ = 0.5] Pr[θ = 0.5]
0.7 · 0.5
= ≃ 0.58
0.7 · 0.5 + 0.5 · 0.5
probability and statistics 233
So if we observe one Heads, there is a 58% chance that the coin was biased
and a 42% chance that it was fair.
p X (i ) = p X | Y (i | j )
or equivalently,
pY ( j ) = pY | X ( j | i )
or equivalently
p X,Y ( x, y) = p X ( x ) · pY (y)
1. E[ X1 · · · Xn ] = E[ X1 ] · · · E[ Xn ]
Expected value may not happen too often. For n = 10, the expected
number of Heads is 8, but that is seen only with probability 0.3. In
other words, with probability 0.7, the number of Heads is different
from the expectation. 2 2
In such cases, expectation can be a
misleading term. It may in fact be never
The highly likely values fall in a smaller and smaller band around the seen. For instance, the expected number
of eyes in an individual drawn from
expected value, as n increases.
the human population is somewhere
For n = 10, there is a good chance that the number of Heads is between 1 and 2 but no individual has a
non-integral number of eyes. Thus mean
value is a more intuitive term.
probability and statistics 235
quite far from the the expectation. For n = 100, the number of
Heads lies in [68, 90] with quite high probability. For n = 1000 it
lies in [770, 830] with high probability.
The probability curve becomes more symmetrical around the mean. Contrast
between the case where n = 10 and the case where n = 100.
Example 18.3.4. A deep neural network model was trained to predict cancer
patients’ chances of staying in remission a year after chemotherapy, and
we are interested in finding out its accuracy p. When the model is tested
on n = 1000 held-out data points, this problem is equivalent to the coin
flipping problem. For each of the held-out data point, the probability that the
model makes the correct prediction is p. By observing the number of correct
predictions on the held-out data, we can construct a confidence interval for
p. Say the test accuracy was p′ = 70%. Then the 68% confidence interval
can be written as
np ∈ [np′ − σ, np′ + σ ]
or equivalently,
p ∈ [0.6855, 0.7145]
where ⃗δ = (δ1 , δ2 , . . . , δk )
238 introduction to machine learning lecture notes for cos 324 at princeton university
1. exp( x ) > 0 for any x ∈ R Figure 19.1: The graph of the exponen-
tial function.
2. exp( x ) is increasing
3. lim exp( x ) = 0
x →−∞
4. lim exp( x ) = ∞
x →∞
1
5. exp(− x ) = exp( x )
1. log( x ) is increasing
2. lim log( x ) = −∞
x →0+
3. lim log( x ) = ∞
x →∞
Proposition 19.1.6. The following properties hold for the sigmoid function:
3. lim σ ( x ) = 0
x →−∞
4. lim σ( x ) = 1
x →∞
5. The graph of σ is symmetrical to the point 0, 21 . In particular,
σ ( x ) + σ (− x ) = 1
19.1.3 Differentiation
Definition 19.1.7 (Derivative). Given a function f : R → R, its
derivative f ′ is defined as
f ( x + h) − f ( x )
f ′ ( x ) = lim
h →0 h
We alternatively denote f ′ ( x ) as d
dx f ( x ).
calculus 241
exp′ ( x ) = exp( x )
f 1 ( x1 , x2 ) = x12 x2
f 2 ( x1 , x2 ) = x1 x22
e zi
so f tmax (⃗z)i = z
(19.1)
∑kj=1 e j
Problem 19.2.4. Show that for k = 2, the definition of the softmax function
is equivalent to the sigmoid function (after slight rearrangement/renaming of
terms).
19.2.3 Differentiation
Just like with functions in one variable, we can define differentiation
for mappings in several variables. The key point is that now we will
define a partial derivative for each pair ( xi , y j ) of coordinate xi of the
domain and coordinate y j of the range.
∂y j f j ( x1 , . . . , x j + h, . . . , xn ) − f j ( x1 , . . . , x j , . . . , xn )
= lim
∂xi ⃗x h →0 h
calculus 243
x2 z2
y3
∂h ∂s ∂ ( t2 )
= +
∂x ∂x ∂x
∂s ∂(t2 ) ∂t
= + ·
∂x ∂t ∂x
= 3 + 2t · 1
= 2x − 1
20.1 Vectors
⃗v = (2, 1)
Vectors are a collection of entries (here, we focus only on real num-
x
bers). For example, the pair (1, 2) is a real vector of size 2, and the
3-tuple (1, 0, 2) is a real vector of size 3. We primarily categorize vec-
Figure 20.1: A visualization of a vector
tors by their size. For example, the set of all real vectors of size n is ⃗v = (2, 1) in R2 .
denoted as Rn . Any element of Rn can be thought of as representing
a point (or equivalently, the direction from the origin to the point) in
y
the n-dimensional Cartesian space. A real number in R is also known
as a scalar, as opposed to vectors in Rn where n > 1.
⃗x +⃗y = (3, 3)
⃗y = (1, 2)
20.1.1 Vector Space
We are interested in two operations defined on vectors — vector
⃗x = (2, 1)
addition and scalar multiplication. Given vectors ⃗x = ( x1 , x2 , . . . , xn ) x
and ⃗y = (y1 , y2 , . . . , yn ) and a scalar c ∈ R, the vector addition is
defined as Figure 20.2: A visualization of ⃗x + ⃗y
where ⃗x = (2, 1) and ⃗y = (1, 2).
⃗x +⃗y = ( x1 + y1 , x2 + y2 , . . . , xn + yn ) ∈ Rn
where ai ’s are scalars and ⃗xi ’s are vectors is called a linear combination
of the vectors ⃗xi ’s. Notice that the zero vector ⃗0 (i.e., the vector with
all zero entries) can always be represented as a linear combination of
an arbitrary collection of vectors, if all ai ’s are chosen as zero. This
is known as a trivial linear combination, and any other choice of ai ’s is
known as a non-trivial linear combination.
Example 20.1.3. The set {(−1, 2), (3, 0), (1, 4)} of three vectors is linearly
dependent because
(1, 4) = 2 · (−1, 2) + (3, 0)
can be represented as the linear combination of the remaining two vectors.
linear algebra 247
Example 20.1.4. The set {(−1, 2, 1), (3, 0, 0), (1, 4, 1)} of three vectors is
linearly independent because there is no way to write one vector as a linear
combination of the remaining two vectors.
20.1.4 Span
Definition 20.1.5. The span of a set of vectors ⃗x1 , . . . ,⃗xk is the set of all
vectors that can be represented as a linear combination of ⃗xi ’s.
Example 20.1.6. (1, 4) is in the span of {(−1, 2), (3, 0)} because
Example 20.1.7. (1, 4, 1) is not in the span of {(−1, 2, 1), (3, 0, 0)} because
there is no way to choose a1 , a2 ∈ R such that
Example 20.1.8. In the R3 , the two vectors (1, 0, 0) and (0, 1, 0) span the
2-dimensional XY-plane. Similarly, the vectors (1, 0, 1) and (0, 2, 1) span
the 2-dimensional plane 2x + y − 2z = 0. 2 2
The term dimension will be formally
defined soon. Here, we rely on your
In Example 20.1.8, we see examples where 2 vectors span a 2- intuition.
dimensional subspace. In general, the dimension of the subspace
spanned by k vectors can go up to k, but it can also be strictly smaller
than k. This is related to the linear independence of the vectors.
is a d-dimensional subspace of Rn .
Conversely, if we know that the span of the k vectors is a d-dimensional
subspace, then the maximum number of vectors that are linearly indepen-
dent with each other is d, and any subcollection of linearly independent d
vectors satisfies (20.1).
20.1.6 Basis
Definition 20.1.12. A collection {⃗x1 , . . . ,⃗xk } of linearly independent
vectors in Rn that span a set S is known as a basis of S. In particular, if
the vectors of the basis are orthogonal/orthonormal, the basis is called an
orthogonal/orthonormal basis of S.
The set S in Definition 20.1.12 can be the entire vector space Rn ,
but it can also be some subspace of Rn with a lower dimension.
Example 20.1.13. The set {(1, 0, 0), (0, 1, 0), (0, 0, 1)} of three vec-
tors is a basis for R3 . When we exclude the last vector (0, 0, 1), the set
{(1, 0, 0), (0, 1, 0)} is a basis of the 2-dimensional XY-plane in R3 .
Given some subspace S, the basis of S is not unique. However,
every basis of S must have the same size — this size is called the
dimension of S. For a finite dimensional space S, it is known that
there exists an orthogonal basis of S. There is a well-known algorithm
— Gram-Schmidt process — that can transform an arbitrary basis
into an orthogonal basis (and eventually an orthonormal basis via
normalization).
20.1.7 Projection
Vector projection is the key concept used in the Gram-Schmidt process
that computes an orthogonal basis. Given a fixed vector ⃗a, it decom-
poses any given vector ⃗x into a sum of two components — one that is
linear algebra 249
⃗x ·⃗a
proj⃗a (⃗x) = ⃗a
⃗a ·⃗a
and is parallel to the fixed vector ⃗a. The remaining component
⃗x − proj⃗a (⃗x)
20.2 Matrices
That is, it is defined as the inner product of the i-th row of X and the
j-th column of Y.
• ( X1 + X2 ) Y = X1 Y + X2 Y
• ( X + Y )⊺ = X⊺ + Y⊺
• (cX)⊺ = c(X⊺ )
• (XY)⊺ = Y⊺ X⊺
⃗y = A⃗x ∈ Rm
1 0 1
Example 20.2.7. Consider the matrix M = −2 −3 1, the rank of
3 3 0
M is 2 because the third row can be expressed as the second row subtracted
from the first row.
When we interpret a matrix as a linear transformation, the rank
measures the dimension of the output space.
Proposition 20.2.8. A ∈ Rm×n has rank k if and only if the image of the
linear transformation, i. e., the subspace
{A⃗x | ⃗x ∈ Rn }
of Rm , has dimension k.
There are many known algorithms to compute the rank of a
matrix. Examples include Gaussian elimination or certain decom-
positions (expressing a matrix as the product of other matrices with
certain properties). Given m vectors in Rn , we can find the maximum
number of linearly independent vectors by constructing a matrix with
each row equal to each vector 4 and finding the rank of that matrix. 4
By Proposition 20.2.4, the order of the
rows can be arbitrary.
A⃗v = λ⃗v
rank(A) + nullity(A) = n
such that the eigenvectors for λ are orthogonal to each other. That
is, if ⃗u, ⃗v are eigenvectors for the same eigenvalue λ, then ⃗u ·⃗v = 0.
Now assume ⃗u, ⃗v are two eigenvectors with distinct eigenvalues λ, µ
respectively. Then
n
λ⃗u ·⃗v = (λ⃗u) ·⃗v = (A⃗u) ·⃗v = ∑ ai,j u j vi
i,j=1
where the third and the fourth equality can be verified by direct
computation. Since λ ̸= µ, we conclude ⃗u · ⃗v = 0. We have now
showed that ⃗u ·⃗v = 0 for any pair of eigenvectors ⃗u, ⃗v — this means
that the basis of eigenvectors is also orthogonal. After normalization,
the basis can be made orthonormal.
∥⃗x∥2 = ⃗x ·⃗x
= (α1⃗u1 + α2⃗u2 + . . . + αn⃗un ) · (α1⃗u1 + α2⃗u2 + . . . + αn⃗un )
n
= ∑ αi α j (⃗ui · ⃗u j )
i,j=1
n
= ∑ α2i
i =1
where for the last equality, we use the fact that ⃗ui ’s are orthonormal
— that is, ⃗ui · ⃗u j = 0 if i ̸= j and ⃗ui · ⃗ui = 1. Since ⃗x has norm 1, we see
n
that ∑ α2i = 1. Now notice that
i =1
n
The allocation of weights αi that will maximize ∑ α2i λ2i while main-
i =1
n
taining ∑ α2i = 1 is assigning αi = ±1 to the eigenvalue λi that has
i =1
the highest value of λ2i . This shows that the unit vector ⃗x = ±⃗ui is an
eigenvector with the eigenvalue λi .
AA⊺⃗x = λ⃗x
where ⃗v
bi is the low-dimensional representation of ⃗vi that can be
computed as
bi = (⃗vi · ⃗u)⃗u
⃗v
256 introduction to machine learning lecture notes for cos 324 at princeton university
Since we are already given a fixed set of vectors ⃗vi , we cannot change
the values of ∥⃗vi ∥2 . Therefore, minimizing the last term of the equa-
N
tion above amounts to maximizing ∑ (⃗vi · ⃗u)2 . Notice that
i =1
N
∑ (⃗vi · ⃗u)2 = ∥A⊺⃗u∥2 = ⃗u⊺ AA⊺⃗u
i =1
n
Again, the allocation of αi ’s that maximize ∑ α2i λi while maintaining
i =1
n
∑ α2i = 1 is assigning αi = ±1 to the eigenvector corresponding to
i =1
the highest value of λi .