Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
(Goodfellow 2016)
The task, T
q Regression: Learn 𝑓: ℝ! → ℝ
o Example: Predict claim amount for an insured
person
q Transcription: unstructured representation to
discrete textual form
o Example: optical character recognition, speech
recognition
q Machine Translation: Sequence to sequence
o Example: translate English to French
(Goodfellow 2016)
The task, T
q Structured output: output is a vector or data
structure of values that are tightly interrelated
o Example: parsing natural language sentence
into a tree that describes grammatical structure
by tagging nodes of the tree as being verbs,
nouns, etc.
o image segmentation: assigning a pixel to a
segment
o Image captioning
(Goodfellow 2016)
The task, T
q Anomaly detection: flag unusual or atypical
events
o Credit card fraud detection
q Synthesis and sampling: generate examples
that are similar to those in the training data
o Genreate textures for video games
o Speech synthesis
q Imputation of missing values: predict the
values of the missing entries
(Goodfellow 2016)
The task, T
q Denoising: 𝑓: 𝑥3 ∈ ℝ! → 𝑥 ∈ ℝ!
o Predict clean example from corrupted example
o 𝑝 𝑥 𝑥5 =?
q Density estimation or probability mass
estimation
o 𝑝#$%&' (𝒙): ℝ! → ℝ (x can be discrete or
continuous)
o Example: 𝑝#$%&' (𝑥" |𝒙(𝒊 )
(Goodfellow 2016)
The performance measure, P
• Accuracy:
q The proportion of examples for which the model produces
the correct output
• Error rate (0-1 loss):
q The proportion of examples for which the model produces
incorrect output
• Average log-probability of some examples (for density
estimation)
• It is difficult sometimes, to decide what should be measured
• It is imparactical sometimes to measure an implicit
performance metric
• Evaluate the performance measure using a test set
(Goodfellow 2016)
The experience, E
• Supervised learning: 𝑝(𝑦|𝒙)
q Experience is a labeled dataset (or datapoints)
q Each datapoint has a label or target
(Goodfellow 2016)
The experience, E
• Semi-Supervised learning:
q Some examples include supervision targets, but
others do not.
• Multi-instance learning:
q A collection of examples is labeled as containing
an example of a class
• Reinforcement learning:
q Interaction with an environment
(Goodfellow 2016)
A dataset
• Design matrix X:
q Each row is an example
q Each column is a feature
*+,×.
q Example: iris dataset: 𝑋 ∈ ℝ
• Vector of labels y
q Example: 0 is a person, 1 is a car, 2 is a cat, etc.
q The label can be a set of words (e.g.
transcription)
(Goodfellow 2016)
Example: Linear regression
• Task: regression problem ℝ! → ℝ
q 𝑦# = 𝒘𝑻 𝒙
• Performance measure:
q Have a test set: 𝑿𝒕𝒆𝒔𝒕 , 𝒚𝒕𝒆𝒔𝒕
1 *
𝑀𝑆𝐸&'(& = 0 𝒚 1𝒕𝒆𝒔𝒕 − 𝒚𝒕𝒆𝒔𝒕 )
𝑚
)
1 𝒕𝒆𝒔𝒕
= 1
𝒚 − 𝒚𝒕𝒆𝒔𝒕 **
𝑚
• Experience
q 𝑿𝒕𝒓𝒂𝒊𝒏 , 𝒚𝒕𝒓𝒂𝒊𝒏
q Minimize 𝑀𝑆𝐸&/0)!
(Goodfellow 2016)
Linear regression
• 𝛻𝒘 𝑀𝑆𝐸012"! = 0
* 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 7
• 𝛻𝒘 𝑿 𝒘−𝒚 𝑿𝒕𝒓𝒂𝒊𝒏 𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 = 0
#!"#$%
𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏
• (𝑿 𝑿 )𝒘 = 𝑿 𝒚 (normal equations)
(Goodfellow 2016)
Linear Regression
Figure 5.1
(Goodfellow 2016)
Capacity, overfitting and
underfitting
• Generalization: ability to perform well on previously
unobserved data
• i.i.d assumptions:
q The examples in each dataset are independent
from each other,
q and that the train set and test set are identically
distributed, drawn from the same probability
distribution as each other.
q We call that shared underlying distribution the
data generating distribution, denoted 𝑝%202
(Goodfellow 2016)
Capacity, overfitting and
underfitting
• For some fixed value w, the expected training set
error is exactly the same as the expected test set
error, because both expectations are formed using
the same dataset sampling process.
• In practice:
q expected test set error > expected train set error
• We need ability to:
q 1. Make the training error small. (underfitting)
q 2. Make the gap between training and test error
small. (overfitting)
(Goodfellow 2016)
Capacity, overfitting and
underfitting
• Underfitting occurs when the model is not able to
obtain a sufficiently low error value on the training
set.
• Overfitting occurs when the gap between the
training error and test error is too large.
• Informally, a model’s capacity is its ability to fit a
wide variety of functions.
q Low capacity may cause underfitting
q High capacity may cause overfitting
(Goodfellow 2016)
Hypothesis space
• One way to control the capacity of a learning
algorithm is by choosing its hypothesis space,
q the set of functions that the learning algorithm is
allowed to select as being the solution.
q For example, linear regression has the set of all
linear functions of its input
q To increase capacity: (model is polynomial of
degree 9)
:
𝑦O = 𝑏 + R 𝑤" 𝑥 "
"9*
(Goodfellow 2016)
Underfitting and Overfitting
in Polynomial Estimation
(Goodfellow 2016)
Generalization and Capacity
Figure 5.3
(Goodfellow 2016)
High capacity: non-
parametric models
• Nearest neighbor regression
=
𝑦O = 𝑦" where 𝑖 = argmin 𝑿𝒊,: − 𝒙 =
(Goodfellow 2016)
Bayes error
• The error incurred by an oracle making predictions
from the true distribution 𝑝 𝒙, 𝑦 .
• This error is due to:
q There may still be noise in the distribution
q 𝑦 might be inherently stochastic
q 𝑦 may be deterministic but involves other
variables besides 𝒙
(Goodfellow 2016)
Training Set Size
moderate amount of
noise added to a
degree-5 polynomial
(Goodfellow 2016)
The no Free lunch theorem
• Averaged over all possible data generating distributions,
every classification algorithm has the same error rate
when classifying previously unobserved points.
(Goodfellow 2016)
No free lunch
• The goal of machine learning research is not to
seek a universal learning algorithm or the absolute
best learning algorithm.
• Instead, our goal is to understand what kinds of
distributions are relevant to the “real world”
q and what kinds of machine learning algorithms
perform well on data drawn distributions we care
about.
(Goodfellow 2016)
Regularization
• We can give a learning algorithm a preference for
one solution in its hypothesis space to another.
q both functions are eligible, but one is preferred.
• Example: linear regression with weight decay:
𝐽 𝑤 = 𝑀𝑆𝐸012"! + 𝜆𝒘𝑻 𝒘
• We are expressing a preference for the weights to
have smaller L2 norm
q Larger 𝜆 forces the weights to become smaller
(Goodfellow 2016)
Weight Decay
(Goodfellow 2016)
Hyperparameters
• The values of hyperparameters are not adapted by
the learning algorithm itself
• Polynomial regression hyperparameters:
q Degree of the polynomial
q 𝜆 (control of the weight decay)
• To set the hyperparameters, we need a validation
set
q Typically, one uses about 80% of the training data
for training and 20% for validation.
(Goodfellow 2016)
Cross validation
• It is used to estimate generalization error of a learning
algorithm A when the given dataset D is too small for a
simple train/test or train/valid split to yield accurate
estimation of generalization error
(Goodfellow 2016)
Cross validation
(Goodfellow 2016)
Point estimation
• Provides the single “best” prediction of some quantity of
interest
• can be a single parameter or a vector of parameters in
some parametric model
• A point estimator or statistic is any function of the data
(assuming i.i.d. samples):
3 = 𝑔(𝒙 𝟏 , 𝒙 𝟐 , … , 𝒙 𝒎 )
𝜽
• Frequentist perspective: we assume that the true
parameter value θ is fixed but unknown
• Function estimation is really just the same as estimating
a parameter θ; the function estimator is simply a point
estimator in function space.
(Goodfellow 2016)
Statistical biasExpectation over the
data 𝒙 𝟏 ,𝒙 𝟐 ,…,𝒙 𝒎
(Goodfellow 2016)
Example: Bernoulli
distribution
! " #
• Consider a set of samples {𝑥 ,𝑥 ,…,𝑥 } that are i.i.d according to a Bernoulli distribution with mean θ
q 𝑃 𝑥 𝑖 ; 𝜃 = 𝜃$ %
1 − 𝜃 !&$ %
(Goodfellow 2016)
Example: Gaussian
distribution - mean
• Consider a set of samples {𝑥 * , 𝑥 = , … , 𝑥 # } that
are i.i.d according to a a Gaussian distribution:
" " =
q 𝑝 𝑥 = 𝒩 𝑥 , 𝜇, 𝜎
*
• 𝜇̂ # = ∑#
"9* 𝑥 (")
#
(Goodfellow 2016)
Example: Gaussian
distribution - variance
* 8 7 ) *
• Biased estimator: 𝜎#7 = ∑)98 𝑥 − 𝜇#7
7
• Unbiased estimator:
8 *
• 𝜎
#7* = ∑7 𝑥 ) − 𝜇#7
7:8 )98
(Goodfellow 2016)
Variance
• how much we expect an estimator to vary as a
function of the data sample?
• 𝑣𝑎𝑟 𝜃g =?
• Standard error 𝑆𝐸 𝜃g =?
• Just as we might like an estimator to exhibit low
bias we would also like it to have relatively low
variance.
(Goodfellow 2016)
Example: Bernoulli
distribution
(Goodfellow 2016)
Trading off Bias and Variance
(Goodfellow 2016)
Bias vs. variance
(Goodfellow 2016)
Maximum Likelihood
Estimation
• Rather than guessing that some function might
make a good estimator and then analyzing its bias
and variance, we would like to have some principle
from which we can derive specific functions that are
good estimators for different models.
(Goodfellow 2016)
Conditional log-likelihood
• Supervised learning
(Goodfellow 2016)
Conditional log-likelihood for
linear regression
• To maximize:
From the
• Same as minimizing: dataset
Output of linear
regression as
the mean
(Goodfellow 2016)
Properties of maximum
likelihood
• The best estimator asymptotically, as the number of
examples 𝑚 → ∞, in terms of its rate of
convergence as 𝑚 increases.
• Consistency: plim 𝜃g# = 𝜃
#→@
q Under appropriate conditions
o The true distribution must lie within the model
family
o The true distribution must correspond to exactly
one value of θ
(Goodfellow 2016)
Bayesian statistics
• the true parameter 𝜃 is unknown or uncertain and
thus is represented as a random variable.
(Goodfellow 2016)
Maximum A Posteriori (MAP)
Estimation
• Add the prior:
(Goodfellow 2016)
Supervised learning
examples
• Logistic regression
• Support vector machines
• Decision trees
(Goodfellow 2016)
Logistic regression
(Goodfellow 2016)
Support vector machines
• Introduced by Vapnik in the 90s
• Widest street approach
• Decision rule
q Which side of the street?
o 𝒘𝒙 + 𝑏 ≥ 0 ⇒ +
o 𝒘𝒙 + 𝑏 ≤ 0 ⇒ −
• 𝒘𝒙0 + 𝑏 ≥ 1
o 𝒘𝒙1 + 𝑏 ≤ −1
(Goodfellow 2016)
SVM
• y" = +1 for + samples
• y" = −1 for – samples
• 𝑦" (𝒘𝒙𝒊 + 𝑏) ≥ 1 same equation for x+ and x-
• 𝑦" 𝒘𝒙𝒊 + 𝑏 − 1 ≥ 0
• Constraints: 𝑦" 𝒘𝒙𝒊 + 𝑏 − 1 = 0 for 𝒙𝒊 in the gutter
𝐰 *(F ((*DF) =
• w𝑖𝑑𝑡ℎ = 𝒙D − 𝒙( = − =
𝒘 𝒘 𝒘 𝒘
• Maximize the width of the street:
= * 𝒘 '
q max ⇒ max ⇒ min
𝒘 𝒘 =
(Goodfellow 2016)
Lagrange multipliers
8
• 𝐿 = 𝒘 * − ∑𝛼) 𝑦) 𝒘𝒙𝒊 + 𝑏 − 1 𝛼) ≥ 0
*
• 𝛻𝒘 𝐿 = 𝒘 − ∑𝛼) 𝑦) 𝒙𝒊
q 𝒘 − ∑𝛼) 𝑦) 𝒙𝒊 = 𝟎 ⇒ 𝒘 = ∑𝛼) 𝑦) 𝒙𝒊
<=
• = ∑𝛼) 𝑦) ⇒ ∑𝛼) 𝑦) = 0
<>
• Replacing in 𝐿 :
8
q 𝐿= ∑𝛼) 𝑦) 𝒙𝒊 ∑𝛼? 𝑦? 𝒙𝒋 − ∑𝛼) 𝑦) 𝒙𝒊 ∑𝛼? 𝑦? 𝒙𝒋 − ∑𝛼) 𝑦) 𝑏 +
*
∑𝛼)
8
q 𝐿 = ∑𝛼) − ∑𝛼) 𝛼? 𝑦) 𝑦? 𝒙𝒊 𝒙𝒋
*
• Numerical analysts will solve this problem (max with
respect to 𝛼) ≥ 0) and give us the 𝛼)
(Goodfellow 2016)
Kernel trick
• 𝛼" ≠ 0 only for points in the gutter, also known as
support vectors
• Decision for 𝒙𝒋 : 𝒘𝒙𝒋 + 𝑏 = ∑𝛼" 𝑦" 𝒙𝒊 𝒙𝒋 + 𝑏
• We can change the representation of 𝒙𝒊 to be 𝜙 𝒙𝒊
• We do not need direct access to 𝜙 𝒙𝒊
q Instead we define 𝑘 𝒙𝒊 , 𝒙𝒋 = 𝜙 𝒙𝒊 𝜙 𝒙𝒋
q The function k is called kernel
• 𝑓 𝒙 = ∑𝛼" 𝑦" 𝑘 𝒙𝒊 , 𝒙 + 𝑏
q This function is nonlinear with respect to 𝒙
(Goodfellow 2016)
Advantages of the kernel
trick
• The kernel trick is powerful for two reasons.
q it allows us to learn models that are nonlinear as a
function of 𝒙 using convex optimization techniques
that are guaranteed to converge efficiently
q the kernel function often admits an implementation
that is significantly more computational efficient than
naively constructing two 𝜙 𝒙 vectors and explicitly
taking their dot product.
o In some cases, 𝜙 𝒙 can even be infinite
dimensional
(Goodfellow 2016)
Gaussian kernel
• Gaussian kernel
q Aka RBF: Radial Basis Function
(Goodfellow 2016)
Watch (SVM)
• https://www.youtube.com/watch?v=_PwhiWxHK8o
(Goodfellow 2016)
K-Nearest Neighbors
• There is not even really a training stage or learning
process.
q We find the k-nearest neighbors to x in the
training data X. We then return the average of the
corresponding y values in the training set
• In the case of classification, we can average over
one-hot code vectors
q We can then interpret the average over these
one-hot codes as giving a probability distribution
over classes
(Goodfellow 2016)
K-NN limitations
• It cannot learn that one feature is more
discriminative than another.
q For example, imagine we have a regression task
with 𝑥 ∈ ℝ*,, drawn from an isotropic Gaussian
distribution,
q but only a single variable is relevant to the output
𝑦 = 𝑥*
q Nearest neighbor regression will not be able to
detect this simple pattern.
(Goodfellow 2016)
Decision Trees
How a decision tree
might divide ℝ= .
(Goodfellow 2016)
Unsupervided learning
• Density estimation
• Learning to draw samples from a distribution
• Learning to denoise data from some distribution
• Clustering the data into groups of related examples
• Simpler representation:
q Lower-dimensional representation
q Sparse representation
q Independent representation
(Goodfellow 2016)
Principal Components
Analysis
• lower dimensionality
• no linear correlation
(Goodfellow 2016)
No linear correlation
• Let us consider the 𝑚×𝑛-dimensional design matrix 𝑿.
• We will assume that the data has a mean of zero,
𝔼 𝒙 = 0.
q If this is not the case, the data can easily be centered
by subtracting the mean from all examples in a
preprocessing step.
• The unbiased sample covariance matrix associated with
X is given by:
8
q 𝑉𝑎𝑟 𝒙 = 𝑿𝑻 𝑿
7:8
• PCA finds 𝒛 = 𝑾𝑻 𝒙 where 𝑉𝑎𝑟 𝒛 is diagonal.
(Goodfellow 2016)
Decorrelation
𝑻
• PCA finds 𝒛 = 𝑾 𝒙 where 𝑉𝑎𝑟 𝒛 is diagonal.
𝑻 𝑻
q Remember that 𝑿 𝑿 = 𝑾𝚲𝑾
* 𝑻 𝟏 𝑻 𝑻 *
• 𝑉𝑎𝑟 𝒛 = 𝒁 𝒁= 𝑾 𝑿 𝑿𝑾 = 𝑾𝑻 𝑾𝜦𝑾𝑻 𝑾 =
#(* 𝒎(𝟏 #(*
*
• 𝑉𝑎𝑟 𝒛 = #(* 𝚲 since 𝑾𝑻 𝑾 = 𝑰
• PCA disentangles the unknown factors of variation
underlying the data.
q this disentangling takes the form of finding a rotation of
the input space that aligns the principal axes of variance
with the basis of the new representation space
associated with z.
(Goodfellow 2016)
k-means clustering
• Divides the training set into k different clusters of
examples
• Provides a k-dimensional one-hot code vector ℎ
representing an input 𝑥.
q If 𝑥 belongs to cluster 𝑖, then ℎ" = 1 and all other
entries of the representation ℎ are zero.
q This is an example of a sparse representation
(Goodfellow 2016)
k-means algorithm
• initialize k different centroids {𝜇(1), . . . , 𝜇(𝑘)} to
different values,
• then alternate between two different steps until
convergence:
q In one step, each training example is assigned to
cluster 𝑖, where 𝑖 is the index of the nearest
centroid 𝜇(𝑖).
q In the other step, each centroid 𝜇(𝑖) is updated to
the mean of all training examples 𝑥(𝑗) assigned to
cluster 𝑖.
(Goodfellow 2016)
How to evaluate clustering?
• Suppose we have images of
q Red trucks, gray trucks
q Red cars, gray cars
• One clustering algorithm may find a cluster of cars and a
cluster of trucks
• Another learning algorithm may find a cluster of red
vehicules and a cluster of gray vehicules
• A third algorithm may find 4 clusters …
q Red cars similarity to gray trucks is same as their
similarity to gray cars!
• A distributed representation may have two attributes
for each vehicule
(Goodfellow 2016)
Building a machine learning
algorithm
• Simple recipe
q A dataset
q A cost function (e.g. MSE, negative log-likelihood)
q A model (e.g. linear, non-linear)
q An optimization procedure (closed form, gradient
descent, stochastic gradient descent …)
(Goodfellow 2016)
Exercise
• Link equations and definitions
(Goodfellow 2016)
Stochastic gradient descent
(Goodfellow 2016)
Challenges motivating deep
learning
• Simple machine learning algorithms work very well
on a wide variety of important problems.
• However, they have not succeeded in solving the
central problems in AI, such as recognizing speech
or recognizing objects.
• Generalizing to new examples becomes
exponentially more difficult when working with high-
dimensional data,
• Such data also often impose high computational
costs.
(Goodfellow 2016)
Curse of Dimensionality
Many machine learning
problems become exceedingly
difficult when the number of
dimensions in the data is high.
(Goodfellow 2016)
Smoothness prior is not
enough!
• Is there a way to represent a complex function that
has many more regions to be distinguished than the
number of training examples?
q A checkerboard contains many variations but
there is a simple structure to them.
• The answer is Yes:
q If we introduce some dependencies between the
regions
q via additional assumptions about the underlying
data generating distribution
(Goodfellow 2016)
Deep learning solution
• Deep learning assumes that the data was
generated by the composition of factors or features,
potentially at multiple levels in a hierarchy
• These apparently mild assumptions allow an
exponential gain in the relationship between the
number of examples and the number of regions that
can be distinguished.
• Deep and distributed representations counter the
curse of dimensionality.
(Goodfellow 2016)
Manifold
• A manifold is a connected set of points
q that can be approximated well by considering only a
small number of degrees of freedom, or dimensions,
embedded in a higher-dimensional space.
• A manifold is a set of points, associated with a
neighborhood around each point.
q Transformations can be applied to move on the
manifold from one position to a neighboring one.
• From any given point, the manifold locally appears to be
an Euclidean space.
q In everyday life, we experience the surface of the
world as a 2-D plane, but it is in fact a spherical
manifold in 3-D space.
(Goodfellow 2016)
Manifold
(Goodfellow 2016)
Manifold learning
• Manifold learning algorithms assumes that most of
!
ℝ consists of invalid inputs,
q interesting inputs occur only along a collection of
manifolds
q Interesting variations in the output of the learned
function occurring only along directions that lie on
the manifold,
q or with interesting variations happening only when
we move from one manifold to another.
the key assumption remains
that probability mass is
highly concentrated. (Goodfellow 2016)
Arguments supporting the
manifold hypothesis
• the probability distribution over images, text strings,
and sounds that occur in real life is highly
concentrated.
Uniformly
Sampled
Images
(Goodfellow 2016)
Arguments supporting the
manifold hypothesis
• Examples we encounter are connected to each
other by applying transformations to traverse the
manifold.
Congratulations
(Goodfellow 2016)