Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow

Machine Learning Basics
Lecture slides for Chapter 5 of Deep Learning

www.deeplearningbook.org
Ian Goodfellow
Adapted by m.n. for CMPS 392
Machine Learning
• A computer program is said to learn from
experience 𝐸 with respect to some class of tasks 𝑇
and performance measure 𝑃, if its performance at
tasks in 𝑇, as measured by 𝑃, improves with
experience 𝐸 (Tom M. Michell, 1997)
• Example
q 𝐸: the experience of playing thousands of games
q 𝑇: playing checkers game
q P: the fraction of games it wins against human
opponents
(Goodfellow 2016)
The task, T
• Tasks are usually described in terms of how the
machine learning should process an example:
!
𝑥 ∈ ℝ where each entry 𝑥" is a feature
q Classification: Learn 𝑓: ℝ! → 1, . . , k
o y = 𝑓(𝑥) : assigns the input to the category with
numerical code 𝑦
o Example: object recognition
q Classification with missing inputs: learn a set
of functions
o Example: medical diagnosis
(Goodfellow 2016)
The task, T
q Regression: Learn 𝑓: ℝ! → ℝ
o Example: Predict claim amount for an insured
person
q Transcription: unstructured representation to
discrete textual form
o Example: optical character recognition, speech
recognition
q Machine Translation: Sequence to sequence
o Example: translate English to French
(Goodfellow 2016)
The task, T
q Structured output: output is a vector or data
structure of values that are tightly interrelated
o Example: parsing natural language sentence
into a tree that describes grammatical structure
by tagging nodes of the tree as being verbs,
nouns, etc.
o image segmentation: assigning a pixel to a
segment
o Image captioning
(Goodfellow 2016)
The task, T
q Anomaly detection: flag unusual or atypical
events
o Credit card fraud detection
q Synthesis and sampling: generate examples
that are similar to those in the training data
o Genreate textures for video games
o Speech synthesis
q Imputation of missing values: predict the
values of the missing entries
(Goodfellow 2016)
The task, T
q Denoising: 𝑓: 𝑥3 ∈ ℝ! → 𝑥 ∈ ℝ!
o Predict clean example from corrupted example
o 𝑝 𝑥 𝑥5 =?
q Density estimation or probability mass
estimation
o 𝑝#$%&' (𝒙): ℝ! → ℝ (x can be discrete or
continuous)
o Example: 𝑝#$%&' (𝑥" |𝒙(𝒊 )
(Goodfellow 2016)
The performance measure, P
• Accuracy:
q The proportion of examples for which the model produces
the correct output
• Error rate (0-1 loss):
q The proportion of examples for which the model produces
incorrect output
• Average log-probability of some examples (for density
estimation)
• It is difficult sometimes, to decide what should be measured
• It is imparactical sometimes to measure an implicit
performance metric
• Evaluate the performance measure using a test set
(Goodfellow 2016)
The experience, E
• Supervised learning: 𝑝(𝑦|𝒙)
q Experience is a labeled dataset (or datapoints)
q Each datapoint has a label or target
• Unsupervised learning: 𝑝(𝒙)

q Experience is an unlabeled dataset
q Clustering, learning probability distribution, denoising, etc.

• The line between supervised and unsupervised is often
blurred
q 𝑝 𝒙 = ∏$!"# 𝑝(𝑥! |𝑥# , … , 𝑥!%# )
& 𝒙,)
q 𝑝 𝑦𝒙 =
∑!" & 𝒙,) "
(Goodfellow 2016)
The experience, E
• Semi-Supervised learning:
q Some examples include supervision targets, but
others do not.
• Multi-instance learning:
q A collection of examples is labeled as containing
an example of a class
• Reinforcement learning:
q Interaction with an environment
(Goodfellow 2016)
A dataset
• Design matrix X:
q Each row is an example
q Each column is a feature
*+,×.
q Example: iris dataset: 𝑋 ∈ ℝ
• Vector of labels y
q Example: 0 is a person, 1 is a car, 2 is a cat, etc.
q The label can be a set of words (e.g.
transcription)
(Goodfellow 2016)
Example: Linear regression
• Task: regression problem ℝ! → ℝ
q 𝑦# = 𝒘𝑻 𝒙
• Performance measure:
q Have a test set: 𝑿𝒕𝒆𝒔𝒕 , 𝒚𝒕𝒆𝒔𝒕
1 *
𝑀𝑆𝐸&'(& = 0 𝒚 1𝒕𝒆𝒔𝒕 − 𝒚𝒕𝒆𝒔𝒕 )
𝑚
)
1 𝒕𝒆𝒔𝒕
= 1
𝒚 − 𝒚𝒕𝒆𝒔𝒕 **
𝑚
• Experience
q 𝑿𝒕𝒓𝒂𝒊𝒏 , 𝒚𝒕𝒓𝒂𝒊𝒏
q Minimize 𝑀𝑆𝐸&/0)!
(Goodfellow 2016)
Linear regression
• 𝛻𝒘 𝑀𝑆𝐸012"! = 0
* 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 7
• 𝛻𝒘 𝑿 𝒘−𝒚 𝑿𝒕𝒓𝒂𝒊𝒏 𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 = 0
#!"#$%
𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏
• (𝑿 𝑿 )𝒘 = 𝑿 𝒚 (normal equations)
• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a simple trick

• b is called the bias term (to not confound with
statistical bias that will be discussed later)
(Goodfellow 2016)
Linear Regression
Figure 5.1
(Goodfellow 2016)
Capacity, overfitting and
underfitting
• Generalization: ability to perform well on previously
unobserved data
• i.i.d assumptions:
q The examples in each dataset are independent
from each other,
q and that the train set and test set are identically
distributed, drawn from the same probability
distribution as each other.
q We call that shared underlying distribution the
data generating distribution, denoted 𝑝%202
(Goodfellow 2016)
underfitting
• For some fixed value w, the expected training set
error is exactly the same as the expected test set
error, because both expectations are formed using
the same dataset sampling process.
• In practice:
q expected test set error > expected train set error
• We need ability to:
q 1. Make the training error small. (underfitting)
q 2. Make the gap between training and test error
small. (overfitting)
(Goodfellow 2016)
underfitting
• Underfitting occurs when the model is not able to
obtain a sufficiently low error value on the training
set.
• Overfitting occurs when the gap between the
training error and test error is too large.
• Informally, a model’s capacity is its ability to fit a
wide variety of functions.
q Low capacity may cause underfitting
q High capacity may cause overfitting
(Goodfellow 2016)
Hypothesis space
• One way to control the capacity of a learning
algorithm is by choosing its hypothesis space,
q the set of functions that the learning algorithm is
allowed to select as being the solution.
q For example, linear regression has the set of all
linear functions of its input
q To increase capacity: (model is polynomial of
degree 9)
:
𝑦O = 𝑏 + R 𝑤" 𝑥 "
"9*
(Goodfellow 2016)
Underfitting and Overfitting
in Polynomial Estimation
Degree 1 Degree 2 Degree 9: infinitely

many solutions
(Goodfellow 2016)
Capacity
• Representational capacity: finding the best function
within a family of functions
• Effective capacity may be less than the
representational capacity because it is hard to find the
best function
• Occam’s razor: Among competing hypotheses that
explain known observations equally well, one should
choose the “simplest” one.
• Statistical learning theory shows that the discrepancy
between training error and generalization error is
bounded from above by a quantity that grows as the
model capacity grows but shrinks as the number of
training examples increases
(Goodfellow 2016)
Generalization and Capacity
Figure 5.3
(Goodfellow 2016)
High capacity: non-
parametric models
• Nearest neighbor regression
=
𝑦O = 𝑦" where 𝑖 = argmin 𝑿𝒊,: − 𝒙 =
(Goodfellow 2016)
Bayes error
• The error incurred by an oracle making predictions
from the true distribution 𝑝 𝒙, 𝑦 .
• This error is due to:
q There may still be noise in the distribution
q 𝑦 might be inherently stochastic
q 𝑦 may be deterministic but involves other
variables besides 𝒙
(Goodfellow 2016)
Training Set Size
moderate amount of
noise added to a
degree-5 polynomial
(Goodfellow 2016)
The no Free lunch theorem
• Averaged over all possible data generating distributions,
every classification algorithm has the same error rate
when classifying previously unobserved points.
• In some sense, no machine learning algorithm is

universally any better than any other.
• The most sophisticated algorithm we can conceive of

has the same average performance (over all possible
tasks) as merely predicting that every point belongs to
the same class.
(Goodfellow 2016)
No free lunch
• The goal of machine learning research is not to
seek a universal learning algorithm or the absolute
best learning algorithm.
• Instead, our goal is to understand what kinds of
distributions are relevant to the “real world”
q and what kinds of machine learning algorithms
perform well on data drawn distributions we care
about.
(Goodfellow 2016)
Regularization
• We can give a learning algorithm a preference for
one solution in its hypothesis space to another.
q both functions are eligible, but one is preferred.
• Example: linear regression with weight decay:
𝐽 𝑤 = 𝑀𝑆𝐸012"! + 𝜆𝒘𝑻 𝒘
• We are expressing a preference for the weights to
have smaller L2 norm
q Larger 𝜆 forces the weights to become smaller
(Goodfellow 2016)
Weight Decay
The true function is quadratic,

but here we use only models with degree 9
(Goodfellow 2016)
Regularization
• Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error.
• Regularization is one of the central concerns of the
field of machine learning
(Goodfellow 2016)
Hyperparameters
• The values of hyperparameters are not adapted by
the learning algorithm itself
• Polynomial regression hyperparameters:
q Degree of the polynomial
q 𝜆 (control of the weight decay)
• To set the hyperparameters, we need a validation
set
q Typically, one uses about 80% of the training data
for training and 20% for validation.
(Goodfellow 2016)
Cross validation
• It is used to estimate generalization error of a learning
algorithm A when the given dataset D is too small for a
simple train/test or train/valid split to yield accurate
estimation of generalization error
• In k-fold cross-validation a partition of the dataset is

formed by splitting it into k non-overlapping subsets.
q The test error may then be estimated by taking the
average test error across k trials. On trial i, the i-th
subset of the data is used as the test set and the rest
of the data is used as the training set.
(Goodfellow 2016)
Cross validation
(Goodfellow 2016)
Point estimation
• Provides the single “best” prediction of some quantity of
interest
• can be a single parameter or a vector of parameters in
some parametric model
• A point estimator or statistic is any function of the data
(assuming i.i.d. samples):
3 = 𝑔(𝒙 𝟏 , 𝒙 𝟐 , … , 𝒙 𝒎 )
𝜽
• Frequentist perspective: we assume that the true
parameter value θ is fixed but unknown
• Function estimation is really just the same as estimating
a parameter θ; the function estimator is simply a point
estimator in function space.
(Goodfellow 2016)
Statistical biasExpectation over the
data 𝒙 𝟏 ,𝒙 𝟐 ,…,𝒙 𝒎
𝑏𝑖𝑎𝑠 𝜽 Z> = 𝔼 𝜽 Z> − 𝜽

• Z > is said to be unbiased if
An estimator 𝜽
Z> = 0
𝑏𝑖𝑎𝑠 𝜽
q which implies that 𝔼 𝜽Z > = 𝜽.
• Z > is said to be asymptotically
An estimator 𝜽
unbiased if lim 𝑏𝑖𝑎𝑠 (𝔼 𝜽 Z> ) = 0
#→@
(Goodfellow 2016)
Example: Bernoulli
distribution
! " #
• Consider a set of samples {𝑥 ,𝑥 ,…,𝑥 } that are i.i.d according to a Bernoulli distribution with mean θ
q 𝑃 𝑥 𝑖 ; 𝜃 = 𝜃$ %
1 − 𝜃 !&$ %
q A common estimator of 𝜃 is:

#
1
𝜃-# = / 𝑥 (%)
𝑚
%'!
(Goodfellow 2016)
Example: Gaussian
distribution - mean
• Consider a set of samples {𝑥 * , 𝑥 = , … , 𝑥 # } that
are i.i.d according to a a Gaussian distribution:
" " =
q 𝑝 𝑥 = 𝒩 𝑥 , 𝜇, 𝜎
*
• 𝜇̂ # = ∑#
"9* 𝑥 (")
#
The sample mean is an unbiased

estimator of Gaussian mean
parameter.
(Goodfellow 2016)
Example: Gaussian
distribution - variance
* 8 7 ) *
• Biased estimator: 𝜎#7 = ∑)98 𝑥 − 𝜇#7
7
The sample variance is a biased

estimator.
• Unbiased estimator:
8 *
• 𝜎
#7* = ∑7 𝑥 ) − 𝜇#7
7:8 )98
(Goodfellow 2016)
Variance
• how much we expect an estimator to vary as a
function of the data sample?
• 𝑣𝑎𝑟 𝜃g =?
• Standard error 𝑆𝐸 𝜃g =?
• Just as we might like an estimator to exhibit low
bias we would also like it to have relatively low
variance.
(Goodfellow 2016)
Example: Bernoulli
distribution
• The variance of the estimator decreases as a function of

m, the number of examples in the dataset.
(Goodfellow 2016)
Trading off Bias and Variance
Figure 5.6 (Goodfellow 2016)

Bias vs. variance
(Goodfellow 2016)
Bias vs. variance
(Goodfellow 2016)
Maximum Likelihood
Estimation
• Rather than guessing that some function might
make a good estimator and then analyzing its bias
and variance, we would like to have some principle
from which we can derive specific functions that are
good estimators for different models.
Same as: (Goodfellow 2016)

Maximum Likelihood
Estimation
• Divide by m:
• MLE minimizes the dissimilarity
Empirical
between the empirical distribution distribution
𝑝O%202 and the model distribution
Any loss consisting of a negative log- Independent of 𝜃

likelihood is a cross-entropy between
the empirical distribution defined by the
training set and the probability
distribution defined by model.
(Goodfellow 2016)
Conditional log-likelihood
• Supervised learning
• Instead of producing a single prediction 𝑦,O we now

think of the model as producing a conditional
distribution 𝑝(𝑦 | 𝒙). Output of linear
regression as
• For linear regression, take: the mean fixed
(Goodfellow 2016)
Conditional log-likelihood for
linear regression
• To maximize:
From the
• Same as minimizing: dataset
Output of linear
regression as
the mean
(Goodfellow 2016)
Properties of maximum
likelihood
• The best estimator asymptotically, as the number of
examples 𝑚 → ∞, in terms of its rate of
convergence as 𝑚 increases.
• Consistency: plim 𝜃g# = 𝜃
#→@
q Under appropriate conditions
o The true distribution must lie within the model
family
o The true distribution must correspond to exactly
one value of θ
(Goodfellow 2016)
Bayesian statistics
• the true parameter 𝜃 is unknown or uncertain and
thus is represented as a random variable.
• It is different than the single point estimate

(frequentist statistics)
(Goodfellow 2016)
Maximum A Posteriori (MAP)
Estimation
• Add the prior:
• MAP Bayesian inference has the advantage of

leveraging information that is brought by the prior
and cannot be found in the training data.
* =
q If this prior is given by 𝒩 𝒘, 𝟎, C 𝑰 , then the log-
7
prior term is proportional to the familiar 𝜆𝑤 𝑤
weight decay penalty
(Goodfellow 2016)
Supervised learning
examples
• Logistic regression
• Support vector machines
• Decision trees
(Goodfellow 2016)
Logistic regression
• There is no closed-form solution for the optimal weights.

• Instead, we must search for them by maximizing the log-
likelihood.
• We can do this by minimizing the negative log-likelihood
(NLL) using gradient descent.
(Goodfellow 2016)
Support vector machines
• Introduced by Vapnik in the 90s
• Widest street approach
• Decision rule
q Which side of the street?
o 𝒘𝒙 + 𝑏 ≥ 0 ⇒ +
o 𝒘𝒙 + 𝑏 ≤ 0 ⇒ −
q w is perpendicular to the median line of the street, but which w?

o We add the following scaling constraints:
• 𝒘𝒙0 + 𝑏 ≥ 1
o 𝒘𝒙1 + 𝑏 ≤ −1
(Goodfellow 2016)
SVM
• y" = +1 for + samples
• y" = −1 for – samples
• 𝑦" (𝒘𝒙𝒊 + 𝑏) ≥ 1 same equation for x+ and x-
• 𝑦" 𝒘𝒙𝒊 + 𝑏 − 1 ≥ 0
• Constraints: 𝑦" 𝒘𝒙𝒊 + 𝑏 − 1 = 0 for 𝒙𝒊 in the gutter
𝐰 *(F ((*DF) =
• w𝑖𝑑𝑡ℎ = 𝒙D − 𝒙( = − =
𝒘 𝒘 𝒘 𝒘
• Maximize the width of the street:
= * 𝒘 '
q max ⇒ max ⇒ min
𝒘 𝒘 =
(Goodfellow 2016)
Lagrange multipliers
8
• 𝐿 = 𝒘 * − ∑𝛼) 𝑦) 𝒘𝒙𝒊 + 𝑏 − 1 𝛼) ≥ 0
*
• 𝛻𝒘 𝐿 = 𝒘 − ∑𝛼) 𝑦) 𝒙𝒊
q 𝒘 − ∑𝛼) 𝑦) 𝒙𝒊 = 𝟎 ⇒ 𝒘 = ∑𝛼) 𝑦) 𝒙𝒊
<=
• = ∑𝛼) 𝑦) ⇒ ∑𝛼) 𝑦) = 0
<>
• Replacing in 𝐿 :
8
q 𝐿= ∑𝛼) 𝑦) 𝒙𝒊 ∑𝛼? 𝑦? 𝒙𝒋 − ∑𝛼) 𝑦) 𝒙𝒊 ∑𝛼? 𝑦? 𝒙𝒋 − ∑𝛼) 𝑦) 𝑏 +
*
∑𝛼)
8
q 𝐿 = ∑𝛼) − ∑𝛼) 𝛼? 𝑦) 𝑦? 𝒙𝒊 𝒙𝒋
*
• Numerical analysts will solve this problem (max with
respect to 𝛼) ≥ 0) and give us the 𝛼)
(Goodfellow 2016)
Kernel trick
• 𝛼" ≠ 0 only for points in the gutter, also known as
support vectors
• Decision for 𝒙𝒋 : 𝒘𝒙𝒋 + 𝑏 = ∑𝛼" 𝑦" 𝒙𝒊 𝒙𝒋 + 𝑏
• We can change the representation of 𝒙𝒊 to be 𝜙 𝒙𝒊
• We do not need direct access to 𝜙 𝒙𝒊
q Instead we define 𝑘 𝒙𝒊 , 𝒙𝒋 = 𝜙 𝒙𝒊 𝜙 𝒙𝒋
q The function k is called kernel
• 𝑓 𝒙 = ∑𝛼" 𝑦" 𝑘 𝒙𝒊 , 𝒙 + 𝑏
q This function is nonlinear with respect to 𝒙
(Goodfellow 2016)
Advantages of the kernel
trick
• The kernel trick is powerful for two reasons.
q it allows us to learn models that are nonlinear as a
function of 𝒙 using convex optimization techniques
that are guaranteed to converge efficiently
q the kernel function often admits an implementation
that is significantly more computational efficient than
naively constructing two 𝜙 𝒙 vectors and explicitly
taking their dot product.
o In some cases, 𝜙 𝒙 can even be infinite
dimensional
(Goodfellow 2016)
Gaussian kernel
• Gaussian kernel
q Aka RBF: Radial Basis Function
• We can think of the Gaussian kernel as performing a kind of

template matching.
q A training example x associated with training label y
becomes a template for class y.
q When a test point x’ is near x according to Euclidean
distance, the Gaussian kernel has a large response,
indicating that x’ is very similar to the x template.
q The model then puts a large weight on the associated
training label y.
• Overall, the prediction will combine many such training
labels weighted by the similarity of the corresponding
training examples.
(Goodfellow 2016)
SVM limitations
• high computational cost of training when the dataset
is large.
• The current deep learning renaissance began when

Hinton et al. (2006) demonstrated that a neural
network could outperform the RBF kernel SVM on
the MNIST benchmark.
(Goodfellow 2016)
Watch (SVM)
• https://www.youtube.com/watch?v=_PwhiWxHK8o
(Goodfellow 2016)
K-Nearest Neighbors
• There is not even really a training stage or learning
process.
q We find the k-nearest neighbors to x in the
training data X. We then return the average of the
corresponding y values in the training set
• In the case of classification, we can average over
one-hot code vectors
q We can then interpret the average over these
one-hot codes as giving a probability distribution
over classes
(Goodfellow 2016)
K-NN limitations
• It cannot learn that one feature is more
discriminative than another.
q For example, imagine we have a regression task
with 𝑥 ∈ ℝ*,, drawn from an isotropic Gaussian
distribution,
q but only a single variable is relevant to the output
𝑦 = 𝑥*
q Nearest neighbor regression will not be able to
detect this simple pattern.
(Goodfellow 2016)
Decision Trees
How a decision tree
might divide ℝ= .
Each leaf requires at least one training

example to define,
so it is not possible for the decision tree to
learn a function that has more local maxima
than the number of training examples.
Figure 5.7 (Goodfellow 2016)

Decision trees limitations
• It struggles to solve some problems that are easy

even for logistic regression.
• If we have a two-class problem and the positive
class occurs wherever 𝑥= > 𝑥*
q the decision boundary is not axis-aligned. The
decision tree will thus need to approximate the
decision boundary with many splits, implementing
a step function that constantly walks back and
forth across the true decision function.
(Goodfellow 2016)
Unsupervided learning
• Density estimation
• Learning to draw samples from a distribution
• Learning to denoise data from some distribution
• Clustering the data into groups of related examples
• Simpler representation:
q Lower-dimensional representation
q Sparse representation
q Independent representation
(Goodfellow 2016)
Principal Components
Analysis
• lower dimensionality
• no linear correlation
(Goodfellow 2016)
No linear correlation
• Let us consider the 𝑚×𝑛-dimensional design matrix 𝑿.
• We will assume that the data has a mean of zero,
𝔼 𝒙 = 0.
q If this is not the case, the data can easily be centered
by subtracting the mean from all examples in a
preprocessing step.
• The unbiased sample covariance matrix associated with
X is given by:
8
q 𝑉𝑎𝑟 𝒙 = 𝑿𝑻 𝑿
7:8
• PCA finds 𝒛 = 𝑾𝑻 𝒙 where 𝑉𝑎𝑟 𝒛 is diagonal.
(Goodfellow 2016)
Decorrelation
𝑻
• PCA finds 𝒛 = 𝑾 𝒙 where 𝑉𝑎𝑟 𝒛 is diagonal.
𝑻 𝑻
q Remember that 𝑿 𝑿 = 𝑾𝚲𝑾
* 𝑻 𝟏 𝑻 𝑻 *
• 𝑉𝑎𝑟 𝒛 = 𝒁 𝒁= 𝑾 𝑿 𝑿𝑾 = 𝑾𝑻 𝑾𝜦𝑾𝑻 𝑾 =
#(* 𝒎(𝟏 #(*
*
• 𝑉𝑎𝑟 𝒛 = #(* 𝚲 since 𝑾𝑻 𝑾 = 𝑰
• PCA disentangles the unknown factors of variation
underlying the data.
q this disentangling takes the form of finding a rotation of
the input space that aligns the principal axes of variance
with the basis of the new representation space
associated with z.
(Goodfellow 2016)
k-means clustering
• Divides the training set into k different clusters of
examples
• Provides a k-dimensional one-hot code vector ℎ
representing an input 𝑥.
q If 𝑥 belongs to cluster 𝑖, then ℎ" = 1 and all other
entries of the representation ℎ are zero.
q This is an example of a sparse representation
(Goodfellow 2016)
k-means algorithm
• initialize k different centroids {𝜇(1), . . . , 𝜇(𝑘)} to
different values,
• then alternate between two different steps until
convergence:
q In one step, each training example is assigned to
cluster 𝑖, where 𝑖 is the index of the nearest
centroid 𝜇(𝑖).
q In the other step, each centroid 𝜇(𝑖) is updated to
the mean of all training examples 𝑥(𝑗) assigned to
cluster 𝑖.
(Goodfellow 2016)
How to evaluate clustering?
• Suppose we have images of
q Red trucks, gray trucks
q Red cars, gray cars
• One clustering algorithm may find a cluster of cars and a
cluster of trucks
• Another learning algorithm may find a cluster of red
vehicules and a cluster of gray vehicules
• A third algorithm may find 4 clusters …
q Red cars similarity to gray trucks is same as their
similarity to gray cars!
• A distributed representation may have two attributes
for each vehicule
(Goodfellow 2016)
Building a machine learning
algorithm
• Simple recipe
q A dataset
q A cost function (e.g. MSE, negative log-likelihood)
q A model (e.g. linear, non-linear)
q An optimization procedure (closed form, gradient
descent, stochastic gradient descent …)
(Goodfellow 2016)
Exercise
• Link equations and definitions
(Goodfellow 2016)
Stochastic gradient descent
Computing the gradient is O(m)

Sample a mini-batch
(Goodfellow 2016)
Challenges motivating deep
learning
• Simple machine learning algorithms work very well
on a wide variety of important problems.
• However, they have not succeeded in solving the
central problems in AI, such as recognizing speech
or recognizing objects.
• Generalizing to new examples becomes
exponentially more difficult when working with high-
dimensional data,
• Such data also often impose high computational
costs.
(Goodfellow 2016)
Curse of Dimensionality
Many machine learning
problems become exceedingly
difficult when the number of
dimensions in the data is high.
one variable, 10 two variables, 10 three variables, 10

regions of interest. regions of interest regions of interest
each. each.
For 𝑑 dimensions and 𝑣 values to be distinguished along each

axis, we seem to need 𝑂(𝑣 ! ) regions and examples (Goodfellow 2016)
Smoothness prior (aka local
consistency)
• the function we learn should not
change very much within a
small region.
q KNN
q RBF kernel
q Decision trees
• to distinguish 𝑂(𝑘) regions in
input space, all of these
methods require 𝑂(𝑘) examples Voronoi
diagram
(Goodfellow 2016)
Smoothness prior is not
enough!
• Is there a way to represent a complex function that
has many more regions to be distinguished than the
number of training examples?
q A checkerboard contains many variations but
there is a simple structure to them.
• The answer is Yes:
q If we introduce some dependencies between the
regions
q via additional assumptions about the underlying
data generating distribution
(Goodfellow 2016)
Deep learning solution
• Deep learning assumes that the data was
generated by the composition of factors or features,
potentially at multiple levels in a hierarchy
• These apparently mild assumptions allow an
exponential gain in the relationship between the
number of examples and the number of regions that
can be distinguished.
• Deep and distributed representations counter the
curse of dimensionality.
(Goodfellow 2016)
Manifold
• A manifold is a connected set of points
q that can be approximated well by considering only a
small number of degrees of freedom, or dimensions,
embedded in a higher-dimensional space.
• A manifold is a set of points, associated with a
neighborhood around each point.
q Transformations can be applied to move on the
manifold from one position to a neighboring one.
• From any given point, the manifold locally appears to be
an Euclidean space.
q In everyday life, we experience the surface of the
world as a 2-D plane, but it is in fact a spherical
manifold in 3-D space.
(Goodfellow 2016)
Manifold
One-dimensional manifold embedded in two dimensional

space.
(Goodfellow 2016)
Manifold
• we allow the dimensionality of the manifold to vary
from one point to another.
q This often happens when a manifold intersects
itself.
q For example, a figure eight is a manifold that has
a single dimension in most places but two
dimensions at the intersection at the center.
(Goodfellow 2016)
Manifold learning
• Manifold learning algorithms assumes that most of
!
ℝ consists of invalid inputs,
q interesting inputs occur only along a collection of
manifolds
q Interesting variations in the output of the learned
function occurring only along directions that lie on
the manifold,
q or with interesting variations happening only when
we move from one manifold to another.
the key assumption remains
that probability mass is
highly concentrated. (Goodfellow 2016)
Arguments supporting the
manifold hypothesis
• the probability distribution over images, text strings,
and sounds that occur in real life is highly
concentrated.
Uniformly
Sampled
Images
(Goodfellow 2016)
Arguments supporting the
manifold hypothesis
• Examples we encounter are connected to each
other by applying transformations to traverse the
manifold.
QMUL Dataset (Goodfellow 2016)

Conclusion
• The manifold assumption is at least approximately
correct.
• This concludes part I of the course

• You are now prepared to embark upon your study
of deep learning.
Congratulations
(Goodfellow 2016)

Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow

Uploaded by

Copyright:

Available Formats

Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow

Uploaded by

Copyright:

Available Formats

Machine Learning Basics

Lecture slides for Chapter 5 of Deep Learning

• Unsupervised learning: 𝑝(𝒙)

q Clustering, learning probability distribution, denoising, etc.

• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a simple trick

Degree 1 Degree 2 Degree 9: infinitely

• In some sense, no machine learning algorithm is

• The most sophisticated algorithm we can conceive of

The true function is quadratic,

• In k-fold cross-validation a partition of the dataset is

𝑏𝑖𝑎𝑠 𝜽 Z> = 𝔼 𝜽 Z> − 𝜽

q A common estimator of 𝜃 is:

The sample mean is an unbiased

The sample variance is a biased

• The variance of the estimator decreases as a function of

Figure 5.6 (Goodfellow 2016)

Same as: (Goodfellow 2016)

Any loss consisting of a negative log- Independent of 𝜃

• Instead of producing a single prediction 𝑦,O we now

• It is different than the single point estimate

• MAP Bayesian inference has the advantage of

• There is no closed-form solution for the optimal weights.

q w is perpendicular to the median line of the street, but which w?

• We can think of the Gaussian kernel as performing a kind of

• The current deep learning renaissance began when

Each leaf requires at least one training

Figure 5.7 (Goodfellow 2016)

• It struggles to solve some problems that are easy

Computing the gradient is O(m)

one variable, 10 two variables, 10 three variables, 10

For 𝑑 dimensions and 𝑣 values to be distinguished along each

One-dimensional manifold embedded in two dimensional

QMUL Dataset (Goodfellow 2016)

• This concludes part I of the course

You might also like