Machine Learning
Linear Models for Regression
Marcello Restelli
t
x
March 9, 2017 4
x3
2 t̂
t
0x1
w1
5 x2
0
−5 0 2 4
x2 x1
w0
Outline
1 Linear Regression
2 Minimizing Least Squares
3 Regularization
4 Bayesian Linear Regression
Marcello Restelli March 9, 2017 2 / 47
Linear Regression
Regression Problems
The goal of regression is to learn
t
a mapping from input x to a
continuous output t
x
Examples we assume
input x
Predict stock market price output t exists a model
Predict age of a web user that goes from x
Predict effect of an actuation in robotics to t
Predict the value of a house
Predict the temperature in a building
Marcello Restelli March 9, 2017 4 / 47
Linear Regression
Linear Models
Many real processes can be
approximated with linear models
t
Linear regression often appears as
a module of larger systems
x
Linear problems can be solved analytically
Linear prediction provides an introduction to many of the core concepts
of machine learning
Augmented with kernels, it can model non-linear relationships
Marcello Restelli March 9, 2017 5 / 47
Linear Regression
Linear Function
Linear function in the parameters w:
D−1
X
y(x, w) = w0 + wj xj = wT x
j=1
x = (1, x1 , . . . , xD−1 )
w0 is the offset
each point in w0 w1 space describes a model
w1
w0 x
Marcello Restelli March 9, 2017 6 / 47
Linear Regression
Loss Functions for Regression
We need to quantify what it means to do well or poorly on a task
We need to define a loss (error) function: L(t, y(x))
The average, or expected, loss is given by:
Z Z
E[L] = L(t, y(x))p(x, t)dxdt
A common choice is the squared loss function
Z Z
E[L] = (t − y(x))2 p(x, t)dxdt
The optimal solution (if we assume a completely
flexible function) is the conditional average:
Z
y(x) = tp(t|x)dt = E[t|x]
Marcello Restelli March 9, 2017 7 / 47
Linear Regression
Other Loss Functions
Simple generalization of the squared loss, called the Minkowski loss:
Z Z
E[L] = (t − y(x))q p(x, t)dxdt
The minimum of E[L] is given by:
the conditional mean for q = 2
the conditional median for q = 1
the conditional mode for q → 0
Marcello Restelli March 9, 2017 8 / 47
Linear Regression
Basis Functions
To consider non-linear functions we can use non-linear basis function:
M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1
φ(x) = (1, φ1 (x), . . . , φM−1 (x))T
1
Examples: 0.5
Polynomial: φj (x) = xj
Gaussian: 0
(x−µ )2
φj (x) = exp − 2σ2j
−0.5
Sigmoidal:
φj (x) = 1 µ −x −1
j
1+exp σ −1 −0.5 0 0.5 1
Marcello Restelli March 9, 2017 9 / 47
Linear Regression
Basis Functions
To consider non-linear functions we can use non-linear basis function:
M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1
φ(x) = (1, φ1 (x), . . . , φM−1 (x))T
1
Examples: 0.8
Polynomial: φj (x) = xj
Gaussian:
0.6
(x−µ )2
φj (x) = exp − 2σ2j 0.4
Sigmoidal: 0.2
φj (x) = 1 µ −x
j
1+exp σ 0
−1 −0.5 0 0.5 1
Marcello Restelli March 9, 2017 9 / 47
Linear Regression
Basis Functions
To consider non-linear functions we can use non-linear basis function:
M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1
φ(x) = (1, φ1 (x), . . . , φM−1 (x))T
1
Examples: 0.8
Polynomial: φj (x) = xj
Gaussian:
0.6
(x−µ )2
φj (x) = exp − 2σ2j 0.4
Sigmoidal: 0.2
φj (x) = 1 µ −x
j
1+exp σ 0
−1 −0.5 0 0.5 1
Marcello Restelli March 9, 2017 9 / 47
Linear Regression
Basis Functions
Example
t
t
x x
t
x2 x
x2
Marcello Restelli March 9, 2017 10 / 47
Linear Regression
Discriminative vs Generative
Generative approach:
Model the joint density: p(x, t) = p(x|t)p(t)
Infer conditional density: p(t|x) = p(x,t)
p(x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Discriminative approach:
Model conditional density p(t|x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Direct approach
Find a regression function y(x) directly from the training data
Marcello Restelli March 9, 2017 11 / 47
Minimizing Least Squares
Minimizing Least Squares
Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103
w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1
This is (half) the residual sum of w0
squares (RSS), a.k.a. sum of
squared errors (SSE)
It can also be written as the sum
of the `2 -norm of the vector of t
residual errors
N
X
RSS(w) = kk22 = 2i x
Marcello Restelli i=1 March 9, 2017 13 / 47
Minimizing Least Squares
Minimizing Least Squares
Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103
w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1
This is (half) the residual sum of w0
squares (RSS), a.k.a. sum of
squared errors (SSE)
It can also be written as the sum
of the `2 -norm of the vector of t
residual errors
N
X
RSS(w) = kk22 = 2i x
Marcello Restelli i=1 March 9, 2017 13 / 47
Minimizing Least Squares
Ordinary Least Squares
Closed-Form Optimization
Let’s write RSS in matrix form with Φ = (φ(x1 ), . . . , φ(xN ))T and
t = (t1 , . . . , tN )T
1 1
L(w) = RSS(w) = (t − Φw)T (t − Φw)
2 2
Compute first and second derivative
∂L(w) ∂ 2 L(w)
= −ΦT (t − Φw) ; T
T =Φ Φ
∂w ∂w∂w
Assuming ΦT Φ in nonsingular
−1 T
ŵOLS = ΦT Φ Φ t
Complexity O(NM 2 + M 3 )
Cholesky: M 3 + NM 2 /2
QR: NM 2
Marcello Restelli March 9, 2017 14 / 47
Minimizing Least Squares
Geometric Interpretation
t is an N-dimensional vector
Let’s denote with ϕj the j-th column of Φ
Define t̂ the N-dimensional vector, whose n-th element is y(xn , w)
t̂ is a linear combination of ϕ1 , . . . , ϕM
so t̂ lies in an M-subspace S
Since t̂ minimizes the SSE with respect to t, it represents the projections
of t onto the subspace S
−1 T
t̂ = Φŵ = Φ ΦT Φ Φ t
−1 T
H = Φ ΦT Φ Φ is called the hat matrix
Marcello Restelli March 9, 2017 15 / 47
Minimizing Least Squares
Geometric Example
= 3 and M= D = 2
Assume N
1 2 5 3.5
Φ=X= 1 −2 t= 1 t̂ = 1
1 2 2 3.5
4
t̂
x3
2 t
0 x1
x2
5
0
2 4
x2 −5 0
x1
Marcello Restelli March 9, 2017 16 / 47
Minimizing Least Squares
Gradient Optimization
Closed-form solution is not practical with big data
We can use sequential (online) updates
Stochastic gradient descent
If the lossPfunction can be expressed as a sum over samples
(L(x) = n L(xn ))
w(k+1) = w(k) − α(k) ∇L(xn )
T
w(k+1) = w(k) − α(k) w(k) φ(xn ) − tn φ(xn )
where k is the iteration and α is a learning rate
For convergence the learning rate has to satisfy
∞
X 1
(k)
= +∞
k=0
α
∞
X 1
2
< +∞
k=0 α(k)
Marcello Restelli March 9, 2017 17 / 47
Minimizing Least Squares
Maximum Likelihood (ML)
The output variable t can be modeled as a deterministic function y of the
input x and a random noise
t = f(x) +
We want to approximate f(x) with y(x, w)
We assume ∼ N 0, σ 2
Given N samples, with inputs X = {x1 , . . . , xN } and outputs
t = (t1 , . . . , tN )T , the likelihood function is
N
Y
2
N tn |wT φ(xn ), σ 2
p(t|X, w, σ ) =
n=1
Marcello Restelli March 9, 2017 18 / 47
Minimizing Least Squares
Maximum Likelihood (ML)
Assuming the samples to be independent and identically distributed
(iid), we consider the log-likelihood:
N
X
`(w) = ln p(t|X, w, σ 2 ) = log p(tn |xn , w, σ 2 )
n=1
N 1
= − log 2πσ 2 − 2 RSS(w)
2 2σ
To find the maximum likelihood, we equal the gradient to zero
N N
!
X T
X T
T
∇ ln `(w) = tn φ(xn ) − w φ(xn )φ(xn ) =0
n=1 n=1
−1 T
T
wML = Φ Φ Φ t
Marcello Restelli March 9, 2017 19 / 47
Minimizing Least Squares
Variance of the Parameters
We assume
the observation ti are uncorrelated and have constant variance σ 2
the xi are fixed (non random)
The variance-covariance matrix of the least-squares estimates is
−1 2
Var(ŵOLS ) = ΦT Φ σ
Usually, the variance σ 2 is estimated by
N
1
σˆ2 =
X
(tn − ŵT φ(xn ))2
N−M−1
n=1
Assuming that the model is linear in the features φ1 (), . . . , φM () and that
the noise is additive and Gaussian
ŵ ∼ N w, (ΦT Φ)−1 σ 2 (N − M − 1)σˆ2 ∼ σ 2 χ2
N−M−1
Such properties can be used to form test hypothesis and confidence
intervals
Marcello Restelli March 9, 2017 20 / 47
Minimizing Least Squares
Gauss-Markov Theorem
Theorem (Gauss-Markov)
The least squares estimate of w has the smallest variance among all linear
unbiased estimates.
It follows that least squares estimator has the lowest MSE of all linear
estimator with no bias
However, there may exist a biased estimator with smaller MSE
Marcello Restelli March 9, 2017 21 / 47
Minimizing Least Squares
Multiple Outputs
Let now consider the case of multiple outputs
We could use a different set of basis functions for each output, thus
having independent regression problems
Usually, a single set of basis functions is considered
−1 T
ŴML = ΦT Φ Φ T
For each output tk we have
−1 T
ŵk = ΦT Φ Φ tk
where tk is an N-dimensional column vector
The solution decouples between the different outputs
−1 T
The pseudo-inverse Φ† = ΦT Φ Φ needs to be computed only once
Marcello Restelli March 9, 2017 22 / 47
Minimizing Least Squares
Increasing Model Complexity: Quadratic Function
200 data
1
2
10
15
19
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 23 / 47
Minimizing Least Squares
Increasing Model Complexity: Sinusoidal Function
Marcello Restelli March 9, 2017 24 / 47
Minimizing Least Squares
Increasing Model Complexity: Sinusoidal Function
Marcello Restelli March 9, 2017 24 / 47
Minimizing Least Squares
Increasing Model Complexity: Sinusoidal Function
Marcello Restelli March 9, 2017 24 / 47
Minimizing Least Squares
Increasing Model Complexity: Sinusoidal Function
Marcello Restelli March 9, 2017 24 / 47
Minimizing Least Squares
Under-fitting vs Over-Fitting
With low-order polynomials we have under-fitting
With high-order polynomials we get excellent fit over the training data,
but a poor representation of the true function: over-fitting
We want to have good generalization
We use a test set of 100 samples
to evaluate generalization
q
ERMS = 2∗RSS(ŵ)
N
Marcello Restelli March 9, 2017 25 / 47
Minimizing Least Squares
How to Avoid Over-fitting?
This is the problem of model selection (we will see this later)
What happens when the number of training samples increases?
Marcello Restelli March 9, 2017 26 / 47
Minimizing Least Squares
How to Avoid Over-fitting?
What happens to the parameters when the model gets more complex?
M=0 M=1 M=3 M=9
ŵ 0 0.19 0.82 0.31 0.35
ŵ 1 -1.27 7.99 232.37
ŵ 2 -25.43 -5321.83
ŵ 3 48568.31
ŵ 4 -231639.30
ŵ 5 640042.26
ŵ 6 -1061800.52
ŵ 7 1042400.18
ŵ 8 -557682.99
ŵ 9 125201.43
Marcello Restelli March 9, 2017 27 / 47
Regularization
Ridge Regression
One way to reduce the MSE is to change the loss function as follows
L(w) = LD (w) + λLW (w)
LD (w): error on data (e.g., RSS)
LW (w): model complexity
By taking LW (w) = 12 wT w = 12 kwk22 we get
N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk22
2 2
i=1
It is called ridge regression (or weight decay)
It is a regularization (or parameter shrinkage) method
The loss function is still quadratic in w:
−1 T
ŵridge = λI + ΦT Φ Φ t
Marcello Restelli March 9, 2017 29 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =2.74e+6
200
data
λ=0
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =1.75e+3
data
200
λ=0
λ=1e-10
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =9.32e+2
data
200
λ=0
λ=1e-9
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =3.92e+2
data
200
λ=0
λ=1e-8
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =1.54e+2
data
200
λ=0
λ=1e-7
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =6.22e+1
data
200
λ=0
λ=1e-6
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example: Weights
Polynomial degree 15
·105
2
1
w
−1
10−10
Marcello Restelli 10−9 10−8 10−7 10−6
March 9, 2017 31 / 47
Regularization
Ridge Regression
Sinusoidal Example
Marcello Restelli March 9, 2017 32 / 47
Regularization
Lasso
Another popular regularization method is lasso
N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk1
2 2
i=1
PM
where kwk1 = j=1 |wj |
Differently from ridge, lasso is nonlinear in the ti and no closed-form
solution exists (quadratic programming problem)
Nonetheless, it has the advantage of making some weights equal to zero
for values of λ sufficiently large
Lasso yields sparse models
Marcello Restelli March 9, 2017 33 / 47
Regularization
Lasso vs Ridge Regression
Ridge Lasso
w2 w2
w∗
w∗
w1 w1
Lasso tends to generate sparser solutions than quadratic regularizer
Marcello Restelli March 9, 2017 34 / 47
Bayesian Linear Regression
Bayesian Approach
We formulate our knowledge about the world in a probabilistic way
We define the model that expresses our knowledge qualitatively
Our model will have some unknown parameters
We capture our assumptions about unknown parameters by specifying the
prior distribution over those parameters before seeing the data
We observe the data
We compute the posterior probability distribution for the parameters,
given observed data
We use the posterior distribution to:
Make predictions by averaging over the posterior distribution
Examine/Account for uncertainty in the parameter values
Make decisions by minimizing expected posterior loss
Marcello Restelli March 9, 2017 36 / 47
Bayesian Linear Regression
Posterior Distribution
The posterior distribution for the model parameters can be found by
combining the prior with the likelihood for the parameters given data
This is accomplished using Bayes’ Rule:
P(data|parameters)P(parameters)
P(parameters|data) =
P(data)
p(D|w)P(w)
p(w|D) =
P(D)
where
p(w|D) is the posterior probability of parameters w given training data D
p(D|w) is the probability (likelihood) of observing D given w
P(w) is the prior probability over the parameters
R marginal likelihood (normalizing constant):
P(D) is the
P(D) = p(D|w)P(w)dw
We want the most probable value of w given the data: maximum a
posteriori (MAP). It is the mode of the posterior.
Marcello Restelli March 9, 2017 37 / 47
Bayesian Linear Regression
Predictive Distribution
Stating Bayes’ rule in words: posterior ∝ likelihood × prior
Prediction for a new data point x∗ (given the training dataset D) can be
done by integrating over the posterior distribution:
Z
p(x |D) = p(x∗ |w, D)p(w|D)dw = E[p(x∗ |w, D)]
∗
which is sometimes called predictive distribution
Note that computing the predictive distribution requires knowledge of
the posterior distribution, which is usually intractable
Marcello Restelli March 9, 2017 38 / 47
Bayesian Linear Regression
Bayesian Linear Regression
Another approach to avoid the over-fitting problem of ML is to use Bayesian
linear regression
In the Bayesian approach the parameters of the model are considered as drawn
from some distribution
Assuming Gaussian likelihood model, the conjugate prior is Gaussian too
p(w) = N (w|w0 , S0 )
Given the data D, the posterior is still Gaussian
p(w|t, Φ, σ 2 ) ∝ N (w|w0 , S0 ) N t|Φw, σ 2 IN = N (w|wN , SN )
ΦT t
wN = SN S−1 0 w 0 +
σ2
ΦT Φ
S−1
N = S0 +
−1
σ2
For sequential data, the posterior acts as prior for the next iteration
Marcello Restelli March 9, 2017 39 / 47
Bayesian Linear Regression
Relation to Ridge Regression
In Gaussian distributions the mode coincides with the mean
It follows that wN is the MAP estimator (Maximum a posteriori)
If the prior has infinite variance, wN reduces to the ML estimator
If w0 = 0 and S0 = τ 2 I, then wN reduces to the ridge estimate, where
2
λ = στ 2
Marcello Restelli March 9, 2017 40 / 47
Bayesian Linear Regression
1D Example
Data generated from: t(x) = −0.3 + 0.5x + , where ∼ N (0, 0.04)
x values taken uniformly from [−1, 1]
Model: y(x, w) = w0 + w1 x
We assume to know σ 2 = 0.04 and τ 2 = 0.5
w1
t
w0 x
Marcello Restelli March 9, 2017 41 / 47
Bayesian Linear Regression
1D Example
0 data points observed
1 data point observed
2 data points observed
20 data points observed
Marcello Restelli March 9, 2017 42 / 47
Bayesian Linear Regression
Predictive Distribution
We are interested in the posterior predictive distribution
Z
p(t|x, D, σ ) = N t|wT φ(x), σ 2 N (w|wN , SN ) dw
2
= N t|wN T φ(x), σN2 (x)
σN2 (x) = σ 2 + φ(x)T SN φ(x)
where
σN2 (x) = σ2
|{z} + φ(x)T SN φ(x)
| {z }
noise in the Uncertainty associated
target values
with parameter values
In the limit, as N → ∞, the second term goes to zero
The variance of the predictive distribution arises only from the additive
noise governed by parameter σ
Marcello Restelli March 9, 2017 43 / 47
Bayesian Linear Regression
Example
Sinsusoidal dataset, 9 Gaussian basis functions
Marcello Restelli March 9, 2017 44 / 47
Bayesian Linear Regression
Modeling Challenges
The first challenge is in specifying suitable model and suitable prior
distributions
A suitable model should admit all the possibilities that thought to be at all
likely
A suitable prior should avoid giving zero or very small probabilities to
possible events, but should also avoid spreading out the probability over
all possibilities
To avoid uninformative priors, we may need to model dependencies
between parameters
One strategy is to introduce latent variables into the model and
hyperparameters into the prior
Both of these represent the ways of modeling dependencies in a
tractable way
Marcello Restelli March 9, 2017 45 / 47
Bayesian Linear Regression
Computational Challenges
The other big challenge is computing the posterior distribution. There are
several approaches:
Analytical integration: If we use “conjugate priors”, the posterior
distribution can be computed analytically. Only works for simple
models
Gaussian (Laplace) approximation: Approximate the posterior
distribution with a Gaussian. Works well when there a lot of data
compared to the model complexity
Monte Carlo integration: Once we have a sample from the posterior
distribution, we can do many things. Currently, the common approach is
Markov Chain Monte Carlo (MCMC), that consists in simulating a
Markov chain that converges to the posterior distribution
Variational approximation: A cleverer way of approximating the
posterior. It is usually faster than MCMC, but it is less general
Marcello Restelli March 9, 2017 46 / 47
Bayesian Linear Regression
Pros and Cons of Fixed Basis Functions
Advantages
Closed-form solution
Tractable Bayesian treatment
Arbitrary non-linearity with the proper basis functions
Limitations
Basis functions are chosen independently from the training set
Curse of dimensionality
Marcello Restelli March 9, 2017 47 / 47