Lecture4 Foundations Supervised Learning
Lecture4 Foundations Supervised Learning
Lecture4 Foundations Supervised Learning
Prevously, we learned about supervised learning, derived our first algorithm, and used it to predict
diabetes risk.
In this lecture, we are going to dive deeper into why supevised learning really works.
First, let’s look at the data, and define where it comes from.
Later, this will be useful to precisely define when supervised learning is guaranteed to work.
At a high level, a supervised machine learning problem has the following structure:
1
5 Data Distribution
We will assume that the dataset is sampled from a probability distribution P, which we will call
the data distribution. We will denote this as
x, y ∼ P.
The training set D = {(x(i) , y (i) ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P.
The key assumption in that the training examples are independent and identicaly distributed (IID).
* Each training example is from the same distribution. * This distribution doesn’t depend on
previous training examples.
Example: Flipping a coin. Each flip has same probability of heads & tails and doesn’t depend on
previous flips.
Counter-Example: Yearly census data. The population in each year will be close to that of the
previous year.
def true_fn(X):
return np.cos(1.5 * np.pi * X)
2
Let’s now draw samples from the distribution. We will generate random x, and then generate
random y using
y = f (x) + ϵ
for a random noise variable ϵ.
[3]: n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
3
8 Data Distribution: Motivation
We will assume that the dataset is sampled from a probability distribution P, which we will call
the data distribution. We will denote this as
x, y ∼ P.
The training set D = {(x(i) , y (i) ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P.
There are several things we may want out of a good model: 1. Interpretable features that explain
how x affects y. 2. Confidence intervals around y (we will see later how to obtain these) 3. Accurate
predictions of the targets y from inputs x.
In this lecture, we fill focus on the latter.
4
12 Hold-Out Dataset: Definition
A hold-out dataset
Ḋ = {(ẋ(i) , ẏ (i) ) | i = 1, 2, ..., m}
is another dataset that is sampled IID from the same distribution P as the training dataset D and
the two datasets are disjoint.
Let’s genenerate a hold-out dataset for the example we saw earlier.
[5]: import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
def true_fn(X):
return np.cos(1.5 * np.pi * X)
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
X_holdout = np.sort(np.random.rand(n_holdout_samples))
y_holdout = true_fn(X_holdout) + np.random.randn(n_holdout_samples) * 0.1
5
plt.legend()
This defines accuracy on a data point. We say a supervised learning model is accurate if it correctly
predicts the target on new (held-out) data.
We can say that a predictive model f is accurate if it’s probability of making an error on a random
holdout sample is small:
1 − P [isaccurate(ẏ, f (ẋ))] ≤ ϵ
for ẋ, ẏ ∼ P, for some small ϵ > 0 and some definition of accuracy.
We can also say that a predictive model f is inaccurate if it’s probability of making an error on a
random holdout sample is large:
1 − P [isaccurate(ẏ, f (ẋ))] ≥ ϵ
or equivalently
P [isaccurate(ẏ, f (ẋ))] ≤ 1 − ϵ.
6
14 Generalization
In machine learning, generalization is the property of predictive models to achieve good perfor-
mance on new, heldout data that is distinct from the training set.
Will supervised learning return a model that generalizes?
Let’s say that we apply our supervised learning to our problem. 1. We define a model class M
containing H different models. 2. One of these models fits the training data perfectly (is accurate
on every point) and we choose that model.
(These are assumptions that could be changed.)
Claim: The probability that supervised learning will return an inaccurate model decreases expo-
nentially with training set size n.
1. A model f is inaccurate if P [isaccurate(ẏ, f (ẋ))]
∏n≤ 1−ϵ.
[ The probability that an] inaccurate
model f perfectly fits the training set is at most i=1 P isaccurate(y , f (x(i) )) ≤ (1−ϵ)n .
(i)
2. We have H models in M, and any of them could be in accurate. The probability that at
least one the at most H inaccurate models willl fit the training set perfectly is ≤ H(1 − ϵ)n .
Therefore, the claim holds.
# Part 3: Overfitting and Underfitting
Let’s now dive deeper into the concept of generalization and two possible failure modes of supervised
learning: overfitting and underfitting.
18 Review: Generalization
We will assume that the dataset is governed by a probability distribution P, which we will call the
data distribution. We will denote this as
x, y ∼ P.
7
˙ , y (i)
A hold-out set Ḋ = {(x(i) ˙ ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P and is distinct from the training set.
A model that generalizes is accurate on a hold-out set.
fθ (x) := θ⊤ ϕ(x)
that is linear in θ but non-linear in x because the features ϕ(x) : R → Rp are non-linear.
By using polynomial features such as ϕ(x) = [1 x . . . xp ], we can fit any polynomial of degree p.
When we switch from linear models to polynomials, we can better fit the data and increase the
accuracy of our models.
Consider the synthetic dataset that we have seen earlier.
[7]: from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
np.random.seed(0)
n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
8
Although fitting a linear model does not work well, qudratic or cubic polynomials improve the fit.
[8]: degrees = [1, 2, 3]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
As we increase the complexity of our model class M to even higher degree polynomials, we are able
to fit the data increasingly even better.
What happens if we further increase the degree of the polynomial?
9
[10]: degrees = [30]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
As the degree of the polynomial increases to the size of the dataset, we are increasingly able to fit
every point in the dataset.
However, this results in a highly irregular curve: its behavior outside the training set is wildly
inaccurate.
10
23 Overfitting
Overfitting is one of the most common failure modes of machine learning. * A very expressive
model (a high degree polynomial) fits the training dataset perfectly. * The model also makes
wildly incorrect prediction outside this dataset, and doesn’t generalize.
24 Underfitting
We can measure overfitting and underfitting by estimating accuracy on held-out data and comparing
it to the training data. * If training perforance is high but held-out performance is low, we are
overfitting. * If training perforance is low but held-out performance is low, we are underfitting.
[11]: degrees = [1, 20, 5]
titles = ['Underfitting', 'Overfitting', 'A Good Fit']
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("{} (Degree {})".format(titles[i], degrees[i]))
11
ax.text(0.05,-1.7, 'Holdout MSE: %.4f' % ((y_holdout-pipeline.
predict(X_holdout[:, np.newaxis]))**2).mean())
,→
Balancing overfitting vs. underfitting is a major challenges in applying machine learning. Briefly,
here are some approaches: * To fight under-fitting, we may increase our model class to encompass
more expressive models. * We may also create richer features for the data that will make the
dataset easier to fit.
We will see many ways of dealing with overftting, but here are some ideas: * If we’re overfitting,
we may reduce the complexity of our model by reducing the size of M * We may also modify our
objective to penalize complex models that may overfit the data.
# Part 4: Regularization
We will now see a very important way to reduce overfitting — regularization. We will also see
several important new algorithms.
28 Review: Generalization
We will assume that the dataset is governed by a probability distribution P, which we will call the
data distribution. We will denote this as
x, y ∼ P.
˙ , y (i)
A hold-out set Ḋ = {(x(i) ˙ ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P and is distinct from the training set.
12
29 Review: Overfitting
Overfitting is one of the most common failure modes of machine learning. * A very expressive
model (a high degree polynomial) fits the training dataset perfectly. * The model also makes
wildly incorrect prediction outside this dataset, and doesn’t generalize.
We can visualize overfitting by trying to fit a small dataset with a high degree polynomial.
[12]: degrees = [30]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
13
30 Regularization: Intuition
The idea of regularization is to penalize complex models that may overfit the data.
In the previous example, a less complex would rely less on polynomial terms of high degree.
31 Regularization: Definition
The idea of regularization is to train models with an augmented objective J : M → R defined over
a training dataset D of size n as
1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1
1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1
1∑
n
J(θ) = L(y (i) , fθ (x(i) )) + λ · R(θ).
n
i=1
32 L2 Regularization: Definition
1∑
n
λ
J(θ) = L(y (i) , θ⊤ x(i) ) + · ||θ||22 .
n 2
i=1
1∑
n
λ
J(θ) = L(y (i) , θ⊤ x(i) ) + · ||θ||22 .
n 2
i=1
∑d
• The regularizer R : M → R is the function R(θ) = ||θ||22 = 2
j=1 θj . This is also known as
the L2 norm of θ.
14
• The regularizer penalizes large parameters. This prevents us from over-relying on any single
feature and penalizes wildly irregular solutions.
• L2 regularization can be used with most models (linear, neural, etc.)
Let’s consider an application to the polynomial model we have seen so far. Given polynomial
features ϕ(x), we optimize the following objective:
1 ∑ ( (i) )2 λ
n
J(θ) = y − θ⊤ ϕ(x(i) ) + · ||θ||22 .
2n 2
i=1
We are going to implement regularized and standard polynomial regression on three random training
sets sampled from the same distribution.
[13]: from sklearn.linear_model import Ridge
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
pipeline2.fit(X[:, np.newaxis], y)
# visualize results
ax = plt.subplot(1, len(degrees), i + 1)
# ax.plot(X_test, true_fn(X_test), label="True function")
15
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="No␣
,→Regularization")
ax.plot(X_test, pipeline2.predict(X_test[:, np.newaxis]), label="L2␣
,→Regularization")
We can show that by usinng small weights, we prevent the model from learning irregular functions.
[14]: print('Non-regularized weights of the polynomial model need to be large to fit␣
,→every point:')
print(pipeline.named_steps['lr'].coef_[:4])
print()
print(pipeline2.named_steps['lr'].coef_[:4])
16
34 How to Choose λ?
In brief, the most common approach is to choose the value of λ that results in the best performance
on a held-out validation set.
We will later see this strategies and several other in more detail
How, do we fit regularized models? In the linear case, we can do this easily by deriving generalized
normal equations!
Let L(θ) = 12 (Xθ − y)⊤ (Xθ − y) be our least squares objective. We can write the Ridge objective
as:
1 1
J(θ) = (Xθ − y)⊤ (Xθ − y) + λ||θ||22
2 2
This allows us to derive the gradient as follows:
( )
1 ⊤ 1
∇θ J(θ) = ∇θ (Xθ − y) (Xθ − y) + λ||θ||2
2
2 2
( )
1
= ∇θ L(θ) + λ||θ||22
2
= ∇θ L(θ) + λθ
= (X ⊤ X)θ − X ⊤ y + λθ
= (X ⊤ X + λI)θ − X ⊤ y
We used the derivation of the normal equations for least squares to obtain ∇θ L(θ) as well as the
fact that: ∇x x⊤ x = 2x.
We can set the gradient to zero to obtain normal equations for the Ridge model:
(X ⊤ X + λI)θ = X ⊤ y.
θ∗ = (X ⊤ X + λI)−1 X ⊤ y.
Note that the matrix (X ⊤ X +λI) is always invertible, which addresses a problem with least squares
that we saw earlier.
17
• Optimizer: Normal equations
# Part 5: Regularization and Sparsity
We will now look another form of regularization, which will have an important new property called
sparsity.
37 Regularization: Definition
The idea of regularization is to train models with an augmented objective J : M → R defined over
a training dataset D of size n as
1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1
1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1
38 L1 Regularizion: Definition
Another closely related approach to regularization is to penalize the size of the weights using the
L1 norm.
In the context of linear models f (x) = θ⊤ x, L1 regularization yields the following objective:
1∑
n
J(θ) = L(y (i) , θ⊤ x(i) ) + λ · ||θ||1 .
n
i=1
1∑
n
J(θ) = L(y (i) , θ⊤ x(i) ) + λ · ||θ||1 .
n
i=1
∑d
• The regularizer R : M → R is the function R(θ) = ||θ||1 = j=1 |θj |. This is also known as
the L1 norm of θ.
• The regularizer also penalizes large weights. It also forces more weights to decay to zero, as
opposed to just being small.
18
39 Algorithm: Lasso
L1-regularized linear regression is also known as the Lasso (least absolute shrinkage and selection
operator).
• Type: Supervised learning (regression)
• Model family: Linear models
• Objective function: L1-regularized mean squared error
• Optimizer: gradient descent, coordinate descent, least angle regression (LARS) and others
min L(θ)
θ∈Θ
such that R(θ) ≤ λ′
We will not prove this, but solving this problem is equivalent so solving the penalized problem for
some λ > 0 that’s different from λ′ .
In other words, * We can regularize by explicitly enforcing R(θ) to be less than a value instead of
penalizing it. * For each value of λ, we are implicitly setting a constraint of R(θ).
1 ∑ ( (i) )2
n
min y − θ⊤ x(i)
θ∈Θ 2n
i=1
such that ||θ|| ≤ λ′
42 L1 vs. L2 Regularization
The following image by Divakar Kapil and Hastie et al. explains the difference between the two
norms.
19
43 Sparsity: Definition
To better understand sparsity, we will fit L2-regularized linear models to the UCI diabetes dataset
and observe the magnitude of each weight (colored lines) as a function of the regularization param-
eter.
[15]: # based on https://scikit-learn.org/stable/auto_examples/linear_model/
,→plot_ridge_path.html
X, y = load_diabetes(return_X_y=True)
[15]: (4.466835921509635e-06,
223.872113856834,
-868.4051623855127,
828.0533448059361)
20
45 Sparsity: Lasso Model
The above Ridge model did not produce sparse weights. Let’s now compare it to a Lasso model.
[16]: # Based on: https://scikit-learn.org/stable/auto_examples/linear_model/
,→plot_lasso_lars.html
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_diabetes
from sklearn.linear_model import lars_path
21
plt.ylabel('Coefficients')
plt.ylabel('Regularization Strength')
plt.title('LASSO Path')
plt.axis('tight')
[16]: (3673.0002477572816,
-133.00520290291772,
-869.3573357636973,
828.4524952229636)
22