Lab2 Linear Regression
Lab2 Linear Regression
Lab2 Linear Regression
NOTE: This is a lab project accompanying the following book [MLF] and it should be used together with the
book.
The purpose of this lab is to apply a simple machine learning method, namely linear regression, to some
regression and classification tasks on two popular data sets. We will show how linear regression may differ
when used to solve a regression or classification problem. As we know, linear regression is simple enough
so that we can derive the closed-form solution to solve it. In this project, we will use both the closed-form
method and an iterative gradient descent method (e.g. minibatch SGD) to solve linear regression for these
tasks and compare their pros and cons in practice. Moreover, we will use linear regression as a simple
example to explain some fine-tuning tricks when using any iterative optimization methods (e.g. SGD) in
machine learning. As we will see in the up-coming projects, these tricks become vital in learning large
models in machine learning, such as deep neural networks.
Prerequisites: N/A
Example 2.1:
Use linear regression to predict house prices in the popular Boston House data set
(https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). Consider to use both the closed-form
solution and an iterative method to fit to the data and discuss their pros and cons with some experimental
results.
Mounted at /content/drive
In [ ]: # load Boston House data set
import pandas as pd
import numpy as np
print(data.shape)
print(target.shape)
(506, 13)
(506,)
# add a constant column of '1' to accomodate the bias (see the margin
note on page 107)
data_wb = np.hstack((data, np.ones((data.shape[0], 1), dtype=data.dtyp
e)))
print(data_wb.shape)
(506, 14)
mean square error for the closed-form solution: 21.89483
Consider to solve the above linear regression using an iterative optimization, such as gradient descent.
Refer to eq.(6.8) on page 112, the objective function in linear regression, i.e. the mean square error (MSE), is
given as:
𝑁
1
(𝐰⊺ 𝐱𝑖 − 𝑦𝑖 )
2
2∑
𝐸(𝐰) =
𝑖=1
(𝐰 𝐗 𝐗𝐰 − 2𝐰⊺ 𝐗⊺ 𝐲 + 𝐲⊺ 𝐲)
1 ⊺ ⊺
=
2
we can show that its gradient can be computed in several equivallent ways as follows:
𝑁 𝑁
∂𝐸(𝐰)
(𝐰 𝐱𝑖 − 𝑦𝑖 )𝐱𝑖 = ∑ 𝐱𝑖 (𝐱𝑖 𝐰 − 𝑦𝑖 )
⊺ ⊺
∑
=
∂𝐰 𝑖=1 𝑖=1
𝑁 𝑁
( ∑ 𝐱𝑖 𝐱𝑖 )𝐰
⊺
𝑦𝑖 𝐱𝑖
∑
= −
𝑖=1 𝑖=1
⊺ ⊺
= 𝐗 𝐗𝐰 − 𝐗 𝐲
where 𝐗 and 𝐲 are defined in the same way as on page 112.
In the following, we use the formula from last row to calculate gradients via vectorization. Furthermore, we
implement a mini-batch SGD, .i.e. Algorithm 2.3 on page 62, to learn linear regression iteratively.
In [ ]: # solve linear regression using gradient descent
import numpy as np
class Optimizer():
def __init__(self, lr, annealing_rate, batch_size, max_epochs):
self.lr = lr
self.annealing_rate = annealing_rate
self.batch_size = batch_size
self.max_epochs = max_epochs
lr = op.lr
errors = np.zeros(op.max_epochs)
for epoch in range(op.max_epochs):
indices = np.random.permutation(n) #randomly shuffle data indices
for batch_start in range(0, n, op.batch_size):
X_batch = X[indices[batch_start:batch_start + op.batch_size]]
y_batch = y[indices[batch_start:batch_start + op.batch_size]]
w -= lr * w_grad / X_batch.shape[0]
return w, errors
In [ ]: import matplotlib.pyplot as plt
Finally, let us show how to solve the above linear regression problem using the scikit-learning
implmenetation.
In [ ]: import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
Example 2.2:
Use linear regression to build a binary classifier to classify two digits ('3' and '8') in the MNIST data set.
Consider to use both the closed-form solution and an iterative method to fit to the data and discuss their
pros and cons with experimental results.
In [ ]: # install python_mnist
Collecting python_mnist
Downloading python_mnist-0.7-py2.py3-none-any.whl (9.6 kB)
Installing collected packages: python-mnist
Successfully installed python-mnist-0.7
In [ ]: #load MINST images
# add a constant column of '1' to accomodate the bias (see the margin
note on page 107)
X_train = np.hstack((X_train, np.ones((X_train.shape[0], 1), dtype=X_t
rain.dtype)))
X_test = np.hstack((X_test, np.ones((X_test.shape[0], 1), dtype=X_test
.dtype)))
print(X_train.shape)
print(y_train)
print(X_test.shape)
print(y_test)
(11982, 784)
[-1 -1 -1 ... 1 -1 1]
(1984, 784)
[-1 -1 -1 ... -1 1 -1]
In [ ]: # use the closed-form solution
accuracy = np.count_nonzero(np.equal(np.sign(predict),y_train))/y_trai
n.size*100.0
print(f'classification accuracy on training data for the closed-form s
olution: {accuracy:.2f}%')
accuracy = np.count_nonzero(np.equal(np.sign(predict),y_test))/y_test.
size*100.0
print(f'classification accuracy on test data for the closed-form solut
ion: {accuracy:.2f}%')
mean square error on training data for the closed-form solution: 0.1
9629
classification accuracy on training data for the closed-form solutio
n: 96.99%
mean square error on training data for the closed-form solution: 1.3
0940
classification accuracy on test data for the closed-form solution: 9
5.92%
In [ ]: # use linear regression from sklearn
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
Next, let us consider to use mini-batch stochastic gradient descent (SGD) to learn linear regression models
for this binary classification problem. When we fine-tune any SGD method for a classificaion problem in
machline learning, it is very important to monitor the following three learning curves:
1. Classification Accuracy on the training set (curve A): this is the goal of the empirical risk mininization
(ERM) of the zero-one loss for classification (see Eq.(5.6) on page 99).
2. Classification Accuracy on an unseen test/development set (curve B): we need to compare the curves A
and B over the learning course to monitor whether overfitting or underfit occurs. Overfitting happens
when the gap between A and B is overly big while underfitting happens when A and B get very close
and both of them yield fairly poor performance. Moreover, we can also monitor the curves A and B to
determine when to terminate the learning proces for the best possible performance on the
test/devopement set.
3. The value of the learning objective function (curve C): because the zero-one loss is not directly
minimizable, we will have to establish a proxy objective function according to some criteria (see Tabe
7.1 on page 135). These objective functions are closelly related to the zero-one loss but they are NOT
the same. When we fine-tune an iterative optimization method, the first thing is to ensure that the value
of the chosen objective function descreases over the entire learning course. If we cannot reduce the
objective function (even when a sufficiently small learning rate is used), it is very likely that the
implementation or code is buggy. Furthermore, if curve C is going down while curve A is not going up,
this is another indicator that someting is still wrong in the implementation.
class Optimizer():
def __init__(self, lr, annealing_rate, batch_size, max_epochs):
self.lr = lr
self.annealing_rate = annealing_rate
self.batch_size = batch_size
self.max_epochs = max_epochs
lr = op.lr
errorsA = np.zeros(op.max_epochs)
errorsB = np.zeros(op.max_epochs)
errorsC = np.zeros(op.max_epochs)
w -= lr * w_grad / X_batch.shape[0]
lr *= op.annealing_rate
print(f'epoch={epoch}: the mean square error is {errorsC[epoch]:.3
f} ({errorsA[epoch]:.3f},{errorsB[epoch]:.3f})')
return w, errorsA, errorsB, errorsC
fig, ax = plt.subplots(2)
fig.suptitle('monitoring three learning curves (A, B, C)')
ax[0].plot(C, 'b', 0.196*np.ones(C.shape[0]), 'c--')
ax[0].legend(['curve C', 'closed-form solution'])
In the above setting, we use a large mini-batch (50), which leads to fairly smooth convergence. As we can
see, even though there is a big gap in the objective function between SGD (curve C) and the closed-form
solution, classification accuracy of SGD exceeds that of the closed-form solution on either the training or
testing set. This indicates that MSE used in linear regression is NOT a good learning criterion for
classification (see why in section 7.1.1 on page 136).
In [ ]: import matplotlib.pyplot as plt
fig, ax = plt.subplots(2)
fig.suptitle('monitoring three learning curves (A, B, C)')
ax[0].plot(C, 'b', 0.196*np.ones(C.shape[0]), 'c--')
ax[0].legend(['curve C', 'closed-form solution'])
Exercises
Problem 2.1:
Use Ridge regression to solve the regression problem in Example 2.1 as well as the classification problem in
Example 2.2, also implement both closed-form and iterative approachs, compare the results of Ridge
regression with those of linear regression.
Problem 2.2:
Use LASSO to solve the regression problem in Example 2.1 as well as the classification problem in Example
2.2, compare the results of LASSO with those of linear regression and Ridge regression.