Machine Learning Lecture-Notes
Machine Learning Lecture-Notes
Machine Learning Lecture-Notes
of Machine Learning
Lectures on YouTube:
https://www.youtube.com/@mathtalent
Seongjai Kim
This lecture note will grow up as time marches; various core algorithms,
useful techniques, and interesting examples would be soon incorporated.
Seongjai Kim
April 30, 2023
iii
iv
Contents
Title ii
Prologue iii
Table of Contents ix
1 Introduction 1
1.1. Why and What in Machine Learning (ML)? . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1. Inference problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2. Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3. Machine learning examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Three Different Types of ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1. Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2. Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3. A Machine Learning Modelcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Python Basics 17
2.1. Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2. Python in 30 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1. Python essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2. Frequently used Python rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3. Looping and functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3. Zeros of Polynomials in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4. Python Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
vi Contents
3.3.2. Convergence and optimization issues with the gradient descent method . . . . 49
3.3.3. Feature scaling and stochastic gradient descent . . . . . . . . . . . . . . . . . . 50
Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
P Projects 361
P.1. mCLESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
P.1.1. Review: Simple classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
P.1.2. The mCLESS classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
P.1.3. Feature expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
P.2. Noise-Removal and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
P.3. Gaussian Sailing to Overcome Local Minima Problems . . . . . . . . . . . . . . . . . . 379
P.4. Quasi-Newton Methods Using Partial Information of the Hessian . . . . . . . . . . . . 381
P.5. Effective Preprocessing Technique for Filling Missing Data . . . . . . . . . . . . . . . . 383
Bibliography 385
Index 391
x Contents
1
C HAPTER
Introduction
This is a hard question to which we can only give a somewhat fuzzy answer.
But at a high enough level of abstraction, there are two answers:
These answers are so abstract that they are probably completely unsatis-
fying. But let’s (start to) clear things up, by looking at some particular
examples of “inference" and “modeling" problems.
Contents of Chapter 1
1.1. Why and What in Machine Learning (ML)? . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Three Different Types of ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3. A Machine Learning Modelcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
2 Chapter 1. Introduction
(c) If I give you a recording of somebody speaking, can you produce text of
what they are saying?
1.1. Why and What in Machine Learning (ML)? 3
1.1.2. Modeling
A second type of problem associated with machine learning (ML) might
be roughly described as:
One example is regression analysis. Most models can be broken into two
categories:
• Question: How should we pick the hypothesis space, the set of possible
functions f ?
– Often, there exist infinitely many hypotheses
– Let’s select f from PM , polynomials of degree ≤ M .
– Which one is best?
8 Chapter 1. Introduction
• Error: misfit
Key Idea 1.6. Training and test performance. Assume that each
training and test example–label pair (x, y) is drawn independently at
random from the same (but unknown) population of examples and labels.
Represent this population as a probability distribution p(x, y), so that:
8 #=====================================================================
9 # DATA: Read & Preprocessing
10 # load_iris, load_wine, load_breast_cancer, ...
11 #=====================================================================
12 data_read = datasets.load_iris(); #print(data_read.keys())
13
14 X = data_read.data
15 y = data_read.target
16 dataname = data_read.filename
17 targets = data_read.target_names
18 features = data_read.feature_names
19
20 print('X.shape=',X.shape, 'y.shape=',y.shape)
21 #---------------------------------------------------------------------
22 # SETTING
23 #---------------------------------------------------------------------
24 N,d = X.shape; labelset=set(y)
25 nclass=len(labelset);
26 print('N,d,nclass=',N,d,nclass)
27
31 #=====================================================================
32 # CLASSIFICATION
33 #=====================================================================
34 btime = time.time()
35 Acc = np.zeros([run,1])
36 ##from sklearn.neighbors import KNeighborsClassifier
37 ##clf = KNeighborsClassifier(5)
38 from myCLF import myCLF ## My classifier
1.3. A Machine Learning Modelcode 13
39
40 for it in range(run):
41 Xtrain, Xtest, ytrain, ytest = train_test_split(
42 X, y, test_size=rtest, random_state=it, stratify = y)
43 ##clf.fit(Xtrain, ytrain);
44 clf = myCLF(Xtrain,ytrain); clf.fit(); ## My classifier
45 Acc[it] = clf.score(Xtest, ytest)
46
47 #-----------------------------------------------
48 # Print: Accuracy && E-time
49 #-----------------------------------------------
50 etime = time.time()-btime
51 print(' %s: Acc.(mean,std) = (%.2f,%.2f)%%; Average E-time= %.5f'
52 %(dataname,np.mean(Acc)*100,np.std(Acc)*100,etime/run))
53
54 #=====================================================================
55 # Scikit-learn Classifiers, for Comparisions
56 #=====================================================================
57 exec(open("sklearn_classifiers.py").read())
sklearn_classifiers.py
1 #=====================================================================
2 # Required: X, y, [dataname, run]
3 print('========= Scikit-learn Classifiers, for Comparisions =========')
4 #=====================================================================
5 from sklearn.preprocessing import StandardScaler
6 from sklearn.datasets import make_moons, make_circles, make_classification
7 from sklearn.neural_network import MLPClassifier
8 from sklearn.neighbors import KNeighborsClassifier
9 from sklearn.linear_model import LogisticRegression
10 from sklearn.svm import SVC
11 from sklearn.gaussian_process import GaussianProcessClassifier
12 from sklearn.gaussian_process.kernels import RBF
13 from sklearn.tree import DecisionTreeClassifier
14 from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
15 from sklearn.naive_bayes import GaussianNB
16 from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
17 #from sklearn.inspection import DecisionBoundaryDisplay
18
19 #-----------------------------------------------
20 classifiers = [
21 LogisticRegression(max_iter = 1000),
22 KNeighborsClassifier(5),
23 SVC(kernel="linear", C=0.5),
14 Chapter 1. Introduction
24 SVC(gamma=2, C=1),
25 RandomForestClassifier(max_depth=5, n_estimators=50, max_features=1),
26 MLPClassifier(alpha=1, max_iter=1000),
27 AdaBoostClassifier(),
28 GaussianNB(),
29 QuadraticDiscriminantAnalysis(),
30 GaussianProcessClassifier(),
31 ]
32 names = [
33 "Logistic Regr",
34 "KNeighbors-5 ",
35 "Linear SVM ",
36 "RBF SVM ",
37 "Random Forest",
38 "Deep-NN ",
39 "AdaBoost ",
40 "Naive Bayes ",
41 "QDA ",
42 "Gaussian Proc",
43 ]
44 #-----------------------------------------------
45 if dataname is None: dataname = 'No-dataname';
46 if run is None: run = 100;
47
48 #===============================================
49 import os; acc_max=0
50 for name, clf in zip(names, classifiers):
51 Acc = np.zeros([run,1])
52 btime = time.time()
53
54 for it in range(run):
55 Xtrain, Xtest, ytrain, ytest = train_test_split(
56 X, y, test_size=rtest, random_state=it, stratify = y)
57
58 clf.fit(Xtrain, ytrain);
59 Acc[it] = clf.score(Xtest, ytest)
60
61 etime = time.time()-btime
62 accmean = np.mean(Acc)*100
63 print('%s: %s: Acc.(mean,std) = (%.2f,%.2f)%%; E-time= %.5f'
64 %(os.path.basename(dataname),name,accmean,np.std(Acc)*100,etime/run))
65 if accmean>acc_max:
66 acc_max= accmean; algname = name
67 print('sklearn classifiers max: %s= %.2f' %(algname,acc_max))
1.3. A Machine Learning Modelcode 15
1.1. The code in Section 1.3 will run without requiring any other implementation of yours,
if (1) uncomment lines 36-37 and 43 and (2) comment lines 38 and 44.
Python Basics
Contents of Chapter 2
2.1. Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2. Python in 30 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3. Zeros of Polynomials in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4. Python Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
17
18 Chapter 2. Python Basics
Advantages of Python
Python has the following characteristics.
• Easy to learn and use
• Flexible and reliable
• Extensively used in Data Science
• Handy for Web Development purposes
• Having Vast Libraries support
• Among the fastest-growing programming languages in the tech
industry
Disadvantage of Python
Python is an interpreted and dynamically-typed language. The line-by-
line execution of code, built with a high flexibility, most likely leads to
slow execution. Python is slower than Matlab that is slower than C.
• You yourself may create and import your own C-module into Python.
If you extend Python with pieces of compiled C-code, then the re-
sulting code is easily 100× faster than Python. Best choice!
• Cython: It is designed as a C-extension for Python, which is
developed for users not familiar with C. For Cyphon implemetation,
see e.g. https://www.youtube.com/watch?v=JKMkhARcwdU, one of
Simon Funke’s YouTube videos.
2.1. Why Python? 19
6 import numpy as np
7 import numpy.linalg as la
8 import matplotlib.pyplot as plt
9 print("\tnp = numpy; la = numpy.linalg; plt = matplotlib.pyplot")
10
Programming Features
• Python does not support pointers.
• Python codes are stored with .py extension.
• Indentation: Python uses indentation to define a block of code.
– A code block (body of a function, loop, etc.) starts with indenta-
tion and ends with the first unindented line.
– The amount of indentation is up to the user, but it must be consis-
tent throughout that block.
• Comments:
– The hash (#) symbol is used to start writing a comment.
– Multi-line comments: Python uses triple quotes, either ”’ or """.
2.2. Python in 30 Minutes 21
• Reference semantics
>>> a = [1, 2, 3]
>>> b = a
>>> a.append(4)
>>> b
[1, 2, 3, 4]
Be aware with copying lists and numpy arrays!
• numpy, range, and iteration
>>> range(8)
[0, 1, 2, 3, 4, 5, 6, 7]
>>> import numpy as np
>>> for k in range(np.size(li)):
... li[k]
. . . <Enter>
’abc’
14
4.34
23
• numpy array and deepcopy
>>> from copy import deepcopy
>>> A = np.array([1,2,3])
>>> B = A
>>> C = deepcopy(A)
>>> A *= 4
>>> B
array([ 4, 8, 12])
>>> C
array([1, 2, 3])
2.2. Python in 30 Minutes 23
12 ## Docstrings in Python
13 def double(num):
14 """Function to double the value"""
15 return 2*num
16 print(double.__doc__)
17 # Output: Function to double the value
18
35
36 ## Python Dictionary
37 d = {'key1':'value1', 'Seth':22, 'Alex':21}
38 print(d['key1'],d['Alex'],d['Seth'])
39 # Output: value1 21 22
40
41 ## Output Formatting
42 x = 5.1; y = 10
43 print('x = %d and y = %d' %(x,y))
44 print('x = %f and y = %d' %(x,y))
45 print('x = {} and y = {}'.format(x,y))
46 print('x = {1} and y = {0}'.format(x,y))
47 # Output: x = 5 and y = 10
48 # x = 5.100000 and y = 10
49 # x = 5.1 and y = 10
50 # x = 10 and y = 5.1
51
52 print("x=",x,"y=",y, sep="#",end="&\n")
53 # Output: x=#5.1#y=#10&
54
55 ## Python Input
56 C = input('Enter any: ')
57 print(C)
58 # Output: Enter any: Starkville
59 # Starkville
2.2. Python in 30 Minutes 25
8 if __name__ == '__main__':
9 num = input('Enter a natural number: ')
10 cubes = get_cubes(int(num))
11 print(cubes)
3 cubes = get_cubes(8)
4 print(cubes)
Execusion
1 [Fri Jul.22] python call_get_cubes.py
2 [1, 8, 27, 64, 125, 216, 343, 512]
26 Chapter 2. Python Basics
– Thus
P 0 (xn ) = Q(xn ). (2.4)
That is, the evaluation of Q at xn becomes the desired quantity
P 0 (xn ).
Example 2.4. Let P (x) = x4 − 4x3 + 7x2 − 5x − 2. Use the Newton’s method
and the Horner’s method to implement a code and find an approximate zero
of P near 3.
Solution. First, let’s try to use built-in functions.
zeros_of_poly_built_in.py
1 import numpy as np
2
7 print(P)
8 print(Pder)
9 print(np.roots(P))
10 print(P(3), Pder(3))
Output
1 4 3 2
2 1 x - 4 x + 7 x - 5 x - 2
3 3 2
4 4 x - 12 x + 14 x - 5
5 [ 2. +0.j 1.1378411+1.52731225j 1.1378411-1.52731225j -0.2756822+0.j ]
6 19 37
7 for i in range(1,n):
8 d = p + x0*d
9 p = A[i] +x0*p
10 return p,d
11
12 def newton_horner(A,x0,tol,itmax):
13 """ input: A = [a_n,...,a_1,a_0]
14 output: x: P(x)=0 """
15 x=x0
28 Chapter 2. Python Basics
16 for it in range(1,itmax+1):
17 p,d = horner(A,x)
18 h = -p/d;
19 x = x + h;
20 if(abs(h)<tol): break
21 return x,it
22
23 if __name__ == '__main__':
24 coeff = [1, -4, 7, -5, -2]; x0 = 3
25 tol = 10**(-12); itmax = 1000
26 x,it =newton_horner(coeff,x0,tol,itmax)
27 print("newton_horner: x0=%g; x=%g, in %d iterations" %(x0,x,it))
Execution
1 [Sat Jul.23] python Zeros-Polynomials-Newton-Horner.py
2 newton_horner: x0=3; x=2, in 7 iterations
Note: The above Python code must be compared with the Matlab code
below.
horner.m
1 function [p,d] = horner(A,x0)
2 % input: A = [a_0,a_1,...,a_n]
3 % output: p=P(x0), d=P'(x0)
4
5 n = size(A(:),1);
6 p = A(n); d=0;
7
8 for i = n-1:-1:1
9 d = p + x0*d;
10 p = A(i) +x0*p;
11 end
newton_horner.m
1 function [x,it] = newton_horner(A,x0,tol,itmax)
2 % input: A = [a_0,a_1,...,a_n]; x0: initial for P(x)=0
3 % outpue: x: P(x)=0
4
5 x = x0;
6 for it=1:itmax
7 [p,d] = horner(A,x);
8 h = -p/d;
9 x = x + h;
10 if(abs(h)<tol), break; end
11 end
2.3. Zeros of Polynomials in Python 29
Call_newton_horner.m
1 a = [-2 -5 7 -4 1];
2 x0=3;
3 tol = 10^-12; itmax=1000;
4 [x,it] = newton_horner(a,x0,tol,itmax);
5 fprintf(" newton_horner: x0=%g; x=%g, in %d iterations\n",x0,x,it)
6 Output: newton_horner: x0=3; x=2, in 7 iterations
Observation 2.5.
Python programming is as easy and simple as Matlab programming.
• In particular, numpy is developed for Matlab-like implementation,
with enhanced convenience.
• Python uses classes for object-oriented programming.
• Furthermore, Python is an open source (free) programming lan-
guage, which explains why Python is fastest-growing in use.
30 Chapter 2. Python Basics
In the following, we would build a simple class, as Dr. Xu did in [82, Ap-
pendix B.5]; you will learn how to initiate, refine, and use classes.
2.4. Python Classes 31
Polynomial_01.py
1 class Polynomial():
2 """A class of polynomials"""
3
4 def __init__(self,coefficient):
5 """Initialize coefficient attribute of a polynomial."""
6 self.coeff = coefficient
7
8 def degree(self):
9 """Find the degree of a polynomial"""
10 return len(self.coeff)-1
11
12 if __name__ == '__main__':
13 p2 = Polynomial([1,2,3])
14 print(p2.coeff) # a variable; output: [1, 2, 3]
15 print(p2.degree()) # a method; output: 2
16 def degree(self):
17 """Find the degree of a polynomial"""
18 return len(self.coeff)-1
19
20 def evaluate(self,x):
21 """Evaluate a polynomial."""
22
23 n = self.degree()
24 eval = []
25 for xi in x:
26 p = self.coeff[0] #Horner's method
27 for k in range(1,n+1):
28 p = self.coeff[k]+ xi*p
29 eval.append(p)
30 return eval
31
32 if __name__ == '__main__':
33 poly1 = Polynomial()
34 print('poly1, default coefficients:', poly1.coeff)
35 poly1.coeff = [1,2,-3]
36 print('poly1, coefficients after reset:', poly1.coeff)
37 print('poly1, degree:', poly1.degree())
38
39 poly2 = Polynomial()
40 poly2.coeff = [1,2,3,4,-5]
41 print('poly2, coefficients after reset:', poly2.coeff)
42 print('poly2, degree:', poly2.degree())
2.4. Python Classes 33
43
48 print('poly2.evaluate([-1,0,1,2]):',poly2.evaluate([-1,0,1,2]))
Inheritance
Note: If we want to write a class that is just a specialized version of
another class, we do not need to write the class from scratch.
• We call the specialized class a child class and the other general
class a parent class.
• The child class can inherit all the attributes and methods form the
parent class; it can also define its own special attributes and meth-
ods or even overrides methods of the parent class.
Classes.py
1 class Polynomial():
2 """A class of polynomials"""
3
4 def __init__(self,coefficient):
5 """Initialize coefficient attribute of a polynomial."""
6 self.coeff = coefficient
7
8 def degree(self):
9 """Find the degree of a polynomial"""
10 return len(self.coeff)-1
11
12 class Quadratic(Polynomial):
13 """A class of quadratic polynomial"""
14
15 def __init__(self,coefficient):
16 """Initialize the coefficient attributes ."""
17 super().__init__(coefficient)
18 self.power_decrease = 1
19
20 def roots(self):
21 a,b,c = self.coeff
22 if self.power_decrease != 1:
23 a,c = c,a
24 discriminant = b**2-4*a*c
25 r1 = (-b+discriminant**0.5)/(2*a)
26 r2 = (-b-discriminant**0.5)/(2*a)
27 return [r1,r2]
28
29 def degree(self):
30 return 2
2.4. Python Classes 35
• Line 12: We must include the name of the parent class in the paren-
theses of the definition of the child class (to indicate the parent-child
relation for inheritance).
• Line 17: The super() function is to give an child object all the at-
tributes defined in the parent class.
• Line 18: An additional child class attribute self.power_decrease is
initialized.
• Lines 20-27: define a new method called roots.
• Lines 29-30: The method degree() overrides the parent’s method.
call_Quadratic.py
1 from Classes import *
2
3 quad1 = Quadratic([2,-3,1])
4 print('quad1, roots:',quad1.roots())
5 quad1.power_decrease = 0
6 print('roots when power_decrease = 0:',quad1.roots())
7 # Output: quad1, roots: [1.0, 0.5]
8 # roots when power_decrease = 0: [2.0, 1.0]
Python Keywords
Keywords are some predefined and reserved words in python.
Ex.: and, or, not, if, else, elif, for, while, break, del, ...
36 Chapter 2. Python Basics
Note: A while loop has not been considered in the lecture. However, you can figure it out
easily by yourself.
2.3. Write a function that takes as input a list of values and returns the largest value. Do
this without using the Python max() function; you should combine a for loop and an
if statement.
2.4. Let P4 (x) = 2x4 − 5x3 − 11x2 + 20x + 10. Solve the following.
In this chapter, we will make use of one of the first algorithmically described
machine learning algorithms for classification, the perceptron and adap-
tive linear neurons (adaline). We will start by implementing a perceptron
step by step in Python and training it to classify different flower species in
the Iris dataset.
Contents of Chapter 3
3.1. Binary Classifiers – Artificial Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2. The Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3. Adaline: ADAptive LInear NEuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
37
38 Chapter 3. Simple Machine Learning Algorithms for Classification
Remark 3.2. Neurons are interconnected nerve cells that are involved
in the processing and transmitting of chemical and electrical signals.
Such a nerve cell can be described as a simple logic gate with binary
outputs;
• multiple signals arrive at the dendrites,
• they are integrated into the cell body,
• and if the accumulated signal exceeds a certain threshold, an output
signal is generated that will be passed on by the axon.
Linear classifiers
As artificial neurons, they have the following characteristics:
• Inputs are feature values: x
• Each feature has a weight: w
• Weighted sum (integration) is the activation
X
activationw (x) = wj x j = w · x (3.1)
j
Unknowns, in ML:
Training : w
Prediction : activationw (x)
Examples:
• Perceptron
• Adaline (ADAptive LInear NEuron)
• Support Vector Machine (SVM) ⇒ nonlinear decision boundaries, too
40 Chapter 3. Simple Machine Learning Algorithms for Classification
where θ is a threshold.
For simplicity, we can bring the threshold θ in (3.2) to the left side of the
equation; define a weight-zero as w0 = −θ and reformulate as
1 if z ≥ 0,
φ(z) = z = wT x = w0 + w1 x1 + · · · + wm xm . (3.3)
−1 otherwise,
The update of the weight vector w can be more formally written as:
where η is the learning rate, 0 < η < 1, y (i) is the true class label of the
i-th training sample, and yb(i) denotes the predicted class label.
y (i) w
b T x(i) ≥ γ > 0, ∀ i, (3.5)
Theorem 3.8. Assume the data set D = {(x(i) , y (i) )} is linearly separa-
ble with margin γ, i.e.,
∃ w,
b kwk
b = 1, y (i) w
b T x(i) ≥ γ > 0, ∀ i. (3.6)
Suppose that kx(i) k ≤ R, ∀ i, for some R > 0. Then, the maximum num-
ber of mistakes made by the perceptron algorithm is bounded by R2 /γ 2 .
Proof. Assume the perceptron algorithm makes yet a mistake for (x(`) , y (`) ).
Then
kw(`+1) k2 = kw(`) + η (y (`) − yb(`) )x(`) k2
T
= kw(`) k2 + kη (y (`) − yb(`) )x(`) k2 + 2η (y (`) − yb(`) )w(`) x(`) (3.7)
≤ kw(`) k2 + kη (y (`) − yb(`) )x(`) k2 ≤ kw(`) k2 + (2η R)2 ,
b T w(`+1) = w
w b T w(`) + η (y (`) − yb(`) )w
b T x(`) ≥ w
b T w(`) + 2η γ,
which implies
b T w(`) ≥ ` · (2η γ)
w (3.10)
and therefore
kw(`) k2 ≥ `2 · (2η γ)2 . (3.11)
It follows from (3.9) and (3.11) that ` ≤ R2 /γ 2 .
Optimal Separator?
Figure 3.3
Example 3.10. Support Vector Machine (Cortes & Vapnik, 1995) chooses
the linear separator with the largest margin.
Figure 3.4
• − vs {◦, +} ⇒ weights w−
• + vs {◦, −} ⇒ weights w+
• ◦ vs {+, −} ⇒ weights w◦
φ(wT x) = wT x.
Definition 3.12. In the case of Adaline, we can define the cost function
J to learn the weights as the Sum of Squared Errors (SSE) between
the calculated outcomes and the true class labels:
1 X (i) T (i)
2
J (w) = y − φ(w x ) . (3.12)
2 i
Algorithm 3.13. The Gradient Descent Method uses −∇J (w) for
the search direction (update direction):
Note: The above Adaline rule is compared with the perceptron rule
(3.4):
w := w + ∆w, ∆w = η (y (i) − yb(i) ) x(i) .
3.3. Adaline: ADAptive LInear NEuron 49
♦ Hyperparameters
Definition 3.14. In ML, a hyperparameter is a parameter whose
value is set before the learning process begins. Thus it is an algorithmic
parameter. Examples are
• The learning rate (η)
• The number of maximum epochs/iterations (n_iter)
• The SGD method updates the weights based on a single training ex-
ample.
• The SGD method typically reaches convergence much faster be-
cause of the more frequent weight updates.
• Since each search direction is calculated based on a single training
example, the error surface is smoother (not noisier) than in the gra-
dient descent method; the SGD method can escape shallow local
minima more readily.
• To obtain accurate results via the SGD method, it is important to
present it with data in a random order, which may prevent cy-
cles with epochs.
• In the SGD method, the learning rate η is often set adaptively, de-
creasing over iteration k. For example, ηk = c1 /(k + c2 ).
52 Chapter 3. Simple Machine Learning Algorithms for Classification
♦ Mini-batch learning
Definition 3.17. A compromise between the batch gradient descent and
the SGD is the so-called mini-batch learning. Mini-batch learning can
be understood as applying batch gradient descent to smaller subsets of
the training data – for example, 32 samples at a time.
To get the Iris dataset, you have to use some lines on as earlier pages from 31.
3.3. Perturb the dataset (X) by a random Gaussian noise Gσ of an observable σ (so as for
Gσ (X) not to be linearly separable) and do the examples in Exercise 3.2 again.
54 Chapter 3. Simple Machine Learning Algorithms for Classification
4
C HAPTER
Contents of Chapter 4
4.1. Gradient Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2. Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3. Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4. The Stochastic Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5. The Levenberg–Marquardt Algorithm, for Nonlinear Least-Squares Problems . . . . . 81
Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
55
56 Chapter 4. Gradient-based Methods for Optimization
1
The Rosenbrock function in 3D is given as f (x, y, z) = [(1 − x)2 + 100 (y − x2 )2 ] + [(1 − y)2 + 100 (z − y 2 )2 ],
which has exactly one minimum at (1, 1, 1). Similarly, one can define the Rosenbrock function in gen-
eral N -dimensional spaces, for N ≥ 4, by adding one more component for each enlarged dimension.
N
X −1
(1 − xi )2 + 100(xi+1 − x2i )2 , where x = [x1 , x2 , · · · , xN ] ∈ RN . See Wikipedia
That is, f (x) =
i=1
(https://en.wikipedia.org/wiki/Rosenbrock_function) for details.
58 Chapter 4. Gradient-based Methods for Optimization
The goal of the gradient descent method is to address directly the process
of minimizing the function f , using the fact that −∇f (x) is the direction of
steepest descent of f at x. Given an initial point x0 , we move it to the
direction of −∇f (x0 ) so as to get a smaller function value. That is,
for some γ > 0. The parameter γ is called the step length or the learn-
ing rate.
Picking the step length γ : Assume that the step length was chosen to
be independent of n, although one can play with other choices as well. The
question is how to select γ in order to make the best gain of the method. To
turn the right-hand side of (4.5) into a more manageable form, we invoke
Taylor’s Theorem:2
ˆ x+t
0
f (x + t) = f (x) + t f (x) + (x + t − s) f 00 (s) ds. (4.6)
x
00
Assuming that |f (s)| ≤ L, we have
t2
f (x + t) ≤ f (x) + t f 0 (x) + L.
2
Now, letting x = xn and t = −γ f 0 (xn ) reads
f (xn+1 ) = f (xn − γ f 0 (xn ))
1
≤ f (xn ) − γ f 0 (xn ) f 0 (xn ) + L [γ f 0 (xn )]2 (4.7)
2
L
= f (xn ) − [f 0 (xn )]2 γ − γ 2 .
2
The gain (learning) from the method occurs when
L 2
γ − γ2 > 0 ⇒ 0 < γ < , (4.8)
2 L
and it will be best when γ − L2 γ 2 is maximal. This happens at the point
1
γ= . (4.9)
L
2
Taylor’s Theorem with integral remainder: Suppose f ∈ C n+1 [a, b] and x0 ∈ [a, b]. Then, for every
n ˆ
X f (k) (x0 ) 1 x
x ∈ [a, b], f (x) = (x − x0 )k + Rn (x), Rn (x) = (x − s)n f (n+1) (s) ds.
k! n! x0
k=0
60 Chapter 4. Gradient-based Methods for Optimization
f 0 (b
x) = lim f 0 (xn ) = 0, (4.14)
n→∞
γ = 0.2
Figure 4.3: On a well-conditioned quadratic function, the gradient descent converges in a
few iterations to the optimum
γ = 0.02
Figure 4.4: On a poorly-conditioned quadratic function, the gradient descent converges and
takes many more iterations to converge than on the above well-conditioned problem. This
is partially because gradient descent requires a much smaller step size on this problem
to converge.
γ = 0.02
Figure 4.5: Gradient descent also converges on a poorly-conditioned non-convex problem.
Convergence is slow in this case.
4.1. Gradient Descent Method 63
The gradient descent algorithm with backtracking line search then becomes
Algorithm 4.11. (The Gradient Descent Algorithm, with Back-
tracking Line Search).
input: initial guess x0 , step size γ0 > 0;
for n = 0, 1, 2, · · · do
initial step size estimate γn ;
while (TRUE) do
γn
if f (xn − γn ∇f (xn )) ≤ f (xn ) − 2 k∇f (xn )k2
break; (4.19)
else γn = γn /2;
end while
xn+1 = xn − γn ∇f (xn );
end for
return xn+1 ;
64 Chapter 4. Gradient-based Methods for Optimization
Figure 4.7: In this example we can clearly see the effect of the backtracking line search
strategy: once the algorithm in a region of low curvature, it can take larger step sizes. The
final result is a much improved convergence compared with the fixed step-size equivalent.
Figure 4.8: The backtracking line search also improves convergence on non-convex prob-
lems.
The gradient descent method thus updates its iterates minimizing the fol-
lowing surrogate function:
1
Qn (x) = f (xn ) + ∇f (xn ) · (x − xn ) + kx − xn k2 . (4.24)
2γ
Differentiating the function and equating to zero reads
H n = ∇2 f (xn ), (4.27)
Remark 4.16. The Newton’s method can be seen as to find the critical
b such that ∇f (b
points of f , i.e., x x) = 0. Let
xn+1 = xn + ∆x. (4.30)
Then
∇f (xn+1 ) = ∇f (xn + ∆x) = ∇f (xn ) + ∇2 f (xn ) ∆x + O(|∆x|2 ).
For the three example functions in Section 4.1.2, the Newton’s method per-
forms better as shown in the following.
γ=1
Figure 4.10: In this case the approximation is exact and it converges in a single iteration.
γ=1
Figure 4.11: Although badly-conditioned, the cost function is quadratic; it converges in a
single iteration.
γ=1
Figure 4.12: When the Hessian is close to singular, there might be some numerical insta-
bilities. However, it is better than the result of the gradient descent method in Figure 4.5.
70 Chapter 4. Gradient-based Methods for Optimization
Claim 4.18. The Hessian (or Hessian matrix) describes the local
curvature of a function. The eigenvalues and eigenvectors of the Hes-
sian have geometric meaning:
• The first principal eigenvector (corresponding to the largest eigen-
value in modulus) is the direction of greatest curvature.
• The last principal eigenvector (corresponding to the smallest eigen-
value in modulus) is the direction of least curvature.
• The corresponding eigenvalues are the respective amounts of these
curvatures.
The eigenvectors of the Hessian are called principal directions, which
are always orthogonal to each other. The eigenvalues of the Hessian
are called principal curvatures and are invariant under rotation and
always real-valued.
and therefore
d
−1
X 1
H v = ξj uj , (4.33)
j=1
λj
where components of v in leading principal directions of H have
been diminished with larger factors.
• Thus the angle measured from H −1 v to the least principal direc-
tion of H becomes smaller than the angle measured from v.
• It is also true when v is the gradient vector (in fact, the negation of
the gradient vector).
4.2. Newton’s Method 71
where a·b
angle(a, b) = arccos .
kak kbk
This implies that by setting v = −∇f (xn ), the adjusted vector H −1 v is
a rotation (and scaling) of the steepest descent vector towards the least
curvature direction.
Claim 4.20. The net effect of H −1 is to rotate and scale the gradi-
ent vector to face towards the minimizer by a certain degree, which may
make the Newton’s method converge much faster than the gradient de-
scent method.
Example 4.21. One can easily check that at each point (x, y) on the ellip-
soid
(x − h)2 (y − k)2
z = f (x, y) = + , (4.35)
a2 b2
−1
the vector − ∇2 f (x, y) ∇f (x, y) is always facing towards the minimizer
(h, k). See Exercise 2.
72 Chapter 4. Gradient-based Methods for Optimization
• Imposing the secant condition H n+1 sn = yn and with (4.40), we get the
update equation of H n+1 :
yn ynT (H n sn )(H n sn )T
H n+1 = Hn + − . (4.41)
y n · sn sn · H n sn
• Let B n = H −1
n , the inverse of H n . Then, applying the Sherman-Morrison
formula, we can update B n+1 = H −1 n+1 as follows.
sn ynT yn sTn sn sTn
B n+1 = I− Bn I − + . (4.42)
yn · sn yn · sn yn · sn
See Exercise 4.4.
3
Rank-one matrices: Let A be an m × n matrix. Then rank(A) = 1 if and only if there exist column
vectors v ∈ Rm and w ∈ Rn such that A = vwT .
74 Chapter 4. Gradient-based Methods for Optimization
Figure 4.16: Even on the ill-conditioned nonconvex problem, the BFGS algorithm also
converges extremely fast, with a convergence that is more similar to Newton’s method than
to gradient descent.
76 Chapter 4. Gradient-based Methods for Optimization
We can write the full stochastic gradient algorithm as follows. The algo-
rithm has only one free parameter: γ.
Algorithm 4.25. (Stochastic Gradient Descent).
The SGD can be much more efficient than gradient descent in the
case in which the objective consists of a large sum, because at each
iteration we only need to evaluate a partial gradient and not the full gradi-
ent.
Example 4.26. A least-squares problem can be written in the form accept-
able by SGD since
m
1 2 1 X
kAx − bk = (Ai x − bi )2 , (4.46)
m m i=1
γn = γ
γ = 0.2
Figure 4.17: For the well-conditioned convex problem, stochastic gradient with constant
step size converges quickly to a neighborhood of the optimum, but then bounces around.
Figure 4.18: Stochastic Gradient with decreasing step sizes is quite robust to the choice
of step size. On one hand there is really no good way to set the step size (e.g., no equivalent
of line search for Gradient Descent) but on the other hand it converges for a wide range of
step sizes.
80 Chapter 4. Gradient-based Methods for Optimization
p := p + ∆p. (4.51)
The goal of each iteration is to find the parameter update ∆p that re-
duces f . We will begin with the gradient descent method and the Gauss-
Newton method.
∆pgd = γ J T W (y − y
b(p)), (4.53)
Algorithm Derivation
[J T W J ] ∆pgn = J T W (y − y
b(p)). (4.57)
[J T W J + λI] ∆plm = J T W (y − y
b(p)), (4.58)
then λ is increased.
• Then the step is accepted when ρk (∆plm ) > ε0 , for a threshold ε0 > 0.
4.1. (Gradient descent method). Implement the gradient descent algorithm (4.15) and
the gradient descent algorithm with backtracking line search (4.19).
4.2. (Net effect of the inverse Hessian matrix). Verify the claim in Example 4.21.
4.3. (Newton’s method). Implement a line search version of the Newton’s method (4.32)
with the Rosenbrock function in 2D.
(a) Recall the results in Exercise 1. With the backtracking line search, is the New-
ton’s method better than the gradient descent method?
(b) Now, we will approximate the Hessian matrix by its diagonal. That is,
2 2
∂ f ∂ f ∂ 2f
∂x2 0 2
Dn = ∂x
def
(xn ) ≈ ∇2 f (xn ) == ∂x∂y (xn ). (4.62)
∂ 2f ∂ 2f ∂ 2f
0
∂y 2 ∂y∂x ∂y 2
How does the Newton’s method perform when the Hessian matrix is replaced by
Dn ?
4.4. (BFGS update). Consider H n+1 and B n+1 in (4.41) and (4.42), respectively:
yn ynT (H n sn )(H n sn )T
H n+1 = H n + − ,
yn · sn sn · H n sn
sn ynT yn sTn sn sTn
B n+1 = I− Bn I − + .
yn · sn yn · sn yn · sn
4.5. (Curve fitting; Optional for undergraduates). Consider a set of data consisting
of four points
1 2 3 4
xi 0.0 1.0 2.0 3.0
yi 1.1 2.6 7.2 21.1
(a) Implement the three algorithms introduced in Section 4.5: the gradient descent
method, the Gauss-Newton method, and the Levenberg-Marquardt method.
(b) Ignore the weight vector W , i.e., set W = I.
(c) For each method, set p0 = [a0 , b0 ] = [1.0, 0.8].
(d) Discuss how to choose γ for the gradient descent and λ for the Levenberg-Marquardt.
Hint : The Jacobian for this example must be in R4×2 ; more precisely,
1 0
∂ eb a eb
J= y
b(x, p) =
e2b 2a e2b ,
∂p
e3b 3a e3b
Contents of Chapter 5
5.1. Logistic Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2. Classification via Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3. Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4. Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5. k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
89
90 Chapter 5. Popular Machine Learning Classifiers
0 ex (1 + ex ) − ex · ex ex
s (x) = = = s(x)(1 − s(x)). (5.4)
(1 + ex )2 (1 + ex )2
Figure 5.2: The rectifier and it smooth approximation, softplus function ln(1 + ex ).
5.1. Logistic Sigmoid Function 93
• We can then define the logit function, which is simply the logarithm
of the odds ratio (log-odds):
p
logit(p) = ln . (5.7)
1−p
• The logit function takes input values in (0, 1) and transforms them to
values over the entire real line,
which we can use to express a linear relationship between fea-
ture values and the log-odds:
Solution.
1
Ans: logit−1 (z) = , the standard logistic sigmoid function.
1 + e−z
Remark 5.4. Logistic Regression can be applied not only for classi-
fication (class labels) but also for class-membership probability.
• For example, logistic regression is used in weather forecasting (to
predict the chance of rain).
• Similarly, it can be used to predict the probability that a patient has
a particular disease given certain symptoms.
– This is why logistic regression enjoys great popularity in the field
of medicine.
96 Chapter 5. Popular Machine Learning Classifiers
1
Note that the Adaline minimizes the sum-squared-error (SSE) cost function defined as J (w) =
1 X 2
φ(z (i) ) − y (i) , where z (i) = wT x(i) , using the gradient descent method; see Section 3.3.1.
2 i
5.2. Classification via Logistic Regression 97
and therefore n h
X i
∇J (w) = − y − φ(z ) x(i) .
(i) (i)
(5.17)
i=1
Note: The above gradient descent rule for logistic regression has the
same shape as that of Adaline; see (3.15) on p. 48.
Only the difference is the activation function φ(z).
100 Chapter 5. Popular Machine Learning Classifiers
Figure 5.6
5.2. Classification via Logistic Regression 101
Regularization
• One way of finding a good bias-variance tradeoff.
• It is useful to prevent overfitting, also handling
– collinearity (high correlation among features)
– filter-out noise from data
– multiple local minima problem
• The concept behind regularization is to introduce additional in-
formation (bias) to penalize extreme parameter (weight) values.
• The most common form of regularization is so-called L2 regulariza-
tion (sometimes also called L2 shrinkage or weight decay):
m
λ λX 2
kwk2 = w , (5.20)
2 2 j=1 j
The cost function for logistic regression can be regularized by adding a sim-
ple regularization term, which will shrink the weights during model train-
ing: for z (i) = wT x(i) ,
n h
X i λ
(i)
J (w) = −y (i) (i) (i)
ln φ(z ) − (1 − y ) ln 1 − φ(z ) + kwk2 . (5.21)
i=1
2
Note: Regularization
• Regularization is another reason why feature scaling such as stan-
dardization is important.
• For regularization to work properly, we need to ensure that all our
features are on comparable scales.
• Then, via the regularization parameter λ, we can control how well we
fit the training data while keeping the weights small. By increasing
the value of λ, we increase the regularization strength.
• See § 6.3 for details on feature scaling.
102 Chapter 5. Popular Machine Learning Classifiers
To find an optimal hyperplane that maximizes the margin, let’s begin with
considering the positive and negative hyperplanes that are parallel to the
decision boundary:
w0 + wT x+ = 1,
(5.22)
w0 + wT x− = −1.
where w = [w1 , w2 , · · · , wd ]T . If we subtract those two linear equations from
each other, then we have
w · (x+ − x− ) = 2
and therefore
w 2
· (x+ − x− ) = . (5.23)
kwk kwk
5.3. Support Vector Machine 103
(b) Evaluate f at all these points, to find the maximum and minimum.
Self-study 5.13. Use the method of Lagrange multipliers to find the ex-
treme values of f (x, y) = x2 + 2y 2 on the circle x2
+ y 2 = 1.
2x = 2x λ 1
2x 2x
Hint : ∇f = λ∇g =⇒ =λ . Therefore, 4y = 2y λ 2
4y 2y 2 2
x +y =1 3
From 1 , x = 0 or λ = 1.
Now, consider
minx maxα L(x, α) subj.to α ≥ 0.
Figure 5.9: minx x2 subj.to x ≥ 1.
(5.29)
3 Let x < 1. ⇒ maxα≥0 {−α(x − 1)} = ∞. However, minx won’t make this
happen! (minx is fighting maxα ) That is, when x < 1, the objective L(x, α)
becomes huge as α grows; then, minx will push x % 1 or increase it to become
x ≥ 1. In other words, minx forces maxα to behave, so constraints will
be satisfied.
Now, the goal is to solve (5.29). In the following, we will define the dual
problem of (5.29), which is equivalent to the primal problem.
106 Chapter 5. Popular Machine Learning Classifiers
where
L(x, α) = x2 − α (x − 1).
The term minx L(x, α) is called the Lagrange dual function and the
Lagrange multiplier α is also called the dual variable.
How to solve it. For the Lagrange dual function minx L(x, α), the minimum
occurs where the gradient is equal to zero.
d α
L(x, α) = 2x − α = 0 ⇒ x = . (5.32)
dx 2
Plugging this to L(x, α), we have
α 2 α α2
L(x, α) = −α −1 =α− .
2 2 4
N
X N
X
(i) (i)
∇w L([w, w0 ], α) = w− αi y x =0 ⇒ w= αi y (i) x(i) ,
i=1 i=1
N N
∂ X X
L([w, w0 ], α) = − αi y (i) = 0 ⇒ αi y (i) = 0,
∂w0 i=1 i=1
αi ≥ 0, (dual feasibility)
αi [y (i) (w0 + wT x(i) ) − 1] = 0, (complementary slackness)
y (i) (w0 + wT x(i) ) − 1 ≥ 0. (primal feasibility)
(5.39)
Complementary slackness will be discussed in detail on page 111.
Again using the first KKT condition, we can rewrite the first term.
N N
1 2 1 X (i) (i)
X
(j) (j)
− kwk = − αi y x · αj y x
2 2 i=1 j=1
N N (5.41)
1 XX
= − αi αj y (i) y (j) x(i) · x(j) .
2 i=1 j=1
Plugging (5.41) into the (simplified) Lagrangian (5.40), we see that the La-
grangian now depends on α only.
5.3. Support Vector Machine 109
Support vectors
Assume momentarily that we have w0∗ . Consider the complementary slack-
ness KKT condition along with the primal and dual feasibility conditions:
Then
Definition 5.20. The examples in the first category, for which the
scaled margin is 1 and the constraints are active, are called support
vectors. They are the closest to the decision boundary.
y (i) (w0∗ + w∗T x(i) ) ≥ 1 and min y (i) (w0∗ + w∗T x(i) ) = 1.
i
Complementary Slackness
Definition 5.21. Types of Constraints
• A binding constraint is one where some optimal solution is on the
hyperplane for the constraint.
• A non-binding constraint is one where no optimal solution is on
the line for the constraint.
• A redundant constraint is one whose removal would not change
the feasible region.
Remark 5.24. The motivation for introducing the slack variable ξ is:
1. The linear constraints need to be relaxed for inseparable data.
2. Allow the optimization to converge
• under appropriate cost penalization,
• in the presence of misclassifications.
Such strategy of the SVM is called the soft-margin classification.
5.3. Support Vector Machine 113
The constraints allow some slack of size ξi , but we pay a price for it in the
objective. That is,
if y (i) f (x(i) ) ≥ 1, then ξi = 0 and penalty is 0. Otherwise, y (i) f (x(i) ) = 1 − ξi
and we pay price ξi > 0
.
114 Chapter 5. Popular Machine Learning Classifiers
So the only difference from the original problem’s dual, (5.42), is that
αi ≥ 0 is changed to 0 ≤ αi ≤ C . Neat!
See § 5.3.6, p. 120, for the solution of (5.51), using the SMO algorithm.
5.3. Support Vector Machine 115
Note:
• G = ZZ T ∈ RN ×N is called the Gram matrix. That is,
For example, for the inseparable data set in Figure 5.12, we define
Kernel Trick
Recall: the dual problem to the soft-margin SVM given in (5.51):
N N N
hX 1 XX (i) (j) (i) (j)
i
max αi − αi αj y y x · x , subj.to
α 2
i=1 i=1 j=1 (5.54)
0 ≤ αi ≤ C, ∀ i,
PN (i)
i=1 αi y = 0.
One of the most widely used kernels is the Radial Basis Function
(RBF) kernel or simply called the Gaussian kernel:
Common Kernels
• Sigmoid:
K(x(i) , x(j) ) = tanh(a x(i) · x(j) + b) (5.59)
• Gaussian RBF:
(i) (j) kx(i) − x(j) k2
K(x , x ) = exp − (5.60)
2σ 2
• And many others: Fisher kernel, graph kernel, string kernel, ...
very active area of research!
Quesiton. Start with (5.63). Let’s say you want to hold α2 , · · · , αN fixed
and take a coordinate step in the first direction. That is, change α1 to
maximize the objective in (5.63). Can we make any progress? Can we
get a better feasible solution by doing this?
PN (i)
Turns out, no. Let’s see why. Look at the constraint in (5.63), i=1 αi y = 0.
This means
N
X N
X
(1) (i) (1)
α1 y =− αi y ⇒ α1 = −y αi y (i) .
i=2 i=2
Figure 5.13
• In practice , this can result in a very deep tree with many nodes,
which can easily lead to overfitting
(We typically set a limit for the maximal depth of the tree)
5.4. Decision Trees 123
where
f : the feature to perform the split
DP : the parent dataset
Dj : the dataset of the j-th child node
I : the impurity measure
NP : the total number of samples at the parent note
Nj : the number of samples in the j-th child node
Impurity measure?
Commonly used in binary decision trees:
• Entropy c
X
IH (t) = − p(i|t) log2 p(i|t) (5.68)
i=1
• Gini impurity
c
X c
X
p(i|t)2
IG (t) = p(i|t) 1 − p(i|t) = 1 − (5.69)
i=1 i=1
• Classification error
IE (t) = 1 − max{p(i|t)} (5.70)
i
where p(i|t) denotes the proportion of the samples that belong to class
i for a particular node t.
⇒ In practice, both Gini impurity and entropy yield very similar results.
• Classification error: It is less sensitive to changes in the class prob-
abilities of the nodes.
⇒ The classification error is a useful criterion for pruning, but not rec-
ommended for growing a decision tree.
5.4. Decision Trees 125
Figure 5.15: A decision tree result with Gini impurity measure, for three classes with
two features (petal length, petal width). Page 99, Python Machine Learning, 3rd Ed..
The maximum in (5.71) often happens when one of the child impurities
is zero or very small.
5.4. Decision Trees 127
Typically, the larger the number of trees, the better the performance
of the random forest classifier at the expense of an increased compu-
tational cost.
5.5. k-Nearest Neighbors 129
Figure 5.16: Illustration for how a new data point (?) is assigned the triangle class label,
based on majority voting, when k = 5.
130 Chapter 5. Popular Machine Learning Classifiers
(a) Apply the logistic regression gradient descent (Algorithm 5.9) for the noisy data
Gσ (XSD ).
(b) Modify the code for the logistic regression with regularization (5.21) and apply
the resulting algorithm for Gσ (XSD ).
(c) Compare their performances
5.4. (Optional for Undergraduate Students) Verify the formulation in (5.51), which is
dual to the minimization of (5.50).
5.5. Experiment examples on pp. 84–91, Python Machine Learning, 3rd Ed., in order to
optimize the performance of kernel SVM by finding a best kernel and optimal hyper-
parameters (gamma and C).
Choose one of Exercises 6 and 7 below to implement and experiment. The experiment
will guide you to understand how the LM software has been composed from scratch.
You may use the example codes thankfully shared by Dr. Jason Brownlee, who is
the founder of machinelearningmastery.com.
5.6. Implement a decision tree algorithm that incorporates the Gini impurity measure,
from scratch, to run for the data used on page 96, Python Machine Learning, 3rd Ed..
Compare your results with the figure on page 97 of the book. You may refer to
https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
5.7. Implement a k-NN algorithm, from scratch, to run for the data used on page 106,
Python Machine Learning, 3rd Ed.. Compare your results with the figure on page
103 of the book. You may refer to
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-
scratch/
132 Chapter 5. Popular Machine Learning Classifiers
6
C HAPTER
Contents of Chapter 6
6.1. General Remarks on Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2. Dealing with Missing Data & Categorical Data . . . . . . . . . . . . . . . . . . . . . . . 136
6.3. Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.4. Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5. Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
133
134 Chapter 6. Data Preprocessing in Machine Learning
The more disciplined you are in your handling of data, the more
consistent and better results you are likely to achieve.
6.1. General Remarks on Data Preprocessing 135
There are two common approaches to bring different features onto the same
scale:
• min-max scaling (normalization):
(i)
(i) xj − xj,min
xj,norm = ∈ [0, 1], (6.3)
xj,max − xj,min
where xj,min and xj,max are the minimum and maximum of the j-th fea-
ture column (in the training dataset), respectively.
• standardization:
(i)
(i) xj − µj
xj,std = , (6.4)
σj
where µj is the sample mean of the j-th feature column and σj is the
corresponding standard deviation.
where
1
Rp (w) := kwkpp , p = 1, 2. (6.6)
p
• Regularization can be considered as adding a penalty term to the
cost function to encourage smaller weights; or in other words, we
penalize large weights.
• Thus, by increasing the regularization strength (λ ↑),
– we can shrink the weights towards zero, and
– decrease the dependence of our model on the training data.
• The minimizer w∗ must be the point where the Lp -ball intersects
with the minimum-valued contour of the unpenalized cost function.
– The variable λ in (6.5) is a kind of Lagrange multiplier.
Pm
Figure 6.1: L2 -regularization (kwk22 = i=1 wi2 ) and L1 -regularization (kwk1 =
P m
i=1 |wi |).
144 Chapter 6. Data Preprocessing in Machine Learning
LASSO (L1 -regularization). In the right figure, the L1 -ball touches the
minimum-valued contour of the cost function at w1 = 0; the optimum
is more likely located on the axes, which encourages sparsity (zero
entries in w∗ ).
where
1, if wi > 0
sign(wi ) = −1, if wi < 0
0, if wi = 0.
There is no research addressing the question of training vs. test data; more
research and more experience are needed to gain a better understanding.
148 Chapter 6. Data Preprocessing in Machine Learning
(a) Perform the same experiment with the k-NN classifier replaced by the support
vector machine (soft-margin SVM classification).
(b) In particular, analyze accuracy of the soft-margin SVM and plot the result
as in the figure on p. 139.
6.2. On pp. 141-143, the permutation feature importance is assessed from the ran-
dom forest classifier, using the wine dataset.
(a) Discuss whether or not you can derive feature importance for a k-NN classifier.
(b) Assess feature importance with the logistic regression classifier, using the
same dataset.
(c) Based on the computed feature importance, analyze and plot accuracy of the
logistic regression classifier for k_features = 1, 2, · · · , 13.
7
C HAPTER
Feature Extraction
• It can be understood as an approach to dimensionality reduction
and data compression.
– with the goal of maintaining most of the relevant information
• In practice, feature extraction is used
– to improve storage space or the computational efficiency
– to improve the predictive performance by reducing the curse
of dimensionality
Contents of Chapter 7
7.1. Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2. Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3. Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.4. Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
149
150 Chapter 7. Feature Extraction: Data Compression
wT X T Xw v1
w1 = arg max = , (X T X)v1 = λ1 v1 , (7.3)
w6=0 wT w kv1 k
where λ1 is the largest eigenvalue of X T X ∈ Rd×d .
and then 2 finding the weight vector which extracts the maximum vari-
ance from this new data matrix Xbk :
wk = arg max kX bk wk2 . (7.5)
kwk=1
Claim 7.3. The above turns out to give the (normalized) eigenvectors of
X T X. That is, the transformation matrix W is the stack of eigenvec-
tors of X T X:
where λ1 ≥ λ2 ≥ · · · ≥ λd ≥ 0.
where
U : n × d orthogonal (the left singular vectors of X.)
Σ : d × d diagonal (the singular values of X.)
V : d × d orthogonal (the right singular vectors of X.)
where σ1 ≥ σ2 ≥ · · · ≥ σd ≥ 0.
• In terms of this factorization, the matrix X T X reads
X T X = (U ΣV T )T U ΣV T = V ΣU T U ΣV T = V Σ2 V T . (7.13)
where
Σk := diag(σ1 , · · · , σk , 0, · · · , 0). (7.16)
• Now, using (7.15), the truncated data matrix reads
Xk = Zk WkT = U Σk WkT = U Σk W T = U Σk V T . (7.17)
kX − Xk k2 = kU ΣV T − U Σk V T k2
= kU (Σ − Σk )V T k2 (7.18)
= kΣ − Σk k2 = σk+1 ,
σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
where
U : m × n orthogonal (the left singular vectors of A.)
Σ : n × n diagonal (the singular values of A.)
V : n × n orthogonal (the right singular vectors of A.)
are orthogonal.
• Now, we write
" #
uT uT Av uT AṼ
T
U AV = · A · [v Ṽ ] =
Ũ T Ũ T Av Ũ T AṼ
Since
we have
T
σ 0 1 0 σ 0 1 0
U T AV = = ,
0 U1 Σ1 V1T 0 U1 0 Σ1 0 V1
or equivalently
T
1 0 σ 0 1 0
A= U V . (7.25)
0 U1 0 Σ1 0 V1
U = [u1 u2 · · · un ],
Σ = diag(σ1 , σ2 , · · · , σn ),
V = [v1 v2 · · · vn ],
A = U Σ V T ⇐⇒ AV = U ΣV T V = U Σ,
we have
AV = A[v1 v2 vn ] = [Av1 Av2 · · · Avn ]
···
σ1
...
= [u1 · · · ur · · · un ]
σr
(7.26)
. ..
0
= [σ1 u1 · · · σr ur 0 · · · 0].
Therefore,
Avj = σj uj , j = 1, 2, · · · , r
A = U ΣV T ⇔ (7.27)
Avj = 0, j = r + 1, · · · , n
• Equation (7.29) gives how to find the singular values {σj } and the
right singular vectors V , while (7.27) shows a way to compute the
left singular vectors U .
• (Dyadic decomposition) The matrix A ∈ Rm×n can be expressed as
n
X
A= σj uj vjT . (7.31)
j=1
When rank(A) = r ≤ n,
r
X
A= σj uj vjT . (7.32)
j=1
This property has been utilized for various approximations and ap-
plications, e.g., by dropping singular vectors corresponding to small
singular values.
160 Chapter 7. Feature Extraction: Data Compression
B1 = {v1 , v2 , · · · , vn }
B2 = {σ1 u1 , σ2 u2 , · · · , σr ur }
for a subspace of Rm :
A
B1 = {v1 , v2 , · · · , vn } −
→ B2 = {σ1 u1 , σ2 u2 , · · · , σr ur } (7.33)
Then, ∀x ∈ S n−1 ,
x = x1 v1 + x2 v2 + · · · + xn vn
Ax = σ1 x1 u1 + σ2 x2 u2 + · · · + σr xr ur (7.34)
= y1 u1 + y2 u2 + · · · + yr ur , (yj = σj xj )
So, we have
yj
yj = σj xj ⇐⇒ xj =
σj
n r
X yj2 (7.35)
x2j
P
= 1 (sphere) ⇐⇒ = α ≤ 1 (ellipsoid)
j=1 j=1
σj2
7.2. Singular Value Decomposition 161
Example 7.12. We build the set A(S n−1 ) by multiplying one factor of A =
U ΣV T at a time. Assume for simplicity that A ∈ R2×2 and nonsingular. Let
3 −2
A = = U ΣV T
−1 2
−0.8649 0.5019 4.1306 0 −0.7497 0.6618
=
0.5019 0.8649 0 0.9684 0.6618 0.7497
Then, for x ∈ S 1 ,
Ax = U ΣV T x = U Σ(V T x)
In general,
σ1 ≥ · · · ≥ σr > 0.
Define, for k = 1, · · · , r − 1,
k
X
Ak = σj uj vjT (sum of rank-1 matrices).
j=1
A k = U Σk V T , (7.41)
A+ + T
k = V Σk U , (7.42)
where
Σ+
k = diag(1/σ1 , 1/σ2 , · · · , 1/σk , 0, · · · , 0). (7.43)
Full SVD
• For A ∈ Rm×n ,
A = U ΣV T ⇐⇒ U T AV = Σ,
where U ∈ Rm×n and Σ, V ∈ Rn×n .
• Expand
[U U#2 ] ∈ Rm×m ,
U → Ũ = " (orthogonal)
Σ
Σ → Σ̃ = ∈ Rm×n ,
O
where O is an (m − n) × n zero matrix.
• Then, " #
Σ
Ũ Σ̃V T = [U U2 ] V T = U ΣV T = A (7.44)
O
7.2. Singular Value Decomposition 165
AT A = V ΛV T ,
vnT
Lemma 7.18. Let A ∈ Rn×n be symmetric. Then (a) all the eigenvalues
of A are real and (b) eigenvectors corresponding to distinct eigenvalues
are orthogonal.
λ1 = 18 and λ2 = 5,
√ √ √ √ √
3. σ1 = λ1 = 18 = 3 2, σ2 = λ2 = 5. So
√
18 √0
Σ=
0 5
√7
" # 7
√3 234
1 1 13 1 √1 − √4
4. u1 = σ1 Av1 = 18 A
√ = 18 13 −4 =
√
√2 234
13 √13
13 234
4
"
−2
# 4 √
1 1
√
13 1 √1 √765
u2 = σ2 Av2 = 5 A
√ = 5 13 7 = 65 .
√
√3
13
0 0
√7 √4
234 65 √ " √3 √2 #
√4 √7
18 √0 13 13
5. A = U ΣV T = − 234 65 2 3
0 5 − √13 √13
√13 0
234
7.2. Singular Value Decomposition 167
Thus,
" #" #" #
√3 − √213 √1 0 √7 4
− √234 √13
A+ = V Σ−1 U T = 13
√2 √3
18
√1
234
√4 √7
234
13 13
0 5 65 65
0
1 4 1
− 30 − 15 6
= 11 13 1
45 45 9
168 Chapter 7. Feature Extraction: Data Compression
Pn · · · P1 A Q1 · · · Qn−2 = B (7.45)
Numerical rank
In the absence of round-off errors and uncertainties in the data, the SVD
reveals the rank of the matrix. Unfortunately the presence of errors makes
rank determination problematic. For example, consider
1/3 1/3 2/3
2/3 2/3 4/3
A= 1/3 2/3 3/3
(7.46)
2/5 2/5 4/5
3/5 1/5 4/5
• Obviously A is of rank 2, as its third column is the sum of the first two.
• Matlab “svd" (with IEEE double precision) produces
σ1 = 2.5987, σ2 = 0.3682, and σ3 = 8.6614 × 10−17 .
• The approximation
k
X
T
A k = U Σk V = σi ui viT (7.50)
i=1
k = 20 k = 50 k = 100
LDA objective
• The LDA objective is to perform dimensionality reduction.
– So what? PCA does that, too!
• However, we want to preserve as much of the class discriminatory
information as possible.
– OK, this is new!
LDA
• Consider a pattern classification problem, where we have c classes.
• Suppose each class has Nk samples in Rd , where k = 1, 2, · · · , c.
• Let Xk = {x(1) , x(2) , · · · , x(Nk ) } be the set of d-dimensional samples for
class k.
• Let X ∈ Rd×N be the data matrix, stacking all the samples from
all
P classes, such that each column represents a sample, where N =
k Nk .
• The LDA seeks to obtain a transformation of X to Z through projecting
the samples in X onto a hyperplane with dimension c − 1.
1
In PCA, the main idea is to re-express the available dataset to extract the relevant information by re-
ducing the redundancy and to minimize the noise. While (unsupervised) PCA attempts to find the
orthogonal component axes of maximum variance in a dataset, the goal in the (supervised) LDA is to find
the feature subspace that optimizes class separability.
7.3. Linear Discriminant Analysis 173
• For each class k, we define the scatter (an equivalent of the variance)
as 2
X
sek = ek )2 , z = wT x.
(z − µ (7.56)
x∈Xk
• The quantity se2k measures the variability within class Xk after project-
ing it on the z-axis.
• Thus, se21 + se22 measures the variability within the two classes at hand
after projection; it is called the within-class scatter of the projected
samples.
• Fisher’s linear discriminant is defined as the linear function wT x
that maximizes the objective function:
∗ µ1 − µ
(e e2 )2
w = arg max J (w), where J (w) = . (7.57)
w se21 + se22
x∈Xk
Sw−1 Sb w = λ w ⇐⇒ Sb w = λ Sw w; (7.65)
9 S1 = cov(X1,0)*n;
10 S2 = cov(X2,0)*n;
11 Sw = S1+S2; % Sw = [20,13; 13,31]
12
15 invSw_Sb = inv(Sw)*Sb;
16 [V,L] = eig(invSw_Sb); % V1 = [ 0.9503,0.3113]; L1 = 1.0476
17 % V2 = [-0.6905,0.7234]; L2 = 0.0000
• If we have N sample (column) vectors, we can stack them into one ma-
trix as follows.
Z = W T X, (7.68)
where X ∈ Rd×N , W ∈ Rd×(c−1) , and Z ∈ R(c−1)×N .
Recall: For the two classes case, the within-class scatter matrix was
computed as
Sw = S1 + S2 .
Xk , and Sw ∈ Rd×d .
Recall: For the two classes case, the between-class scatter matrix
was computed as
Sb = (µ1 − µ2 )(µ1 − µ2 )T .
where rank(Sb ) = c − 1.
7.3. Linear Discriminant Analysis 179
∗ W T Sb W
W = arg max J (W ) = arg max T . (7.71)
W W W Sw W
Illustration – 3 classes
• Let us generate a dataset for each class to illustrate the LDA transfor-
mation.
• For each class:
– Use the random number generator to generate a uniform stream of
500 samples that follows U(0, 1).
– Using the Box-Muller approach, convert the generated uniform stream
to N (0, 1).
– Then use the method of eigenvalues and eigenvectors to manip-
ulate the standard normal to have the required mean vector and
covariance matrix .
– Estimate the mean and covariance matrix of the resulted dataset.
4 %% uniform stream
5 U = rand(2,1000); u1 = U(:,1:2:end); u2 = U(:,2:2:end);
6
41 lda_c3_visualize;
denormalize.m
1 function Xnew = denormalize(X,Mu,Cov)
2 % it manipulates data samples in N(0,1) to something else.
3
6 Xnew = zeros(size(X));
7 for j=1:size(X,2)
8 Xnew(:,j)= VsD * X(:,j);
9 end
10
lda_c3_visualize.m
1 figure, hold on; axis([-10 20 -5 20]);
2 xlabel('x_1 - the first feature','fontsize',12);
3 ylabel('x_2 - the second feature','fontsize',12);
4 plot(X1(1,:)',X1(2,:)','ro','markersize',4,"linewidth",2)
5 plot(X2(1,:)',X2(2,:)','g+','markersize',4,"linewidth",2)
6 plot(X3(1,:)',X3(2,:)','bd','markersize',4,"linewidth",2)
7 hold off
8 print -dpng 'LDA_c3_Data.png'
9
17 plot(Mu1(1),Mu1(2),'c.','markersize',20)
18 plot(Mu2(1),Mu2(2),'m.','markersize',20)
19 plot(Mu3(1),Mu3(2),'r.','markersize',20)
20 plot(Mu(1),Mu(2),'k*','markersize',15,"linewidth",3)
21 text(Mu(1)+0.5,Mu(2)-0.5,'\mu','fontsize',18)
22
33
43 figure, plot(z1_wk,z1_wk_pdf,'ro',z2_wk,z2_wk_pdf,'g+',...
44 z3_wk,z3_wk_pdf,'bd')
45 xlabel('z','fontsize',12); ylabel('p(z|w1)','fontsize',12);
46 print -dpng 'LDA_c3_Xw1_pdf.png'
47
61 figure, plot(z1_wk,z1_wk_pdf,'ro',z2_wk,z2_wk_pdf,'g+',...
62 z3_wk,z3_wk_pdf,'bd')
63 xlabel('z','fontsize',12); ylabel('p(z|w2)','fontsize',12);
64 print -dpng 'LDA_c3_Xw2_pdf.png'
184 Chapter 7. Feature Extraction: Data Compression
Figure 7.9: w1∗ (solid line in black) and w2∗ (dashed line in magenta).
λ1 = 3991.2, λ2 = 1727.7.
7.3. Linear Discriminant Analysis 185
Figure 7.10: Classes PDF, along the first projection vector w1∗ ; λ1 = 3991.2.
Figure 7.11: Classes PDF, along the second projection vector w2∗ ; λ2 = 1727.7.
Apparently, the projection vector that has the highest eigenvalue pro-
vides higher discrimination power between classes.
186 Chapter 7. Feature Extraction: Data Compression
We summarize the main steps that are required to perform the LDA for
dimensionality reduction.
1. Standardize the d-dimensional dataset (d is the number of features).
2. For each class j, compute the d-dimensional mean vector µj .
3. Construct the within-class scatter matrix Sw (7.69) and the
between-class scatter matrix Sb (7.70).
4. Compute the eigenvectors and corresponding eigenvalues of the ma-
trix Sw−1 Sb (7.72).
5. Sort the eigenvalues by decreasing order to rank the corresponding
eigenvectors.
6. Choose the k eigenvectors that correspond to the k largest eigenvalues
to construct a transformation matrix
Remark 7.24.
• rank(Sw−1 Sb ) ≤ c − 1; we must have k ≤ c − 1.
• The projected feature Zij is x(i) · wj in the projected coordinates and
(x(i) · wj ) wj in the original coordinates.
7.3. Linear Discriminant Analysis 187
The (supervised) LDA classifier must work better than the (unsuper-
vised) PCA, for datasets in Figures 7.9 and 7.12.
Recall: Fisher’s LDA was generalized under the assumption of equal class
covariances and normally distributed classes.
However, even if one or more of those assumptions are (slightly) violated,
the LDA for dimensionality reduction can still work reasonably well.
7.4. Kernel Principal Component Analysis 189
where λ1 ≥ λ2 ≥ · · · ≥ λk ≥ 0.
• (Remark 7.4, p. 152). The matrix Zk ∈ RN ×k is scaled eigenvectors of
XX T :
p p p
Zk = [ λ1 u1 | λ2 u2 | · · · | λk uk ], (XX T ) uj = λj uj , uTi uj = δij .
(7.78)
• A data (row) vector x (new or old) is transformed to a k-
dimensional row vector of principal components
z = xWk ∈ R1×k . (7.79)
Then,
V ∼
= W ; σj2 = λj , j = 1, 2, · · · , d,
(7.80)
Zk = [σ1 u1 |σ2 u2 | · · · |σk uk ].
190 Chapter 7. Feature Extraction: Data Compression
and therefore
N N
1 X (i) (i) T 1 X
v= φ(x )φ(x ) v = [φ(x(i) ) · v] φ(x(i) ), (7.88)
λN i=1 λN i=1
Note:
• The above claim means that all eigenvectors v with λ 6= 0 lie in the
span of φ(x(1) ), · · · , φ(x(N ) ).
• Thus, finding the eigenvectors in (7.84) is equivalent to finding the
coefficients α = (α1 , α2 , · · · , αN )T .
192 Chapter 7. Feature Extraction: Data Compression
How to find α
• By substituting this back into the equation and using (7.83), we get
1
Cvj = λj vj ⇒ φ(X)T φ(X)φ(X)T αj = λj φ(X)T αj . (7.90)
N
and therefore
1
φ(X)φ(X)T φ(X)φ(X)T αj = λj φ(X)φ(X)T αj . (7.91)
N
• Let K be the similarity (kernel) matrix:
def
K == φ(X)φ(X)T ∈ RN ×N . (7.92)
• Then, referring (7.78) derived for the standard PCA, we may conclude
that the k principal components for the kernel PCA are
√ √ √
Ak = [ µ1 α1 | µ2 α2 | · · · | µk αk ] ∈ RN ×k . (7.97)
• It follows from (7.86), (7.89), and (7.95)-(7.96) that for a new point x,
its projection onto the principal components is:
N N
T 1 T
X
(`) 1 X
zj = φ(x) vj = √ φ(x) α`j φ(x ) = √ α`j φ(x)T φ(x(`) )
µj µj
`=1 `=1
N
1 X 1
= √ α`j K(x, x(`) ) = √ K(x, X)T αj .
µj µj
`=1
(7.98)
That is, due to (7.95) and (7.96), when vj is expressed in terms of αj ,
√
it must be scaled by 1/ µj .
• In a matrix form
e = K − K1 − 1 K + 1 K1 ,
K (7.102)
1/N 1/N 1/N 1/N
Kα
e j = µj αj , αTi αj = δij . (7.105)
• With an appropriate choice of kernel function, the kernel PCA can give
a good re-encoding of the data that lies along a nonlinear manifold.
• The kernel matrix is in (N × N )-dimensions, so the kernel PCA will
have difficulties when we have lots of data points.
196 Chapter 7. Feature Extraction: Data Compression
7.1. Read pp. 145–158, Python Machine Learning, 3rd Ed., about the PCA.
(a) Find the optimal number of components k ∗ which produces the best classifica-
tion accuracy (for logistic regression), by experimenting the example code with
n_components = 1, 2, · · · , 13.
(b) What is the corresponding cumulative explained variance?
7.2. Let A ∈ Rm×n . Prove that ||A||2 = σ1 , the largest singular value of A. Hint: Use the
following
||Av1 ||2 σ1 ||u1 ||2
= = σ1 =⇒ ||A||2 ≥ σ1
||v1 ||2 ||v1 ||2
and arguments around Equations (7.34) and (7.35) for the opposite directional in-
equality.
7.3. Recall that the Frobenius matrix norm is defined by
m X
X n 1/2
||A||F = |aij |2 , A ∈ Rm×n .
i=1 j=1
Show that ||A||F = (σ12 + · · · + σk2 )1/2 , where σj are nonzero singular values of A. Hint:
You may use the norm-preserving property of orthogonal matrices. That is, if U is
orthogonal, then ||U B||2 = ||B||2 and ||U B||F = ||B||F .
7.4. Prove Lemma 7.18. Hint: For (b), let Avi = λi vi , i = 1, 2, and λ1 6= λ2 . Then
For (a), you may use a similar argument, but with the dot product being defined for
complex values, i.e.,
u · v = uT v,
where v is the complex conjugate of v.
7.5. Use Matlab to generate a random matrix A ∈ R8×6 with rank 4. For example,
A = randn(8,4);
A(:,5:6) = A(:,1:2)+A(:,3:4);
[Q,R] = qr(randn(6));
A = A*Q;
(a) Print out A on your computer screen. Can you tell by looking if it has (numerical)
rank 4?
(b) Use Matlab’s “svd" command to obtain the singular values of A. How many are
“large?" How many are “tiny?" (You may use the command “format short e" to get
a more accurate view of the singular values.)
(c) Use Matlab’s “rank" command to confirm that the numerical rank is 4.
7.4. Kernel Principal Component Analysis 197
(d) Use the “rank" command with a small enough threshold that it returns the value
6. (Type “help rank" for information about how to do this.)
7.6. Verify (7.64). Hint : Use the quotient rule for ∂J∂w
(w)
and equate the numerator to zero.
7.7. Try to understand the kernel PCA more deeply by experimenting pp. 175–188, Python
Machine Learning, 3rd Ed.. Its implementation is slightly different from (but equiv-
alent to) Summary 7.27.
(a) Modify the code, following Summary 7.27, and test if it works as expected as in
Python Machine Learning, 3rd Ed..
(b) The datasets considered are transformed via the Gaussian radial basis function
(RBF) kernel only. What happens if you use the following kernels?
K1 (x(i) , x(j) ) = (a1 + b1 x(i) · x(j) )2 (polynomial of degree up to 2)
K2 (x(i) , x(j) ) = tanh(a2 + b2 x(i) · x(j) ) (sigmoid)
Cluster Analysis
Contents of Chapter 8
8.1. Basics for Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.2. K-Means and K-Medoids Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.3. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.4. DBSCAN: Density-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.5. Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
8.6. Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
199
200 Chapter 8. Cluster Analysis
d(i, j) = kxi − xj kp = (|xi1 − xj1 |p + |xi2 − xj2 |p · · · |xid − xjd |p )1/p , (8.1)
Center-based Clusters
Contiguity-based Clusters
Density-based Clusters
Conceptual Clusters
Partitional Clustering
Divide data objects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset
Examples are
Hierarchical Clustering
A set of nested clusters, organized as a hierarchical tree
Objective Functions
Global objective function
iteration=1 iteration=2
iteration=3 iteration=15
Figure 8.13: Lloyd’s algorithm: The Voronoi diagrams, the given centroids (•), and the
+
updated centroids ( ), for iteration = 1, 2, 3, and 15.
8.2. K-Means and K-Medoids Clustering 213
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split “loose” clusters, i.e., clusters with relatively high SSE
– Merge clusters that are “close” and that have relatively low SSE
The PAM finds the best K-medoids among a given data, and the CLARA
finds the best K-medoids among the selected samples.
⊕ It is more efficient and scalable than both the PAM and the CLARA;
handles outliers.
⊕ Focusing techniques and spatial access structures may further im-
prove its performance; see (Ng & Han, 2002) [55] and (Schubert &
Rousseeuw, 2018) [71].
Complexity
Limitations
• Single linkage
• Complete linkage
• Group linkage
• Centroid linkage
• Ward’s minimum variance
Dendrogram
• Agglomerative clustering is monotonic
– The similarity between merged clusters is monotone decreas-
ing with the level of the merge.
• Dendrogram: Plot each merge at the dissimilarity between the two
merged groups
– Provides an interpretable visualization of the algorithm and
data
– Useful summarization tool, part of why hierarchical clustering
is popular
Figure 8.24: Six observations and a dendrogram showing their hierarchical clustering.
Remark 8.2.
• The height of the dendrogram indicates the order in which the clus-
ters were joined; it reflects the distance between the clusters.
• The greater the difference in height, the more dissimilarity.
• Observations are allocated to clusters by drawing a horizontal line
through the dendrogram. Observations that are joined together below
the line are in the same clusters.
8.3. Hierarchical Clustering 229
Single Link
Complete Link
Average Link
DBSCAN
• Given a set of points, it groups points that are closely packed together
(points with many nearby neighbors),
– marking as outliers points that lie alone in low-density regions.
• It is one of the most common clustering algorithms and also most
cited in scientific literature. (Citation #: 28,008, as of Apr. 15, 2023)
– In 2014, the algorithm was awarded the test of time awarda at
the leading data mining conference,
KDD 2014: https://www.kdd.org/kdd2014/.
a
The test of time award is an award given to algorithms which have received substantial attention in
theory and practice.
232 Chapter 8. Cluster Analysis
User parameters:
• ε: the radius of a neighborhood
• minPts: the minimum number of points in the ε-neighborhood
8.4. DBSCAN: Density-based Clustering 233
• Point A and 5 other red points are core points. They are all reachable
from one another, so they form a single cluster.
• Points B and C are not core points, but are reachable from A (via
other core points) and thus belong to the cluster as well.
• Point N is a noise point that is neither a core point nor directly-
reachable.
Note: Reachability is not a symmetric relation since, by definition,
no point may be reachable from a non-core point, regardless of distance.
(A non-core point may be reachable, but nothing can be reached from it.)
DBSCAN: Pseudocode
DBSCAN
1 DBSCAN(D, eps, MinPts)
2 C=0 # Cluster counter
3 for each unvisited point P in dataset D
4 mark P as visited
5 NP = regionQuery(P, eps) # Find neighbors of P
6 if size(NP) < MinPts
7 mark P as NOISE
8 else
9 C = C + 1
10 expandCluster(P, NP, C, eps, MinPts)
11
Figure 8.29: Original points (left) and point types of the DBSCAN clustering with eps=10
and MinPts=4 (right): core (green), border (blue), and noise (red).
• Resistant to Noise
• Can handle clusters of different shapes and sizes
• Eps and MinPts depend on each other and can be hard to specify
• Varying densities
• High-dimensional data
Figure 8.31: The DBSCAN clustering. For both cases, it results in 3 clusters.
• Two matrices
– Proximity matrixa (P ∈ RN ×N )
– Incidence matrix (I ∈ RN ×N )
* One row and one column for each data point
* An entry is 1 if the associated pair of points belong to the same cluster
* An entry is 0 if the associated pair of points belongs to different clusters
measure of the similarity (or distance) between the items to which row i and column j
correspond.
240 Chapter 8. Cluster Analysis
Order the similarity matrix with respect to cluster labels and inspect
visually.
Internal Measures
• (Average) SSE is good for comparing two clusterings or two clusters.
XK X
SSE = ||x − µi k2 . (8.9)
i=1 x∈Ci
10 clusters
Silhouette Coefficient
10 clusters
• Entropya
– For cluster j, let pij be the probability that a member of cluster j
belongs to class i, defined as
pij = nij /Nj , (8.13)
• Purity
– The purity of cluster j is given by
SOM Architecture
Remark 8.6. The following are quite deeply related to each other.
(a) Repeatability
(b) Optimality
(c) Convergence
(d) Interpretability
Originally, the SOM algorithm was defined for data described by numer-
ical vectors which belong to a subset X of Rd .
• Continuous setting: the input space X ⊂ Rd is modeled by a proba-
bility distribution with a density function f ,
• Discrete setting: the input space X comprises N data points
x1 , x2 , · · · , xN ∈ R d .
Here the discrete setting means a finite subset of the input space.
252 Chapter 8. Cluster Analysis
Components of Self-Organization
Preliminary 8.7. The self-organization process involves four major
components:
1. Initialization: All the connection weights are initialized with
small random values.
2. Competition: For each input pattern, the neurons compute their re-
spective values of a discriminant function which provides the basis
for competition.
• The particular neuron with the smallest value of the discrim-
inant function is declared the winner.
3. Cooperation: The winning neuron determines the spatial lo-
cation of a topological neighborhood of excited neurons, thereby
providing the basis for cooperation among neighboring neurons.
(smoothing the neighborhood of the winning neuron)
4. Adaptation: The excited neurons decrease their individual values
of the discriminant function in relation to the input pattern through
suitable adjustment of the associated connection weights, such that
the response of the winning neuron to the subsequent application of a
similar input pattern is enhanced.
(making the winning neuron look more like the observation)
Thus, the neuron whose weight vector comes closest to the input vector (i.e.
is most similar to it) is declared the winner; see Preliminary 8.7, item 2.
254 Chapter 8. Cluster Analysis
where I(x) is the index of the winning neuron and η(t) is a learning
rate (0 < η(t) < 1, constant or decreasing).
• The effect of each weight update is to move the weight vectors
wk of the winning neuron and its neighbors towards the input vector
x.
– Repeated presentations of the training data thus leads to topo-
logical ordering.
Theoretical Issues
• The algorithm is easy to define and to use, and a lot of practical studies
confirm that it works.
– However, the theoretical study of its convergence when t tends to
∞ remains without complete proof and provides open problems.
– The main question is to know if the solution obtained from a finite
sample converges to the true solution that might be obtained from
the true data distribution.
• When t tends to ∞, the Rd -valued stochastic processes [wk (t)]k=1,2,··· ,K
can present oscillations, explosion to infinity, convergence in distribu-
tion to an equilibrium process, convergence in distribution or almost
sure to a finite set of points in Rd , etc.. Some of the open questions are:
– Is the algorithm convergent in distribution or almost surely, when
t tends to ∞?
– What happens when η(t) is constant? (when it decreases?)
– If a limit state exists, is it stable?
– How to characterize the organization?
8.6. Self-Organizing Maps 259
8.1. We will experiment the K-Means algorithm following the first section of Chapter 11,
Python Machine Learning, 3rd Ed., in a little bit different fashion.
(a) Make a dataset of 4 clusters (modifying the code on pp. 354–355).
(b) For K = 1, 2, · · · , 10, run the K-Means clustering algorithm with the initialization
init=’k-means++’.
(c) For each K, compute the within-cluster SSE (distortion) for an elbow analysis
to select an appropriate K. Note: Rather than using inertia_ attribute, imple-
ment a function for the computation of distortion.
(d) Produce silhouette plots for K = 3, 4, 5, 6.
8.2. Now, let’s experiment DBSCAN, following Python Machine Learning, 3rd Ed., pp. 376–
381.
(a) Produce a dataset having three half-moon-shaped structures each of which con-
sists of 100 samples.
(b) Compare performances of K-Means, AGNES, and DBSCAN.
(Set n_clusters=3 for K-Means and AGNES.)
(c) For K-Means and AGNES, what if you choose n_clusters much larger than 3 (for
example, 9, 12, 15)?
(d) Again, for K-Means and AGNES, perform an elbow analysis to select an appro-
priate K.
9
C HAPTER
Contents of Chapter 9
9.1. Basics for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
9.2. Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3. Back-Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
9.4. Deep Learning: Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 285
Exercises for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
261
262 Chapter 9. Neural Networks and Deep Learning
Feature Learning
−→ −→
Representation Algorithm
Representation Learning
or, equivalently,
1 if z ≥ 0,
φ(z) = z = b + w · x, (9.2)
−1 otherwise,
Figure 9.3: The standard logistic sigmoid function σ(z) = 1/(1 + e−z ).
=⇒
Figure 9.5: Segmentation.
MNIST Dataset :
A modified subset of two datasets collected by NIST (US National Insti-
tute of Standards and Technology):
• The first part contains 60,000 images (for training)
• The second part is 10,000 images (for test)
Each image is in 28 × 28 grayscale pixels.
• Cost function
1 X (i)
C(W , B) = ||y − a(x(i) )||2 , (9.6)
2N i
where W denotes the collection of all weights in the network, B all the
biases, and a(x(i) ) is the vector of outputs from the network when x(i)
is input.
• Gradient descent method
W W ∆W
← + , (9.7)
B B ∆B
where
∆W ∇W C
= −η .
∆B ∇B C
e(1) , x
x e(2) , · · · , x
e(m) ,
13 class Network(object):
14 def __init__(self, sizes):
15 """The list ``sizes`` contains the number of neurons in the
16 respective layers of the network. For example, if the list
17 was [2, 3, 1] then it would be a three-layer network, with the
18 first layer containing 2 neurons, the second layer 3 neurons,
19 and the third layer 1 neuron. """
20
21 self.num_layers = len(sizes)
22 self.sizes = sizes
23 self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
24 self.weights = [np.random.randn(y, x)
25 for x, y in zip(sizes[:-1], sizes[1:])]
26
4 import network
5 n_neurons = 20
6 net = network.Network([784 , n_neurons, 10])
7
Validation Accuracy
Validation Accuracy
1 Epoch 0: 9006 / 10000
2 Epoch 1: 9128 / 10000
3 Epoch 2: 9202 / 10000
4 Epoch 3: 9188 / 10000
5 Epoch 4: 9249 / 10000
6 ...
7 Epoch 25: 9356 / 10000
8 Epoch 26: 9388 / 10000
9 Epoch 27: 9407 / 10000
10 Epoch 28: 9410 / 10000
11 Epoch 29: 9428 / 10000
Accuracy Comparisons
• scikit-learn’s SVM classifier using the default settings: 9435/10000
• A well-tuned SVM: ≈98.5%
• Well-designed (convolutional) NN: 9979/10000 (only 21 missed!)
9.3. Back-Propagation
9.3.1. Notations
• Let’s begin with notations which let us refer to weights, biases, and
activations in the network in an unambiguous way.
`
wjk : the weight for the connection from the k-th neuron in the
(` − 1)-th layer to the j-th neuron in the `-th layer
`
bj : the bias of the j-th neuron in the `-th layer
`
aj : the activation of the j-th neuron in the `-th layer
`
Figure 9.8: The weight wjk .
9.3. Back-Propagation 279
where the sum is over all neurons k in the (` − 1)-th layer. Denote the
weighted input by
X
zj` := ` `−1
wjk ak + b`j . (9.11)
k
• Now, define
W ` = [wjk `
] : the weight matrix for layer `
` `
b = [bj ] : the bias vector for layer `
(9.12)
z` = [zj` ] : the weighted input vector for layer `
a` = [a`j ] : the activation vector for layer `
Remark 9.1. Thus the cost function in (9.14) satisfies the assumptions,
with
1 1X
Cx = ||y(x) − aL (x)||2 = (yj (x) − aLj (x))2 . (9.16)
2 2 j
• The reason we need the first assumption is because what the back-
propagation actually lets us do is compute the partial derivatives
`
∂Cx /∂wjk and ∂Cx /∂b`j for a single training example.
`
– We then can recover ∂C/∂wjk and ∂C/∂b`j by averaging over
training examples.
• With the assumptions in mind, we may focus on computing the
partial derivatives for a single example.
9.3. Back-Propagation 281
Theorem 9.3. Suppose that the cost function C satisfies the two as-
sumptions in Section 9.3.2 so that it represents the cost for a single train-
ing example. Assume the network contains L layers, of which the feed-
forward model is given as in (9.10):
X
` ` ` ` `−1
aj = σ(zj ), zj = wjk ak + b`j ; ` = 2, 3, · · · , L.
k
Then,
∂C 0 L
(a) δjL = σ (zj ),
∂aLj
X
` `+1 `+1 0 `
(b) δj = wkj δk σ (zj ), ` = L − 1, · · · , 2,
k
∂C (9.18)
(c) `
= δj` , ` = 2, · · · , L,
∂bj
∂C
(d) `
= a`−1 `
k δj , ` = 2, · · · , L.
∂wjk
282 Chapter 9. Neural Networks and Deep Learning
Proof. Here, we will prove (b) only; see Exercise 1 for the others. Using the
definition (9.17) and the chain rule, we have
` ∂C X ∂C ∂zk`+1 X ∂zk`+1 `+1
δj = ` = `+1 ∂z `
= δ
` k
(9.19)
∂zj k
∂z k j k
∂z j
Note X X
zk`+1 = `+1 `
wki ai + b`+1
k = `+1
wki σ(zi` ) + b`+1
k . (9.20)
i i
Differentiating it, we obtain
∂zk`+1 X `+1 ∂σ(zi` ) `+1 0 `
`
= wki `
= wkj σ (zj ). (9.21)
∂zj i
∂z j
(c d)j = cj dj . (9.22)
• For example,
1 3 1·3 3
= = . (9.23)
2 4 2·4 8
z` = W ` a`−1 + b` ; a` = σ(z` ); ` = 2, 3, · · · , L
3. Output error δ L :
δ L = ∇aL C σ 0 (zL );
4. Back-propagate the error:
∇b` C = δ ` ; ∇W ` C = δ ` (a`−1 )T ; ` = 2, · · · , L
Figure 9.11
Remarks 9.9. The neural network exemplified in Figure 9.11 can pro-
duce a classification accuracy better than 98%, for the MNIST hand-
written digit dataset.
• But upon reflection, it’s strange to use networks of fully-
connected layers to classify images.
– The network architecture does not take into account the spatial
structure of the images.
– For instance, it treats input pixels which are far apart and close
together, on exactly the same footing.
• We slide the local receptive field across the entire input image.
– For each local receptive field, there is a different hidden neuron in
the first hidden layer.
Figure 9.13: Two of local receptive fields, starting from the top-left corner.
(Geometry of neurons in the first hidden layer is 24 × 24.)
Note: We have seen that the local receptive field is moved by one pixel
at a time (stride_length=1).
• In fact, sometimes a different stride length is used.
– For instance, we might move the local receptive field 2 pixels to
the right (or down).
– Most software gives a hyperparameter for the user to set the stride
length.
9.4. Deep Learning: Convolutional Neural Networks 289
To put it in slightly more abstract terms, CNNs are well adapted to the
translation invariance of images.a
a
Move a picture of a cat a little ways, and it’s still an image of a cat.
290 Chapter 9. Neural Networks and Deep Learning
Modern CNNs are often built with 10 to 50 feature maps, each associated
to a r × r local receptive field: r = 3 ∼ 9.
Figure 9.15: The 20 images corresponding to 20 different feature maps, which are ac-
tually learned when classifying the MNIST dataset (r = 5).
(c) Pooling
• CNNs also contain pooling layers, in addition to the convolutional
layers just mentioned.
• Pooling layers are usually used right after convolutional layers.
• What they do is to simplify the information in the output from the
convolutional layer.
From Figure 9.14: Since we have 24×24 neurons output from the convo-
lutional layer, after pooling we will have 12 × 12 neurons for each feature
map:
Types of pooling
Figure 9.18: A simple CNN of three feature maps, to classify MNIST digits.
Figure 9.19: The images missed by an ensemble of 5 CNNs. The label in the top right is
the correct classification, while in the bottom right is the label classified output.
• Keras
– A high-level neural network application programming interface
(API)
– It can run on top of TensorFlow, Theano, and Microsoft Cognitive
Toolkit (CNTK).
– Keras focuses on being modular, user-friendly, and extensible.
– Written in Python
• Tensorflow
– Low and High API levels: An end-to-end open-source deep
learning framework developed by Google in 2015
– It is a symbolic math library used for neural networks ⇒ fast!
– Tensorflow is difficult to use and debug.
– Written in Python, C++, CUDA
• Pytorch
– Low and High API levels: It is a deep learning framework based
on Torch, developed by Facebook in 2017, and taken over by the
PyTorch Foundation (part of Linux Foundation) in late 2022.
– It has outstanding community support and development.
– Pytorch is much easier to use and debug than Tensorflow.
– Written in Lua
9.1. Complete proof of Theorem 9.3. Hint : The four fundamental equations in (9.18) can
be obtained by simple applications of the chain rule.
9.2. The core equations of back-propagation in a network with fully-connected layers are
given in (9.18). Suppose we have a network containing a convolutional layer, a max-
pooling layer, and a fully-connected output layer, as in the network shown in Fig-
ure 9.18. How are the core equations of back-propagation modified?
9.3. (Designing a deep network). First, download a CNN code (including the MNIST
dataset) by accessing to
https://github.com/mnielsen/neural-networks-and-deep-learning.git
or ‘git clone’ it. In the ‘src’ directory, there are 8 python source files:
conv.py mnist_average_darkness.py mnist_svm.py network2.py
expand_mnist.py mnist_loader.py network.py network3.py
On lines 16–22 in Figure 9.20 below, I put a design of a CNN model, which involved 2
hidden layers, one for a convolution-pooling layer and the other for a fully-connected
layer. Its test accuracy becomes approximately 98.8% in 30 epochs.
(a) Set ‘GPU = False’ in network3.py, if you are NOT using a GPU.
(Default: ‘GPU = True’, set on line 50-some of network3.py)
(b) Modify Run_network3.py appropriately to design a CNN model as accurate as
possible. Can your network achieve an accuracy better than 99.5%? Hint : You
may keep using the SoftmaxLayer for the final layer. ReLU (Rectified Linear
Units) seems comparable with the sigmoid function (default) for activation. The
default p_dropout=0.0. You should add some more layers, meaningful, and tune
well all of hyperparameters: the number of feature maps for convolutional layers,
the number of fully-connected neurons, η, p_dropout, etc..
(c) Design an ensemble of 5 such networks to further improve the accuracy. Hint :
Explore the function ‘ensemble’ defined in conv.py.
296 Chapter 9. Neural Networks and Deep Learning
Run_network3.py
1 """ Run_network3.py:
2 -------------------
3 A CNN model, for the MNIST dataset,
4 which uses network3.py written by Michael Nielsen.
5 The source code can be downloaded from
6 https://github.com/mnielsen/neural-networks-and-deep-learning.git
7 or 'git clone' it
8 """
9 import network3
10 from network3 import Network, ReLU
11 from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
12
15 mini_batch_size = 10
16 net = Network([
17 ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),
18 filter_shape=(20, 1, 5, 5),
19 poolsize=(2, 2), activation_fn=ReLU),
20 FullyConnectedLayer(
21 n_in=20*12*12, n_out=100, activation_fn=ReLU, p_dropout=0.0),
22 SoftmaxLayer(n_in=100, n_out=10, p_dropout=0.5)], mini_batch_size)
23
Data Mining
Contents of Chapter 10
10.1.Introduction to Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
10.2.Vectors and Matrices in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
10.3.Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
10.4.Eigenvalue Methods in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Exercises for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
297
298 Chapter 10. Data Mining
Scientific Viewpoint
• Data collected and stored at enormous speeds (GB/hour)
– Remote sensors on a satellite
– Telescopes scanning the skies
– Microarrays generating gene expression data
– Scientific simulations generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
10.1. Introduction to Data Mining 299
Related Fields
10.2.1. Examples
Example 10.1. Term-document matrices are used in information re-
trieval. Consider the following set of five documents. Key words, referred
to as terms, are marked in boldface.
Document 1: The Google matrix P is a model of the Internet.
Document 2: Pij is nonzero if there is a link from web page j to i.
Document 3: The Google matrix is used to rank all web pages.
Document 4: The ranking is done by solving a matrix eigenvalue
problem.
Document 5: England dropped out of the top 10 in the FIFA ranking.
• Assume that we want to find all documents that are relevant with respect
to the query “ranking of web pages”. This is represented by a query vec-
tor, constructed in an analogous way as the term-document matrix, using
the same dictionary.
0
0
0
0
0
q = ∈ R10 . (10.2)
0
0
1
1
1
Thus the query itself is considered as a document.
• Now, the information retrieval task can be formulated as a mathematical
problem: find the columns of A that are close to the vector q.
A = FbΩ−1 G,
b (10.9)
where
Fb = (f1 , · · · , fr ) ∈ Rm×r , fi = A(i) f (i) ,
Ω = diag(ω1 , · · · , ωr ) ∈ Rr×r , (10.10)
Gb = (g1 , · · · , gr ) ∈ Rn×r , gi = A(i) T g (i) .
Theorem 10.4 can be generalized to the case where the reduction of rank is
larger than one, as shown in the next theorem.
308 Chapter 10. Data Mining
Note: There are many choices of F and G that satisfy the condition
(10.11).
• Therefore, various rank-reduction decompositions are possible.
• It is known that several standard matrix factorizations in numerical
linear algebra are instances of the Wedderburn formula:
– Gram-Schmidt orthogonalization,
– singular value decomposition,
– QR and Cholesky decomposition, and
– the Lanczos procedure.
10.2. Vectors and Matrices in Data Mining 309
Stemming
• Stemming is the process of reducing each word that is conjugated or
has a suffix to its stem.
• Clearly, from the point of view of information retrieval, no information
is lost in the following reduction.
computable
computation
computing =⇒ comput
computed
computational
where
• fij is term frequency,
the number of times term i appears in document j,
• ni is the number of documents that contain term i
(inverse document frequency).
If a term occurs frequently in only a few documents, then both factors
are large. In this case the term discriminates well between different
groups of documents, and it gets a large weight in the documents where
it appears.
314 Chapter 10. Data Mining
Figure 10.3: The first 500 rows and columns of the Medline matrix. Each dot represents a
non-zero element.
10.3. Text Mining 315
Query Matching
• The query is parsed using the same dictionary as the documents, giv-
ing a vector q ∈ Rm .
• Query matching is the process of finding all documents that are con-
sidered relevant to a particular query q.
• This is often done using the cosine distance measure: All docu-
ments {aj } are returned for which
q · aj
≥ tol, (10.18)
||q|| ||aj ||
where tol is user-defined tolerance.
Figure 10.4: Recall versus precision diagram for query matching for Q9, using the vector
space method.
10.3. Text Mining 317
Figure 10.5: Recall versus precision diagram for query matching for Q9, using the full
vector space method (solid curve) and the rank 100 approximation (dashed).
10.3. Text Mining 319
We then compute the SVD of the term-document matrix, and use a rank
2 approximation. After projection to the two-dimensional subspace the
cosines, computed according to (10.22), are
It turns out that Document 1, which was deemed totally irrelevant for the
query in the original representation, is now highly relevant. In addition,
the scores for the relevant Documents 2-4 have been reinforced. At the
same time, the score for Document 5 has been significantly reduced.”
• However, I view it as a warning.
320 Chapter 10. Data Mining
10.4.1. Pagerank
• Let all web pages be ordered from 1 to n, and let i be a particular web
page.
• Then Oi will denote the set of pages that i is linked to, the outlinks.
The number of outlinks is denoted Ni = |Oi |.
• The set of inlinks, denoted Ii , are the pages that have an outlink to i.
That is, the weighting is such that the rank of a page j is divided evenly
among its outlinks.
λr = Qr, λ = 1, (10.25)
Reformulation of (10.23)
Modify the matrix Q to have an eigenvalue λ = 1.
• Assume that a surfer visiting a web page, always chooses the next page
among the outlinks with equal probability.
• Assume that the random surfer never get stuck.
– In other words, there should be no web pages without outlinks
(such a page corresponds to a zero column in Q).
• Therefore the model is modified so that zero columns are replaced by
a constant value in each position.
10.4. Eigenvalue Methods in Data Mining 323
Remark 10.26. Given the size of the Internet and reasonable assump-
tions about its structure,
• it is highly probable that the link graph is not strongly connected,
• which means that the Pagerank eigenvector of P may not be
well-defined.
10.4. Eigenvalue Methods in Data Mining 325
One billion dollar idea, by Sergey Brin and Lawrence Page in 1996
• The Google matrix is the matrix
1
G = αP + (1 − α) eeT , (10.31)
n
for some α satisfying 0 < α < 1, called the damping factor.
• Obviously G is irreducible (since G > 0) and column-stochastic.a
• Furthermore,
1
eT G = αeT P + (1 − α)eT eeT = αeT + (1 − α)eT = eT . (10.32)
n
• The pagerank equation reads
h 1 Ti
Gr = αP + (1 − α) ee r = r. (10.33)
n
a
A n × n matrix is called a Markov matrix if all entries are nonnegative and the sum of each column
vector is equal to 1. A Markov matrix are also called a stochastic matrix.
326 Chapter 10. Data Mining
Note: The random walk interpretation of the additional rank one term
is that each time step a page is visited, the surfer will jump to any page
in the whole web with probability 1 − α (sometimes referred to as tele-
portation).
1
• Recall (10.27): P = Q + edT , which can be interpreted as follows.
n
When a random surfer visits a web page of no outlinks, the
surfer will jump to any page with an equal probability 1/n.
1
• The convex combination in (10.31): G = αP + (1 − α) eeT .
n
Although there are outlinks, the surfer will jump to any page
with an equal probability (1 − α)/n.
1
Remark 10.28. The vector e in (10.31) can be replaced by a non-
n
negative vector v with ||v||1 = 1, which can be chosen in order to make
the search biased towards certain kinds of web pages. Therefore, it is
referred to as a personalization vector.
10.4. Eigenvalue Methods in Data Mining 327
The only viable method so far for Pagerank computations on the whole
web seems to be the power method.
• The rate of convergence of the power method depends on the ratio of
the second largest and the largest eigenvalue in magnitude.
• Here, we have
|1 − λ(k) | = O(αk ), (10.35)
due to Proposition 10.27.
• In view of the huge dimension of the Google matrix, it is non-trivial
to compute the matrix-vector product. We will consider some details.
328 Chapter 10. Data Mining
Thus, when the power method begins with y(0) with ||y(0) ||1 = 1, the
normalization step in the power method is unnecessary.
• Let us look at the multiplication in some detail:
h 1 Ti e
z = αP + (1 − α) ee y = αQy + β , (10.38)
n n
where
β = αdT y + (1 − α)eT y. (10.39)
Note:
• From Proposition 10.27, we know that the second eigenvalue of the
Google matrix is αλ2 .
• A typical value of α = 0.85.
• Approximately k = 57 iterations are needed to reach 0.85k < 10−4 .
• This is reported to be close the number of iterations used by Google.
330 Chapter 10. Data Mining
10.1. Consider Example 10.17, p.319. Compute vectors of cosines, for each subspace ap-
proximations, i.e., with Ak where k = 1, 2, · · · , 5.
10.2. Verify equations in Derivation 10.30, p.328, particularly (10.38), (10.39), and (10.40).
10.3. Consider the link matrix Q in (10.4) and its corresponding link graph in Figure 10.2.
Find the pagerank vector r by solving the Google pagerank equation.
• You may initialize the power method with any vector r(0) satisfying ||r(0) ||1 = 1.
• Set α = 0.85.
• Let the iteration stop, when residual < 10−4 .
10.4. Now, consider a modified link matrix Q,e by adding an outlink from page 4 to 5 in
Figure 10.2. Find the pagerank vector e
r, by setting parameters and initialization the
same way as for the previous problem.
• Compare r with e
r.
• Compare the number of iterations for convergence.
11
C HAPTER
Quadratic Programming
Quadratic programming (QP) is the process of solving a constrained
quadratic optimization problem. That is, the objective f is quadratic and
the constraints are linear in several variables x ∈ Rn . Quadratic program-
ming is a particular type of nonlinear programming. Its general form is
1 T
minn f (x) := x Ax − xT b, subj.to
x∈R 2
Cx = c, (11.1)
Dx ≤ d,
where A ∈ Rn×n is symmetric, C ∈ Rm×n , D ∈ Rp×n , b ∈ Rn , c ∈ Rm , and
d ∈ Rp .
Contents of Chapter 11
11.1.Equality Constrained Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . 332
11.2.Direct Solution for the KKT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
11.3.Linear Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
11.4.Iterative Solution of the KKT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
11.5.Active Set Strategies for Convex QP Problems . . . . . . . . . . . . . . . . . . . . . . . 350
11.6.Interior-point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
11.7.Logarithmic Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Exercises for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
331
332 Chapter 11. Quadratic Programming
Let Z ∈ Rn×(n−m) be a matrix whose columns span the null space of C, N (C),
i.e.,
CZ = 0 or Span(Z) = N (C). (11.6)
Definition 11.1. The matrix K in (11.5) is called the KKT matrix and
the matrix Z T AZ is referred to as the reduced Hessian.
Proposition 11.6.
• If A, B 0, then A · B ≥ 0, and A · B = 0 implies AB = 0.
• A symmetric matrix A is semidefinite if A · B ≥ 0 for every B 0.
A CT x
x
K = = 0, (11.9)
0 C 0 0
which implies the KKT matrix K is singular.
Now, we assume that the KKT matrix is singular. That is, there are
x ∈ Rn , z ∈ Rm , not both zero, such that
A CT x
x
K = = 0,
z C 0 z
which implies
Ax + C T z = 0 and Cx = 0. (11.10)
It follows from the above equations that
0 = xT Ax + xT C T z = xT Ax,
pT Ax∗ = pT (b − C T λ∗ ) = pT b − pT C T λ∗ = pT b,
A CT x
b
= , (11.19)
C 0 λ c
| {z }
:=K
where the second row is the primal feasibility.
For direct solutions of the KKT system (11.19), this section considers sym-
metric factorization, the range-space approach, and the null-space approach.
P KP T = LDLT , (11.20)
9 print('L0=\n',L0)
10 print('D=\n',D)
11 print('P=\n',P)
12 print('L0*D*L0^T=\n',L0.dot(D).dot(L0.T),'\n#-----------')
13
where wY ∈ Rm , wZ ∈ Rn−m .
7. On the other hand, substituting (11.26) into the first equation of (11.19), we get
Ax + C T λ = AY wY + AZwZ + C T λ = b. (11.28)
The reduced KKT system (11.29) can be solved easily e.g. by a Cholesky factoriza-
tion of the reduced Hessian Z T AZ ∈ R(n−m)×(n−m) .
8. Once wY and wZ have been computed as solutions of (11.27) and (11.29), respectively,
x∗ is obtained from (11.26).
9. When Lagrange multipliers λ∗ is to be computed, we multiply (11.28) by Y T and
solve the resulting equation:
(CY )T λ∗ = Y T b − Y T Ax∗ . (11.30)
342 Chapter 11. Quadratic Programming
Ax = b, (11.31)
A = M − N, (11.32)
M x = N x + b. (11.33)
M xk = N xk−1 + b, (11.34)
or, equivalently,
• It follows from (11.33) and (11.34) that the error equation reads
M ek = N ek−1 (11.36)
or, equivalently,
ek = M −1 N ek−1 . (11.37)
• Since
kek k ≤ kM −1 N k · kek−1 k ≤ kM −1 N k2 · kek−2 k
(11.38)
≤ · · · ≤ kM −1 N kk · ke0 k,
kM −1 N k < 1. (11.39)
connecting from Pi to Pj .
|z − aii | ≤ Λi , 1 ≤ i ≤ n. (11.43)
|λ − 2| < 2
Positiveness
Definition 11.21. An n × n complex-valued matrix A = [aij ] is diago-
nally dominant if
n
X
|aii | ≥ Λi := |aij |, (11.44)
j=1
j 6= i
Implementation of (11.51):
(1) (2)
T
compute rk " k , rk ) := β −#Kψ k ;
= (r
(1)
rk
compute Lrk = ;
−C A b−1 r(1) + r(2) (11.52)
k k
solve M1 ∆ψ k = Lrk ;
set ψ k+1 = ψ k + ∆ψ k ;
350 Chapter 11. Quadratic Programming
(C i and D i are row vectors; we will deal with them like column vectors.)
The primal active set strategy is an iterative procedure:
• Given a feasible iterate xk , k ≥ 0, we determine its active set
xk+1 = xk − αk pk , (11.60)
Remark 11.31. (Update for Iac (xk+1 )). Let’s begin with defining the
set of blocking constraints:
def
n
T D Ti xk − di o
Ibl (pk ) == i 6∈ Iac (xk ) | D i pk < 0, ≤1 . (11.65)
D Ti pk
Then we specify Iac (xk+1 ) by adding the most restrictive blocking con-
straint to Iac (xk ):
def
n D Tj xk − dj o
Iac (xk+1 ) == Iac (xk ) ∪ j ∈ Ibl (pk ) | =α
bk . (11.66)
D Tj pk
For such a j (the index of the newly added constraint), we clearly have
D Tj xk+1 = D Tj xk − αk D Tj pk = D Tj xk − α
bk D Tj pk = dj , (11.67)
Newton’s method:
• Given a feasible iterate (x, µ, z) = (xk , µk , zk ), we introduce a duality
measure θ: p
1X zT µ
θ := zi µi = . (11.75)
p i=1 p
where A DT 0
def
∇F (x, µ, z) == D 0 I .
0 Z M
• The new iterate (xk+1 , µk+1 , zk+1 ) is then determined by means of
(xk+1 , µk+1 , zk+1 ) = (xk , µk , zk ) + α(∆x, ∆µ, ∆z), (11.77)
Algorithms based on barrier functions are iterative methods where the it-
erates are forced to stay within the interior of the feasible set:
def
F int == {x ∈ Rn | D Ti x − di < 0, 1 ≤ i ≤ p}. (11.79)
βk → 0, as k → ∞. (11.81)
Then there holds:
1. For any β > 0, the logarithmic barrier function Bβ (x) is convex in
F int and attains a minimizer xβ ∈ F int .
2. There is a unique minimizer; any local minimizer xβ is also a
global minimizer of Bβ (x).
Recall: The KKT conditions for (11.78) are given in (11.71), p. 353:
∇x Q(x) + DT µ = Ax − b + DT µ = 0,
Dx − d ≤ 0,
(11.83)
µi (Dx − d)i = 0, i = 1, 2, · · · , p,
µi ≥ 0, i = 1, 2, · · · , p.
(b) Now, considering the inequality constraints, discuss why the problem is yet ad-
mitting a unique solution α∗ .
11.6. Consider the following equality-constrained QP problem
min Q(x) := (x1 − 1)2 + 2(x2 − 2)2 + 2(x3 − 2)2 − 2x1 x2 , subj.to
x∈R3
x1 + x2 − 3x3 ≤ 0, (11.93)
4x1 − x2 + x3 ≤ 1.
Implement the interior-point method with the Newton’s iterative update, pre-
sented in Section 11.6, to solve the QP problem (11.93).
(a) As in Exercise 6, you should first formulate the KKT system for (11.93) by iden-
tifying A ∈ R3×3 , D ∈ R2×3 , and vectors b ∈ R3 and d ∈ R2 .
(b) Choose a feasible initial value (x0 , µ0 , z0 ).
(c) Select the algorithm parameter σ ∈ [0, 1] appropriately for Newton’s method.
(d) Discuss how to determine α in (11.77) in order for the new iterate (xk+1 , µk+1 , zk+1 )
to stay feasible.
360 Chapter 11. Quadratic Programming
P
A PPENDIX
Projects
Contents of Projects
P.1. mCLESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
P.2. Noise-Removal and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
P.3. Gaussian Sailing to Overcome Local Minima Problems . . . . . . . . . . . . . . . . . . 379
P.4. Quasi-Newton Methods Using Partial Information of the Hessian . . . . . . . . . . . . 381
P.5. Effective Preprocessing Technique for Filling Missing Data . . . . . . . . . . . . . . . . 383
361
362 Appendix P. Projects
P.1. mCLESS
Note: Some machine learning algorithms are considered as black
boxes, because
• the models are sufficiently complex and
• they are not straightforwardly interpretable to humans.
Lack of interpretability in predictive models can undermine trust
in those models, especially in health care, in which so many decisions
are – literally – life and death issues [59].
Project Objectives
• Activation:
1, if z ≥ θ
φ(z) = (P.1.2)
0, otherwise,
where θ is a threshold. When the logistic sigmoid function is chosen
for the activation function, i.e., φ(z) = 1/(1 + e−z ), the resulting
classifier is called the Logistic Regression.
Remark P.2. Note that the net input in (P.1.1) represents a hyper-
plane in Rd .
• More complex neural networks can be built, stacking the simple ar-
tificial neurons as building blocks.
• Machine learning (ML) is to train weights from datasets of an arbi-
trary number of classes.
– The weights must be trained in such a way that data points in a
class are heavily weighted by the corresponding part of weights.
• The activation function is incorporated in order
(a) to keep the net input restricted to a certain limit as per
our requirement and, more importantly,
(b) to add nonlinearity to the network.
364 Appendix P. Projects
where the number in () in the superscript denotes the class that the
point belongs to.
• Let’s consider an artificial neural network of the identity activation
and no hidden layer, for simplicity.
P.1. mCLESS 365
where the j-th column weights heavily points in the j-th class.
– Define the source matrix:
B = [δci ,j ] ∈ RN ×3 . (P.1.7)
For example, if the i-th point is in Class 0, then the i-th row of B is
[1, 0, 0].
(AT A) W
c = AT B, AT A ∈ R3×3 . (P.1.9)
Prediction
The prediction step in the mCLESS is quite simple:
(a) Let [x1 , x2 ] be a new point.
(b) Compute
[1, x1 , x2 ] W
c = [p0 , p1 , p2 ], c ∈ R3×3 .
W (P.1.10)
Note. Ideally, if the point [x1 , x2 ] is in class j, then pj is near 1, while
others would be near 0. Thus pj is the largest.
(c) Decide the class c:
for all points in the dataset. Compare the equation with (P.1.3).
• The corresponding expanded information and weight matrices read
(0) (1) (C−1)
w w0 · · · w0
1 x11 x12 · · · x1d σ(x1 ) 0
w(0) w(1) · · · w(C−1)
1 1 1
1 x
21 x 22 · · · x 2d σ(x )
2
.. ... ..
A
e=
.. . . , W
f= . ,
.
. .
. . .
wd(0) wd(1) · · · wd(C−1)
1 xN 1 xN 2 · · · xN d σ(xN )
(0) (1) (C−1)
wd+1 wd+1 · · · wd+1
(P.1.15)
N ×(d+2) f (d+2)×C
where A ∈ R
e ,W ∈R , and C is the number of classes.
• Feature expansion can be performed multiple times. When α features
are added, the optimal weight matrix W c ∈ R(d+1+α)×C is the least-
squares solution of
eT A)
(A e Wc=A eT B, (P.1.16)
eT A
where A e ∈ R(d+1+α)×(d+1+α) and B is the same as in (P.1.7).
5 def generate_data(n,scale,theta):
6 # Normally distributed around the origin
7 x = np.random.normal(0,1, n); y = np.random.normal(0,1, n)
8 P = np.vstack((x, y)).T
9 # Transform
10 sx,sy = scale
11 S = np.array([[sx,0],[0,sy]])
12 c,s = np.cos(theta), np.sin(theta)
13 R = np.array([[c,-s],[s,c]]).T #T, due to right multiplication
14 return P.dot(S).dot(R)
15
16 def synthetic_data():
17 N=0
18 plt.figure()
19 for i in range(N_CLASS):
20 scale = SCALE[i]; theta = THETA[i]; N+=N_D1
21 D1 = generate_data(N_D1,scale,theta) +TRANS[i]
22 D1 = np.column_stack((D1,i*np.ones([N_D1,1])))
23 if i==0: DATA = D1
24 else: DATA = np.row_stack((DATA,D1))
25 plt.scatter(D1[:,0],D1[:,1],s=15,c=COLOR[i],marker=MARKER[i])
26
27 np.savetxt(DAT_FILENAME,DATA,delimiter=',',fmt=FORMAT)
28 print(' saved: %s' %(DAT_FILENAME))
29
39 if __name__ == '__main__':
40 synthetic_data()
P.1. mCLESS 371
GLOBAL_VARIABLES.py
1 import numpy as np
2 import matplotlib.pyplot as plt
3
4 N_D1 = 100
5 FORMAT = '%.3f','%.3f','%d'
6
14 N_CLASS = len(SCALE)
15
16 DAT_FILENAME = 'synthetic.data'
17 FIG_FILENAME = 'synthetic-data.png'
18 FIG_INTERPRET = 'synthetic-data-interpret.png'
19
20 def myfigsave(figname):
21 plt.savefig(figname,bbox_inches='tight')
22 print(' saved: %s' %(figname))
372 Appendix P. Projects
What to do
1. Implement mCLESS:
• Training. You should implement modules for each of (P.1.5) and
(P.1.7). Then use Xtrain and ytrain to get A and B.
• Test. Use the same module (implemented for A) to get Atest from
Xtest. Then perform P = (Atest)*W c as in (P.1.10). Now, you can
get the prediction using
prediction = np.argmax(P,axis=1);
which may be compared with ytest to obtain accuracy.
2. Use following datasets:
• Synthetic datasets. Generate two different synthetic datasets:
1 Use Line 7 in GLOBAL_VARIABLES.py
2 Use Line 8 in GLOBAL_VARIABLES.py
• Real datasets. Use public datasets such as iris and wine.
To get the public datasets, you may use:
from sklearn import datasets
data_read1 = datasets.load_iris()
data_read2 = datasets.load_wine()
3. Compare the performance of mCLESS with
• LogisticRegression(max_iter = 1000)
• KNeighborsClassifier(5)
• SVC(gamma=2, C=1)
• RandomForestClassifier(max_depth=5, n_estimators=50, max_features=1)
See Section 1.3.
4. (Optional for Undergraduate Students) Add modules for feature
expansion, as described on page 369.
• For this, try to an interpretable strategy to find an effective point p such
that the feature expansion with (P.1.17) improves accuracy.
• Experiment Steps 1-3.
You may start with the machine learning modelcode in Section 1.3; add
your own modules.
P.2. Noise-Removal and Classification 373
Project Objectives
For each of selected datasets, you will design the best model for
noise-removal and classification.
Confidence Region
A confidence score indicates the likelihood that a machine learning
model assigns the respective intent correctly.
Figure 5.16: Illustration for how a new data point (?) is assigned the triangle class label,
based on majority voting, when k = 5.
PCA
Recall: (PCA), p. 189. Consider a data matrix X ∈ RN ×d :
◦ each of the N rows represents a different data point,
◦ each of the d columns gives a particular kind of feature, and
◦ each column has zero empirical mean (e.g., after standardization).
where λ1 ≥ λ2 ≥ · · · ≥ λk ≥ 0.
• (Remark 7.4, p. 152). The matrix Zk ∈ RN ×k is scaled eigenvectors of
XX T :
p p p
Zk = [ λ1 u1 | λ2 u2 | · · · | λk uk ], (XX T ) uj = λj uj , uTi uj = δij .
(P.2.3)
• A data (row) vector x (new or old) is transformed to a k-
dimensional row vector of principal components
z = xWk ∈ R1×k . (P.2.4)
Then,
V ∼
= W ; σj2 = λj , j = 1, 2, · · · , d,
(P.2.5)
Zk = [σ1 u1 |σ2 u2 | · · · |σk uk ].
376 Appendix P. Projects
• For each class, one may perform PCA along with the SVD.
for c in range(nclass):
Xc = X[y==c]; CC = np.mean(Xc,axis=0)
U, s, VT = svd(Xc-CC,full_matrices=False)
(c) (c) (c)
• Let µ(c) = CC[c], V (c) = [v1 , · · · , vd ], and σj = s[j], the jth singular value
for Class c. Define an anisotropic distance as
d (c)
(c)
X (x − µ(c) ) · vj 2
γ (x) = (c)
. (P.2.6)
j=1 σj
(c)
Remark P.10. Let rmax be
2
(c)
rmax = max γ (c) (x). (P.2.7)
x∈X (c)
(c) 2
Then γ (c) (x) = rmax approximates the MVEE relatively well.
See Figure P.4 (a).
(a) (b)
Figure P.4: Approximate MVEEs for: (a) the dataset and (b) the confidence regions.
Note: You must first find confidence regions for the training dataset,
which can be viewed as denoising. You may then begin the training
step with the denoised dataset.
378 Appendix P. Projects
What to do
1. Download PCA-KNN-Denoising.PY.tar:
https://skim.math.msstate.edu/LectureNotes/data/PCA-KNN-
Denoising.PY.tar
2. Compose a denoising-and-classification code, using appropriate
functions from the downloaded package.
• You must implement both denoising algorithms: the k-NN-based
and the PCA-based.
3. Use similar datasets, utilized for Project 1. mCLESS, Section P.1:
• Select a synthetic dataset, using Line 7 or 8 in GLOBAL_VARIABLES.py.
• Real datasets. Use public datasets such as iris and wine.
To get the public datasets, you may use:
from sklearn import datasets
data_read1 = datasets.load_iris()
data_read2 = datasets.load_wine()
4. Compare performances of the classifiers with and without denois-
ing
• LogisticRegression(max_iter = 1000)
• KNeighborsClassifier(5)
• SVC(gamma=2, C=1)
• RandomForestClassifier(max_depth=5, n_estimators=50, max_features=1)
5. (Optional for Undergraduate Students)
Add modules for clustering-and-PCA denoising.
• For example, the MVEE does not make sense for a half-moon dataset.
• Add another dataset, such as
from sklearn.datasets import make_moons
X, y = make_moons(noise=0.2, n_samples=400, random_state=12)
• Preform k-Means cluster analysis for each class, with k = 4.
• For each cluster in each class, perform the PCA-based denoising.
• Carry out Steps 2–4.
for cw ≥ 0. Since
1 cw 1 1
cw c2w cw = cw [1 cw 1],
1 cw 1 1
the convolution smoothing can be implemented easily and conve-
niently as
1
1
S ∗A = cw ∗ [1 cw 1] ∗ A ,
(2 + cw )2
1
Tasks to do
1. Go to http://www.sfu.ca/∼ssurjano/optimization.html (Virtual Li-
brary of Simulation Experiments) and select three functions of
your interests having multiple local minima. (e.g., two of them are
the Ackley function and Griewank function.)
2. Store each of the functions in a 2D-array A which has dimensions
large enough.
3. Compute
Aσ = gaussian_filter(A, σ), or At = S
| ∗ S ∗{z· · · ∗ S} ∗A,
t-times
You may work in a group of two people; however, you must report individu-
ally.
P.4. Quasi-Newton Methods Using Partial Information of the Hessian 381
H
bqb = ∇f
c (xn ). (P.4.5)
∼ ||∇f
c (xn )||
σ= . (P.4.7)
||b
q||
• Comparisons:
– Implement (or download codes for) the original Newton’s method
and one of quasi-Newton methods (e.g., BFGS).
– Let’s call our method the partial Hessian (PH)-Newton method.
Compare the PH-Newton with those known methods for: the num-
ber of iterations, the total elapsed time, convergence behavior, and
stability/robustness.
– Test with e.g. the Rosenbrock function defined on Rd , d ≥ 10, with
various initial points x0 .
P.5. Effective Preprocessing Technique for Filling Missing Data 383
Objectives.
• Algorithm Development:
– Think a good strategy or two for filling the missing values, if it
is to be done manually.
– What information available from the dataset is useful and help us
build a good strategy?
– Suggestions: Use near-values to interpolate; try to employ the
concept of feature importance, if available.
• Comparisons:
– For the Wine dataset, for example, erase r% data values in ran-
dom; r = 5, 10, 20.
– Compare your new filling strategy with 1 the simple sample re-
moval method and 2 the imputation strategy using mean, me-
dian, or mode.
– Perform accuracy analysis for various classifiers, e.g., logistic
regression, support vector machine, and random forests.
Bibliography
[1] W. B ELSON, Matching and prediction on the principle of biological classification,
JRSS, Series C, Applied Statistics, 8 (1959), pp. 65–75.
[5] Y. B ENGIO, Y. L E C UN, AND G. H INTON, Deep learning, Nature, 521 (2015), pp. 436–
444.
[6] A. B JÖRCK, Numerical Methods for Least Squares Problems, SIAM, Philadelphia,
1996.
[10] J. B UNCH AND L. K AUFMAN, Some stable methods for calculating inertia and solving
symmetric linear systems, Math. Comput., 31 (1977), pp. 163–179.
[11] R. B. C ATTELL, The description of personality: Basic traits resolved into clusters, Jour-
nal of Abnormal and Social Psychology., 38 (1943), pp. 476–506.
385
386 BIBLIOGRAPHY
[17] L. E LDÉN, Numerical linear algebra in data mining, Acta Numerica, 15 (2006),
pp. 327 – 384.
[19] A. F ISHER , C. R UDIN, AND F. D OMINICI, Model class reliance: Variable importance
measures for any machine learning model class, from the "rashomon" perspective,
(2018).
[21] R. F LETCHER, A new approach to variable metric algorithms, The Computer Journal,
13 (1970), pp. 317–322.
[22] S. G ERSCHGORIN, Über die abgrenzung der eigenwerte einer matrix, Izv. Akad. Nauk
SSSR Ser. Mat., 7 (1931), pp. 746–754.
[23] X. G LOROT, A. B ORDES, AND Y. B ENGIO, Deep sparse rectifier neural networks,
in Proceedings of the Fourteenth International Conference on Artificial Intelligence
and Statistics, G. Gordon, D. Dunson, and M. Dudík, eds., vol. 15 of Proceedings
of Machine Learning Research, Fort Lauderdale, FL, USA, 11–13 Apr 2011, PMLR,
pp. 315–323.
[25] F. G OLUB AND C. V. L OAN, Matrix Computations, 3rd Ed., The Johns Hopkins Uni-
versity Press, Baltimore, 1996.
[26] G. H. G OLUB AND C. R EINSCH, Singular value decomposition and least squares so-
lutions, Numer. Math., 14 (1970), pp. 403–420.
[27] B. G ROSSER AND B. L ANG, An o(n2 ) algorithm for the bidiagonal svd, Lin. Alg. Appl.,
358 (2003), pp. 45–70.
BIBLIOGRAPHY 387
[28] L. G UTTMAN, A necessary and sufficient formula for matric factoring, Psychometrika,
22 (1957), pp. 79–81.
[30] G. E. H INTON, S. O SINDERO, AND Y.-W. T EH, A fast learning algorithm for deep
belief nets, Neural Comput., 18 (2006), pp. 1527–1554.
[34] , Relations between two sets of variates, Biometrika, 28 (1936), pp. 321–377.
[35] A. K. J AIN AND R. C. D UBES, Algorithms for Clustering Data, Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1988.
[36] W. K ARUSH, Minima of functions of several variables with inequalities as side con-
straints, M.Sc. Dissertation, Department of Mathematics, Univ. of Chicago, Chicago,
Illinois, 1939.
[43] K. L ANGE , D. R. H UNTER , AND I. YANG, Optimization transfer using surrogate objec-
tive functions, Journal of Computational and Graphical Statistics, 9 (2000), pp. 1–20.
388 BIBLIOGRAPHY
[47] C. L EMARECHAL, Cauchy and the gradient method, Documenta Mathematica, Extra
Volume, (2012), pp. 251–254.
[48] S. P. L LOYD, Least square quantization in pcm, Bell Telephone Laboratories Report,
1957. Published in journal: S. P. Lloyd (1982). Least squares quantization in PCM.
IEEE Transactions on Information Theory. 28 (2): pp. 129–137.
[53] H. M OBAHI AND J. W. F ISHER III, On the link between gaussian homotopy continua-
tion and convex envelopes, in Energy Minimization Methods in Computer Vision and
Pattern Recognition. Lecture Notes in Computer Science, vol. 8932, X.-C. Tai, E. Bae,
T. F. Chan, and M. Lysaker, eds., Hong Kong, China, 2015, Springer, pp. 43–56.
[54] R. T. N G AND J. H AN, Efficient and effective clustering methods for spatial data
mining, in Proceedings of the 20th International Conference on Very Large Data
Bases, VLDB ’94, San Francisco, CA, USA, 1994, Morgan Kaufmann Publishers Inc.,
pp. 144–155.
[55] , CLARANS: A method for clustering objects for spatial data mining, IEEE Trans-
actions on Knowledge and Data Engineering, 14 (2002), pp. 1003–1016.
[56] M. N IELSEN, Neural networks and deep learning. (The online book can be found at
http://neuralnetworksanddeeplearning.com), 2013.
BIBLIOGRAPHY 389
[74] O. T AUSSKY, Bounds for characteristic roots of matrices, Duke Math. J., 15 (1948),
pp. 1043–1044.
[75] R. T IBSHIRANI, Regression shrinkage and selection via the LASSO, Journal of the
Royal Statistical Society. Series B (methodological), 58 (1996), pp. 267–288.
[76] R. C. T RYON, Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis
for the Isolation of Unities in Mind and Personality, Edwards Brothers, 1939.
[77] R. VARGA, Matrix Iterative Analysis, 2nd Ed., Springer-Verlag, Berlin, Heidelberg,
2000.
[78] J. H. M. W EDDERBURN, Lectures on Matrices, Amer. Math. Soc., New York, 1934.
[79] P. R. W ILLEMS, B. L ANG, AND C. V ÖMEL, Computing the bidiagonal SVD using
multiple relatively robust representations, SIAM Journal on Matrix Analysis and Ap-
plications, 28 (2006), pp. 907–926.
[84] J. Z UBIN, A technique for measuring like-mindedness, The Journal of Abnormal and
Social Psychology, 33 (1938), pp. 508–516.
Index
:, Python slicing, 21 bias-variance tradeoff, 101
_ _init_ _() constructor, 31 binary classifier, 38
binary decision tree, 123
acceptance criterion, 85 binding constraint, 111
activation function, 363 binomial distribution, 96
active set, 350 bisecting K-Means algorithm, 217
Adaline, 47, 95, 363 black boxes, 362
adaline, 37 bootstrap sample, 127
adaptive linear neurons, 37 border point, 232
adaptive step size, 63
agglomerative clustering, 224, 226 C++, 294
AGNES, 224, 228 call_get_cubes.py, 25
algorithmic parameter, 49 categorical data, 137
anisotropic distance, 376 Cauchy, Augustin-Louis, 56
API, 294 central path, 354
application programming interface, 294 centroid, 204
approximate complementarity, 357 centroid linkage, 227
approximate Hessian, 83 chain rule, 265
argmax, numpy, 366 chi-squared error criterion, 81
artificial neural networks, 92 child class, 34
artificial neurons, 38 CIFAR, 265
artificial topographic maps, 249 CLARA, 221, 222
association rule discovery, 300 CLARANS, 223
attributes, 31 class, 29, 30
averaged perceptron, 44 class attribute, 33
averaging operator, 379 class-membership probability, 95
Classes.py, 34
back-propagation, 278 classification, 2, 300
back-propagation algorithm, 283 classification error, 124
backtracking line search, 63 Clay Institute, 207
barrier functions, 355 cluster analysis, 199
barrier parameter, 355 cluster cohesion, 242
batch gradient descent, 51 cluster separation, 242
best split, 126 cluster validation, 236
between-class scatter, 176 clustering, 2, 11, 199, 300
between-class scatter matrix, 176, 178 CNN, 265, 286
BFGS algorithm, 72 code block, 20
bias, 40, 100, 266 column-stochastic matrix, 323
391
392 INDEX