Perceptron Example (Practice Que)
Perceptron Example (Practice Que)
CS 478 - PERCEPTRONS 1
Perceptron Node – Threshold
Logic Unit
x1 w1
x2 w2 q z
xn wn
n
1 if åx w ³q
i =1
i i
z= n
0 if åx w <q
i =1
i i
CS 478 - PERCEPTRONS 2
Perceptron Node – Threshold
Logic Unit
x1 w1
x2 w2 q z
xn wn
CS 478 - PERCEPTRONS 3
Perceptron Learning Algorithm
x1 .4
.1 z
x2 -.2
åx w ³q
x1 x2 t
1 if i i
i =1
.8 .3 1 z= n
.4 .1 0 0 if åx w <q
i =1
i i
CS 478 - PERCEPTRONS 4
First Training Instance
.8 .4
.1 z =1
åx w ³q
x1 x2 t
1 if i i
i =1
.8 .3 1 z= n
.4 .1 0 0 if åx w <q
i =1
i i
CS 478 - PERCEPTRONS 5
Second Training Instance
.4 .4
.1 z =1
åx w ³q
x1 x2 t
1 if i i
.8 .3 1 z=
i =1 Dwi = (t - z) * c * xi
n
.4 .1 0 0 if åx w <q
i =1
i i
CS 478 - PERCEPTRONS 6
Perceptron Rule Learning
Dwi = c(t – z) xi
Where wi is the weight from input i to perceptron node, c is the learning rate, tj is the
target for the current instance, z is the current output, and xi is ith input
Least perturbation principle
◦ Only change weights if there is an error
◦ small c rather than changing weights sufficient to make current pattern correct
◦ Scale by xi
CS 478 - PERCEPTRONS 7
CS 478 - PERCEPTRONS 8
Augmented Pattern Vectors
1 0 1 -> 0
1 0 0 -> 1
Augmented Version
1 0 1 1 -> 0
1 0 0 1 -> 1
Treat threshold like any other weight. No special case. Call it a bias
since it biases the output up or down.
Since we start with random weights anyways, can ignore the - notion,
and just think of the bias as an extra available weight. (note the author
uses a -1 input)
Always use a bias weight
CS 478 - PERCEPTRONS 9
Perceptron Rule Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - PERCEPTRONS 10
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - PERCEPTRONS 11
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - PERCEPTRONS 12
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - PERCEPTRONS 13
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - PERCEPTRONS 14
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
001 1 0 0000 0 0 0 0 0 0
111 1 1 0000 0 0 1 1 1 1
101 1 1 1111 3 1 0 0 0 0
011 1 0 1111 3 1 0 -1 -1 -1
001 1 0 1000 0 0 0 0 0 0
111 1 1 1000 1 1 0 0 0 0
101 1 1 1000 1 1 0 0 0 0
011 1 0 1000 0 0 0 0 0 0
CS 478 - PERCEPTRONS 15
Linear Separability
CS 478 - PERCEPTRONS 16
Linear Separability and
Generalization
CS 478 - PERCEPTRONS 17
Limited Functionality of
Hyperplane
CS 478 - PERCEPTRONS 18
How to Handle Multi-Class Output
This is an issue with any learning model which only supports binary
classification (perceptron, SVM, etc.)
Create 1 perceptron for each output class, where the training set
considers all other classes to be negative examples
◦ Run all perceptrons on novel data and set the output to the class of the
perceptron which outputs high
◦ If there is a tie, choose the perceptron with the highest net value
Create 1 perceptron for each pair of output classes, where the training
set only contains examples from the 2 classes
◦ Run all perceptrons on novel data and set the output to be the class with the
most wins (votes) from the perceptrons
◦ In case of a tie, use the net values to decide
◦ Number of models grows by the square of the output classes
CS 478 - PERCEPTRONS 19
UC Irvine Machine Learning Data
Base
Iris Data Set
4.8,3.0,1.4,0.3, Iris-setosa
5.1,3.8,1.6,0.2, Iris-setosa
4.6,3.2,1.4,0.2, Iris-setosa
5.3,3.7,1.5,0.2, Iris-setosa
5.0,3.3,1.4,0.2, Iris-setosa
7.0,3.2,4.7,1.4, Iris-versicolor
6.4,3.2,4.5,1.5, Iris-versicolor
6.9,3.1,4.9,1.5, Iris-versicolor
5.5,2.3,4.0,1.3, Iris-versicolor
6.5,2.8,4.6,1.5, Iris-versicolor
6.0,2.2,5.0,1.5, Iris-viginica
6.9,3.2,5.7,2.3, Iris-viginica
5.6,2.8,4.9,2.0, Iris-viginica
7.7,2.8,6.7,2.0, Iris-viginica
6.3,2.7,4.9,1.8, Iris-viginica
CS 478 – PERCEPTRONS 20
Objective Functions:
Accuracy/Error
How do we judge the quality of a particular model (e.g. Perceptron with a
particular setting of weights)
Consider how accurate the model is on the data set
◦ Classification accuracy = # Correct/Total instances
◦ Classification error = # Misclassified/Total instances (= 1 – acc)
For nominal data, pattern error is typically 1 for a mismatch and 0 for a
match
◦ For nominal (including binary) output and targets, SSE and classification error
are equivalent
CS 478 - PERCEPTRONS 21
Mean Squared Error
Mean Squared Error (MSE) – SSE/n where n is the number of instances
in the data set
◦ This can be nice because it normalizes the error for data sets of different
sizes
◦ MSE is the average squared error per pattern
Root Mean Squared Error (RMSE) – is the square root of the MSE
◦ This puts the error value back into the same units as the features and can
thus be more intuitive
◦ RMSE is the average distance (error) of targets from the outputs in the same
scale as the features
CS 478 - PERCEPTRONS 22
Gradient Descent Learning:
Minimize (Maximze) the Objective
Function
Error Landscape
SSE:
Sum
Squared
Error
S (ti – zi)2
0
Weight Values
CS 478 - PERCEPTRONS 23
Deriving a Gradient Descent
Learning Algorithm
Goal is to decrease overall error (or other objective function) each time
a weight is changed
Total Sum Squared error one possible objective function E: S (ti – zi)2
Seek a weight changing algorithm such that ¶E
is negative
¶wijlearning
If a formula can be found then we have a gradient descent
algorithm
Delta rule is a variant of the perceptron rule which gives a gradient
descent learning algorithm
CS 478 - PERCEPTRONS 24
Delta rule algorithm
Delta rule uses (target - net) before the net value goes through the threshold in
the learning rule to decide weight update
CS 478 - PERCEPTRONS 25
Perceptron rule vs Delta rule
Perceptron rule (target - thresholded output) guaranteed to converge to a
separating hyperplane if the problem is linearly separable. Otherwise may
not converge – could get in cycle
Singe layer Delta rule guaranteed to have only one global minimum. Thus
it will converge to the best SSE solution whether the problem is linearly
separable or not.
◦ Could have a higher misclassification rate than with the perceptron rule and a
less intuitive decision surface – we will discuss with regression
Stopping Criteria – For these models stop when no longer making progress
◦ When you have gone a few epochs with no significant improvement/change
between epochs (including oscillations)
CS 478 - PERCEPTRONS 26