Lec03 NeuralNetwork
Lec03 NeuralNetwork
Lec03 NeuralNetwork
Machine Learning 1
Biological Motivation
• The study of artificial neural networks (ANNs) has been inspired in part
by the observation that biological learning systems are built of very
complex webs of interconnected neurons.
• Artificial neural networks are built out of a densely interconnected set
of simple units, where each unit takes a number of real-valued inputs
(possibly the outputs of other units) and produces a single real-valued
output (which may become the input to many other units).
• The human brain is estimated to contain a densely interconnected
network of approximately 1011 neurons, each connected, on average, to
104 others.
– Neuron activity is typically inhibited through connections to other
neurons.
Machine Learning 2
ALVINN – Neural Network Learning To Steer An
Autonomous Vehicle
Machine Learning 3
Properties of Artificial Neural Networks
• A large number of very simple, neuron-like processing elements called
units,
• A large number of weighted, directed connections between pairs of
units
– Weights may be positive or negative real values
• Local processing in that each unit computes a function based on the
outputs of a limited number of other units in the network
• Each unit computes a simple function of its input values, which are the
weighted outputs from other units.
– If there are n inputs to a unit, then the unit's output, or activation is defined by a =
g((w1 * x1) + (w2 * x2) + ... + (wn * xn)).
– Each unit computes a (simple) function g of the linear combination of its inputs.
• Learning by tuning the connection weights
Machine Learning 4
Appropriate Problems for NN Learning
• Instances are represented by many attribute-value pairs.
• The target function output may be discrete-valued, real-valued, or a
vector of several real-valued or discrete-valued attributes.
• The training examples may contain errors.
• Long training times are acceptable.
• Fast evaluation of the learned target function may be required.
• The ability of humans to understand the learned target function is not
important.
Machine Learning 5
Perceptron
x0=1
x1 w1
w0
w2
x2 o
. i=0n wi xi
. wn
.
xn
1 if i=0n wi xi >0
o(x0 ,…,xn)= { -1 otherwise
Machine Learning 6
Perceptron
• Perceptron is a Linear Threshold Unit (LTU).
• A perceptron takes a vector of real-valued inputs, calculates a linear
combination of these inputs, then outputs 1 if the result is greater than
some threshold and -1 otherwise.
• Given inputs xl through xn, the output o(x1, . . . , xn) computed by the
perceptron is:
Machine Learning 7
Perceptron - Learning
• Learning a perceptron involves choosing values for weights w0, …,wn.
• The space H of candidate hypotheses considered in perceptron learning
is the set of all possible real-valued weight vectors
Machine Learning 8
Representational Power of Perceptrons
• A perceptron represents a hyperplane decision surface in the
n-dimensional space of instances.
• The perceptron outputs 1 for instances lying on one side of the
hyperplane and outputs -1 for instances lying on the other side.
• The equation for this decision hyperplane is =0
• Some sets of positive and negative examples cannot be separated by any
hyperplane. Those that can be separated are called linearly separable
sets of examples.
• A single perceptron can be used to represent many boolean functions.
– AND, OR, NAND, NOR are representable by a perceptron
– XOR cannot be representable by a perceptron.
Machine Learning 9
Representational Power of Perceptrons
x2 x2
+
+
+ - + -
-
x1 x1
+ - +
-
-
Representable by a perceptron NOT representable by a perceptron
Machine Learning 10
Perceptron Training Rule
• To learn an acceptable weight vector is to begin with random weights,
then iteratively apply the perceptron to each training example,
modifying the perceptron weights whenever it misclassifies an example.
– If the training example classifies correctly, weights are not updated.
• This process is repeated, iterating through the training examples as
many times as needed until the perceptron classifies all training
examples correctly.
– Each pass through all of the training examples is called one epoch
• Weights are modified at each step according to
perceptron training rule
Machine Learning 11
Perceptron Training Rule
wi = wi + wi
wi = (t - o) xi
t is the target value
o is the perceptron output
is a small constant (e.g. 0.1) called learning rate
t=-1
t=1
o=1
w=[0.25,–0.1,0.5]
x2 = 0.2 x1 – 0.5 o=-1
(x,t)=([2,1],-1)
(x,t)=([-1,-1],1)
(x,t)=([1,1],1)
o=sgn(0.45-0.6+0.3)=1
o=sgn(0.25+0.1-0.5)
o=sgn(0.25-0.7+0.1)=-1
=-1
w=[0.2 –0.2 –0.2]
w=[-0.2,–0.4,–0.2]
Machine Learning 16
Gradient Descent
• Consider linear unit without threshold and continuous output o (not just
–1,1)
o = w0 + w1 x1 + … + wn xn
• Train the wi’s such that they minimize the squared error
E[w0,…,wn] = ½ dD (td - od)2
where D is the set of training examples td is the target output for training
example d, and od is the output of the linear unit for training example d.
Machine Learning 17
Gradient Descent
• The wo, wl plane represents the entire
hypothesis space.
• The vertical axis indicates the error E
relative to some fixed set of training
examples.
• The error surface summarizes the
desirability of every weight vector in the
hypothesis space (we desire a hypothesis
with minimum error).
• The arrow shows the negated gradient at
one particular point, indicating the direction
in the wo, wl plane producing steepest
descent along the error surface.
• Gradient descent search determines a weight vector that minimizes E by starting with an
arbitrary initial weight vector, then repeatedly modifying it in small steps.
• At each step, the weight vector is altered in the direction that produces the steepest descent
along the error surface.
• This process continues until the global minimum error is reached.
Machine Learning 18
Gradient Descent
• How can we calculate the direction of steepest descent along the error
surface?
• This direction can be found by computing the derivative of E with
respect to each component of the vector w0,..,wn.
• This vector derivative is called the gradient of E with respect to
w1,..,wn, written E(w0,…,wn)
• Gradient:
E[w0,…,wn] = [E/w0,… E/wn]
• When the gradient is interpreted as a vector in weight space, the
gradient specifies the direction that produces the steepest increase in E.
• The negative of this vector ( -E[w0,…,wn] ) gives the direction of
steepest decrease.
Machine Learning 19
Training Rule for Gradient Descent
• For each weight wi
– wi = wi + wi
– wi = - E[wi]
• is a small constant called learning rate
Machine Learning 20
Gradient Descent
Gradient-Descent(training_examples, )
Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector
of input values, and t is the target output value, is the learning rate (e.g. 0.1)
• Initialize each wi to some small random value
• Until the termination condition is met, Do
– Initialize each wi to zero
– For each <(x1,…xn),t> in training_examples, Do
• Input the instance (x1,…,xn) to the linear unit and compute the output o
• For each linear unit weight wi Do
– wi= wi + (t-o) xi
– For each linear unit weight wi, Do
• wi=wi+wi
Machine Learning 21
Incremental (Stochastic) Gradient Descent
Gradient descent is a strategy for searching through a large or
infinite hypothesis space that can be applied whenever
• the hypothesis space contains continuously parameterized
hypotheses (e.g., the weights in a linear unit), and
• the error can be differentiated with respect to these hypothesis
parameters.
The key practical difficulties in applying gradient descent are
– converging to a local minimum can sometimes be quite slow (i.e.,it
can require many thousands of gradient descent steps), and
– if there are multiple local minima in the error surface, then there is
no guarantee that the procedure will find the global minimum.
Machine Learning 22
Incremental (Stochastic) Gradient Descent
• Batch mode : gradient descent
w=w - ED[w] over the entire data D
ED[w]=1/2d(td-od)2
Machine Learning 23
Incremental (Stochastic) Gradient Descent
Incremental (Stochastic) Gradient-Descent(training_examples, )
Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector
of input values, and t is the target output value, is the learning rate (e.g. 0.1)
• Initialize each wi to some small random value
• Until the termination condition is met, Do
– Initialize
/ / / / / / each
/ / / /w/ /i to
/ / zero
/////
– For each <(x1,…xn),t> in training_examples, Do
• Input the instance (x1,…,xn) to the linear unit and compute the output o
• For each linear unit weight wi Do
/ /–/ /w/ /i=/ w
/ / /i +/ //(t-o)
/ / / x/i / / wi= wi + (t-o) xi
– For
/ / /each
/ / / linear
/ / / / unit
/ / / weight
/ / / / /wi,
/ / Do
///
/ •/ /w/i=w
/ / i/+w
/ / /i / / / / / / / / /
Machine Learning 24
Incremental (Stochastic) Gradient Descent
• In standard gradient descent, the error is summed over all examples
before updating weights, whereas in stochastic gradient descent weights
are updated upon examining each training example.
• Summing over multiple examples in standard gradient descent requires
more computation per weight update step. On the other hand, because it
uses the true gradient, standard gradient descent is often used with a
larger step size per weight update than stochastic gradient descent.
• In cases where there are multiple local minima with respect to
E(w0,…,wn), stochastic gradient descent can sometimes avoid falling
into these local minima because it uses the various Ed(w0,…,wn)
rather than E (w0,…,wn) to guide its search.
Machine Learning 25
Comparison Perceptron and Gradient Descent Rule
Perceptron learning rule guaranteed to succeed if
• Training examples are linearly separable
• Sufficiently small learning rate
Machine Learning 26
Multi-Layer Networks
• Single perceptrons can only express linear decision surfaces.
• Multilayer networks are capable of expressing a rich variety of
nonlinear decision surfaces.
output layer
hidden layer
input layer
Machine Learning 27
Multi-Layer Networks with Linear Units
Ex. XOR
• Multiple layers of cascaded linear units still produce only linear
functions.
OR: 0.5*x1 + 0.5*x2 – 0.25 > 0
w0= -0.25
AND: 0.5*x1 + 0.5*x2 – 0.75 > 0
OR AND
w0= -0.75
w0= -0.25
w2=0.5 w1=0.5
w2=0.5
w1=0.5
x1 x2
Machine Learning 28
Multi-Layer Networks with Non-Linear Units
• Multiple layers of cascaded linear units still produce only linear
functions.
• We prefer networks capable of representing highly nonlinear functions.
• What we need is a unit whose output is a nonlinear function of its
inputs, but whose output is also a differentiable function of its inputs.
• One solution is the sigmoid unit, a unit very much like a perceptron,
but based on a smoothed, differentiable threshold function.
Machine Learning 29
Sigmoid Unit
x0=1
x1 w1 w0 n o=(net)=1/(1+e-net)
w2 net= wi xi
i=0
x2 o
.
. wn
.
xn
(x) is the sigmoid function: 1/(1+e-x)
Machine Learning 30
Sigmoid Unit
Derive gradient descent rules to train:
– one sigmoid function
E/wi = -d(td-od) od (1-od) xid
Machine Learning 31
Backpropagation Algorithm
• Create a feed-forward network with ni inputs, nhidden hidden units, and nout output units.
• Initialize each wi to some small random value (e.g., between -.05 and .05).
• Until the termination condition is met, Do
– For each training example <(x1,…xn),t>, Do
// Propagate the input forward through the network:
1. Input the instance (x1,…,xn) to the network and compute the network outputs ok for
every unit
Machine Learning 32
Backpropagation
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
-in practice often works well (can be invoked multiple times with
different initial weights)
• Often include weight momentum term
wi,j(n)= j xi,j + wi,j (n-1)
• Minimizes error training examples
• Training can be slow typical 1000-10000 iterations
• Using network after training is fast
Machine Learning 33
8-3-8 Binary Encoder -Decoder
• 8 x 3 x 8 network was trained to learn the identity function, using the eight training
examples shown.
• After 5000 training epochs, the three hidden unit values encode the eight distinct
inputs using the encoding shown on the right.
• If the encoded values are rounded to zero or one, the result is the standard binary
encoding for eight distinct values.
Machine Learning 34
Single Layer – Linear Problem
Machine Learning 35
Multi-Layer Network – NonLinear Problem
Machine Learning 36
Multi-Layer Network – NonLinear Problem2
Machine Learning 37
Multi-Layer Network – NonLinear Problem2
Machine Learning 38
Expressive Capabilities of ANNs
Boolean functions
– Every boolean function can be represented by network with single
hidden layer
– but might require exponential (in number of inputs) hidden units
Continuous functions
– Every bounded continuous function can be approximated with
arbitrarily small error by network with one hidden layer
– Any function can be approximated to arbitrary accuracy by a
network with two hidden layers
Machine Learning 39