Lecture 4
Lecture 4
Lecture 4
Iris setosa
www.mathworks.com/help/toolbox/stats/ bqzdnrv-1.html
www.fs.fed.us/wildflowers/beauty/iris/blueflag/
Your first workshop task (using J48 in Weka) was to see if there was enough information in the four attributes of petal length and width and sepal length and width to distinguish and classify these similar looking flowers. Notice that colour may not help here. Your next workshop will involve using ANNs to distinguish between these flowers. Iris virginica
www.fs.fed.us/wildflowers/beauty/iris/blueflag/
Lecture 4
Artificial Neural Networks
Images of the brain. Top left: photo; top right: what is currently known; left: close up showing the brain consisting of layers of interconnected nerve cells (neurons) and tissue.
All images taken from www.idsia.ch/NNcourse/brain.html4
Neurons
Artificial
http://research.yale.edu/ysm/images/78.2/articles-neural-neuron.jpg
http://faculty.washington.edu/chudler/color/pic1an.gif
http://www.frontiersin.org/neuromorphic_engineering/10.3389/fnins.2011.00026/full
Biological computing through action potential spikes: A: Abstract physiology B: Biochemistry Physiology and biochemistry lead to spikes. Can spikes be used for computing?
Post synaptic neuron spikes more (or less) depending on pre synaptic behaviour
Action potential spiking can take place many times per second, depending on which part of the brain we look at
http://www.socialbehavior.uzh.ch/teaching/ComputationalNeuroeconomicsFS11/Chapter10.pptx
data
data
data
http://ilab.usc.edu/classes/2002cs561/notes/session28.ppt
Feedforward
http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt
Introduction
What is an (artificial) neural network? A set of nodes (units, neurons, processing elements) Each node has input and output Each node performs a simple computation by its node function Weighted connections between nodes Connectivity gives the structure/architecture of the net What can be computed by a NN is primarily determined by the connections and their weights A very much simplified version of networks of neurons in animal nerve systems Neuron is basic computational unit (primitive processor), not a program
http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt
Introduction
Von Neumann machine
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Human Brain
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
One or a few high speed (ns) processors with considerable computing power One or a few shared high speed buses for communication Sequential memory access by address Problem-solving knowledge is separated from the computing component Hard to be adaptive
Large # (1011) of low speed processors (ms) with limited computing power Large # (1015) of low speed connections Problem-solving knowledge resides in the connectivity of neurons Adaptation by changing the connectivity Easily adapts for learning Fault tolerant
12
http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt
ANN
Introduction
Bio NN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Cell body signal from other neurons firing frequency firing mechanism Synapses synaptic strength
Highly parallel, simple local computation (at neuron level) achieves global results as emerging property of the interaction (at network level) Pattern directed (meaning of individual nodes only in the context of a pattern) Fault-tolerant/graceful degrading Learning/adaptation plays important role.
http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt
History of NN
Pitts & McCulloch (1943)
First mathematical model of biological neurons All Boolean operations can be implemented by these neuronlike nodes (with different threshold and excitatory/inhibitory connections). Competitor to Von Neumann model for general purpose computing device Origin of automata theory.
Hebb (1949)
Hebbian rule of learning: increase the connection strength between neurons i and j whenever both i and j are activated. Or increase the connection strength between nodes i and j whenever both nodes are simultaneously ON or OFF.
http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt
History of NN
Early booming (50s early 60s) Rosenblatt (1958)
Perceptron: network of threshold nodes for pattern classification x1 x2 xn Perceptron convergence theorem: everything that can be represented by a perceptron can be learned
A neuron only fires if its input signal exceeds a certain amount (the threshold) in a short time period. Synapses vary in strength Good connections allowing a large signal Slight connections allow only a weak signal. Synapses can be either excitatory or inhibitory.
Perceptron
Linear treshold unit (LTU)
x1 w1 w2 wn x0=1 w0
Usually, an extra input (constant) is included to ensure that, even if all the inputs are 0, there is non-zero fed into the threshold function
x2
. . .
wi xi i=0
o(xi)=
o
n
xn
1 if
w x >0
i=0
i i
-1 otherwise
Threshold function: if the sum of weighed inputs is greater than 0, output 1 else output -1
16
If the output is correct (t=o) the weights wi are not changed If the output is incorrect (to) the weights wi are changed such that the output of the perceptron for the adjusted weights is closer to t. The algorithm converges to the correct classification after repeated presentations of samples: if the training data is linearly separable and is sufficiently small
17
1. Samples fed in one at a time to the input units: repeat many times
3. Actual output stored in a temporary file to allow calculation of error (difference between desired output and actual output) error can be summed
18
Training ANN (perceptron) for Additional AND function For AND input constant
-1 A x A B Output 00 0 01 0 t = 0.0 10 0 W2 = ? 11 1 Output is +1 if t exceeded, W3 = ? 0 otherwise W1 = ?
Initialize with random weight values Introduce additional input constant (called bias) to ensure that the threshold function receives some input even if all other inputs are zero
19
Training Perceptrons
-1 W1 = 0.3 x t = 0.0 For AND A B Output 00 0 01 0 10 0 11 1
W2 = 0.5
W3 = -0.4
y
I1 I2 I 3 -1 -1 -1 -1 0 0 1 1 Summation Output 0 0 1 0
0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2
Given the current weights, this perceptron does not produce the correct results for two combinations of AND: 1 0 and 1 1.
20
Exercise: Fill the values in the summation table to determine whether this Perceptron correctly performs the AND function.
-1 W1 = 0.4 x t = 0.0 For AND A B Output 00 0 01 0 10 0 11 1
W2 = 0.7
W3 = -0.2
y
I1 I2 I 3 -1 -1 -1 -1 0 0 1 1 0 1 0 1
21
Summation
Output
Solution to Exercise
-1 W1 = 0.4 x t = 0.0 For AND A B Output 00 0 01 0 10 0 11 1
W2 = 0.7
W3 = -0.2
y
I1 I2 I3 -1 -1 -1 -1 0 0 1 1 Summation Output 0 0 1 1
22
0 (-1*0.4) + (0*0.7) + (0*-0.2) = -0.4 1 (-1*0.4) + (0*0.7) + (1*-0.2) = -0.6 0 (-1*0.4) + (1*0.7) + (0*-0.2) = 0.3 1 (-1*0.4) + (1*0.7) + (1*-0.2) = 0.1
Weight adjustment
I1 I2 I 3 -1 -1 -1 -1 0 0 1 1 Summation Output 0 0 1 0
Output 0 0 1 1
0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2
Summation
I1 I2 I 3 -1 -1 -1 -1 0 0 1 1
0 (-1*0.4) + (0*0.7) + (0*-0.2) = -0.4 1 (-1*0.4) + (0*0.7) + (1*-0.2) = -0.6 0 (-1*0.4) + (1*0.7) + (0*-0.2) = 0.3 1 (-1*0.4) + (1*0.7) + (1*-0.2) = 0.1
Note that the summation results are in the right direction for producing the correct output for 1 1 but not for 1 0 (given threshold of t=0.0)
23
Learning algorithm
Epoch : Presentation of the entire training set to the neural network. In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) Error: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it output a 1, then Error = -1
Learning question 1: Is there a set of weights that will produce the correct results for all input values of the AND function?
Learning question 2: If so, how do we find these weights automatically rather than manually adjusting the weights?
24
25
26
Error = (0 1) = -1 Weight change for w1 = 0.1 x -1 x -1 = 0.1 (w1=0.4+0.1=0.5) Weight change for w2 = 0.1 x -1 x 1 = -0.1 (w2=0.7-0.1=0.6) Weight change for w3 = 0.1 x -1 x 0 = 0 (w3=0.2, no change)
28
W2 = 0.6
W3 = -0.2
y The next time 1 0 is presented:
-1 1 0 (-1*0.5) + (1*0.6) + (0*-0.2) = 0.1 1
That is, while the output is still wrong, there has been a reduction in output from 0.3 to 0.1 for input 1 0. At least another presentation of 1 0 is required to produce the desired output 0, given the threshold of t=0.0
29
Next presentation
Error = (0 1) = -1 Weight change for w1 = 0.1 x -1 x -1 = 0.1 (w1=0.5+0.1=0.6) Weight change for w2 = 0.1 x -1 x 1 = -0.1 (w2=0.6-0.1=0.5) Weight change for w3 = 0.1 x -1 x 0 = 0 (w3=0.2, no change)
-1 1 0 (-1*0.6) + (1*0.5) + (0*-0.2) = -0.1 0
Next presentation:
Since the output activation for 1 0 is now below 0, the correct output for AND(1,0) = 0 is now produced. We can now move to the next wrongly classified sample 1 1. One can change weights after processing a single pattern or accumulate weight error values over a batch of patterns before changing the weights. This allows all patterns to be presented to a perceptrons existing weights before the weights are changed.
30
http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt
History of NN
The setback (mid 60s late 70s)
Serious problems with perceptron model (Minskys book 1969) Single layer perceonptrons cannot represent (learn) simple functions such as XOR Multi-layer of non-linear units may have greater power but there is no learning rule for such nets Scaling problem: connection weights may grow infinitely The first two problems overcame by latter effort in 80s, but the scaling problem persists Death of Rosenblatt (1964) Striving of Von Neumann machine and AI
http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt
History of NN
Renewed enthusiasm and flourish (80s present)
New techniques
Backpropagation learning for multi-layer feed forward nets (with non-linear, differentiable node functions) Thermodynamic models (Hopfield net, Boltzmann machine, etc.) Unsupervised learning
Impressive application (character recognition, speech recognition, text-to-speech transformation, process control, associative memory, etc.) ANNs now preferred computational method in many applications (e.g. pattern recognition)
http://ilab.usc.edu/classes/2002cs561/notes/session28.ppt
33
Information is distributed
34
Hidden Layer
Output Layer
Input Layer
35
output layer can contain more than one unit E.g. output classes 1/0 can be represented by one unit or by two units ( 1 0 for class 1 and 0 1 for class 2)
Input Layer i
One can change weights after processing a single pattern or accumulate weight error values over a batch of patterns before changing the weights.
37
Why back-propagation?
Each weight Shares the Blame for prediction error with other weights. Back-propagation algorithm decides how to distribute the blame among all weights and adjust the weights accordingly. Small portion of blame leads to small adjustment. Large portion of the blame leads to large adjustment.
39
x1
x2
xn
Assume = 1 w1new = 0.5+1*(01)*1=0.5 W2new = 0.2+1*(01)*0=0.2 W3new = 0.8+1*(01)*1= 0.2 Large can lead to weight oscillation: Assume = 0.2 w1new = 0.5+0.2*(01)*1=0.3 Note how W2new = 0.2+0.2*(01)*0=0.2 weights that are W3new = 0.8+0.2*(01)*1= 0.6 more to blame
Assume three inputs, one output 1 0 1 is the pattern at the input nodes, with 0 the target
Transfer Functions: Transfer function is usually the same for every unit in the same layer There are various choices for Transfer / Activation functions that determine what is output from a neuron
1
1
0.5
-1
Choice of transfer function for one output unit will depend on class information: 1. If 0 and 1, use Logistic or Threshold 2. If non-binary, use Tanh (e.g. -1, 0, 1 for tripartite classification) 3. If N classes, a class is represented as (0,..., 0,1, 0,.., 0) at the output layer (i.e. as many output nodes as distinct class values)
41
42
Local Minima
Advantages of back propagation Relatively simple implementation Standard method and generally works well Disadvantages of BP Slow and inefficient Can get stuck in local minima resulting in sub-optimal solutions
Local Minimum
Weights are stuck through gradient descent (i.e. error has reached local minimum) Gradient descent must be amended to allow learning to leave flat spots
Global Minimum
43
44
45
weight change = some small constant error input activation + momentum constant old weight change
w(t) = *d + a*w(t-1) w is the change in weight is the learning rate d is the error x input activation a is the momentum parameter
46
Batch Update
With default BP update you update weights after every pattern With Batch update you accumulate the changes for each weight, but do not update them until the end of each epoch Batch update gives a correct direction of the gradient for the entire data set, while default update could do some weight updates in directions quite different from the average gradient of the entire data set
47
Example of non-training. Note how the error graph goes into a flat line
49
50
52
53
Practical Differences
A traditional program need have no knowledge of the hardware on which it is run A neural network is totally dependent on the architecture and is therefore hardware dependent You can write a traditional program without paying attention to the hardware To solve a problem on a neural network, you have to experiment with different architectures and parameters (hardware)
55
Philosophical differences
A computer program contains explicit rules for manipulating symbols, e.g.
If x>y then a=1 else a=2
A neural network represents symbols (x, y, etc) as a distributed pattern (typically a feature vector), e.g. 001100 for a, 110011 for b A neural network doesnt use explicit rules but weights on connections (and threshold functions in units) to perform tasks
56
Operation
A traditional program takes input, transforms the input through rules and produces output A neural network uses spreading activation that represents a probability that a neuron generates an action potential spike (fires) that spreads to other units connected to it
57
Explaining learning
Computationalism explains learning as the application of symbolic formulae to data to extract important features or derive new conclusions Connectionism explains learning as modifying weights on connections
59
60
Summary
You have been introduced to the two main views in AI concerning intelligence and how to build intelligence into a machine: Write increasingly sophisticated software Construct increasingly sophisticated neural networks The debate is not just about machine intelligence but about ourselves: Are we von Neumann computers (like your desktop, Apple iPad)? Are we neural networks (no real neural network computers have yet been built)?
61
W2 = 0.3
W3 = -0.2
y When 1 1 is presented:
-1 1 1 (-1*0.5) + (1*0.3) + (1*-0.2) = -0.4 0
Using the perceptron learning rule and a learning rate of 0.1, show how the weights are changed for subsequent presentations of 1 1. How many presentations does it take before the perceptron produces the correct output for 1 1? Then test your weights on 0 1, 0 0 and 1 0.
62
Solution to Exercise: 1 1
-1 1 1 (-1*0.5) + (1*0.3) + (1*-0.2) = -0.4 0
First presentation: Error = (target output actual output) = 1 Weight change = learning rate x error x input Weight change for w1 = 0.1 x 1 x -1 = -0.1 (w1=0.5-0.1=0.4) Weight change for w2 = 0.1 x 1 x 1 = 0.1 (w2=0.3+0.1=0.4) Weight change for w3 = 0.1 x 1 x 1 = 0.1 (w3=-0.2+0.1=-0.1)
-1 1 1 (-1*0.4) + (1*0.4) + (1*-0.1) = -0.1 0
Second presentation: Error = (target output actual output) = 1 Weight change = learning rate x error x input Weight change for w1 = 0.1 x 1 x -1 = -0.1 (w1=0.4-0.1=0.3) Weight change for w2 = 0.1 x 1 x 1 = 0.1 (w2=0.4+0.1=0.5) Weight change for w3 = 0.1 x 1 x 1 = 0.1 (w3=-0.1+0.1=0)
-1 1 1 (-1*0.3) + (1*0.5) + (1*0) = 0.2 1
63
For 1 0:
-1
64
65