Unit 2: 1. Mathematical Foundations and Learning Mechanisms
Unit 2: 1. Mathematical Foundations and Learning Mechanisms
Unit 2: 1. Mathematical Foundations and Learning Mechanisms
UNIT 2
1. Mathematical Foundations and Learning Mechanisms.
Mathematical Preliminaries
Notation
Intervals, Sequences and Set closures
Vector Spaces
Inner Product, Norm and Orthogonality
Functions and Continuity
Continuous Functions
Homeomorphisms
Monotonic Functions
Matrices, Linear Transformations, and Eigenvectors
Notation
Simple upper case alphabets denote Vectors, where as boldfaced Upper case alphabets
denote Matrices.
X is a vector whereas A is a matrix.
X= (x1,………..xn) is a vector
A= [aij] is a matrix
As far as possible the symbols I, j, l will be employed to denote components where as k
will be used to denote an iteration index or a discrete time index.
The vector X k at iteration k will have components X k = (xk1 ,...,xk n)T
Intervals, Sequences, and Set Closure
we consistently assume the following notation—R denotes the set of real numbers; R+
denotes the positive reals; Z denotes the set of integers; and Z+ denotes the set of positive
integers.
Intervals Recall that an interval IS is a set of real numbers that satisfy the property that if
x ∈ I and y ∈ I and x ≤ z ≤ y, then z ∈ I.
In describing intervals we usually employ the following notations as shown in Eqs
(a,b)={x: a<x<b}
[a,b]={x:a ≤x ≤b}
(a, ∞) = {x : x>a}
(a, ∞) = {x : x ≥ a}
(−∞,b) = {x : x<b}
(−∞,b) = {x : x ≤ b}
Note that these intervals are bounded either above or below or both.
This simply means that there exists a real number h that is greater than every element of I,
or a number l that is less than every number in I.
The continuum property of real numbers states that every non-empty set of real numbers
which is bounded above has a least upper bound (lub); and which is bounded below has a
greatest lower bound (glb).
If glb(I) ∈ I and lub(I) ∈ I, then I is closed. Otherwise I is either half-open or open.
Sequences:
A sequence {xn} ∈ I is a mapping from the set of integers, commonly Z+, into I. A point
x ⊂ R is a limit point of I if there is a sequence of points {xn} ∈ I that converges to x.
For example, {1/n}∞n=1 is a sequence of points in the interval [0,1] that converges to the
point 0. Therefore, 0 is a limit point of the sequence {1/n}∞n=1. Similarly the sequence {1
− 1/n}∞n=1 converges to the point 1. Clearly, 0 and 1 are limit points of the open and closed
intervals (0, 1) and [0, 1].
Set Closuer
One can now define the closure of a set in terms of its limit points. A set together with its
limit points is called its closure. A set J is closed if it contains all of its limit points. In the
above example, (0, 1) is therefore open and [0,1] is closed.
The complement of a closed set is open. By definition, an empty set is closed.
Further, any finite union of closed sets is closed. But surprisingly infinite unions of closed
sets need not be closed.
3.1+5.5= 8.6 is a real number adding two real numbers give real number so it is closed.
4 - 8= - 4, - 4 is not a whole number because whole number don’t have negative , so whole
numbers are not closed under subtraction.
From a set of shirts,
Shirts are closed under the operation of “wash”.
Shirts are not closed under the operation of “rip”.
Vector Space
Vector means a quantity having direction as well as magnitude, the position of one point
in space is relative to other.
A space consisting of vectors, together with the Associative and Commutative operation
of addition of vectors , and the Associative and the distributive operation of multiplication
of vectors by scalars.
Suppose that V is a set upon which we have defined two operations:
(1) Vector addition, which combines two elements of V and is denoted by “+”,
and
(2) Scalar multiplication, which combines a complex number with an element
of V and is denoted by juxtaposition. Then V, along with the two operations, is
a vector space over C if the following ten properties hold.
AC Additive Closure
If u, v ∈ V, then u + v ∈ V.
SC Scalar Closure
If α ∈ C and u ∈ V, then α u ∈ V.
C Commutativity
If u, v ∈ V then u + v = v + u.
AA Additive Associativity
If u, v, w ∈ V, then u+(v + w) = ( u + v) + w.
Z Zero Vector
There is a vector, 0, called the zero vector, such that u + 0 = u for all u ∈ V.
AI Additive Inverses
If u ∈ V, then there exists a vector −u ∈ V so that u + (−u) = 0.
SMA Scalar Multiplication Associativity
If α, β ∈ C and u ∈ V, then α(βu)=(αβ)u.
DVA Distributivity across Vector Addition
If α ∈ C and u, v ∈ V, then α(u + v)=αu+αv.
DSA Distributivity across Scalar Addition
If α, β ∈ C and u ∈ V, then (α + β)u=αu+βu.
O One
If u ∈ V, then 1u=u.
Inner Product
Orthogonality
In , two vectors u and v are perpendicular if and only if the distance between u and v is the
same as the distance between u and –v.
Two vectors are orthogonal if they satisfy the condition XT Y=0, if (X,Y)=0.
The product of the square of the magnitude of two vectors can never be less than the square
of their inner product.
This is stated in the form of Cauchy- Schwarz inequality
(XT Y)2= ║X║2 ║Y║2
Functions and Continuity
Continuous Function
A function is continuous when its graph is a single unbroken curve.
That is not a formal definition, but it helps you understand the idea.
Here is a continuous function:
Homeomorphisms
In mathematics, a correspondence between two figures or surfaces or other geometrical
objects, defined by a one-to- one mapping that is continuous in both directions.
In general a function f is a one–one (or injective) function if every element in the domain
maps to exactly one element in the range.
Alternatively, for a one–one map, we say that f(x) = f(y) =⇒ x = y.
Monotonicity
The monotonicity of a function tells us if the function is increasing or decreasing.
A function is increasing when its graph rises from left to right.
In technical terms, a function is increasing on an interval I if for any x1 and x2 in I, x1 is
less than x2 implies that f(x1) is less than f(x2).
Matrices
Matrices play a crucial role in neural network theory.
We start by recalling that a matrix is a rectangular array of real or complex numbers, and
in the present context we are interested only in matrices with real valued entries.
Matrices that have an equal number of rows and columns are called square matrices.
Matrix Multiplication
A matrix can (and should) be looked upon as a horizontal stack of column vectors, or a
vertical stack of row vectors.
Matrix Multiplication by rows
Linear Transformation
A linear transformation A (or just a linear map) defined in a space such as Rn satisfies two
important properties:
A(X1 + X2) = A(X1) + A(X2 ),∀X1 ,X2 ∈ ℝn;
A(αX) = αA(X) ∀X ∈ ℝn,α ∈ R. Linear transformations generally scale and rotate points
in ℝn.
Eigenvalues and Eigenvectors have their importance in linear differential equations where
you want to find a rate of change or when you want to maintain relationships between two
variables.
Eigenvectors and eigenvalues are used to reduce noise in data. They can help us improve
efficiency in computationally intensive tasks.
AX= λX
AX- λX = 0
X(A-λ)=0
Expectations and Covariance matrices
The average value of some random variable X is called the expected value or the
expectation.
The expectation of the vector X = (x1 ,......,xn)T is a vector 𝑋̅ whose elements(𝑥̅
1,…………….,𝑥̅ n) are given
It is calculated as the probability weighted sum of values that can be drawn
E[X] = sum(x1 * p1, x2 * p2, x3 * p3, ..., xn * pn)
where p(X) is the probability density function that governs the distribution. We use the
notation E[X] for the expectation of X.
The covariance matrix K associated with real random vector X is the expected value of the
outer vector product (X −𝑋̅) (X −𝑋̅)T
K = E[(X −𝑋̅) (X −𝑋̅)T]
The covariance matrix is denoted as the uppercase Greek letter Sigma. The covariance for
each pair of random variables is calculated as above.
Covariancce matrix (K) is related to the correlation matrix(R).
K= R - X 𝑋̅T
R= K + X 𝑋̅T
Learning Process
The property that is of primary significance for a neural network is the ability of the
network to learn from its environment. and to improve its performance through learning.
The improvement in performance takes place over time in accordance with some
prescribed measure.
A neural network learns about its environment through an interactive process of
adjustments applied to its synaptic weights and bias levels. Ideally.
The network becomes more knowledgeable about its environment after each iteration of
the learning process.
We define learning in the context of neural networks as:
Learning is a process by which the free parameters of a neural network are adapted through
a process of stimulation by the environment in which the network is embedded. The type
of learning is determined by the manner in which the parameter changes take place.
This definition of the learning process implies the following sequence of events:
The neural network is stimulated by an environment.
The neural network undergoes changes in its free parameters as a result of this
stimulation.
The neural network responds in a new way to the environment because of the
changes that have occurred in its internal structure.
A prescribed set of well-defined rules for the solution of a learning problem is called a
learning algorithm
we have a "kit of tools" represented by a diverse variety of learning algorithms, each of
which offers advantages of its own.
Basically, learning algorithms differ from each other in the way in which the adjustment to
a synaptic weight of a neuron is formulated.
Supervised learning
Supervised learning as the name indicates the presence of a supervisor as a teacher.
Basically supervised learning is a learning in which we teach or train the machine using
data which is well labeled that means some data is already tagged with the correct answer.
After that, the machine is provided with a new set of examples (data) so that supervised
learning algorithm analyses the training data (set of training examples) and produces a
correct outcome from labeled data.
Unsupervised learning
Unsupervised learning is the training of machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here
the task of machine is to group unsorted information according to similarities, patterns and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to
the machine. Therefore machine is restricted to find the hidden structure in unlabeled data
by our-self.
Vectors can be thought of as an array of numbers where the order of the numbers also
matters (i.e magnitude and direction) .
The individual numbers are denoted by writing the vector name with subscript indicating
the position of the individual member.
For example, x1 is the first number, x2 is the second number and so on. If we want to write
the vector with the members explicitly then we enclose the individual elements in square
brackets as,
Addition of two vectors is performed by adding the corresponding elements of each vector.
For example,
When a vector is multiplied by a scalar then each element gets multiplied by the scalar. For
example
Matrix
A matrix is a two dimensional array of numbers. Generally matrices are represented by an
uppercase bold letter such as A.
Since a matrix is two dimensional, each element is represented by a small letter with two
indices such as aij where i represents the row and j represents the column.
A representation of m×n matrix is shown below,
Matrix multiplication
The multiplication result matrix of two matrices A and B denoted by C has elements cij
which is the dot product of the ith row of A with jth column of the matrix B.
Diagonal Matrix
A square matrix whose non-diagonal elements are zero is called a diagonal matrix. For
example:
Identity Matrix
A square matrix whose non-diagonal elements are zero and diagonal elements are 1 is
called an identity matrix. For example:
Transpose of a Matrix
The transpose of matrix X, denoted by XT, is the result of flipping the rows and columns
of a matrix X.
If
Inverse of a Matrix
The inverse of a square matrix X, denoted by X-1 is a special matrix which has the property
that their multiplication results in the identity matrix.
Above, I is an identity matrix. Both X and X-1 are inverses of each other. An instance of
an inverse matrix is illustrated below.
These two matrices are inverses of each other. When we multiply them, we get an identity
matrix.
Solution
The man will first go with the goat in the boat at the other side of the river and come back
alone in the boat.
The man will go with the lion in the boat and leave the lion at the other side of the river
and come back with the goat.
The man will go with the grass in the boat, drops the grass at the end of the river with the
lion and come back alone.
The man will finally go with the goat in the boat.
4. Optimization Concept
Optimization is already a part of our life. You might not be aware of it. You are already
doing optimization in your life. For example, you wake up early in the morning and try to
catch a bus or train on time. Your objective is to catch the bus/train on time while properly
using your limited time early in the morning.
You go to shopping and decide what to purchase within your budget constraint.
Here, you try to maximize your satisfaction within your budget and in a time constraint.
There are so many examples in our lives that we try to optimize while
maximizing/minimizing our objectives with many other constraints as such time, cost, etc.
Optimization methods are applied in many fields: economics, management, planning,
logistics, robotics, optimal design, engineering, signal processing, etc.
Phases
The objective of the first phase is the acquisition of information, which is to identify the
parameters that affect the system. This could be profit, time, potential energy, or the
quantity or combination
The second, very important, phase is the development of a model to integrate
all the parameters identified.
The third phase focuses on the evaluation and analysis of the performance of the system.
Finally, an optimization phase is regularly used to improve system performance.
Any optimization problem has three basic ingredients (Arora 2003):
The optimization variables, also called the design variables.
Cost function, also called the objective function.
The constraints expressed as equalities or inequalities.
ER is used to identify the error and reduce error in the learning process
Notations used for ER are
Output signal of neuron k is denoted by yk(n)
Desired output signal of neuron k is denoted by dk(n)
Error signal is denoted by ek(n)
If ek(n) is equal to 0 then there is no error
If there is a difference we need to minimize the cost function.
This objective is achieved by minimizing a cost function or index of performance. E(n)
defined in terms of the error signal ek(n) as:
The step by Step adjustments to the synaptic weights of neuron k are continued until the
system reaches a steady state.
At that point the learning process is terminated.
Minimization of cost function leads to a learning rule commonly called referred to as the
Delta rule or Widrow- Hoff rule.
According to the delta rule. the adjustment Δwkj(n), applied to the synaptic weight wkj at
time step n is defined by
Where η is a positive constant that determines the rate of learning as we proceed from one
step in the learning process to another.
It is therefore natural that we refer to η as the learning-rate parameter.
Having computed the synaptic adjustment , ΔWkj(n), the updated value of synaptic weight
Wkj is determined by
In effect, wkj(n)and wkj(n + 1 ) may be viewed as the old and new values of synaptic weight
wkj respectively. In computational terms we may also write.
wkj(n)= z-1[Wkj(n+1)]
z is the unit delay operator that is z-1 represents a storage element.
6. Memory-Based Learning
In memory-based learning, all (or most) of the past experiences are explicitly stored in a
large memory of correctly classified input-output examples:
[(xi, di)]Ni =1, where xi denotes an input vector and di denotes the corresponding desired
response.
Without loss of generality, we have restricted the desired response to be a scalar.
For example, in a binary pattern classification problem there are two classes of hypotheses,
denoted by E1and E2, to be considered.
In this example, the desired response di takes the value 0 (or -1) for class E1 and the value
1 for class E 2.
When classification of a test vector test (not seen before) is required, the algorithm
responds by retrieving and analyzing the training data in a “local neighborhood” of xtest.
All memory-based learning algorithms involve two essential ingredients:
a. Criterion used for defining the local neighbourhood of the test vector xtest.
b. Learning rule applied to the training examples in the local neighbourhood of xtest.
The algorithms differ from each other in the way in which these two ingredients are defined.
In a simple yet effective type of memory-based learning known as the nearest neighbor
rule, the local neighbourhood is defined as the training example which lies in the immediate
neighbourhood of the test vector xtest. In particular, the vector.
Where, d(xi, xtest ) is the Euclidean distance between the vectors xi and xtest.
The class associated with the minimum distance, that is, vector x’N is reported as the
classification of xtest. This rule is independent of the underlying distribution responsible for
generating the training examples.
Cover and Hart (1967) have formally studied the nearest neighbour rule as a tool for pattern
classification.
The analysis is based on two assumptions:
a. The classified examples (xi, di) are independently and identically distributed (iid),
according to the joint probability distribution of the example (x, d).
b. The sample size N is infinitely large.
Under these two assumptions, it is shown that the probability of classification error
incurred by the nearest neighbour rule is bounded about by twice the Bayes probability of
error, that is, the minimum probability of error over all decision rule.
In this sense, it may be said that half the classification information in a training set of
infinite size is contained in the nearest neighbour, which is a surprising result.
A variant of the nearest neighbour classifier is the k-nearest neighbour classifier,
which proceeds as:
a. Identify the k classified patterns which lie nearest to the test vector xtest for some
integer k.
b. Assign xtest to class (hypothesis) which is most frequently represented in the k
nearest neighbours to xtest (i.e., use a majority vote to make the classification).
Thus, the k-nearest neighbour classifier acts like an averaging device.
Hebb’s postulate of learning is the oldest and the most famous of all learning rules; it is
named in honor of the neuropsychologist Hebb (1949).
When an axon of cell A is near enough to excite a cell B and repeatedly or persistently
takes part in firing it, some growth process or metabolic changes take place in one or both
cells such that A’s efficiency as one of the cells firing B, is increased.
Hebb proposed this change as a basis of associative learning (at the cellular level), which
would result in an enduring modification in the activity pattern of a spatially distributed
“assembly of nerve cells”.
This statement is made in a neurobiological context. We may expand and rephrase it as a
two-part rule:
a. If two neurons on either side of a synapse are activated simultaneously (i.e.,
synchronously), then the strength of that synapse is selectively increased.
b. If two neurons on either side of a synapse are activated asynchronously, then that
synapse is selectively weakened or eliminated.
Such a synapse is called Hebbian synapse. More precisely, we define a Hebbian synapse
as a synapse which uses a time-dependent, highly local, and strongly interactive
mechanism to increase synaptic efficiency as a function of the correlation between the
presynaptic and postsynaptic activities.
Four key properties which characterise a Hebbian synapse:
i. Time-Dependent Mechanism:
This mechanism refers to the facts that the modifications in a Hebbian synapse
depend on the exact time of occurrence of the presynaptic and postsynaptic signals.
Mathematical Modelling
Consider a synaptic weight wkj of neuron k with pre synaptic and post synaptic signals
denoted by Xj and yk respectively
The adjustment applied to wkj at time n is expressed as
Where F(-,-) is a function of both post and pre synaptic signals.
Hebb’s Hypothesis
The change in weight is expressed as
Where ƞ is the learning rate, and this is also referred as activity product rule.
Covariance Hypothesis
The adjustment applied to the synaptic weight is defined as
For a neuron k to be the winning neuron, its induced local field vk for a specified input
pattern x must be the largest among all the neurons in the network. The output signal yk of
winning neuron k is set equal to one; the output signals of all the neurons which lose the
competition are set equal to zero.
We thus write:
where, the induced local field yk represents the combined action of all the forward and
feedback inputs to neuron k.
Let ωkj denote the synaptic weight connecting input node j to neuron k. Suppose that each
neuron is allotted a fixed amount of synaptic weight (i.e., all synaptic weights are positive),
which is distributed among its input nodes that is, for all k.
A neuron then learns by shifting synaptic weights from its inactive to active input nodes.
If a neuron does not respond to a particular input pattern, no learning takes place in that
neuron.
If a particular neuron wins the competition, each input node of that neuron relinquishes
(voluntarily give up) some proportion of its synaptic weight, and the weight relinquished
is then distributed equally among the active input nodes. According to the standard
competitive learning rule, the change Δωkj applied to synaptic weight ωkj is defined by
Where, ƞ is the learning rate parameter, xj is the jth input node, ωkj is the synaptic weight
of jth node to kth node.
This rule has the overall effect of moving the synaptic weight vector ωk of winning neuron
k towards the input pattern y.