Day 4
Day 4
Day 4
Morten Hjorth-Jensen1,2
1
Department of Physics and Center for Computing in Science Education, University of Oslo, Norway
partment of Physics and Astronomy and Facility for Rare Ion Beams and National Superconducting Cyclotron Laboratory, Michigan State University, U
Neural networks
Artificial neural networks are computational systems that can learn to perform
tasks by considering examples, generally without being programmed with any
task-specific rules. It is supposed to mimic a biological system, wherein neurons
interact by sending signals in the form of mathematical functions between layers.
All layers can contain an arbitrary number of neurons, and each connection is
represented by a weight variable.
Artificial neurons
The field of artificial neural networks has a long history of development, and
is closely connected with the advancement of computer science and computers
in general. A model of artificial neurons was first developed by McCulloch and
Pitts in 1943 to study signal processing in the brain and has later been refined
by others. The general idea is to mimic neural networks in the human brain,
which is composed of billions of neurons that communicate with each other by
sending electrical signals. Each neuron accumulates its incoming signals, which
must exceed an activation threshold to yield an output. If the threshold is not
overcome, the neuron remains inactive, i.e. has zero output.
This behaviour has inspired a simple mathematical model for an artificial
neuron.
n
!
X
y=f wi xi = f (u) (1)
i=1
2. neural networks designed specifically for image processing, the most promi-
nent example of this class being Convolutional Neural Networks (CNNs),
3. neural networks for sequential data such as Recurrent Neural Networks
(RNNs), and
4. neural networks for unsupervised learning such as Deep Boltzmann Ma-
chines.
In natural science, DNNs and CNNs have already found numerous applications.
In statistical physics, they have been applied to detect phase transitions in 2D
Ising and Potts models, lattice gauge theories, and different phases of polymers,
or solving the Navier-Stokes equation in weather forecasting. Deep learning has
also found interesting applications in quantum physics. Various quantum phase
transitions can be detected and studied using DNNs and CNNs, topological
phases, and even non-equilibrium many-body localization. Representing quantum
states as DNNs quantum state tomography are among some of the impressive
achievements to reveal the potential of DNNs to facilitate the study of quantum
systems.
In quantum information theory, it has been shown that one can perform gate
decompositions with the help of neural.
The applications are not limited to the natural sciences. There is a plethora
of applications in essentially all disciplines, from the humanities to life science
and medicine.
2
Feed-forward neural networks
The feed-forward neural network (FFNN) was the first and simplest type of
ANNs that were devised. In this network, the information moves in only one
direction: forward through the layers.
Nodes are represented by circles, while the arrows display the connections
between the nodes, including the direction of information flow. Additionally,
each arrow corresponds to a weight variable (figure to come). We observe that
each node in a layer is connected to all nodes in the subsequent layer, making
this a so-called fully-connected FFNN.
3
and a linear output layer (”linear” here means that each node in the output layer
has a linear activation function). The layers are normally fully-connected and
there are no cycles, thus RBFs can be viewed as a type of fully-connected FFNN.
They are however usually treated as a separate type of NN due the unusual
activation functions.
Multilayer perceptrons
One uses often so-called fully-connected feed-forward neural networks with three
or more layers (an input layer, one or more hidden layers and an output layer)
consisting of neurons that have non-linear activation functions.
Such networks are often called multilayer perceptrons (MLPs).
Mathematical model
The output y is produced via the activation function f
n
!
X
y=f wi xi + bi = f (z),
i=1
Pn
This function receives xi as inputs. Here the activation z = ( i=1 wi xi + bi ).
In an FFNN of such neurons, the inputs xi are the outputs of the neurons in the
preceding layer. Furthermore, an MLP is fully-connected, which means that each
neuron receives a weighted sum of the outputs of all neurons in the previous
layer.
Mathematical model
First, for each node i in the first hidden layer, we calculate a weighted sum zi1 of
the input coordinates xj ,
M
X
zi1 = 1
wij xj + b1i (2)
j=1
Here bi is the so-called bias which is normally needed in case of zero activation
weights or inputs. How to fix the biases and the weights will be discussed below.
4
The value of zi1 is the argument to the activation function fi of each node i, The
variable M stands for all possible inputs to a given node i in the first layer. We
define the output yi1 of all neurons in layer 1 as
XM
1 1 1 1
yi = f (zi ) = f wij xj + bi (3)
j=1
where we assume that all nodes in the same layer have identical activation
functions, hence the notation f . In general, we could assume in the more general
case that different layers have different activation functions. In this case we
would identify these functions with a superscript l for the l-th layer,
Nl−1
X
l l−1
yil = f l (uli ) = f l wij yj + bli (4)
j=1
where Nl is the number of nodes in layer l. When the output of all the nodes
in the first hidden layer are computed, the values of the subsequent layer can be
calculated and so forth until the output is obtained.
Mathematical model
The output of neuron i in layer 2 is thus,
XN
yi2 = f 2 2 1
wij yj + b2i (5)
j=1
!
XN M
X
= f2 2 1
wij f 1
wjk xk + b1j + b2i (6)
j=1 k=1
where we have substituted yk1 with the inputs xk . Finally, the ANN output reads
XN
yi3 = f 3 3 2
wij yj + b3i (7)
j=1
! !
X X X
3 2 2 1 1
= f3 wij f wjk f wkm xm + b1k + b2j + b31 (8)
j k m
Mathematical model
We can generalize this expression to an MLP with l hidden layers. The complete
functional form is,
5
Nl Nl−1 N0
! !
X X X
yil+1 =f l+1 3 l
wij f wl−1
jk . . . f1 1
wmn xn + b1m ... + b2k + b31
j=1 k=1 n=1
(9)
Mathematical model
This confirms that an MLP, despite its quite convoluted mathematical form, is
nothing more than an analytic function, specifically a mapping of real-valued
vectors x̂ ∈ Rn → ŷ ∈ Rm .
Furthermore, the flexibility and universality of an MLP can be illustrated
by realizing that the expression is essentially a nested sum of scaled activation
functions of the form
This is not just a convenient and compact notation, but also a useful and
intuitive way to think about MLPs: The output is calculated by a series of
6
matrix-vector multiplications and vector additions that are used as input to the
activation functions. For each operation Wl ŷl−1 we move forward one layer.
• Non-constant
• Bounded
• Monotonically-increasing
• Continuous
f (x) = tanh(x)
Relevance. The sigmoid function are more biologically plausible because the
output of inactive neurons are zero. Such activation function are called one-sided.
However, it has been shown that the hyperbolic tangent performs better than
the sigmoid for training MLPs. has become the most popular for deep neural
networks
"""The sigmoid function (or the logistic curve) is a
function that takes any real number, z, and outputs a number (0,1).
It is useful in neural networks for assigning weights on a relative scale.
The value z is the weighted sum of parameters involved in the learning algorithm."""
import numpy
import matplotlib.pyplot as plt
import math as mt
z = numpy.arange(-5, 5, .1)
sigma_fn = numpy.vectorize(lambda z: 1/(1+numpy.exp(-z)))
sigma = sigma_fn(z)
7
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, sigma)
ax.set_ylim([-0.1, 1.1])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel(’z’)
ax.set_title(’sigmoid function’)
plt.show()
"""Step Function"""
z = numpy.arange(-5, 5, .02)
step_fn = numpy.vectorize(lambda z: 1.0 if z >= 0.0 else 0.0)
step = step_fn(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, step)
ax.set_ylim([-0.5, 1.5])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel(’z’)
ax.set_title(’step function’)
plt.show()
"""Sine Function"""
z = numpy.arange(-2*mt.pi, 2*mt.pi, 0.1)
t = numpy.sin(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, t)
ax.set_ylim([-1.0, 1.0])
ax.set_xlim([-2*mt.pi,2*mt.pi])
ax.grid(True)
ax.set_xlabel(’z’)
ax.set_title(’sine function’)
plt.show()
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, y)
ax.set_ylim([-2.0, 2.0])
ax.set_xlim([-2.0, 2.0])
ax.grid(True)
ax.set_xlabel(’z’)
ax.set_title(’Rectified linear unit’)
plt.show()
8
The multilayer perceptron (MLP)
The multilayer perceptron is a very popular, and easy to implement approach,
to deep learning. It consists of
1. A neural network with one or more layers of nodes between the input and
the output nodes.
2. The multilayer network structure, or architecture, or topology, consists of
an input layer, one or more hidden layers, and one output layer.
3. The input nodes pass values to the first hidden layer, its nodes pass the
information on to the second and so on till we reach the output layer.
As a convention it is normal to call a network with one layer of input units, one
layer of hidden units and one layer of output units as a two-layer network. A
network with two layers of hidden units is called a three-layer network etc etc.
For an MLP network there is no direct connection between the output
nodes/neurons/units and the input nodes/neurons/units. Hereafter we will call
the various entities of a layer for nodes. There are also no connections within a
single layer.
The number of input nodes does not need to equal the number of output
nodes. This applies also to the hidden layers. Each layer may have its own
number of nodes and activation functions.
The hidden layers have their name from the fact that they are not linked to
observables and as we will see below when we define the so-called activation ẑ,
we can think of this as a basis expansion of the original inputs x̂. The difference
however between neural networks and say linear regression is that now these
basis functions (which will correspond to the weights in the network) are learned
from data. This results in an important difference between neural networks and
deep learning approaches on one side and methods like logistic regression or
linear regression and their modifications on the other side.
9
parameters. It is the multilayer feedforward architecture itself which gives neural
networks the potential of being universal approximators.
where the ti s are our n targets (the values we want to reproduce), while
the outputs of the network after having propagated all inputs x̂ are given by
yi . Below we will demonstrate how the basic equations arising from the back
propagation algorithm can be modified in order to study classification problems
with K classes.
Definitions
With our definition of the targets t̂, the outputs of the network ŷ and the inputs
x̂ we define now the activation zjl of node/neuron/unit j of the l-th layer as a
function of the bias, the weights which add up from the previous layer l − 1 and
the forward passes/outputs âl−1 from the previous layer as
Ml−1
X
l l−1
zjl = wij ai + blj ,
i=1
where blkare the biases from layer l. Here Ml−1 represents the total number
of nodes/neurons/units of layer l − 1. The figure here illustrates this equation.
We can rewrite this in a more compact form as the matrix-vector products we
discussed earlier,
T
ẑ l = Ŵ l âl−1 + b̂l .
With the activation values ẑ l we can in turn define the output of layer l as
â = f (ẑ l ) where f is our activation function. In the examples here we will use
l
the sigmoid function discussed in our logistic regression lectures. We will also
10
use the same activation function f for all layers and their nodes. It means we
have
1
alj = f (zjl ) = .
1 + exp −(zjl )
∂zjl
l
= al−1
i ,
∂wij
and
∂zjl l
= wji .
∂al−1
i
With our definition of the activation function we have that (note that this
function depends only on zjl )
∂alj
= alj (1 − alj ) = f (zjl )(1 − f (zjl )).
∂zjl
∂C(WˆL ) L
∂aL
j
L
= aj − tj L
,
∂wjk ∂wjk
The last partial derivative can easily be computed and reads (by applying the
chain rule)
∂aL
j ∂aLj ∂zj
L
L L−1
L
= = aLj (1 − aj )ak ,
∂wjk ∂zjL ∂wjk
L
11
Defining
0 L ∂C
δjL = aL L L
j (1 − aj ) aj − tj = f (zj ) ,
∂(aL
j)
and using the Hadamard product of two vectors we can write this as
∂C
δ̂ L = f 0 (ẑ L ) ◦ .
∂(âL )
With the definition of δjL we have a more compact definition of the derivative
of the cost function in terms of the weights, namely
∂C(WˆL )
L
= δjL akL−1 .
∂wjk
∂C ∂C ∂aLj
δjL = = ,
∂zjL ∂aL L
j ∂zj
which can also be interpreted as the partial derivative of the cost function with
respect to the biases bL
j , namely
∂C ∂bLj ∂C
δjL = L L
= L,
∂bj ∂zj ∂bj
That is, the error δjL is exactly equal to the rate of change of the cost function
as a function of the bias.
12
Bringing it together
We have now three equations that are essential for the computations of the
derivatives of the cost function at the output layer. These equations are needed
to start the algorithm and they are
∂C(WˆL )
L
= δjL akL−1 , (13)
∂wjk
and
∂C
δjL = f 0 (zjL ) , (14)
∂(aL
j)
and
∂C
δjL = , (15)
∂bL
j
∂C
δjl = .
∂zjl
We want to express this in terms of the equations for layer l + 1. Using the chain
rule and summing over all k entries we have
X ∂C ∂z l+1 l+1
l+1 ∂zk
X
δjl = k
= δk ,
k
∂zkl+1 ∂zjl k
∂zjl
13
and recalling that
Ml
X
zjl+1 = l+1 l
wij ai + bl+1
j ,
i=1
First, we set up the input data x̂ and the activations ẑ1 of the input layer and
compute the activation function and the pertinent outputs â1 .
Secondly, we perform then the feed forward till we reach the output layer and
compute all ẑl of the input layer and compute the activation function and the
pertinent outputs âl for l = 2, 3, . . . , L.
Finally, we update the weights and the biases using gradient descent for each
l = L − 1, L − 2, . . . , 2 and update the weights and biases according to the rules
l
wjk l
←= wjk − ηδjl al−1
k ,
∂C
blj ← blj − η = blj − ηδjl ,
∂blj
The parameter η is the learning parameter discussed in connection with
the gradient descent methods. Here it is convenient to use stochastic gradient
descent (see the examples below) with mini-batches with an outer loop that
steps through multiple epochs of training.
14
Setting up a Multi-layer perceptron model for classification
We are now gong to develop an example based on the MNIST data base. This
is a classification problem and we need to use our cross-entropy function we
discussed in connection with logistic regression. The cross-entropy defines our
cost function for the classificaton problems with neural networks.
In binary classification with two classes (0, 1) we define the logistic/sigmoid
function as the probability that a particular input is in class 0 or 1. This is
possible because the logistic function takes any input from the real numbers
and inputs a number between 0 and 1, and can therefore be interpreted as a
probability. It also has other nice properties, such as a derivative that is simple
to calculate.
For an input a from the hidden layer, the probability that the input x is in
class 0 or 1 is just. We let θ represent the unknown weights and biases to be
adjusted by our equations). The variable x represents our activation values z.
We have
1
P (y = 0 | x̂, θ̂) = ,
1 + exp (−x̂)
and
P (y = 1 | x̂, θ̂) = 1 − P (y = 0 | x̂, θ̂),
where y ∈ {0, 1} and θ̂ represents the weights and biases of our network.
This last equality means that we can interpret our cost function as a sum
over the loss function for each point in the dataset Li (θ̂). The negative sign is
just so that we can think about our algorithm as minimizing a positive number,
rather than maximizing a negative number.
In multiclass classification it is common to treat each integer label as a so
called one-hot vector:
y = 5 → ŷ = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0), and
y = 1 → ŷ = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
i.e. a binary bit string of length C, where C = 10 is the number of classes in
the MNIST dataset (numbers from 0 to 9)..
If x̂i is the i-th input (image), yic refers to the c-th component of the i-th
output vector ŷi . The probability of x̂i being in class c will be given by the
softmax function:
exp ((âhidden
i )T ŵc )
P (yic = 1 | x̂i , θ̂) = PC−1 ,
hidden )T ŵ 0 )
c0 =0 exp ((âi c
15
which reduces to the logistic function in the binary case. The likelihood of
this C-class classifier is now given as:
n C−1
Y Y
P (D | θ̂) = [P (yic = 1)]yic .
i=1 c=0
exp (β0 + β1 xi )
p(yi = 1|xi , β̂) = ,
1 + exp (β0 + β1 xi )
and
p(yi = 0|xi , β̂) = 1 − p(yi = 1|xi , β̂).
The parameters β̂ were defined using a minimization method like gradient descent
or Newton-Raphson’s method.
Now we replace xi with the activation zil for a given layer l and the outputs
as yi = ali = f (zil ), with zil now being a function of the weights wij
l
and biases bli .
We have then
exp (zil )
ali = yi = ,
1 + exp (zil )
with X
l l−1
zil = wij aj + bli ,
j
where the superscript l − 1 indicates that these are the outputs from layer l − 1.
Our cost function at the final layer l = L is now
n
X
ti log aL L
C(Ŵ ) = − i + (1 − ti ) log (1 − ai ) ,
i=1
16
where we have defined the targets ti . The derivatives of the cost function with
respect to the output aL
i are then easily calculated and we get
∂C(Ŵ ) aL
i − ti
= .
∂aL
i aL (1 − aL )
i i
In case we use another activation function than the logistic one, we need to
evaluate other derivatives.
∂f (zil )
= f (zil ) δij − f (zjl ) ,
∂zjl
17
Collect and pre-process data
Here we will be using the MNIST dataset, which is readily available through
the scikit-learn package. You may also find it for example here. The MNIST
(Modified National Institute of Standards and Technology) database is a large
database of handwritten digits that is commonly used for training various image
processing systems. The MNIST dataset consists of 70 000 images of size 28 × 28
pixels, each labeled from 0 to 9. The scikit-learn dataset we will use consists of a
selection of 1797 images of size 8 × 8 collected and processed from this database.
To feed data into a feed-forward neural network we need to represent the
inputs as a design/feature matrix X = (ninputs , nf eatures ). Each row represents
an input, in this case a handwritten digit, and each column represents a feature,
in this case a pixel. The correct answers, also known as labels or targets are
represented as a 1D array of integers Y = (ninputs ) = (5, 3, 1, 8, ...).
As an example, say we want to build a neural network using supervised
learning to predict Body-Mass Index (BMI) from measurements of height (in
m) and weight (in kg). If we have measurements of 5 people the design/feature
matrix could be for example:
1.85&81
1.71&65
X= 1.95&103 ,
1.55&42
1.63&56
and the targets would be:
18
# define inputs and labels
inputs = digits.images
labels = digits.target
print("inputs = (n_inputs, pixel_width, pixel_height) = " + str(inputs.shape))
print("labels = (n_inputs) = " + str(labels.shape))
# equivalently in numpy
def train_test_split_numpy(inputs, labels, train_size, test_size):
n_inputs = len(inputs)
inputs_shuffled = inputs.copy()
labels_shuffled = labels.copy()
np.random.shuffle(inputs_shuffled)
np.random.shuffle(labels_shuffled)
train_end = int(n_inputs*train_size)
X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:]
19
Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:]
return X_train, X_test, Y_train, Y_test
y = f (z),
where f is the activation function, ai represents input from neuron i in the
preceding layer and wi is the weight to input i. The activation of the neurons in
the input layer is just the features (e.g. a pixel value).
The simplest activation function for a neuron is the Heaviside function:
(
1, z > 0
f (z) =
0, otherwise
A feed-forward neural network with this activation is known as a perceptron.
For a binary classifier (i.e. two classes, 0 or 1, dog or not-dog) we can also use this
in our output layer. This activation can be generalized to k classes (using e.g. the
one-against-all strategy), and we call these architectures multiclass perceptrons.
However, it is now common to use the terms Single Layer Perceptron (SLP)
(1 hidden layer) and Multilayer Perceptron (MLP) (2 or more hidden layers) to
refer to feed-forward neural networks with any activation function.
Typical choices for activation functions include the sigmoid function, hyper-
bolic tangent, and Rectified Linear Unit (ReLU). We will be using the sigmoid
function σ(x):
1
f (x) = σ(x) = ,
1 + e−x
which is inspired by probability theory (see logistic regression) and was most
commonly used until about 2011. See the discussion below concerning other
activation functions.
20
Layers
• Input
Since each input image has 8x8 = 64 pixels or features, we have an input layer
of 64 neurons.
• Hidden layer
We will use 50 neurons in the hidden layer receiving input from the neurons in
the input layer. Since each neuron in the hidden layer is connected to the 64
inputs we have 64x50 = 3200 weights to the hidden layer.
• Output
Since each neuron in the output layer is connected to the 50 inputs from the
hidden layer we have 50x10 = 500 weights to the output layer.
21
will be mapped to zero (before being passed through the activation). The bias
unit has an output of 1, and a weight to each neuron j, bj :
n
X
zj = wij ai + bj .
i=1
The bias weights b̂ are often initialized to zero, but a small value like 0.01
ensures all neurons have some output which can be backpropagated in the first
training cycle.
# building our neural network
Feed-forward pass
Denote F the number of features, H the number of hidden neurons and C the
number of categories. For each input image we calculate a weighted sum of input
features (pixel values) to each neuron j in the hidden layer l:
F
X
zjl = l
wij xi + blj ,
i=1
alj = f (zjl ).
We calculate a weighted sum of inputs (activations in the hidden layer) to
each neuron j in the output layer:
H
X
zjL = L l
wij ai + bL
j.
i=1
Finally we calculate the output of neuron j in the output layer using the
softmax function:
exp (zjL )
aL
j = PC−1 .
L
c=0 exp (zc )
22
Matrix multiplications
Since our data has the dimensions X = (ninputs , nf eatures ) and our weights to
the hidden layer have the dimensions Whidden = (nf eatures , nhidden ), we can
easily feed the network all our training data in one go by taking the matrix
product
XW h = (ninputs , nhidden ),
and obtain a matrix that holds the weighted sum of inputs to the hidden
layer for each input image and each hidden neuron. We also add the bias to
obtain a matrix of weighted sums to the hidden layer Z h :
ẑ l = X̂ Ŵ l + b̂l ,
meaning the same bias (1D array with size equal number of hidden neurons)
is added to each input image. This is then passed through the activation:
âl = f (ẑ l ).
This is fed to the output layer:
ẑ L = âL Ŵ L + b̂L .
Finally we receive our output values for each image and each category by
passing it through the softmax function:
def sigmoid(x):
return 1/(1 + np.exp(-x))
def feed_forward(X):
# weighted sum of inputs to the hidden layer
z_h = np.matmul(X, hidden_weights) + hidden_bias
# activation in the hidden layer
a_h = sigmoid(z_h)
return probabilities
probabilities = feed_forward(X_train)
print("probabilities = (n_inputs, n_categories) = " + str(probabilities.shape))
print("probability that image 0 is in category 0,1,2,...,9 = \n" + str(probabilities[0]))
print("probabilities sum up to: " + str(probabilities[0].sum()))
print()
23
# we obtain a prediction by taking the class with the highest likelihood
def predict(X):
probabilities = feed_forward(X)
return np.argmax(probabilities, axis=1)
predictions = predict(X_train)
print("predictions = (n_inputs) = " + str(predictions.shape))
print("prediction for image 0: " + str(predictions[0]))
print("correct label for image 0: " + str(Y_train[0]))
θi+1 = θi − η∇C(θi ),
24
where η is known as the learning rate, which controls how big a step we take
towards the minimum. This update can be repeated for any number of iterations,
or until we are satisfied with the result.
A simple and effective improvement is a variant called Batch Gradient De-
scent. Instead of calculating the gradient on the whole dataset, we calculate an
approximation of the gradient on a subset of the data called a minibatch. If
there are N data points and we have a minibatch size of M , the total number
of batches is N/M . We denote each minibatch Bk , with k = 1, 2, ..., N/M . The
gradient then becomes:
N
1 X 1 X
∇C(θ) = ∇Li (θ) → ∇Li (θ),
N i=1 M
i∈Bk
i.e. instead of averaging the loss over the entire dataset, we average over a
minibatch.
This has two important benefits:
1. Introducing stochasticity decreases the chance that the algorithm becomes
stuck in a local minima.
2. It significantly speeds up the calculation, since we do not have to use the
entire dataset to calculate the gradient.
The various optmization methods, with codes and algorithms, are discussed in
our lectures on Gradient descent approaches.
Regularization
It is common to add an extra term to the cost function, proportional to the size
of the weights. This is equivalent to constraining the size of the weights, so that
they do not grow out of control. Constraining the size of the weights means that
the weights cannot grow arbitrarily large to fit the training data, and in this
way reduces overfitting.
We will measure the size of the weights using the so called L2-norm, meaning
our cost function becomes:
N N N
1 X 1 X 1 X X
C(θ) = Li (θ) → Li (θ) + λ||ŵ||22 = L(θ) + λ 2
wij ,
N i=1 N i=1 N i=1 ij
i.e. we sum up all the weights squared. The factor λ is known as a regular-
ization parameter.
In order to train the model, we need to calculate the derivative of the cost
function with respect to every bias and weight in the network. In total our
network has (64+1)×50 = 3250 weights in the hidden layer and (50+1)×10 = 510
weights to the output layer (+1 for the bias), and the gradient must be calculated
for every parameter. We use the backpropagation algorithm discussed above.
This is a clever use of the chain rule that allows us to calculate the gradient
efficently.
25
Matrix multiplication
To more efficently train our network these equations are implemented using
matrix operations. The error in the output layer is calculated simply as, with t̂
being our targets,
δL = t̂ − ŷ = (ninputs , ncategories ).
The gradient for the output weights is calculated as
where f 0 (ah ) is the derivative of the activation in the hidden layer. The
matrix products mean that we are summing up the products for each neuron
in the output layer. The symbol ◦ denotes the Hadamard product, meaning
element-wise multiplication.
This again gives us the gradients in the hidden layer:
ninputs
X
∇bh = δh = (nhidden ).
i=1
# one-hot in numpy
def to_categorical_numpy(integer_vector):
n_inputs = len(integer_vector)
n_categories = np.max(integer_vector) + 1
onehot_vector = np.zeros((n_inputs, n_categories))
onehot_vector[range(n_inputs), integer_vector] = 1
return onehot_vector
#Y_train_onehot, Y_test_onehot = to_categorical(Y_train), to_categorical(Y_test)
Y_train_onehot, Y_test_onehot = to_categorical_numpy(Y_train), to_categorical_numpy(Y_test)
26
def feed_forward_train(X):
# weighted sum of inputs to the hidden layer
z_h = np.matmul(X, hidden_weights) + hidden_bias
# activation in the hidden layer
a_h = sigmoid(z_h)
eta = 0.01
lmbd = 0.01
for i in range(1000):
# calculate gradients
dWo, dBo, dWh, dBh = backpropagation(X_train, Y_train_onehot)
# regularization term gradients
dWo += lmbd * output_weights
dWh += lmbd * hidden_weights
Improving performance
As we can see the network does not seem to be learning at all. It seems to be
just guessing the label for each image. In order to obtain a network that does
something useful, we will have to do a bit more work.
27
The choice of hyperparameters such as learning rate and regularization
parameter is hugely influential for the performance of the network. Typically a
grid-search is performed, wherein we test different hyperparameters separated
by orders of magnitude. For example we could test the learning rates η =
10−6 , 10−5 , ..., 10−1 with different regularization parameters λ = 10−6 , ..., 10−0 .
Next, we haven’t implemented minibatching yet, which introduces stochastic-
ity and is though to act as an important regularizer on the weights. We call a
feed-forward + backward pass with a minibatch an iteration, and a full training
period going through the entire dataset (n/M batches) an epoch.
If this does not improve network performance, you may want to consider
altering the network architecture, adding more neurons or hidden layers. Andrew
Ng goes through some of these considerations in this video. You can find a
summary of the video here.
self.epochs = epochs
self.batch_size = batch_size
self.iterations = self.n_inputs // self.batch_size
self.eta = eta
self.lmbd = lmbd
self.create_biases_and_weights()
def create_biases_and_weights(self):
self.hidden_weights = np.random.randn(self.n_features, self.n_hidden_neurons)
self.hidden_bias = np.zeros(self.n_hidden_neurons) + 0.01
28
def feed_forward(self):
# feed-forward for training
self.z_h = np.matmul(self.X_data, self.hidden_weights) + self.hidden_bias
self.a_h = sigmoid(self.z_h)
exp_term = np.exp(self.z_o)
self.probabilities = exp_term / np.sum(exp_term, axis=1, keepdims=True)
def backpropagation(self):
error_output = self.probabilities - self.Y_data
error_hidden = np.matmul(error_output, self.output_weights.T) * self.a_h * (1 - self.a_h)
self.output_weights_gradient = np.matmul(self.a_h.T, error_output)
self.output_bias_gradient = np.sum(error_output, axis=0)
self.hidden_weights_gradient = np.matmul(self.X_data.T, error_hidden)
self.hidden_bias_gradient = np.sum(error_hidden, axis=0)
def train(self):
data_indices = np.arange(self.n_inputs)
for i in range(self.epochs):
for j in range(self.iterations):
# pick datapoints with replacement
chosen_datapoints = np.random.choice(
data_indices, size=self.batch_size, replace=False
)
29
self.Y_data = self.Y_data_full[chosen_datapoints]
self.feed_forward()
self.backpropagation()
# equivalent in numpy
def accuracy_score_numpy(Y_test, Y_pred):
return np.sum(Y_test == Y_pred) / len(Y_test)
Adjust hyperparameters
We now perform a grid search to find the optimal hyperparameters for the
network. Note that we are only using 1 layer with 50 neurons, and human
performance is estimated to be around 98% (2% error rate).
eta_vals = np.logspace(-5, 1, 7)
lmbd_vals = np.logspace(-5, 1, 7)
# store the models for later use
DNN_numpy = np.zeros((len(eta_vals), len(lmbd_vals)), dtype=object)
# grid search
for i, eta in enumerate(eta_vals):
for j, lmbd in enumerate(lmbd_vals):
dnn = NeuralNetwork(X_train, Y_train_onehot, eta=eta, lmbd=lmbd, epochs=epochs, batch_size
n_hidden_neurons=n_hidden_neurons, n_categories=n_categories)
dnn.train()
DNN_numpy[i][j] = dnn
test_predict = dnn.predict(X_test)
30
print("Learning rate = ", eta)
print("Lambda = ", lmbd)
print("Accuracy score on test set: ", accuracy_score(Y_test, test_predict))
print()
Visualization
# visual representation of grid search
# uses seaborn heatmap, you can also do this with matplotlib imshow
import seaborn as sns
sns.set()
for i in range(len(eta_vals)):
for j in range(len(lmbd_vals)):
dnn = DNN_numpy[i][j]
train_pred = dnn.predict(X_train)
test_pred = dnn.predict(X_test)
scikit-learn implementation
scikit-learn focuses more on traditional machine learning methods, such as
regression, clustering, decision trees, etc. As such, it has only two types of neural
networks: Multi Layer Perceptron outputting continuous values, MPLRegressor,
and Multi Layer Perceptron outputting labels, MLPClassifier. We will see how
simple it is to use these classes.
scikit-learn implements a few improvements from our neural network, such
as early stopping, a varying learning rate, different optimization methods, etc.
We would therefore expect a better performance overall.
from sklearn.neural_network import MLPClassifier
# store models for later use
DNN_scikit = np.zeros((len(eta_vals), len(lmbd_vals)), dtype=object)
31
for i, eta in enumerate(eta_vals):
for j, lmbd in enumerate(lmbd_vals):
dnn = MLPClassifier(hidden_layer_sizes=(n_hidden_neurons), activation=’logistic’,
alpha=lmbd, learning_rate_init=eta, max_iter=epochs)
dnn.fit(X_train, Y_train)
DNN_scikit[i][j] = dnn
Visualization
# optional
# visual representation of grid search
# uses seaborn heatmap, could probably do this in matplotlib
import seaborn as sns
sns.set()
train_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
test_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
for i in range(len(eta_vals)):
for j in range(len(lmbd_vals)):
dnn = DNN_scikit[i][j]
train_pred = dnn.predict(X_train)
test_pred = dnn.predict(X_test)
32
Tensorflow, building one in Keras is really quite trivial, though the performance
may suffer.
In our previous example we used only one hidden layer, and in this we will
use two. From this it should be quite clear how to build one using an arbitrary
number of hidden layers, using data structures such as Python lists or NumPy
arrays.
Tensorflow
Tensorflow is an open source library machine learning library developed by the
Google Brain team for internal use. It was released under the Apache 2.0 open
source license in November 9, 2015.
Tensorflow is a computational framework that allows you to construct machine
learning models at different levels of abstraction, from high-level, object-oriented
APIs like Keras, down to the C++ kernels that Tensorflow is built upon. The
higher levels of abstraction are simpler to use, but less flexible, and our choice
of implementation should reflect the problems we are trying to solve.
Tensorflow uses so-called graphs to represent your computation in terms
of the dependencies between individual operations, such that you first build a
Tensorflow graph to represent your model, and then create a Tensorflow session
to run the graph.
In this guide we will analyze the same data as we did in our NumPy and
scikit-learn tutorial, gathered from the MNIST database of images. We will give
an introduction to the lower level Python Application Program Interfaces (APIs),
and see how we use them to build our graph. Then we will build (effectively) the
same graph in Keras, to see just how simple solving a machine learning problem
can be.
To install tensorflow on Unix/Linux systems, use pip as
pip install tensorflow
and/or if you use anaconda, just write (or install from the graphical user
interface)
conda install tensorflow
33
# download MNIST dataset
digits = datasets.load_digits()
import tensorflow as tf
class NeuralNetworkTensorflow:
def __init__(
self,
X_train,
Y_train,
X_test,
Y_test,
n_neurons_layer1=100,
34
n_neurons_layer2=50,
n_categories=2,
epochs=10,
batch_size=100,
eta=0.1,
lmbd=0.0):
self.n_inputs = X_train.shape[0]
self.n_features = X_train.shape[1]
self.n_neurons_layer1 = n_neurons_layer1
self.n_neurons_layer2 = n_neurons_layer2
self.n_categories = n_categories
self.epochs = epochs
self.batch_size = batch_size
self.iterations = self.n_inputs // self.batch_size
self.eta = eta
self.lmbd = lmbd
def create_placeholders(self):
# placeholders are fine here, but "Datasets" are the preferred method
# of streaming data into a model
with tf.name_scope(’data’):
self.X = tf.placeholder(tf.float32, shape=(None, self.n_features), name=’X_data’)
self.Y = tf.placeholder(tf.float32, shape=(None, self.n_categories), name=’Y_data’)
def create_DNN(self):
with tf.name_scope(’DNN’):
# the weights are stored to calculate regularization loss later
# Output layer
self.W_out = self.weight_variable([self.n_neurons_layer2, self.n_categories], name=’ou
b_out = self.bias_variable([self.n_categories], name=’out’, dtype=tf.float32)
self.z_out = tf.matmul(a_fc2, self.W_out) + b_out
35
def create_loss(self):
with tf.name_scope(’loss’):
softmax_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.Y
regularizer_loss_fc1 = tf.nn.l2_loss(self.W_fc1)
regularizer_loss_fc2 = tf.nn.l2_loss(self.W_fc2)
regularizer_loss_out = tf.nn.l2_loss(self.W_out)
regularizer_loss = self.lmbd*(regularizer_loss_fc1 + regularizer_loss_fc2 + regularize
self.loss = softmax_loss + regularizer_loss
def create_accuracy(self):
with tf.name_scope(’accuracy’):
probabilities = tf.nn.softmax(self.z_out)
predictions = tf.argmax(probabilities, axis=1)
labels = tf.argmax(self.Y, axis=1)
correct_predictions = tf.equal(predictions, labels)
correct_predictions = tf.cast(correct_predictions, tf.float32)
self.accuracy = tf.reduce_mean(correct_predictions)
def create_optimiser(self):
with tf.name_scope(’optimizer’):
self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=self.eta).minimize(se
def weight_variable(self, shape, name=’’, dtype=tf.float32):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial, name=name, dtype=dtype)
sess.run([DNN.loss, DNN.optimizer],
feed_dict={DNN.X: batch_X,
DNN.Y: batch_Y})
accuracy = sess.run(DNN.accuracy,
feed_dict={DNN.X: batch_X,
DNN.Y: batch_Y})
step = sess.run(DNN.global_step)
36
Optimizing and using gradient descent
epochs = 100
batch_size = 100
n_neurons_layer1 = 100
n_neurons_layer2 = 50
n_categories = 10
eta_vals = np.logspace(-5, 1, 7)
lmbd_vals = np.logspace(-5, 1, 7)
DNN_tf = np.zeros((len(eta_vals), len(lmbd_vals)), dtype=object)
DNN_tf[i][j] = DNN
sns.set()
train_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
test_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
for i in range(len(eta_vals)):
for j in range(len(lmbd_vals)):
DNN = DNN_tf[i][j]
train_accuracy[i][j] = DNN.train_accuracy
test_accuracy[i][j] = DNN.test_accuracy
37
Using Keras
Keras is a high level neural network that supports Tensorflow, CTNK and
Theano as backends. If you have Tensorflow installed Keras is available through
the tf.keras module. If you have Anaconda installed you may run the following
command
conda install keras
sgd = optimizers.SGD(lr=eta)
model.compile(loss=’categorical_crossentropy’, optimizer=sgd, metrics=[’accuracy’])
return model
DNN_keras[i][j] = DNN
print("Learning rate = ", eta)
print("Lambda = ", lmbd)
print("Test accuracy: %.3f" % scores[1])
print()
# optional
# visual representation of grid search
# uses seaborn heatmap, could probably do this in matplotlib
import seaborn as sns
sns.set()
38
test_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
for i in range(len(eta_vals)):
for j in range(len(lmbd_vals)):
DNN = DNN_keras[i][j]
39
y=outputs
#%%
plt.figure()
plt.scatter(x[:,0],x[:,2],s=40,c=y,cmap=plt.cm.Spectral)
plt.xlabel(’Mean radius’,fontweight=’bold’)
plt.ylabel(’Mean perimeter’,fontweight=’bold’)
plt.show()
plt.figure()
plt.scatter(x[:,5],x[:,6],s=40,c=y, cmap=plt.cm.Spectral)
plt.xlabel(’Mean compactness’,fontweight=’bold’)
plt.ylabel(’Mean concavity’,fontweight=’bold’)
plt.show()
plt.figure()
plt.scatter(x[:,0],x[:,1],s=40,c=y,cmap=plt.cm.Spectral)
plt.xlabel(’Mean radius’,fontweight=’bold’)
plt.ylabel(’Mean texture’,fontweight=’bold’)
plt.show()
plt.figure()
plt.scatter(x[:,2],x[:,1],s=40,c=y,cmap=plt.cm.Spectral)
plt.xlabel(’Mean perimeter’,fontweight=’bold’)
plt.ylabel(’Mean compactness’,fontweight=’bold’)
plt.show()
# %%
40
batch_size=100 #Number of samples per gradient update
# %%
def NN_model(inputsize,n_layers,n_neuron,eta,lamda):
model=Sequential()
for i in range(n_layers): #Run loop to add hidden layers to the model
if (i==0): #First layer requires input dimensions
model.add(Dense(n_neuron,activation=’relu’,kernel_regularizer=regularizers.l2(lamda),i
else: #Subsequent layers are capable of automatic shape inferencing
model.add(Dense(n_neuron,activation=’relu’,kernel_regularizer=regularizers.l2(lamda)))
model.add(Dense(2,activation=’softmax’)) #2 outputs - ordered and disordered (softmax for pro
sgd=optimizers.SGD(lr=eta)
model.compile(loss=’categorical_crossentropy’,optimizer=sgd,metrics=[’accuracy’])
return model
for i in range(len(n_neuron)): #run loops over hidden neurons and learning rates to calculate
for j in range(len(eta)): #accuracy scores
DNN_model=NN_model(X_train.shape[1],n_layers,n_neuron[i],eta[j],lamda)
DNN_model.fit(X_train,y_train,epochs=epochs,batch_size=batch_size,verbose=1)
Train_accuracy[i,j]=DNN_model.evaluate(X_train,y_train)[1]
Test_accuracy[i,j]=DNN_model.evaluate(X_test,y_test)[1]
def plot_data(x,y,data,title=None):
# plot results
fontsize=16
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(data, interpolation=’nearest’, vmin=0, vmax=1)
cbar=fig.colorbar(cax)
cbar.ax.set_ylabel(’accuracy (%)’,rotation=90,fontsize=fontsize)
cbar.set_ticks([0,.2,.4,0.6,0.8,1.0])
cbar.set_ticklabels([’0%’,’20%’,’40%’,’60%’,’80%’,’100%’])
# put text on matrix elements
for i, x_val in enumerate(np.arange(len(x))):
for j, y_val in enumerate(np.arange(len(y))):
c = "${0:.1f}\\%$".format( 100*data[j,i])
ax.text(x_val, y_val, c, va=’center’, ha=’center’)
ax.set_xticklabels([’’]+x)
ax.set_yticklabels([’’]+y)
ax.set_xlabel(’$\\mathrm{learning\\ rate}$’,fontsize=fontsize)
ax.set_ylabel(’$\\mathrm{hidden\\ neurons}$’,fontsize=fontsize)
41
if title is not None:
ax.set_title(title)
plt.tight_layout()
plt.show()
plot_data(eta,n_neuron,Train_accuracy, ’training’)
plot_data(eta,n_neuron,Test_accuracy, ’testing’)
42
The derivative of the Logistic funtion
Looking at the logistic activation function, when inputs become large (negative
or positive), the function saturates at 0 or 1, with a derivative extremely close to
0. Thus when backpropagation kicks in, it has virtually no gradient to propagate
back through the network, and what little gradient exists keeps getting diluted
as backpropagation progresses down through the top layers, so there is really
nothing left for the lower layers.
In their paper, Glorot and Bengio propose a way to significantly alleviate
this problem. We need the signal to flow properly in both directions: in the
forward direction when making predictions, and in the reverse direction when
backpropagating gradients. We don’t want the signal to die out, nor do we want
it to explode and saturate. For the signal to flow properly, the authors argue
that we need the variance of the outputs of each layer to be equal to the variance
of its inputs, and we also need the gradients to have equal variance before and
after flowing through a layer in the reverse direction.
One of the insights in the 2010 paper by Glorot and Bengio was that the
vanishing/exploding gradients problems were in part due to a poor choice of
activation function. Until then most people had assumed that if Nature had
chosen to use roughly sigmoid activation functions in biological neurons, they
must be an excellent choice. But it turns out that other activation functions
behave much better in deep neural networks, in particular the ReLU activation
function, mostly because it does not saturate for positive values (and also because
it is quite fast to compute).
43
If runtime performance is an issue, then you may opt for the leaky ReLU
function over the ELU function If you don’t want to tweak yet another hyperpa-
rameter, you may just use the default α of 0.01 for the leaky ReLU, and 1 for
ELU. If you have spare time and computing power, you can use cross-validation
or bootstrap to evaluate other activation functions.
If the validation and test sets are drawn from the same distributions, then a good
performance on the validation set should lead to similarly good performance on
the test set.
However, sometimes the training data and test data differ in subtle ways
because, for example, they are collected using slightly different methods, or
because it is cheaper to collect data in one way versus another. In this case,
there can be a mismatch between the training and test data. This can lead to the
neural network overfitting these small differences between the test and training
sets, and a poor performance on the test set despite having a good performance
on the validation set. To rectify this, Andrew Ng suggests making two validation
or dev sets, one constructed from the training data and one constructed from
the test data. The difference between the performance of the algorithm on these
two validation sets quantifies the train-test mismatch. This can serve as another
important diagnostic when using DNNs for supervised learning.
44
• Need labeled data. All supervised learning methods, DNNs for super-
vised learning require labeled data. Often, labeled data is harder to acquire
than unlabeled data (e.g. one must pay for human experts to label images).
• Supervised neural networks are extremely data intensive. DNNs
are data hungry. They perform best when data is plentiful. This is doubly
so for supervised methods where the data must also be labeled. The utility
of DNNs is extremely limited if data is hard to acquire or the datasets are
small (hundreds to a few thousand samples). In this case, the performance
of other methods that utilize hand-engineered features can exceed that of
DNNs.
Some of these remarks are particular to DNNs, others are shared by all supervised
learning methods. This motivates the use of unsupervised methods which in
part circumvent these problems.
45
telescopes. Moreover, a lot of man-made sources of radio frequency interference
are also present, which can produce the same signals as pulsars. Hence, the
classification of pulsars from possible candidate data is of great importance.
In reality, pulsars are very weak radio sources. They are classified from a
given data sample by first extracting information from the pool of data and then
identifying which features or characteristics are relevant. Since the individual
pulses are very much different, astronomers particularly stack them up and
generate an integrated pulse profile for pulsar classification. This profile is
the coherent addition of thousands of pulses together in a process known as
folding. Moreover, pulses will arrive at various times across different radio
frequencies. The delay of these pulses from frequency to frequency is called
dispersion and is said to be caused by the ionized inter-stellar medium. The
method usually employed by astronomers is that they fit for the shape of the delay
as to reduce its negative effect. However, as with all kinds of fitting procedures,
there would always be an uncertainty associated with it. This is expressed in
the so-called DM-SNR (dispersion-measure-signal-to-noise-ratio) curve. Both
the integrated profile curve and DM-SNR curve are used in identifying possible
pulsar candidates. These curves provide eight numerical characteristic features
as depicted and listed below.
plt.figure(figsize=(12,6))
plt.subplot(121)
ax = sns.countplot(y = data["target_class"],
palette=["r","g"],
linewidth=1,
edgecolor="k"*2)
for i,j in enumerate(data["target_class"].value_counts().values):
ax.text(.7,i,j,weight = "bold",fontsize = 27)
46
plt.title("Count for target variable in dataset")
plt.subplot(122)
plt.pie(data["target_class"].value_counts().values,
labels=["not pulsar stars","pulsar stars"],
autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()
47
plt.figure(figsize=(13,20))
for i,j,k in itertools.zip_longest(columns,range(length),colors):
plt.subplot(length/2,length/4,j+1)
sns.distplot(data[i],color=k)
plt.title(i)
plt.subplots_adjust(hspace = .3)
plt.axvline(data[i].mean(),color = "k",linestyle="dashed",label="MEAN")
plt.axvline(data[i].std(),color = "b",linestyle="dotted",label="STANDARD DEVIATION")
plt.legend(loc="upper right")
Before delving into any analysis, we would like to decrease the amount of
data to look at. This can be achieved by performing a correlation plot. As
seen in Figure 5, the features that correlate the most with the class (pulsar or
non-pulsar) are the mean, the excess kurtosis, and the skewness of the integrated
profile. We could choose to only analyze these, but this would mean that pulsars
can be identified with only one of the two plots that astrophysicists use to classify
the data. In this case, colinearity may become a problem since these features
are strongly correlated with each other.
# correlation matrix
corr_matrix = data.corr()
# Pair plots
sns.pairplot(data,hue = "Target class")
plt.title("Pair plot for variables")
# plt.savefig(’pairplot.png’,bbox_inches=’tight’)
plt.show()
# Violin Plot
columns = [x for x in data.columns if x not in ["Target class"]]
length = len(columns)
plt.figure(figsize=(13,25))
for i,j in itertools.zip_longest(columns,range(length)):
plt.subplot(length/2,length/4,j+1)
sns.violinplot(x=data["Target class"],y=data[i],
palette=["Orangered","lime"],alpha=.5)
plt.title(i)
#plt.savefig(’violinplot.png’,bbox_inches=’tight’)
plt.show()
# Logistc regression
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
# We define the hyperparameters to be used
nlambdas = 500 # why not? the code runs relatively fast
lmbdas = np.logspace(-5, 5, nlambdas)
kfold = KFold(n_splits = 5) #cross validation spliting
# We preallocate data
# true values that will be found later
48
train_accuracy=np.zeros(lmbdas.shape,np.float64)
test_accuracy=np.zeros(lmbdas.shape,np.float64)
train_red_accuracy=np.zeros(lmbdas.shape,np.float64)
test_red_accuracy=np.zeros(lmbdas.shape,np.float64)
# dummy arrays to be averaged later on
train_accuracy_d=np.zeros(5,np.float64)
test_accuracy_d=np.zeros(5,np.float64)
train_red_accuracy_d=np.zeros(5,np.float64)
test_red_accuracy_d=np.zeros(5,np.float64)
# We create the design matrix X and separate the labels into Y
x_fea = [x for x in data.columns if x not in [’Target class’]]
#print(x_fea)
X = np.zeros((data.shape[0],data.shape[1]-1))
X_red = np.zeros((data.shape[0],3))
Y = np.zeros(data.shape[0])
for i,feature in enumerate(x_fea): # Here we just take the variables of interest
X[:,i] = data[feature]
if ’Mean profile’ == feature:
X_red[:,0]
if ’Kurtosis profile’ == feature:
X_red[:,1]
if ’Skewness profile’== feature:
X_red[:,2]
Y[:] = data[’Target class’]
# We perform a logistic regression for each value of lambda
for i,lmbda in enumerate(lmbdas):
#define model
logreg = LogisticRegression(C=1.0/lmbda,solver=’liblinear’)
X_test = X[test_inds]
X_red_test = X_red[test_inds]
Y_test = Y[test_inds]
# We will scale the data
scaler = StandardScaler()
scaler.fit(X_train)
# first on full data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# then rescale and do on reduced data
scaler.fit(X_red_train)
X_red_train = scaler.transform(X_red_train)
X_red_test = scaler.transform(X_red_test)
del scaler
logreg.fit(X_red_train, Y_train)
train_red_accuracy_d[j]=logreg.score(X_red_train,Y_train)
49
test_red_accuracy_d[j]=logreg.score(X_red_test,Y_test)
j += 1
del X_red_train,X_red_test,X_train,Y_train,X_test,Y_test # delete useless data
#Average to get accuracy values
train_accuracy[i]=np.mean(train_accuracy_d)
test_accuracy[i]=np.mean(test_accuracy_d)
train_red_accuracy[i]=np.mean(train_red_accuracy_d)
test_red_accuracy[i]=np.mean(test_red_accuracy_d)
plt.figure(figsize=(15,7))
plt.semilogx(lmbdas,train_accuracy,label=’train’)
plt.semilogx(lmbdas,test_accuracy,label=’test’)
plt.semilogx(lmbdas,train_red_accuracy,label=’train reduced’)
plt.semilogx(lmbdas,test_red_accuracy,’--’,label=’test reduced’)
plt.xlabel(’$\\lambda$’)
plt.ylabel(’$\\mathrm{accuracy}$’)
plt.grid()
plt.legend()
plt.show()
50
sns.set()
train_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
test_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
for i in range(len(eta_vals)):
for j in range(len(lmbd_vals)):
dnn = DNN_scikit[i][j]
train_pred = dnn.predict(X_train)
test_pred = dnn.predict(X_test)
train_accuracy[i][j] = accuracy_score(y_train, train_pred)
test_accuracy[i][j] = accuracy_score(y_test, test_pred)
51
inputs, performs a dot product and optionally follows it with a non-linearity.
The whole network still expresses a single differentiable score function: from the
raw image pixels on one end to class scores at the other. And they still have
a loss function (for example Softmax) on the last (fully-connected) layer and
all the tips/tricks we developed for learning regular Neural Networks still apply
(back propagation, gradient descent etc etc).
What is the difference? CNN architectures make the explicit assump-
tion that the inputs are images, which allows us to encode certain
properties into the architecture. These then make the forward func-
tion more efficient to implement and vastly reduce the amount of
parameters in the network.
Here we provide only a superficial overview, for the more interested, we
recommend highly the course IN5400 – Machine Learning for Image Analysis
and the slides of CS231.
Another good read is the article here https://arxiv.org/pdf/1603.07285.
pdf.
3D volumes of neurons
Convolutional Neural Networks take advantage of the fact that the input consists
of images and they constrain the architecture in a more sensible way.
52
In particular, unlike a regular Neural Network, the layers of a CNN have
neurons arranged in 3 dimensions: width, height, depth. (Note that the word
depth here refers to the third dimension of an activation volume, not to the
depth of a full Neural Network, which can refer to the total number of layers in
a network.)
To understand it better, the above example of an image with an input volume
of activations has dimensions 32 × 32 × 3 (width, height, depth respectively).
The neurons in a layer will only be connected to a small region of the layer
before it, instead of all of the neurons in a fully-connected manner. Moreover,
the final output layer could for this specific image have dimensions 1 × 1 × 10,
because by the end of the CNN architecture we will reduce the full image into a
single vector of class scores, arranged along the depth dimension.
• INPUT (32 × 32 × 3) will hold the raw pixel values of the image, in this
case an image of width 32, height 32, and with three color channels R,G,B.
• CONV (convolutional )layer will compute the output of neurons that are
connected to local regions in the input, each computing a dot product
between their weights and a small region they are connected to in the input
volume. This may result in volume such as [32 × 32 × 12] if we decided to
use 12 filters.
53
• RELU layer will apply an elementwise activation function, such as the
max(0, x) thresholding at zero. This leaves the size of the volume un-
changed ([32 × 32 × 12]).
• POOL (pooling) layer will perform a downsampling operation along the
spatial dimensions (width, height), resulting in volume such as [16×16×12].
• FC (i.e. fully-connected) layer will compute the class scores, resulting in
volume of size [1 × 1 × 10], where each of the 10 numbers correspond to
a class score, such as among the 10 categories of the MNIST images we
considered above . As with ordinary Neural Networks and as the name
implies, each neuron in this layer will be connected to all the numbers in
the previous volume.
Transforming images
CNNs transform the original image layer by layer from the original pixel values
to the final class scores.
Observe that some layers contain parameters and other don’t. In particular,
the CNN layers perform transformations that are a function of not only the
activations in the input volume, but also of the parameters (the weights and biases
of the neurons). On the other hand, the RELU/POOL layers will implement
a fixed function. The parameters in the CONV/FC layers will be trained with
gradient descent so that the class scores that the CNN computes are consistent
with the labels in the training set for each image.
CNNs in brief
In summary:
• A CNN architecture is in the simplest case a list of Layers that transform
the image volume into an output volume (e.g. holding the class scores)
• There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL
are by far the most popular)
• Each Layer accepts an input 3D volume and transforms it to an output
3D volume through a differentiable function
• Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL
don’t)
• Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL
do, RELU doesn’t)
For more material on convolutional networks, we strongly recommend the course
IN5400 – Machine Learning for Image Analysis and the slides of CS231 which is
taught at Stanford University (consistently ranked as one of the top computer
science programs in the world). Michael Nielsen’s book is a must read, in
particular chapter 6 which deals with CNNs.
54
CNNs in more detail, building convolutional neural net-
works in Tensorflow and Keras
As discussed above, CNNs are neural networks built from the assumption that
the inputs to the network are 2D images. This is important because the number
of features or pixels in images grows very fast with the image size, and an
enormous number of weights and biases are needed in order to build an accurate
network.
As before, we still have our input, a hidden layer and an output. What’s
novel about convolutional networks are the convolutional and pooling layers
stacked in pairs between the input and the hidden layer. In addition, the data is
no longer represented as a 2D feature matrix, instead each input is a number of
2D matrices, typically 1 for each color dimension (Red, Green, Blue).
Setting it up
It means that to represent the entire dataset of images, we require a 4D matrix
or tensor. This tensor has the dimensions:
Strong correlations
Images typically have strong local correlations, meaning that a small part of
the image varies little from its neighboring regions. If for example we have an
image of a blue car, we can roughly assume that a small blue part of the image
is surrounded by other blue regions.
Therefore, instead of connecting every single pixel to a neuron in the first
hidden layer, as we have previously done with deep neural networks, we can
instead connect each neuron to a small part of the image (in all 3 RGB depth
dimensions). The size of each small area is fixed, and known as a receptive.
Layers of a CNN
The layers of a convolutional neural network arrange neurons in 3D: width,
height and depth. The input image is typically a square matrix of depth 3.
55
A convolution is performed on the image which outputs a 3D volume of
neurons. The weights to the input are arranged in a number of 2D matrices,
known as filters.
Each filter slides along the input image, taking the dot product between each
small part of the image and the filter, in all depth dimensions. This is then
passed through a non-linear function, typically the Rectified Linear (ReLu)
function, which serves as the activation of the neurons in the first convolutional
layer. This is further passed through a pooling layer, which reduces the size
of the convolutional layer, e.g. by taking the maximum or average across some
small regions, and this serves as input to the next convolutional layer.
Systematic reduction
By systematically reducing the size of the input volume, through convolution and
pooling, the network should create representations of small parts of the input,
and then from them assemble representations of larger areas. The final pooling
layer is flattened to serve as input to a hidden layer, such that each neuron in the
final pooling layer is connected to every single neuron in the hidden layer. This
then serves as input to the output layer, e.g. a softmax output for classification.
56
random_indices = np.random.choice(indices, size=5)
for i, image in enumerate(digits.images[random_indices]):
plt.subplot(1, 5, i+1)
plt.axis(’off’)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation=’nearest’)
plt.title("Label: %d" % digits.target[random_indices[i]])
plt.show()
import tensorflow as tf
class ConvolutionalNeuralNetworkTensorflow:
def __init__(
self,
X_train,
Y_train,
X_test,
Y_test,
n_filters=10,
n_neurons_connected=50,
n_categories=10,
receptive_field=3,
stride=1,
padding=1,
epochs=10,
batch_size=100,
eta=0.1,
lmbd=0.0):
self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name=’global_step’)
self.X_train = X_train
self.Y_train = Y_train
self.X_test = X_test
57
self.Y_test = Y_test
self.n_inputs, self.input_width, self.input_height, self.depth = X_train.shape
self.n_filters = n_filters
self.n_downsampled = int(self.input_width*self.input_height*n_filters / 4)
self.n_neurons_connected = n_neurons_connected
self.n_categories = n_categories
self.receptive_field = receptive_field
self.stride = stride
self.strides = [stride, stride, stride, stride]
self.padding = padding
self.epochs = epochs
self.batch_size = batch_size
self.iterations = self.n_inputs // self.batch_size
self.eta = eta
self.lmbd = lmbd
self.create_placeholders()
self.create_CNN()
self.create_loss()
self.create_optimiser()
self.create_accuracy()
def create_placeholders(self):
with tf.name_scope(’data’):
self.X = tf.placeholder(tf.float32, shape=(None, self.input_width, self.input_height, self
self.Y = tf.placeholder(tf.float32, shape=(None, self.n_categories), name=’Y_data’)
def create_CNN(self):
with tf.name_scope(’CNN’):
# Convolutional layer
self.W_conv = self.weight_variable([self.receptive_field, self.receptive_field, self.depth
b_conv = self.weight_variable([self.n_filters], name=’conv’, dtype=tf.float32)
z_conv = tf.nn.conv2d(self.X, self.W_conv, self.strides, padding=’SAME’, name=’conv’) + b_
a_conv = tf.nn.relu(z_conv)
# 2x2 max pooling
a_pool = tf.nn.max_pool(a_conv, [1, 2, 2, 1], [1, 2, 2, 1], padding=’SAME’, name=’pool’)
# Fully connected layer
a_pool_flat = tf.reshape(a_pool, [-1, self.n_downsampled])
self.W_fc = self.weight_variable([self.n_downsampled, self.n_neurons_connected], name=’fc’
b_fc = self.bias_variable([self.n_neurons_connected], name=’fc’, dtype=tf.float32)
a_fc = tf.nn.relu(tf.matmul(a_pool_flat, self.W_fc) + b_fc)
# Output layer
self.W_out = self.weight_variable([self.n_neurons_connected, self.n_categories], name=’out
b_out = self.bias_variable([self.n_categories], name=’out’, dtype=tf.float32)
self.z_out = tf.matmul(a_fc, self.W_out) + b_out
def create_loss(self):
with tf.name_scope(’loss’):
softmax_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.Y, lo
regularizer_loss_conv = tf.nn.l2_loss(self.W_conv)
regularizer_loss_fc = tf.nn.l2_loss(self.W_fc)
regularizer_loss_out = tf.nn.l2_loss(self.W_out)
regularizer_loss = self.lmbd*(regularizer_loss_conv + regularizer_loss_fc + regularizer_lo
self.loss = softmax_loss + regularizer_loss
58
def create_accuracy(self):
with tf.name_scope(’accuracy’):
probabilities = tf.nn.softmax(self.z_out)
predictions = tf.argmax(probabilities, 1)
labels = tf.argmax(self.Y, 1)
correct_predictions = tf.equal(predictions, labels)
correct_predictions = tf.cast(correct_predictions, tf.float32)
self.accuracy = tf.reduce_mean(correct_predictions)
def create_optimiser(self):
with tf.name_scope(’optimizer’):
self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=self.eta).minimize(self.l
def weight_variable(self, shape, name=’’, dtype=tf.float32):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial, name=name, dtype=dtype)
def bias_variable(self, shape, name=’’, dtype=tf.float32):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial, name=name, dtype=dtype)
def fit(self):
data_indices = np.arange(self.n_inputs)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(self.epochs):
for j in range(self.iterations):
chosen_datapoints = np.random.choice(data_indices, size=self.batch_size, replace=F
batch_X, batch_Y = self.X_train[chosen_datapoints], self.Y_train[chosen_datapoints
sess.run([CNN.loss, CNN.optimizer],
feed_dict={CNN.X: batch_X,
CNN.Y: batch_Y})
accuracy = sess.run(CNN.accuracy,
feed_dict={CNN.X: batch_X,
CNN.Y: batch_Y})
step = sess.run(CNN.global_step)
self.train_loss, self.train_accuracy = sess.run([CNN.loss, CNN.accuracy],
feed_dict={CNN.X: self.X_train,
CNN.Y: self.Y_train})
self.test_loss, self.test_accuracy = sess.run([CNN.loss, CNN.accuracy],
feed_dict={CNN.X: self.X_test,
CNN.Y: self.Y_test})
eta_vals = np.logspace(-5, 1, 7)
lmbd_vals = np.logspace(-5, 1, 7)
59
CNN_tf = np.zeros((len(eta_vals), len(lmbd_vals)), dtype=object)
for i, eta in enumerate(eta_vals):
for j, lmbd in enumerate(lmbd_vals):
CNN = ConvolutionalNeuralNetworkTensorflow(X_train, Y_train, X_test, Y_test,
n_filters=n_filters, n_neurons_connected=n_neurons_connected
n_categories=n_categories, epochs=epochs, batch_size=batch_s
eta=eta, lmbd=lmbd)
CNN.fit()
print("Learning rate = ", eta)
print("Lambda = ", lmbd)
print("Test accuracy: %.3f" % CNN.test_accuracy)
print()
CNN_tf[i][j] = CNN
train_accuracy[i][j] = CNN.train_accuracy
test_accuracy[i][j] = CNN.test_accuracy
60
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
sgd = SGD(lr=eta)
model.compile(loss=’categorical_crossentropy’, optimizer=sgd, metrics=[’accuracy’])
return model
epochs = 100
batch_size = 100
input_shape = X_train.shape[1:4]
receptive_field = 3
n_filters = 10
n_neurons_connected = 50
n_categories = 10
eta_vals = np.logspace(-5, 1, 7)
lmbd_vals = np.logspace(-5, 1, 7)
Final part
CNN_keras = np.zeros((len(eta_vals), len(lmbd_vals)), dtype=object)
CNN_keras[i][j] = CNN
Final visualization
# visual representation of grid search
# uses seaborn heatmap, could probably do this in matplotlib
import seaborn as sns
sns.set()
61
test_accuracy = np.zeros((len(eta_vals), len(lmbd_vals)))
for i in range(len(eta_vals)):
for j in range(len(lmbd_vals)):
CNN = CNN_keras[i][j]
Fun links
1. Self-Driving cars using a convolutional neural network
62