deep-learning-using-python
deep-learning-using-python
Learning
Using Python
ISBN: 978-81-19585-32-8
No part of the book may be printed, copied, stored, retrieved, duplicated and
reproduced in any form without the written permission of the
editor/publisher.
DISCLAIMER
Information contained in this book has been published by Parab Publications
and has been obtained by the author from sources believed to be reliable and
correct to the best of their knowledge. The author is solely responsible for
the contents of the articles compiled in this book. Responsibility of
authenticity of the work or the concepts/views presented by the author
through this book shall lie with the author and the publisher has no role or
claim or any responsibility in this regard. Errors, if any, are purely
unintentional and readers are requested to communicate such error to the
author to avoid discrepancies in future.
Published by:
Parab Publications
Dedicated to
My Parents
IV
Preface
Next, the book delves into building various types of neural networks using
Python libraries. From feedforward neural networks for basic tasks to
convolutional neural networks (CNNs) for image data and recurrent neural
networks (RNNs) for sequential data processing, each chapter provides
hands-on examples and code snippets to facilitate understanding and
implementation.
Deep Learning Using Python equips readers with the essential tools and
knowledge to embark on their journey into deep learning. Whether for
academic study, professional development, or personal interest, this book
serves as a comprehensive guide to mastering deep learning techniques and
leveraging Python's capabilities to build intelligent systems for diverse
applications.
VI
Acknowledgement
I would like to thank my family and loved ones for their constant support,
comprehension, and inspiration during the many hours that I have invested
in the writing, research, and editing of this text.
Lastly, I express gratitude to the readers for their interest and trust in this
work, with the hope that it serves as a meaningful resource in the field of
Deep Learning.
VII
Contents
Preface (v - vi)
Acknowledgement (vii)
Bibliography 289
Index 291
An Introduction to Deep Learning 1
An Introduction to Deep
Learning
At its core, deep learning utilizes artificial neural networks (ANNs) to process
data and extract meaningful patterns. These networks are composed of layers of
interconnected nodes, or neurons, that perform computations. Each layer processes
the input data and passes it on to the next layer, with each subsequent layer learning
more abstract features from the data.
NEURAL NETWORK LAYERS
1. Input Layer: Receives raw data as input, such as images, text, or numerical
data.
2. Hidden Layers: Intermediate layers between the input and output layers.
Each hidden layer performs complex transformations and feature extraction.
2 Deep Learning Using Python
Most modern deep learning models are based on artificial neural networks,
specifically convolutional neural networks (CNN)s, although they can also include
propositional formulas or latent variables organized layer-wise in deep generative
models such as the nodes in deep belief networks and deep Boltzmann machines.
In deep learning, each level learns to transform its input data into a slightly
more abstract and composite representation. In an image recognition application,
the raw input may be a matrix of pixels; the first representational layer may
abstract the pixels and encode edges; the second layer may compose and encode
arrangements of edges; the third layer may encode a nose and eyes; and the fourth
layer may recognize that the image contains a face. Importantly, a deep learning
An Introduction to Deep Learning 5
process can learn which features to optimally place in which level on its own. This
does not eliminate the need for hand-tuning; for example, varying numbers of
layers and layer sizes can provide different degrees of abstraction.
The word ―deep‖ in ―deep learning‖ refers to the number of layers through
which the data is transformed. More precisely, deep learning systems have a
substantial credit assignment path (CAP) depth. The CAP is the chain of
transformations from input to output. CAPs describe potentially causal connections
between input and output. For a feedforward neural network, the depth of the
CAPs is that of the network and is the number of hidden layers plus one (as the
output layer is also parameterized). For recurrent neural networks, in which a
signal may propagate through a layer more than once, the CAP depth is potentially
unlimited. No universally agreed-upon threshold of depth divides shallow learning
from deep learning, but most researchers agree that deep learning involves CAP
depth higher than 2. CAP of depth 2 has been shown to be a universal approximator
in the sense that it can emulate any function. Beyond that, more layers do not add
to the function approximator ability of the network. Deep models (CAP > 2) are
able to extract better features than shallow models and hence, extra layers help
in learning the features effectively.
Deep learning architectures can be constructed with a greedy layer-by-layer
method. Deep learning helps to disentangle these abstractions and pick out which
features improve performance.
For supervised learning tasks, deep learning methods eliminate feature
engineering, by translating the data into compact intermediate representations akin
to principal components, and derive layered structures that remove redundancy
in representation.
Deep learning algorithms can be applied to unsupervised learning tasks. This
is an important benefit because unlabeled data are more abundant than the labeled
data. Examples of deep structures that can be trained in an unsupervised manner
are deep belief networks.
INTERPRETATIONS
Deep neural networks are generally interpreted in terms of the universal
approximation theorem or probabilistic inference.
The classic universal approximation theorem concerns the capacity of
feedforward neural networks with a single hidden layer of finite size to approximate
continuous functions. In 1989, the first proof was published by George Cybenko
for sigmoid activation functions and was generalised to feed-forward multi-layer
architectures in 1991 by Kurt Hornik. Recent work also showed that universal
6 Deep Learning Using Python
approximation also holds for non-bounded activation functions such as the rectified
linear unit.
The universal approximation theorem for deep neural networks concerns the
capacity of networks with bounded width but the depth is allowed to grow. Lu
et al. proved that if the width of a deep neural network with ReLU activation is
strictly larger than the input dimension, then the network can approximate any
Lebesgue integrable function; If the width is smaller or equal to the input dimension,
then a deep neural network is not a universal approximator.
The probabilistic interpretation derives from the field of machine learning. It
features inference, as well as the optimization concepts of training and testing,
related to fitting and generalization, respectively. More specifically, the probabilistic
interpretation considers the activation nonlinearity as a cumulative distribution
function. The probabilistic interpretation led to the introduction of dropout as
regularizer in neural networks. The probabilistic interpretation was introduced by
researchers including Hopfield, Widrow and Narendra and popularized in surveys
such as the one by Bishop.
HISTORY
Some sources point out that Frank Rosenblatt developed and explored all of
the basic ingredients of the deep learning systems of today. He described it in his
book ―Principles of Neurodynamics: Perceptrons and the Theory of Brain
Mechanisms‖, published by Cornell Aeronautical Laboratory, Inc., Cornell
University in 1962.
The first general, working learning algorithm for supervised, deep, feedforward,
multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1967.
A 1971 paper described a deep network with eight layers trained by the group
method of data handling. Other deep learning working architectures, specifically
those built for computer vision, began with the Neocognitron introduced by
Kunihiko Fukushima in 1980.
The term Deep Learning was introduced to the machine learning community
by Rina Dechter in 1986, and to artificial neural networks by Igor Aizenberg and
colleagues in 2000, in the context of Boolean threshold neurons.
In 1989, Yann LeCun et al. applied the standard backpropagation algorithm,
which had been around as the reverse mode of automatic differentiation since
1970, to a deep neural network with the purpose of recognizing handwritten ZIP
codes on mail. While the algorithm worked, training required 3 days.
Independently in 1988, Wei Zhang et al. applied the backpropagation algorithm
to a convolutional neural network (a simplified Neocognitron by keeping only the
An Introduction to Deep Learning 7
convolutional interconnections between the image feature layers and the last fully
connected layer) for alphabets recognition and also proposed an implementation
of the CNN with an optical computing system. Subsequently, Wei Zhang, et al.
modified the model by removing the last fully connected layer and applied it for
medical image object segmentation in 1991 and breast cancer detection in
mammograms in 1994.
In 1994, André de Carvalho, together with Mike Fairhurst and David Bisset,
published experimental results of a multi-layer boolean neural network, also
known as a weightless neural network, composed of a 3-layers self-organising
feature extraction neural network module (SOFT) followed by a multi-layer
classification neural network module (GSN), which were independently trained.
Each layer in the feature extraction module extracted features with growing
complexity regarding the previous layer.
In 1995, Brendan Frey demonstrated that it was possible to train (over two
days) a network containing six fully connected layers and several hundred hidden
units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton.
Many factors contribute to the slow speed, including the vanishing gradient
problem analyzed in 1991 by Sepp Hochreiter.
Since 1997, Sven Behnke extended the feed-forward hierarchical convolutional
approach in the Neural Abstraction Pyramid by lateral and backward connections
in order to flexibly incorporate context into decisions and iteratively resolve local
ambiguities.
Simpler models that use task-specific handcrafted features such as Gabor
filters and support vector machines (SVMs) were a popular choice in the 1990s
and 2000s, because of artificial neural network‘s (ANN) computational cost and
a lack of understanding of how the brain wires its biological networks.
Both shallow and deep learning (e.g., recurrent nets) of ANNs have been
explored for many years. These methods never outperformed non-uniform internal-
handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM)
technology based on generative models of speech trained discriminatively. Key
difficulties have been analyzed, including gradient diminishing and weak temporal
correlation structure in neural predictive models. Additional difficulties were the
lack of training data and limited computing power.
Most speech recognition researchers moved away from neural nets to pursue
generative modeling. An exception was at SRI International in the late 1990s.
Funded by the US government‘s NSA and DARPA, SRI studied deep neural
networks in speech and speaker recognition. The speaker recognition team led by
8 Deep Learning Using Python
Larry Heck reported significant success with deep neural networks in speech
processing in the 1998 National Institute of Standards and Technology Speaker
Recognition evaluation. The SRI deep neural network was then deployed in the
Nuance Verifier, representing the first major industrial application of deep learning.
The principle of elevating ―raw‖ features over hand-crafted optimization was
first explored successfully in the architecture of deep autoencoder on the ―raw‖
spectrogram or linear filter-bank features in the late 1990s, showing its superiority
over the Mel-Cepstral features that contain stages of fixed transformation from
spectrograms. The raw features of speech, waveforms, later produced excellent
larger-scale results.
Many aspects of speech recognition were taken over by a deep learning
method called long short-term memory (LSTM), a recurrent neural network
published by Hochreiter and Schmidhuber in 1997. LSTM RNNs avoid the vanishing
gradient problem and can learn ―Very Deep Learning‖ tasks that require memories
of events that happened thousands of discrete time steps before, which is important
for speech. In 2003, LSTM started to become competitive with traditional speech
recognizers on certain tasks. Later it was combined with connectionist temporal
classification (CTC) in stacks of LSTM RNNs. In 2015, Google‘s speech recognition
reportedly experienced a dramatic performance jump of 49% through CTC-trained
LSTM, which they made available through Google Voice Search.
In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and
Teh showed how a many-layered feedforward neural network could be effectively
pre-trained one layer at a time, treating each layer in turn as an unsupervised
restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.
The papers referred to learning for deep belief nets.
Deep learning is part of state-of-the-art systems in various disciplines,
particularly computer vision and automatic speech recognition (ASR). Results on
commonly used evaluation sets such as TIMIT (ASR) and MNIST (image
classification), as well as a range of large-vocabulary speech recognition tasks
have steadily improved. Convolutional neural networks (CNNs) were superseded
for ASR by CTC for LSTM. but are more successful in computer vision.
The impact of deep learning in industry began in the early 2000s, when CNNs
already processed an estimated 10% to 20% of all the checks written in the US,
according to Yann LeCun. Industrial applications of deep learning to large-scale
speech recognition started around 2010.
The 2009 NIPS Workshop on Deep Learning for Speech Recognition was
motivated by the limitations of deep generative models of speech, and the possibility
An Introduction to Deep Learning 9
that given more capable hardware and large-scale data sets that deep neural nets
(DNN) might become practical. It was believed that pre-training DNNs using
generative models of deep belief nets (DBN) would overcome the main difficulties
of neural nets. However, it was discovered that replacing pre-training with large
amounts of training data for straightforward backpropagation when using DNNs
with large, context-dependent output layers produced error rates dramatically
lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov
Model (HMM) and also than more-advanced generative model-based systems. The
nature of the recognition errors produced by the two types of systems was
characteristically different, offering technical insights into how to integrate deep
learning into the existing highly efficient, run-time speech decoding system deployed
by all major speech recognition systems. Analysis around 2009–2010, contrasting
the GMM (and other generative speech models) vs. DNN models, stimulated early
industrial investment in deep learning for speech recognition, eventually leading
to pervasive and dominant use in that industry. That analysis was done with
comparable performance (less than 1.5% in error rate) between discriminative
DNNs and generative models.
In 2010, researchers extended deep learning from TIMIT to large vocabulary
speech recognition, by adopting large output layers of the DNN based on context-
dependent HMM states constructed by decision trees.
Advances in hardware have driven renewed interest in deep learning. In 2009,
Nvidia was involved in what was called the ―big bang‖ of deep learning, ―as deep-
learning neural networks were trained with Nvidia graphics processing units
(GPUs).‖ That year, Andrew Ng determined that GPUs could increase the speed
of deep-learning systems by about 100 times. In particular, GPUs are well-suited
for the matrix/vector computations involved in machine learning. GPUs speed up
training algorithms by orders of magnitude, reducing running times from weeks
to days. Further, specialized hardware and algorithm optimizations can be used
for efficient processing of deep learning models.
We cannot start deep learning without explaining linear and logistics regression
which is the basis of deep learning.
Linear regression
It is a statistical method that allows us to summarise and study relationships
between two continuous (quantitative) variables.
10 Deep Learning Using Python
In this example, we have historical data based on the size of the house. We
plot them into the graph as seen as dot points. Linear regression is the technique
where finding a straight line between these points with less error(this will be
explained later). Once we have a line with less error, we can predict the house
price based on the size of the house.
In this example, we have historical dataset of student which have passed and
not passed based on the grades and test scores. If we need to know a student will
pass or not based on the grade and test score, logistic regression can be used.
In logistic regression, similar to linear regression, it will find best possible
straight line that separate the two classification(passed and not passed).
ACTIVATION FUNCTION
Activation functions are functions that decide, given the inputs into the node,
what should be the node‘s output? Because it‘s the activation function that decides
the actual output, we often refer to the outputs of a layer as its ―activations‖.
One of the simplest activation functions is the Heaviside step function. This
function returns a 0 if the linear combination is less than 0. It returns a 1 if the
linear combination is positive or equal to zero.
The output unit returns the result of f(h), where h is the input to the output
unit:
WEIGHTS
When input data comes into a neuron, it gets multiplied by a weight value
12 Deep Learning Using Python
that is assigned to this particular input. For example, the neuron above university
example have two inputs, tests for test scores and grades, so it has two associated
weights that can be adjusted individually.
Use of weights
These weights start out as random values, and as the neural network learns
more about what kind of input data leads to a student being accepted into a
university, the network adjusts the weights based on any errors in categorization
that the previous weights resulted in. This is called training the neural network.
Remember we can associate weight as m(slope) in the orginal linear equation.
y = mx+b
BIAS
Weights and biases are the learnable parameters of the deep learning models.
Bias represented as b in the above linear equation.
to remember them by using a reference of these visuals.That is why you are likely
to forget the answers you have written in a certification exam after a couple of
days if you don‘t revise them again. Similarly, if you have binge-watched a sitcom
on Netflix, you are likely to remember the dialogues and scenes for a long time
if you watch it repeatedly.
INTRODUCTION TO DEEP LEARNING ALGORITHMS
Before we move on to the list of deep learning models in machine learning,
let‘s understand the structure and working of deep learning algorithms with the
famous MNIST dataset. The human brain is a network of billions of neurons that
help represent a tremendous amount of knowledge. Deep Learning also uses the
same analogy of a brain neuron for processing the information and recognizing
them. Let‘s understand this with an example.
The above image is taken from the very famous MNIST dataset that gives a
glimpse of the visual representation of digits. The MNIST dataset is widely used
in many image processing techniques. Now, let‘s observe the image of how each
number is written in different ways. We as human beings can easily understand
these digits even if they are slightly tweaked or written differently because we
have written them and seen them millions of times. But how will you make a
computer recognize these digits if you are building an image processing system?
That is where Deep learning comes into the picture!
What is a Neural Network in Deep Learning?
One can visually represent the fundamental structure of a Neural network cas
in the above image, with mainly three components –
1. Input Layer
2. Hidden Layers
3. Output layer
The above image shows only one Hidden layer, and we can call it an Artificial
Neural Network or a neural network. On the other hand, deep learning has several
hidden layers, and that is where it gets its name ―Deep‖. These hidden layers are
interconnected and are used to make our model learn to give the final output.
Each node with information is passed in the form of inputs, and the node
multiplies the inputs with random weight values and adds a bias before calculation.
A nonlinear or activation function is then applied to determine which particular
node will determine the output.
The activation functions used in artificial neural networks work like logic
gates. So, if we require an output to be 1 for an OR gate. We will need to pass
14 Deep Learning Using Python
the input values as 0,1 or 1,0. Different deep learning models use different
activation functions and sometimes a combination of activation functions. We
have similarities in neural networks and deep learning structures. But one cannot
use neural networks for unstructured data like images, videos, sensor data, etc.
We need multiple hidden layers (sometimes even thousands) for these types
of data, so we use deep neural networks.
How do Deep Learning Algorithms Work?
For the MNIST example discussed above, we can consider the digits as the
input that are sent in a 28x28 pixel grid format to hidden layers for digit recognition.
The hidden layers classify the digit (whether it is 0,1,2,...9) based on the shape.
For example – If we consider a digit 8, it looks like having two knots interconnected
to each other. The image data converted into pixel binaries(0,1) is sent as an input
to the input layer.
Each connection in the layers has a weight associated with it, which determines
the input value‘s importance. The initial weights are set randomly.
We can have negative weights also associated with these connections if the
importance needs to be reduced.
The weights are updated after every iteration using the backpropagation
algorithm.
In some cases, there might not be a prominent input image of the digit, and
that is when several iterations have to be performed to train the deep learning
model by increasing the number of hidden layers. Finally, the final output is
generated based on the weights and number of iterations.
Now that we have a basic understanding of input and output layers in deep
learning, let‘s understand some of the primary deep learning algorithms, how they
work, and their use cases.
TOP DEEP LEARNING ALGORITHMS LIST
Multilayer Perceptrons (MLPs)
MLP is the most basic deep learning algorithm and also one of the oldest deep
learning techniques. If you are a beginner in deep learning and have just started
exploring it, we recommend you get started with MLP. MLPs can be referred to
as a form of Feedforward neural networks.
How does MLP deep learning algorithm work?
• The working of MLP is the same as what we discussed above in our
MNIST data example. The first layer takes the inputs, and the last produce
An Introduction to Deep Learning 15
the output based on the hidden layers. Each node is connected to every
node on the next layer, so the information is constantly fed forward
between the multiple layers, which is why it is referred to as a feed-forward
network.
• MLP uses a prevalent supervised learning technique called backpropagation
for training.
• Each hidden layer is fed with some weights (randomly assigned values). The
combination of the weights and input is supplied to an activation function
which is passed further to the next layer to determine the output. If we don‘t
arrive at the expected output, we calculate the loss (error) and we back-track
to update the weights. It is an iterative process until the predicted output
is obtained (trial and error). It is critical in training the deep learning model,
as the correct weights will determine your final output.
• MLP‘s popularly use sigmoid functions, Rectified Linear unit (ReLU), and
tanh as activation functions.
APPLICATIONS OF MLP
It is used by Social media sites (Instagram, Facebook) for compressing image
data. That significantly helps to load the images even if the network strength is
not too strong.
Other applications include Used in image and speech recognition, data
compression, and also for classification problems.
Pros of MLP
1. They do not make any assumptions regarding the Probability density
functions (PDF), unlike the other models that are based on Probability.
2. Ability to provide the decision function directly by training the perceptron.
Cons of MLP
1. Due to the hard-limit transfer function, the perceptrons can only give
outputs in the form of 0 and 1.
2. While updating the weights in layers, the MLP network may be stuck in
a local minimum which can hamper accuracy.
RADIAL BASIS FUNCTION NETWORKS (RBFNS)
As the name suggests, it is based on the Radial basis function (RBF) activation
function. The model training process requires slightly less time using RBFN than
MLP.
16 Deep Learning Using Python
Suppose you have built a model to predict the next word based on the previous
ones. Assume you are trying to predict the last word in the sentence, ―the sun rises
in the east,‖ we don‘t need any further context, and obviously the following term
will be east. In these types of cases, where there is not much gap between the
relevant information and the place where it‘s needed, RNNs can learn and predict
the output easily. But if we have a sentence like, ―I am born in India. I speak fluent
Hindi‖. This kind of prediction requires some context from the previous sentence
about where a person was born, and RNNs might not be able to learn and connect
the information in such cases.
How do LSTM deep learning algorithms work?
The cell state and hidden state are transferred to the next cell. As the name
suggests, memory blocks remember things, and the changes to these memory
blocks are done through mechanisms referred to as gates.
The key to LSTMs is the cell state (the horizontal line at the top, which runs
through in the diagram). The key to LSTMs is the cell state (the horizontal line
at the top which runs through in the diagram).
Step 1: - LSTM decides what information should be kept intact and what
should be thrown away from the cell state. The sigmoid layer is responsible for
making this decision.
Step 2: - LSTM decides what new information one should keep and replaces
the irrelevant one identified in step 1 - the tanh and the sigmoid play an important
role in identifying relevant information.
Step 3: - The output is determined with the help of the cell state that will
now be a filtered version because of the applied sigmoid and tanh functions.
Applications
Anomaly detection in network traffic data or IDSs (intrusion detection systems),
Time-series forecasting, Auto-completion, text and video analysis, and Caption
generation.
Pros of LSTM
LSTMs when compared to conventional RNNs, are very handy in modeling
the chronological sequences and long-range dependencies.
Cons of LSTM
1. High computation and resources are required to train the LSTM model,
and it is also a very time-consuming process.
2. They are prone to overfitting.
20 Deep Learning Using Python
It is not possible to imagine it using scatter or pair plots. Here comes SOMs. It
reduces the data‘s dimension (less relevant features are removed) and helps us
visualize the distribution of feature values.
How does the SOM deep learning algorithm work?
SOMs group similar data items together by creating a 1D or 2D map. Similar
to the other algorithms, weights are initialized randomly for each node. At each
step, one sample vector x is randomly taken from the input data set and the
distances between x and all the other vectors are computed.
A Best-Matching Unit (BMU) closest to x is selected after voting among all
the other vectors. Once BMU is identified, the weight vectors are updated, and
the BMU and its topological neighbors are moved closer to the input vector x in
the input space. This process is repeated until we get the expected output.
For our example, the program would first select a color from an array of
samples, such as red, and then search the weights for those red locations. The
weights surrounding those locations are red, and then the next color, blue is
chosen, and the process continues.
Applications
Image analysis, fault diagnosis, process monitoring and control, etc. SOMs
are used for 3D modeling human heads from stereo images because of their ability
to generate powerful visualizations, and they are extensively valuable for the
healthcare sector for creating 3D charts.
Pros of SOMs
1. We can easily interpret and understand the data using SOM.
2. Using dimensionality reduction further makes it much simpler to check
for any similarities within our data.
Cons of SOMs
1. SOM requires neuron weights to be necessary and sufficient to cluster the
input data.
2. If, while training SOM, we provide less or extensively more data, we may
not get the informative or very accurate output.
GENERATIVE ADVERSARIAL NETWORKS (GANS)
It is an unsupervised learning algorithm capable of automatically discovering
and learning the patterns in the data. GANs then generate new examples that
resemble the original dataset.
22 Deep Learning Using Python
A simple example is -Suppose your friend has asked you to share a software
you have saved on your computer. The folder size of that software is close to 1
GB. If you directly upload this whole folder to your Google drive, it will take
a lot of time. But if you compress it, then the data size will reduce, and you can
upload it easily. Your friend can directly download this folder, extract the data,
and get the original folder.
In the above example, the original folder is the input, the compressed folder
is encoded data, and when your friend extracts the compressed folder, it is
decoding.
How do autoencoders work?
There are 3 main components in Autoencoders –
1. Encoder – The encoder compresses the input into a latent space
representation which can be reconstructed later to get the original input.
2. Code – This is the compressed part (latent space representation) that is
obtained after encoding.
3. Decoder – The decoder aims to reconstruct the code to its original form.
The reconstruction output obtained may not be as accurate as the original
and might have some loss.
The code layer present between the encoder and decoder is also referred to
as Bottleneck. It is used to decide which aspects of input data are relevant and
what can be neglected. The bottleneck is a very significant layer in our network.
Without it, the network could easily learn to memorize the input values by passing
them along through the network.
Applications
Colouring of images, image compression, denoising, etc.
They are used in the healthcare industry for medical imaging (technique and
process of imaging the human body‘s interior for performing clinical analysis) Eg
- breast cancer detection.
Pros of Autoencoders
Using multiple encoder and decoder layers reduces the computational cost of
representing some functions to a certain extent.
Cons of Autoencoders
1. It is not as efficient as GANs when reconstructing images as for complex
images,it usually does not work well.
2. We might lose essential data from our original input after encoding.
24 Deep Learning Using Python
Cons of DBNS
Hardware requirements are high to process inputs.
Image processing is a very useful technology and the demand from the industry
seems to be growing every year. Historically, image processing that uses machine
learning appeared in the 1960s as an attempt to simulate the human vision system
and automate the image analysis process. As the technology developed and improved,
solutions for specific tasks began to appear.
The rapid acceleration of computer vision in 2010, thanks to deep learning
and the emergence of open source projects and large image databases only increased
the need for image processing tools.
Currently, many useful libraries and projects have been created that can help
you solve image processing problems with machine learning or simply improve
the processing pipelines in the computer vision projects where you use ML.
Frameworks and libraries
In theory, you could build your image processing application from scratch,
just you and your computer. But in reality, it‘s way better to stand on the shoulders
of giants and use what other people have built and extend or adjust it where
needed.
This is where libraries and frameworks come in and in image processing,
where creating efficient implementations is often a difficult task this is even more
true.
So, let me give you my list of libraries and frameworks that you can use in
your image processing projects:
OpenCV
Open-source library of computer vision and image processing algorithms.
Designed and well optimized for real-time computer vision applications.
Designed to develop open infrastructure.
Functionality:
• Basic data structures
• Image processing algorithms
• Basic algorithms for computer vision
26 Deep Learning Using Python
Functionality:
• Getting information about raster data
• Convert to various formats
• Data re-projection
• Creation of mosaics from rasters
• Creation of shapefiles with raster tile index
MIScnn
Framework for 2D/3D Medical Image Segmentation.
Functionality:
• Creation of segmentation pipelines
• Preliminary processing
• Input Output
• Data increase
• Patch analysis
• Automatic assessment
• Cross validation
Tracking
JavaScript library for computer vision.
Functionality:
• Color tracking
• Face recognition
• Using modern HTML5 specifications
• Lightweight kernel (~ 7 KB)
WebGazer
Library for eye tracking.
Uses a webcam to determine the location of visitors‘ gaze on the page in real-
time (where the person is looking).
Functionality:
• Self-calibration of the model, which observes the interaction of Internet
visitors with a web page, and trains the display between eye functions and
position on the screen
An Introduction to Deep Learning 29
training dataset usually gets you bigger improvements than state-of-the-art network
architectures or training methods.
With that in mind, let me give you a list of image datasets that you can use
in your projects:
Diversity in Faces
A dataset designed to reduce the bias of algorithms.
A million labeled images of faces of people of different nationalities, ages and
genders, as well as other indicators – head size, face contrast, nose length, forehead
height, face proportions, etc. and their relationships to each other.
FaceForencis
Dataset for recognizing fake photos and videos.
A set of images (over half a million) created using the Face2Face, FaceSwap
and DeepFakes methods.
1000 videos with faces made using each of the falsification methods.
YouTube-8M Segments
Dataset of Youtube videos, with marked up content in dynamics.
Approximately 237 thousand layouts and 1000 categories.
SketchTransfer
Dataset for training neural networks to generalize
The data consists of real-world tagged images and unlabeled sketches.
DroneVehicle
Dataset for counting objects in drone images.
15,532 RGB drone shots, there is an infrared shot for each image.
Object marking is available for both RGB and infrared images.
The dataset contains directional object boundaries and object classes.
In total, 441,642 objects were marked in the dataset for 31,064 images.
Waymo Open Dataset
Dataset for training autopilot vehicles.
Includes videos of driving with marked objects.
3,000 driving videos totaling 16.7 hours, 600,000 frames, about 25 million
3D object boundaries and 22 million 2D object boundaries.
An Introduction to Deep Learning 31
In this case, we say anything below blue line will be ―No(not passed)‖ and
above it will be ―Yes(passed)‖. Similarly, we say anything on the left side will
be ―No(not passed)‖ and on the right side ―Yes(passed)‖.
As we have neurons in nervous system, we can define each line as one neuron
and connected to next layer neurons along with neurons in the same layer. In this
case we have two neurons that represents the two lines. The above picture is an
example of simple neural network where two neurons accept that input data and
compute yes or no based on their condition and pass it to the second layer neuron
to concatenate the result from previous layer. For this specific example test score
Neural Network in Deep Learning 35
1 and grade 8 input, the output will be ―Not passed‖ which is accurate, but in
logistic regression out we may get as ―passed‖.
To summarise this, using multiple neurons in different layers, essentially we
can increase the accuracy of the model. This is the basis of neural network.
The diagram below shows a simple network. The linear combination of the
weights, inputs, and bias form the input h, which passes through the activation
function f(h), giving the final output, labeled y.
The good fact about this architecture, and what makes neural networks possible,
is that the activation function, f(h) can be any function, not just the step function
shown earlier.
For example, if you let f(h)=h, the output will be the same as the input. Now
the output of the network is
This equation should be familiar to you, it‘s the same as the linear regression
model!
Other activation functions you‘ll see are the logistic (often called the sigmoid),
tanh, and softmax functions.
sigmoid(x)=1/(1+e‖x)
The sigmoid function is bounded between 0 and 1, and as an output can be
interpreted as a probability for success. It turns out, again, using a sigmoid as the
activation function results in the same formulation as logistic regression.
We can finally say output of the simple neural network based on sigmoid as
below:
36 Deep Learning Using Python
else it will be 0. The loss for the particular observation will be squared difference
between the Yactual and Ypredicted.
Similarly, for all the observations, calculate the summation of the squared
difference between the Yactual and Ypredicted to get the total loss of the model
for a particular threshold value b.
Learning Algorithm
The purpose of the learning algorithm is to find out the best value for the
40 Deep Learning Using Python
parameter b so that the loss of the model will be minimum. In the ideal scenario,
the loss of the model for the best value of b would be zero.
For n features in the data, the summation we are computing can take only
values between 0 and n because all of our inputs are binary (0 or 1). 0 — indicates
all the features are off and 1 — indicates all the features are on. Therefore the
different values the threshold b can take will also vary from 0 to n. As we have
only one parameter with a range of values 0 to n, we can use the brute force
approach to find the best value of b.
• Initialize the b with a random integer [0,n]
• For each observation
• Find the predicted outcome, by using the formula
Calculate the summation of inputs and check whether its greater than or equal
to b. If its greater than or equal to b, then the predicted outcome will be 1 or else
it will be 0.
• After finding the predicting outcome compute the loss for each observation.
• Finally, compute the total loss of the model by summing up all the individual
losses.
• Similarly, we can iterate over all the possible values of b and find the total
loss of the model. Then we can choose the value of b, such that the loss
is minimum.
MODEL EVALUATION
After finding the best threshold value b from the learning algorithm, we can
evaluate the model on the test data by comparing the predicted outcome and the
actual outcome.
For the above-shown test data, the accuracy of the MP neuron model = 75%.
This is the best part of the post according to me. Lets start with the OR
function.
OR Function
We already discussed that the OR function‘s thresholding parameter theta is
1, for obvious reasons. The inputs are obviously boolean, so only 4 combinations
are possible — (0,0), (0,1), (1,0) and (1,1). Now plotting them on a 2D graph and
making use of the OR function‘s aggregation equation
i.e., x_1 + x_2 e” 1 using which we can draw the decision boundary as shown
in the graph below. Mind you again, this is not a real number graph.
We just used the aggregation equation i.e., x_1 + x_2 =1 to graphically show
that all those inputs whose output when passed through the OR function M-P
neuron lie ON or ABOVE that line and all the input points that lie BELOW that
line are going to output 0.
Voila!! The M-P neuron just learnt a linear decision boundary! The M-P
neuron is splitting the input sets into two classes — positive and negative. Positive
ones (which output 1) are those that lie ON or ABOVE the decision boundary
and negative ones (which output 0) are those that lie BELOW the decision
42 Deep Learning Using Python
boundary. Lets convince ourselves that the M-P unit is doing the same for all the
boolean functions by looking at more examples (if it is not already clear from the
math).
AND Function
In this case, the decision boundary equation is x_1 + x_2 =2. Here, all the
input points that lie ON or ABOVE, just (1,1), output 1 when passed through the
AND function M-P neuron. It fits! The decision boundary works!
Tautology
The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 =
1 is shown below:
Take your time and convince yourself by looking at the above plot that all
the points that lie ON or ABOVE that plane (positive half space) will result in
output 1 when passed through the OR function M-P unit and all the points that
lie BELOW that plane (negative half space) will result in output 0.
Just by hand coding a thresholding parameter, M-P neuron is able to conveniently
represent the boolean functions which are linearly separable.
44 Deep Learning Using Python
Linear separability (for boolean functions): There exists a line (plane) such
that all inputs which produce a 1 lie on one side of the line (plane) and all inputs
which produce a 0 lie on other side of the line (plane).
Limitations Of M-P Neuron
• What about non-boolean (say, real) inputs?
• Do we always need to hand code the threshold?
• Are all inputs equal? What if we want to assign more importance to some
inputs?
• What about functions which are not linearly separable? Say XOR function.
I hope it is now clear why we are not using the M-P neuron today. Overcoming
the limitations of the M-P neuron, Frank Rosenblatt, an American psychologist,
proposed the classical perception model, the mighty artificial neuron, in 1958. It
is more generalized computational model than the McCulloch-Pitts neuron where
weights and thresholds can be learnt over time.
More on perceptron and how it learns the weights and thresholds etc. in my
future posts.
Basically, a neuron takes an input signal (dendrite), processes it like the CPU
(soma), passes the output through a cable like structure to other connected neurons
(axon to synapse to other neuron‘s dendrite). Now, this might be biologically
inaccurate as there is a lot more going on out there but on a higher level, this is
what is going on with a neuron in our brain — takes an input, processes it, throws
out an output.
Our sense organs interact with the outer world and send the visual and sound
information to the neurons. Let‘s say you are watching Friends. Now the information
your brain receives is taken in by the ―laugh or not‖ set of neurons that will help
you make a decision on whether to laugh or not. Each neuron gets fired/activated
only when its respective criteria (more on this later) is met like shown below.
Not real.
Of course, this is not entirely true. In reality, it is not just a couple of neurons
which would do the decision making. There is a massively parallel interconnected
network of 10¹¹ neurons (100 billion) in our brain and their connections are not
as simple as I showed you above. It might look something like this:
Division of work
It is believed that neurons are arranged in a hierarchical fashion (however,
many credible alternatives with experimental support are proposed by the scientists)
and each layer has its own role and responsibility. To detect a face, the brain could
be relying on the entire network and not on a single layer.
We can see that g(x) is just doing a sum of the inputs — a simple
aggregation.
And theta here is called thresholding parameter.
For example, if I always watch the game when the sum turns out to be 2 or
more, the theta is 2 here.
This is called the Thresholding Logic.
Boolean Functions Using M-P Neuron
So far we have seen how the M-P neuron works. Now lets look at how
this very neuron can be used to represent a few boolean functions.
Mind you that our inputs are all boolean and the output is also boolean so
essentially, the neuron is just trying to learn a boolean function.
A lot of boolean decision problems can be cast into this, based on
appropriate input variables— like whether to continue reading this post,
whether to watch Friends after reading this post etc. can be represented by
the M-P neuron.
M-P Neuron: A Concise Representation
This representation just denotes that, for the boolean inputs x_1, x_2 and x_3
if the g(x) i.e., sum e” theta, the neuron will fire otherwise, it won‘t.
Neural Network in Deep Learning 49
AND Function
An AND function neuron would only fire when ALL the inputs are ON i.e.,
g(x) e‖ 3 here.
OR Function
Now this might look like a tricky one but it‘s really not. Here, we have an
inhibitory input i.e., x_2 so whenever x_2 is 1, the output will be 0.
Keeping that in mind, we know that x_1 AND !x_2 would output 1 only when
x_1 is 1 and x_2 is 0 so it is obvious that the threshold parameter should be 1.
Lets verify that, the g(x) i.e., x_1 + x_2 would be e‖ 1 in only 3 cases:
Case 1: when x_1 is 1 and x_2 is 0
Case 2: when x_1 is 1 and x_2 is 1
Case 3: when x_1 is 0 and x_2 is 1
But in both Case 2 and Case 3, we know that the output will be 0 because
x_2 is 1 in both of them, thanks to the inhibition.
And we also know that x_1 AND !x_2 would output 1 for Case 1 (above)
so our thresholding parameter holds good for the given function.
NOR Function
For a NOR neuron to fire, we want ALL the inputs to be 0 so the thresholding
parameter should also be 0 and we take them all as inhibitory input.
NOT Function
Training
Weights start out as random values, and as the neural network learns more
about what kind of input data leads to a student being accepted into a
university(above example), the network adjusts the weights based on any errors
in categorization that the previous weights resulted in.
This is called training the neural network. Once we have the trained network,
we can use it for predicting the output for the similar input.
Error
This very important concept to define how well a network performing during
the training.
In the training phase of the network, it make use of error value to adjust the
weights so that it can get reduced error at each step.
The goal of the training phase to minimize the error
Mean Squared Error is one of the popular error function. it is a modified
version Sum Squared Error.
Forward Propagation
By propagating values from the first layer (the input layer) through all the
mathematical functions represented by each node, the network outputs a value.
This process is called a forward pass.
52 Deep Learning Using Python
print(„Output-layer Output:‟)
print(output_layer_out)
Gradient Descent
Gradient descent is an optimization algorithm used to find the values of
parameters (coefficients) of a function (f) that minimizes a cost function
(cost).Gradient descent is best used when the parameters cannot be calculated
analytically (e.g. using linear algebra) and must be searched for by an optimization
algorithm.Gradient descent is used to find the minimum error by minimizing a
―cost‖ function.
In the university example(explained it in the neural network section), the
correct lines to divide the dataset is already defined. How does we find the correct
line? As we know, weights are adjusted during the training process. Adjusting the
weight will enable each neuron to correctly divide the dataset with given dataset.
To figure out how we‘re going to find these weights, start by thinking about
the goal. We want the network to make predictions as close as possible to the real
values. To measure this, we need a metric of how wrong the predictions are,
the error. A common metric is the sum of the squared errors (SSE):
where y^ is the prediction and y is the true value, and you take the sum over
all output units j and another sum over all data points ì.
The SSE is a good choice for a few reasons. The square ensures the error is
always positive and larger errors are penalized more than smaller errors. Also, it
makes the math nice, always a plus.
Remember that the output of a neural network, the prediction, depends on the
weights
np.random.seed(21)
def sigmoid(x):
“””
Calculate sigmoid
“””
return 1 / (1 + np.exp(-x))
# Hyperparameters
n_hidden = 2 # number of hidden units
epochs = 900
learnrate = 0.005
n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features
** .5,
size=(n_features,
n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features
** .5,
size=n_hidden)
for e in range(epochs):
del_w_input_hidden = np.zeros(weights_input_hidden.shape)
del_w_hidden_output =
np.zeros(weights_hidden_output.shape)
for x, y in zip(features.values, targets):
## Forward pass ##
# TODO: Calculate the output
hidden_input = np.dot(x, weights_input_hidden)
hidden_output = sigmoid(hidden_input)
output = sigmoid(np.dot(hidden_output,
weights_hidden_output))
## Backward pass ##
# TODO: Calculate the network‟s prediction error
error = y - output
# TODO: Calculate error term for the output unit
output_error_term = error * output * (1 - output)
## propagate errors to hidden layer
# TODO: Calculate the hidden layer‟s contribution to
the error
56 Deep Learning Using Python
The building block of the deep neural networks is called the sigmoid neuron.
Sigmoid neurons are similar to perceptrons, but they are slightly modified such
that the output from the sigmoid neuron is much smoother than the step functional
output from perceptron.
In this post, we will talk about the motivation behind the creation of sigmoid
neuron and working of the sigmoid neuron model.
This is the 1st part in the two-part series discussing the working of sigmoid
neuron and it‘s learning algorithm:
1 | Sigmoid Neuron — Building Block of Deep Neural Networks
2 | Sigmoid Neuron Learning Algorithm Explained With Math
Why Sigmoid Neuron
Before we go into the working of a sigmoid neuron, let‘s talk about the
perceptron model and its limitations in brief.
Perceptron model takes several real-valued inputs and gives a single binary
output. In the perceptron model, every input xi has weight wi associated with it.
The weights indicate the importance of the input in the decision-making process.
The model output is decided by a threshold W if the weighted sum of the inputs
is greater than threshold W output will be 1 else output will be 0. In other words,
the model will fire if the weighted sum is greater than the threshold.
From the mathematical representation, we might say that the thresholding
logic used by the perceptron is very harsh. Let‘s see the harsh thresholding logic
with an example.
58 Deep Learning Using Python
Red points indicates that a person would not buy a car and green points
indicate that person would like to buy a car. Isn‘t it a bit odd that a person with
50.1K will buy a car but someone with a 49.9K will not buy a car? The small
change in the input to a perceptron can sometimes cause the output to completely
flip, say from 0 to 1. This behavior is not a characteristic of the specific problem
we choose or the specific weight and the threshold we choose. It is a characteristic
of the perceptron neuron itself which behaves like a step function. We can overcome
Neural Network in Deep Learning 59
We no longer see a sharp transition at the threshold b. The output from the
sigmoid neuron is not 0 or 1. Instead, it is a real value between 0–1 which can
be interpreted as a probability.
REGRESSION AND CLASSIFICATION
The inputs to the sigmoid neuron can be real numbers unlike the boolean
inputs in MP Neuron and the output will also be a real number between 0–1. In
the sigmoid neuron, we are trying to regress the relationship between X and Y
in terms of probability. Even though the output is between 0–1, we can still use
the sigmoid function for binary classification tasks by choosing some threshold.
Learning Algorithm
An algorithm for learning the parameters w and b of the sigmoid neuron model
by using the gradient descent algorithm.
60 Deep Learning Using Python
The objective of the learning algorithm is to determine the best possible values
for the parameters, such that the overall loss (squared error loss) of the model is
minimized as much as possible. Here goes the learning algorithm:
Loss Optimization
We will keep doing the update operation until we are satisfied. Till satisfied
could mean any of the following:
• The overall loss of the model becomes zero.
Neural Network in Deep Learning 61
• The overall loss of the model becomes a very small value closer to zero.
• Iterating for a fixed number of passes based on computational capacity.
Can It Handle Non-Linear Data?
One of the limitations of the perceptron model is that the learning algorithm
works only if the data is linearly separable. That means that the positive points
will lie on one side of the boundary and negative points lie another side of the
boundary. Can sigmoid neuron handle non-linearly separable data?.
Let‘s take an example of whether a person is going to buy a car or not based
on two inputs, X — Salary in Lakhs Per Annum (LPA) and X — Size of the family.
I am assuming that there is a relationship between X and Y, it is approximated
using the sigmoid function.
The red points indicate that the output is 0 and green points indicate that it
is 1.
62 Deep Learning Using Python
As we can see from the figure, there is no line or a linear boundary that can
effectively separate red and green points. If we train a perceptron on this data,
the learning algorithm will never converge because the data is not linearly
separable. Instead of going for convergence, I will run the model for a certain
number of iterations so that the errors will be minimized as much as possible.
Perceptron Decision boundary for fixed iterations
From the perceptron decision boundary, we can see that the perceptron doesn‘t
distinguish between the points that lie close to the boundary and the points lie
far inside because of the harsh thresholding logic. But in the real world scenario,
we would expect a person who is sitting on the fence of the boundary can go either
way, unlike the person who is way inside from the decision boundary. Let‘s see
how sigmoid neuron will handle this non-linearly separable data. Once I fit our
two-dimensional data using the sigmoid neuron, I will be able to generate the 3D
contour plot shown below to represent the decision boundary for all the observations.
Sigmoid Neuron Decision Boundary (Left) & Top View of Decision Boundary
(Right)
For comparison, let‘s take the same two observations and see what will be
predicted outcome from the sigmoid neuron for these observations. As you can
see the predicted value for the observation present in the far left of the plot is
zero (present in the dark red region) and the predicted value of another observation
is around 0.35 i.e. there is a 35% chance that the person might buy a car. Unlike
the rigid output from the perceptron, now we a smooth and continuous output
between 0–1 which can be interpreted as a probability.
STILL DOES NOT COMPLETELY SOLVE OUR PROBLEM FOR
NON-LINEAR DATA.
Although we have introduced the non-linear sigmoid neuron function, it is still
not able to effectively separate red points from green points. The important point
Neural Network in Deep Learning 63
is that from a rigid decision boundary in perceptron, we have taken our first step
in the direction of creating a decision boundary that works well for non-linearly
separable data. Hence the sigmoid neuron is the building block of deep neural
network eventually we have to use a network of neurons to helps us out to create
a ―perfect‖ decision boundary.
Hidden layers have ushered in a new era, with the old techniques being non-
efficient, particularly when it comes to problems like Pattern Recognition, Object
Detection, Image Segmentation, and other image processing-based problems. CNN
is one of the most deployed deep
learning neural networks.
BACKGROUND OF CNNS
Around the 1980s, CNNs were developed and deployed for the first time.
A CNN could only detect handwritten digits at the time. CNN was primarily used
in various areas to read zip and pin codes etc.
The most common aspect of any A.I. model is that it requires a massive
amount of data to train. This was one of the biggest problems that CNN faced
at the time, and due to this, they were only used in the postal industry. Yann LeCun
was the first to introduce convolutional neural networks.
Kunihiko Fukushima, a renowned Japanese scientist, who even invented
recognition, which was a very simple Neural Network used for image identification,
had developed on the work done earlier by LeCun
What is CNN?
In the field of deep learning, convolutional neural network (CNN) is among
64 Deep Learning Using Python
the class of deep neural networks, which was being mostly deployed in the field
of analyzing/image recognition.
Convolution Layers
This is the very first layer in the CNN that is responsible for the extraction
of the different features from the input images. The convolution mathematical
operation is done between the input image and a filter of a specific size MxM
in this layer.
THE FULLY CONNECTED
The Fully Connected (FC) layer comprises the weights and biases together
with the neurons and is used to connect the neurons between two separate layers.
The last several layers of a CNN Architecture are usually positioned before the
output layer.
Pooling layer
The Pooling layer is responsible for the reduction of the size(spatial) of the
Convolved Feature. This decrease in the computing power is being required to
process the data by a significant reduction in the dimensions.
There are two types of pooling
1 average pooling
2 max pooling.
A Pooling Layer is usually applied after a Convolutional Layer. This layer‘s
major goal is to lower the size of the convolved feature map to reduce computational
expenses. This is accomplished by reducing the connections between layers and
operating independently on each feature map. There are numerous sorts of Pooling
operations, depending on the mechanism utilised.
Neural Network in Deep Learning 67
The largest element is obtained from the feature map in Max Pooling. The
average of the elements in a predefined sized Image segment is calculated using
Average Pooling. Sum Pooling calculates the total sum of the components in the
predefined section. The Pooling Layer is typically used to connect the Convolutional
Layer and the FC Layer.
Dropout
To avoid overfitting (when a model performs well on training data but not
on new data), a dropout layer is utilised, in which a few neurons are removed
from the neural network during the training phase, resulting in a smaller model.
Activation Functions
They’re utilised to learn and approximate any form of network variable-
to-variable association that’s both continuous and complex.
It gives the network non-linearity. The ReLU, Softmax, and tanH are some
of the most often utilised activation functions.
TRAINING THE CONVOLUTIONAL NEURAL NETWORK
The process of adjusting the value of the weights is defined as the ―training‖
of the neural network.
Firstly, the CNN initiates with the random weights. During the training of
CNN, the neural network is being fed with a large dataset of images being
labelled with their corresponding class labels (cat, dog, horse, etc.). The CNN
network processes each image with its values being assigned randomly and then
make comparisons with the class label of the input image.
68 Deep Learning Using Python
If the output does not match the class label(which mostly happen initially at
the beginning of the training process and therefore makes a respective small
adjustment to the weights of its CNN neurons so that output correctly matches
the class label image.
The corrections to the value of weights are being made through a technique
which is known as backpropagation. Backpropagation optimizes the tuning process
and makes it easier for adjustments for better accuracy every run of the training
of the image dataset is being called an ―epoch.‖
The CNN goes through several series of epochs during the process of training,
adjusting its weights as per the required small amounts.
After each epoch step, the neural network becomes a bit more accurate at
classifying and correctly predicting the class of the training images. As the CNN
improves, the adjustments being made to the weights become smaller and smaller
accordingly.
After training the CNN, we use a test dataset to verify its accuracy. The test
dataset is a set of labelled images that were not being included in the training
process. Each image is being fed to CNN, and the output is compared to the actual
class label of the test image. Essentially, the test dataset evaluates the prediction
performance of the CNN
If a CNN accuracy is good on its training data but is bad on the test data,
it is said as ―overfitting.‖ This happens due to less size of the dataset (training)
Limitations
They (CNN) use massive computing power and resources for the recognition
of various visual patterns/trends that is very much impossible to achieve by the
human eye.
One usually needs a very long time to train a convolutional neural network,
especially with a large size of image datasets.
Neural Network in Deep Learning 69
One generally requires very specialized hardware (like a GPU) to perform the
training of the dataset
PYTHON CODE IMPLEMENTATION FOR IMPLEMENTING
CNN FOR CLASSIFICATION
Importing Relevant Libraries
import NumPy as np
%matplotlib inline
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import TensorFlow as tf
tf.compat.v1.set_random_seed(2019)
Loading MNIST Dataset
(X_train,Y_train),(X_test,Y_test) =
keras.datasets.mnist.load_data()
Scaling The Data
X_train = X_train / 255
X_test = X_test / 255
#flatenning
X_train_flattened = X_train.reshape(len(X_train), 28*28)
X_test_flattened = X_test.reshape(len(X_test), 28*28)
Designing The Neural Network
model = keras.Sequential([
keras.layers.Dense(10, input_shape=(784,),
activation=‟sigmoid‟)
])
model.compile(optimizer=‟adam‟,
loss=‟sparse_categorical_crossentropy‟,
metrics=[„accuracy‟])
model.fit(X_train_flattened, Y_train, epochs=5)
Output:
Epoch 1/5
1875/1875 [==============================] - 8s 4ms/step -
loss: 0.7187 - accuracy: 0.8141
Epoch 2/5
1875/1875 [==============================] - 6s 3ms/step -
loss: 0.3122 - accuracy: 0.9128
70 Deep Learning Using Python
Epoch 3/5
1875/1875 [==============================] - 6s 3ms/step -
loss: 0.2908 - accuracy: 0.9187
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step -
loss: 0.2783 - accuracy: 0.9229
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step -
loss: 0.2643 - accuracy: 0.9262
Confusion Matrix for visualization of predictions
Y_predict = model.predict(X_test_flattened)
Y_predict_labels = [np.argmax(i) for i in Y_predict]
cm =
tf.math.confusion_matrix(labels=Y_test,predictions=Y_predict_labels)
%matplotlib inline
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True, fmt=‟d‟)
plt.xlabel(„Predicted‟)
plt.ylabel(„Truth‟)
Output
Matlab and Python in Machine Learning 71
the model. A deep neural network thus combines multiple nonlinear processing
layers, using simple elements operating in parallel and inspired by biological
nervous systems. It consists of an input layer, several hidden layers, and an output
layer. The layers are interconnected via nodes, or neurons, with each hidden layer
using the output of the previous layer as its input. It is generally perceived that
the greater the number of layers, better the accuracy of the final model.
Deep Learning is especially well-suited to identification applications such as
face recognition, text translation, voice recognition, and advanced driver assistance
systems, including, lane classification and traffic sign recognition since it is
deemed as having better accuracy than ML. Advanced tools and techniques have
dramatically improved Deep Learning algorithms—to the point where they can
outperform humans at classifying images, win against the world‘s best GO player,
or enable a voice-controlled assistant used by the likes of Amazon and google
Home.
USING MATLAB® FOR DEEP LEARNING
MATLAB® is a programming platform designed specifically for engineers and
scientists to analyze and design systems and products in a fast and efficient
manner. The heart of MATLAB is the MATLAB language, a matrix-based language
allowing the most natural expression of computational mathematics. MATLAB
is considered one of the best programming languages at being able to handle the
matrices of Deep Learning in a simple and intuitive manner.
Advantages of MATLAB for Deep Learning
MATLAB has interactive Deep Learning apps for labeling that includes signal
data, audio data, images, and video. Labelling is one of the most tedious tasks
in Deep Learning, and MATLAB is the ideal app that automates it.
MATLAB can help with generating synthetic data when you don‘t have
enough data of the right scenarios. This is a huge benefit as Deep Learning depends
a lot on huge amount of datasets.
• In the case of automated driving, you can author scenarios and
simulate the output of different sensors using a 3D simulation
environment.
• In radar and communications, this includes generating data for
waveform-modulation-identification and target classification
applications.
• MATLAB has a variety of ways to interact and transfer data between
Deep Learning frameworks.
Matlab and Python in Machine Learning 75
On the other hand, Python is free and open-source software. Not only can
you download Python at no cost, but you can also download, look at, and modify
the source code as well. This is a big advantage for Python because it means that
anyone can pick up the development of the language if the current developers were
unable to continue for some reason.
If you‘re a researcher or scientist, then using open-source software has some
pretty big benefits. Paul Romer, the 2018 Nobel Laureate in Economics, is a
recent convert to Python. By his estimation, switching to open-source software
in general, and Python in particular, brought greater integrity and accountability
to his research. This was because all of the code could be shared and run by
any interested reader.
Moreover, since Python is available at no cost, a much broader audience
can use the code you develop. As you‘ll see a little later on in the article,
Python has an awesome community that can help you get started with the
language and advance your knowledge. There are tens of thousands of tutorials,
articles, and books all about Python software development. Here are a few to
get you started:
• Introduction to Python 3
• Basic Data Types in Python
• Python 3 Basics Learning Path
Plus, with so many developers in the community, there are hundreds of
thousands of free packages to accomplish many of the tasks that you‘ll want to
do with Python.
Like MATLAB, Python is an interpreted language. This means that Python
code can be ported between all of the major operating system platforms and CPU
architectures out there, with only small changes required for different platforms.
There are distributions of Python for desktop and laptop CPUs and microcontrollers
like Adafruit. Python can also talk to other microcontrollers like Arduino with a
simple programming interface that is almost identical no matter the host operating
system.
For all of these reasons, and many more, Python is an excellent choice to
replace MATLAB as your programming language of choice. Now that you‘re
convinced to try out Python, read on to find out how to get it on your computer
and how to switch from MATLAB!
Note: GNU Octave is a free and open-source clone of MATLAB. In this sense,
GNU Octave has the same philosophical advantages that Python has around code
reproducibility and access to the software.
Matlab and Python in Machine Learning 77
possible.
Sometimes, though, a package is only available with pip, and for those cases,
you can read What Is Pip? A Guide for New Pythonistas.
Changing the Default Window Layout in Spyder
The default window in Spyder looks like the image below. This is for version
3.3.4 of Spyder running on Windows 10. It should look quite similar on macOS
or Linux:
Before you take a tour of the user interface, you can make the interface look
a little more like MATLAB. In the View ! Window layouts menu choose MATLAB
layout. That will change the window automatically so it has the same areas that
you‘re used to from MATLAB, annotated on the figure below:
In the top left of the window is the File Explorer or directory listing. In this
pane, you can find files that you want to edit or create new files and folders to
work with.
In the top center is a file editor. In this editor, you can work on Python scripts
that you want to save to re-run later on. By default, the editor opens a file
called temp.py located in Spyder‘s configuration directory. This file is meant as
80 Deep Learning Using Python
a temporary place to try things out before you save them in a file somewhere else
on your computer.
In the bottom center is the console. Like in MATLAB, the console is where
you can run commands to see what they do or when you want to debug some code.
Variables created in the console are not saved if you close Spyder and open it
up again. The console is technically running IPython by default.
Any commands that you type in the console will be logged into the history
file in the bottom right pane of the window. Furthermore, any variables that you
create in the console will be shown in the variable explorer in the top right pane.
Notice that you can adjust the size of any pane by putting your mouse over
the divider between panes, clicking, and dragging the edge to the size that you
want. You can close any of the panes by clicking the x in the top of the pane.
You can also break any pane out of the main window by clicking the button
that looks like two windows in the top of the pane, right next to the x that closes
the pane.
When a pane is broken out of the main window, you can drag it around and
rearrange it however you want. If you want to put the pane back in the main
Matlab and Python in Machine Learning 81
window, drag it with the mouse so a transparent blue or gray background appears
and the neighboring panes resize, then let go and the pane will snap into place.
Once you have the panes arranged exactly how you want, you can ask Spyder
to save the layout. Go to the View menu and find the Window layouts flyout again.
Then click Save current layout and give it a name. This lets you reset to
your preferred layout at any time if something gets changed by accident. You
can also reset to one of the default configurations from this menu.
Getting an Integrated Development Environment
One of the big advantages of MATLAB is that it includes a development
environment with the software. This is the window that you‘re most likely used
to working in. There is a console in the center where you can type commands,
a variable explorer on the right, and a directory listing on the left.
Unlike MATLAB, Python itself does not have a default development
environment.
It is up to each user to find one that fits their needs. Fortunately, Anaconda
comes with two different integrated development environments (IDEs) that are
similar to the MATLAB IDE to make your switch seamless. These are called
Spyder and JupyterLab. In the next two sections, you‘ll see a detailed introduction
to Spyder and a brief overview of JupyterLab.
Spyder
Spyder is an IDE for Python that is developed specifically for scientific Python
work. One of the really nice things about Spyder is that it has a mode specifically
designed for people like you who are converting from MATLAB to Python. You‘ll
see that a little later on.
Running Statements in the Console in Spyder
In this chapter, you‘re going to be writing some simple Python commands,
but don‘t worry if you don‘t quite understand what they mean yet.
You‘ll learn more about Python syntax a little later on in this chapter. What
you want to do right now is get a sense for how Spyder‘s interface is similar to
and different from the MATLAB interface.
You‘ll be working a lot with the Spyder console in this chapter, so you should
learn about how it works.
In the console, you‘ll see a line that starts with In [1]:, for input line 1. Spyder
(really, the IPython console) numbers all of the input lines that you type. Since
this is the first input you‘re typing, the line number is 1. In the rest of this chapter,
82 Deep Learning Using Python
you‘ll see references to ―input line X,‖ where X is the number in the square
brackets.
One of the first things I like to do with folks who are new to Python is show
them the Zen of Python. This short poem gives you a sense of what Python is all
about and how to approach working with Python.
To see the Zen of Python, type import this on input line 1 and then run the
code by pressing Enter. You‘ll see an output like below:
In [1]: import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren‟t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one— and preferably only one —obvious way to
do it.
Although that way may not be obvious at first unless you‟re
Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it‟s a bad idea.
If the implementation is easy to explain, it may be a good
idea.
Namespaces are one honking great idea — let‟s do more of
those!
This code has import this on input line 1. The output from running import
this is to print the Zen of Python onto the console. We‘ll return to several of the
stanzas in this poem later on in the chapter.
In many of the code blocks in this chapter, you‘ll see three greater-than signs
(>>>) in the top right of the code block. If you click that, it will remove the input
prompt and any output lines, so you can copy and paste the code right into your
console.
Matlab and Python in Machine Learning 83
3. Size shows the size of the data stored variable, which is more useful
for lists and other data structures.
4. Value shows the current value of the variable.
the following code that is similar to what you already typed in the console:
8var_4 = 10
9var_5 = 20
10var_6 = var_4 + var_5
Then, there are three ways to run the code:
1. You can use the F5 keyboard shortcut to run the file just like in
MATLAB.
2. You can click the green right-facing triangle in the menu bar just
above the Editor and File explorer panes.
3. You can use the Run ! Run menu option.
The first time you run a file, Spyder will open a dialog window asking you
to confirm the options you want to use. For this test, the default options are fine
and you can click Run at the bottom of the dialog box:
86 Deep Learning Using Python
If you‘re interested in learning more about JupyterLab, you can read a lot more
about the next evolution of the Notebook in the blog post announcing the beta
release or in the JupyterLab documentation. You can also learn about the Notebook
interface in Jupyter Notebook: An Introduction and the Using Jupyter
Notebooks course. One neat thing about the Jupyter Notebook-style document is
that the code cells you created in Spyder are very similar to the code cells in a
Jupyter Notebook.
Summarizing Your Experience in Spyder
Now you have the basic tools to use Spyder as a replacement for the MATLAB
integrated development environment. You know how to run code in the console
or type code into a file and run the file. You also know where to look to see your
directories and files, the variables that you‘ve defined, and the history of the
commands you typed.
Once you‘re ready to start organizing your code into modules and packages,
you can check out the following resources:
• Python Modules and Packages – An Introduction
• How to Publish an Open-Source Python Package to PyPI
• How to Publish Your Own Python Package to PyPI
Spyder is a really big piece of software, and you‘ve only just scratched the
surface. You can learn a lot more about Spyder by reading the official documentation,
the troubleshooting and FAQ guide, and the Spyder wiki.
LEARNING ABOUT PYTHON’S MATHEMATICAL LIBRARIES
Now you‘ve got Python on your computer and you‘ve got an IDE where you
feel at home. So how do you learn about how to actually accomplish a task in
Python? With MATLAB, you can use a search engine to find the topic you‘re
looking for just by including MATLAB in your query. With Python, you‘ll usually
get better search results if you can be a bit more specific in your query than just
including Python.
You‘ll take the next step to really feeling comfortable with Python by learning
about how Python functionality is divided into several libraries. You‘ll also learn
what each library does so you can get top-notch results with your searches!
Python is sometimes called a batteries-included language. This means that
most of the important functions you need are already included when you install
Python. For instance, Python has built-in math and statistics libraries that include
the basic operations.
Matlab and Python in Machine Learning 89
In this code, you are first creating num to store the value 10 and then checking
whether the value of num is equal to 10. If it is, you are displaying the phrase num
is equal to 10 on the console from line 2. Otherwise, the else clause will kick in
and display num is not equal to 10. Of course, if you run this code, you will see
the num is equal to 10 output and then I am now outside the if block.
Now you should modify your code so it looks like the sample below:
1num = 10;
2
3if num == 10
4 disp(“num is equal to 10”)
5else
6 disp(“num is not equal to 10”)
7end
8
9disp(“I am now outside the if block”)
In this code, you have only changed lines 3 and 5 by adding some spaces or
indentation in the front of the line. The code will perform identically to the
previous example code, but with the indentation, it is much easier to tell what
code goes in the if part of the statement and what code is in the else part of the
statement.
In Python, indentation at the start of a line is used to delimit the beginning
and end of class and function definitions, if statements, and for and while loops.
There is no end keyword in Python. This means that indentation is very important
in Python!
In addition, in Python the definition line of an if/else/elif statement,
a for or while loop, a function, or a class is ended by a colon. In MATLAB, the
colon is not used to end the line.
Consider this code example:
1num = 10
2
3if num == 10:
4 print(“num is equal to 10”)
5else:
6 print(“num is not equal to 10”)
7
8print(“I am now outside the if block”)
On the first line, you are defining num and setting its value to 10. On line 2,
writing if num == 10: tests the value of num compared to 10. Notice the colon at
94 Deep Learning Using Python
the end of the line. Next, line 3 must be indented in Python‘s syntax. On that line,
you are using print() to display some output to the console, in a similar way
to disp() in MATLAB. You‘ll read more about print() versus disp().
On line 4, you are starting the else block. Notice that the e in the else keyword
is vertically aligned with the i in the if keyword, and the line is ended by a colon.
Because the else is dedented relative to print() on line 3, and because it is aligned
with the if keyword, Python knows that the code within the if part of the block
has finished and the else part is starting. Line 5 is indented by one level, so it forms
the block of code to be executed when the else statement is satisfied.
Lastly, on line 6 you are printing a statement from outside the if/else block.
This statement will be printed regardless of the value of num. Notice that
the p in print() is vertically aligned with the i in if and the e in else. This is how
Python knows that the code in the if/else block has ended. If you run the code
above, Python will display num is equal to 10 followed by I am now outside the
if block.
Now you should modify the code above to remove the indentation and see
what happens. If you try to type the code without indentation into the Spyder/
IPython console, you will get an IndentationError:
In [1]: num = 10
In [2]: if num == 10:
...: print(“num is equal to 10”)
File “<ipython-input-2-f453ffd2bc4f>”, line 2
print(“num is equal to 10”)
^
IndentationError: expected an indented block
In this code, you first set the value of num to 10 and then tried to write
the if statement without indentation. In fact, the IPython console is smart and
automatically indents the line after the if statement for you, so you‘ll have to delete
the indentation to produce this error.
When you‘re indenting your code, the official Python style guide called PEP
8 recommends using 4 space characters to represent one indentation level. Most
text editors that are set up to work with Python files will automatically insert 4
spaces if you press the Tab key on your keyboard. You can choose to use the tab
character for your code if you want, but you shouldn‘t mix tabs and spaces or
you‘ll probably end up with a TabError if the indentation becomes mismatched.
Conditional Statements Use elif in Python
In MATLAB, you can construct conditional statements with if, elseif, and else.
Matlab and Python in Machine Learning 95
These kinds of statements allow you to control the flow of your program in
response to different conditions.
You should try this idea out with the code below, and then compare the
example of MATLAB vs Python for conditional statements:
1num = 10;
2if num == 10
3 disp(“num is equal to 10”)
4elseif num == 20
5 disp(“num is equal to 20”)
6else
7 disp(“num is neither 10 nor 20”)
8end
In this code block, you are defining num to be equal to 10. Then you are
checking if the value of num is 10, and if it is, using disp() to print output to the
console. If num is 20, you are printing a different statement, and if num is neither
10 nor 20, you are printing the third statement.
In Python, the elseif keyword is replaced with elif:
1num = 10
2if num == 10:
3 print(“num is equal to 10”)
4elif num == 20:
5 print(“num is equal to 20”)
6else:
7 print(“num is neither 10 nor 20”)
This code block is functionally equivalent to the previous MATLAB code
block. There are 2 main differences. On line 4, elseif is replaced with elif, and
there is no end statement to end the block. Instead, the if block ends when the next
dedented line of code is found after the else. You can read more in the Python
documentation for if statements.
Calling Functions and Indexing Sequences Use Different Brackets
in Python
In MATLAB, when you want to call a function or when you want to index
an array, you use round brackets (()), sometimes also called parentheses. Square
brackets ([]) are used to create arrays.
You can test out the differences in MATLAB vs Python with the example code
below:
>> arr = [10, 20, 30];
96 Deep Learning Using Python
>> arr(1)
ans =
10
>> sum(arr)
ans =
60
In this code, you first create an array using the square brackets on the right
side of the equal sign. Then, you retrieve the value of the first element by arr(1),
using the round brackets as the indexing operator. On the third input line, you
are calling sum() and using the round brackets to indicate the parameters that
should be passed into sum(), in this case just arr. MATLAB computes the sum of
the elements in arr and returns that result.
Python uses separate syntax for calling functions and indexing sequences. In
Python, using round brackets means that a function should be executed and using
square brackets will index a sequence:
In [1]: arr = [10, 20, 30]
In [2]: arr[0]
Out[2]: 10
In [3]: sum(arr)
Out[3]: 60
In this code, you are defining a Python list on input line 1. Python lists have
some important distinctions from arrays in MATLAB and arrays from the NumPy
package.
On the input line 2, you are displaying the value of the first element of the
list with the indexing operation using square brackets. On input line 3, you are
calling sum() using round brackets and passing in the list stored in arr. This results
in the sum of the list elements being displayed on the last line. Notice that Python
uses square brackets for indexing the list and round brackets for calling functions.
The First Index in a Sequence Is 0 in Python
In MATLAB, you can get the first value from an array by using 1 as the index.
This style follows the natural numbering convention and starts how you would
count the number of items in the sequence. You can try out the differences of
MATLAB vs Python with this example:
>> arr = [10, 20, 30];
>> arr(1)
ans =
10
>> arr(0)
Matlab and Python in Machine Learning 97
In Python, the last value in a sequence can be retrieved by using the index -
1:
In [1]: arr = [10, 20, 30]
In [2]: arr[-1]
Out[2]: 30
In this code, you are defining a Python list with three elements on input line
1. On input line 2, you are displaying the value of the last element of the list,
which has the index -1 and the value 30.
In fact, by using negative numbers as the index values you can work your way
backwards through the sequence:
In [3]: arr[-2]
Out[3]: 20
In [4]: arr[-3]
Out[4]: 10
In this code, you are retrieving the second-to-last and third-to-last elements
from the list, which have values of 20 and 10, respectively.
Exponentiation Is Done With ** in Python
In MATLAB, when you want to raise a number to a power you use the caret
operator (^). The caret operator is a binary operator that takes two numbers.
Other binary operators include addition (+), subtraction (-), multiplication (*), and
division (/), among others. The number on the left of the caret is the base and
the number on the right is the exponent.
Try out the differences of MATLAB vs Python with this example:
>> 10^2
ans =
100
In this code, you are raising 10 to the power of 2 using the caret resulting
an answer of 100.
In Python, you use two asterisks (**) when you want to raise a number to
a power:
In [1]: 10 ** 2
Out[1]: 100
In this code, you are raising 10 to the power of 2 using two asterisks resulting
an answer of 100.
Notice that there is no effect of including spaces on either side of the asterisks.
In Python, the typical style is to have spaces on both sides of a binary operator.
Matlab and Python in Machine Learning 99
Out[5]: -10
In this code, you have used keyword arguments for all three arguments
to add_or_subtract(). Keyword arguments are specified by stating the argument
name, then an equals sign, then the value that argument should have. One of the
big advantages of keyword arguments is that they make your code more explicit.
(As the Zen of Python says, explicit is better than implicit.)
However, they make the code somewhat longer, so it‘s up to your judgement
when to use keyword arguments or not.
Another benefit of keyword arguments is that they can be specified in any
order:
In [6]: add_or_subtract(subtract=True, num_2=20, num_1=10)
Out[6]: -10
In this code, you have specified the three arguments for add_or_subtract() as
keyword arguments, but the order is different from in the function definition.
Nonetheless, Python connects the right variables together because they are specified
as keywords instead of positional arguments.
You can also mix positional and keyword arguments together in the same
function call. If positional and keyword arguments are mixed together, the positional
arguments must be specified first, before any keyword arguments:
In [7]: add_or_subtract(10, 20, subtract=True)
Out[7]: -10
In this code, you have specified the values for num_1 and num_2 using
positional arguments, and the value for subtract using a keyword argument. This
is probably the most common case of using keyword arguments, because it
provides a good balance between being explicit and being concise.
Finally, there is one last benefit of using keyword arguments and default
values. Spyder, and other IDEs, provide introspection of function definitions. This
will tell you the names of all of the defined function arguments, which ones have
default arguments, and the value of the default arguments. This can save you time
and make your code easier and faster to read.
There Are No switch/case Blocks in Python
In MATLAB, you can use switch/case blocks to execute code by checking the
value of a variable for equality with some constants. This type of syntax is quite
useful when you know you want to handle a few discrete cases. Try out a switch/
case block with this example:
11 Deep Learning Using Python
0
num = 10;
switch num
case 10
disp(―num is 10‖)
case 20
disp(―num is 20‖)
otherwise
disp(―num is neither 10 nor 20‖)
end
In this code, you start by defining num and setting it equal to 10 and on the
following lines you test the value of num. This code will result in the output num
is 10 being displayed on the console, since num is equal to 10.
This syntax is an interesting comparison of MATLAB vs Python because
Python does not have a similar syntax. Instead, you should use an if/elif/else block:
num = 10
if num == 10:
print(―num is 10‖)
elif num == 20:
print(―num is 20‖)
else:
print(―num is neither 10 nor 20‖)
In this code, you start by defining num and setting it equal to 10. On the next
several lines you are writing an if/elif/else block to check the different values that
you are interested in.
Namespaces Are One Honking Great Idea in Python
In MATLAB, all functions are found in a single scope. MATLAB has a defined
search order for finding functions within the current scope. If you define your own
function for something that MATLAB already includes, you may get unexpected
behavior.
As you saw in the Zen of Python, namespaces are one honking great
idea. Namespaces are a way to provide different scopes for names of functions,
classes, and variables. This means you have to tell Python which library has the
function you want to use. This is a good thing, especially in cases where multiple
libraries provide the same function.
Matlab and Python in Machine Learning 111
For instance, the built-in math library provides a square root function, as does
the more advanced NumPy library. Without namespaces, it would be more difficult
to tell Python which square root function you wanted to use.
To tell Python where a function is located, you first have to import the library,
which creates the namespace for that library‘s code. Then, when you want to use
a function from the library, you tell Python which namespace to look in:
In [1]: import math
In [2]: math.sqrt(4)
Out[2]: 2.0
In this code, on input line 1 you imported the math library that is built-in to
Python. Then, input line 2 computes the square root of 4 using the square root
function from within the math library. The math.sqrt() line should be read as ―from
within math, find sqrt().‖
The import keyword searches for the named library and binds the namespace
to the same name as the library by default. You can read more about how Python
searches for libraries in Python Modules and Packages – An Introduction.
You can also tell Python what name it should use for a library. For instance,
it is very common to see numpy shortened to np with the following code:
In [3]: import numpy as np
In [4]: np.sqrt(4)
Out[4]: 2.0
In this code, input line 3 imports NumPy and tells Python to put the library
into the np namespace. Then, whenever you want to use a function from NumPy,
you use the np abbreviation to find that function. On input line 4, you are computing
the square root of 4 again, but this time, using np.sqrt(). np.sqrt() should be read
as ―from within NumPy, find sqrt().‖
There are two main caveats to using namespaces where you should be careful:
1. You should not name a variable with the same name as one of the
functions built into Python. You can find a complete list of these
functions in the Python documentation. The most common variable
names that are also built-in functions and should not be used
are dir, id, input, list, max, min, sum, str, type, and vars.
2. You should not name a Python file (one with the extension .py) with
the same name as a library that you have installed. In other words,
you should not create a Python file called math.py. This is because
Python searches the current working directory first when it tries to
11 Deep Learning Using Python
2
import a library. If you have a file called math.py, that file will be
found before the built-in math library and you will probably see
an AttributeError.
NATURE
MATLAB is closed-source software and a proprietary commercial product.
Thus, you need to purchase it to be able to use it. For every additional MATLAB
toolbox you wish to install and run, you need to incur extra charges. The cost
aspect aside, it is essential to note that since MATLAB is specially designed for
MathWorks, its user base is quite limited. Also, if MathWorks were to ever go
out of business, MATLAB would lose its industrial importance.
Unlike MATLAB, Python is an open-source programming language, meaning
Matlab and Python in Machine Learning 115
it is entirely free. You can download and install Python and make alterations to
the source code to best suit your needs. Due to this reason, Python enjoys a bigger
fan following and user base. Naturally, the Python community is pretty extensive,
with hundreds and thousands of developers contributing actively to enrich the
language continually. As we stated earlier, Python offers numerous free packages,
making it an appealing choice for developers worldwide.
Syntax
The most notable technical difference between MATLAB and Python lies in
their syntax. While MATLAB treats everything as an array, Python treats everything
as a general object.
For instance, in MATLAB, strings can either be arrays of strings or arrays
of characters, but in Python, strings are denoted by a unique object called ―str.‖
Another example highlighting the difference between MATLAB and Python‘s
syntax is that in MATLAB, a comment is anything that starts after the percent
sign (%). In contrast, comments in Python typically follow the hash symbol (#).
IDE
MATLAB boasts of having an integrating development environment. It is a
neat interface with a console located in the center where you can type commands,
while a variable explorer lies on the right, you‘ll find a directory listing on the
left.
On the other hand, Python does not include a default development environment.
Users need to choose an IDE that fits their requirement specifications. Anaconda,
a popular Python package, encompasses two different IDEs – Spyder and JupyterLab
– that function as efficiently as the MATLAB IDE.
Tools
Programming languages are usually accompanied by a suite of specialized
tools to support a wide range of user requirements, from modeling scientific data
to building ML models. Integrated tools make the development process easier,
quicker, and more seamless.
Although MATLAB does not have a host of libraries, its standard library
includes integrated toolkits to cover complex scientific and computational
challenges. The best thing about MATLAB toolkits is that experts develop them,
rigorously tested, and well-documented for scientific and engineering operations.
The toolkits are designed to collaborate efficiently and also integrate seamlessly
with parallel computing environments and GPUs. Moreover, since they are updated
together, you get fully-compatible versions of the tools.
11 Deep Learning Using Python
6
As for Python, all of its libraries contain many useful modules for different
programming needs and frameworks. Some of the best Python libraries include
NumPy, SciPy, PyTorch, OpenCV Python, Keras, TensorFlow, Matplotlib, Theano,
Requests, and NLTK. Being an open-source programming language, Python offers
the flexibility and freedom to developers to design Python-based software tools
(like GUI toolkits) for extending the capabilities of the language.
WHICH IS BETTER FOR DEEP LEARNING MATLAB OR
PYTHON?
Deep Learning techniques has changed the field of computer vision significantly
during the last decade, providing state-of-the-art solutions such as, object detection
and image classification and opened the door for challenges and new problems,
like image-to-image translation and visual question answering (VQA).
The success and popularization of Deep Learning in the field of computer
vision and related areas are fostered, in great part, by the availability of rich tools,
apps and frameworks in the Python and MATLAB ecosystems.
MATLAB is a robust computing environment for mathematical or technical
computing operations involving the arrays, matrices, and linear algebra while,
Python is a high-level launguage, general-purpose programming language designed
for ease of use by human beings accomplishing all sorts of tasks.
MATLAB has scientific computing for a long while Python has evolved as
an efficient programming language with the emergence of artificial intelligence,
deep learning, and machine learning. Though which both are used to execute
various data analysis and rendering tasks, there are some elementary differences.
MATLAB VS PYTHON
MATLAB was designed by Cleve Moler Matlab is also known as matrix
laboratory as a multi-paradigm programming language developed by MathWorks.
It helpful for matrix manipulation, Implementation of algorithms and interfacing
the programs written in other programming languages. MATLAB Primarily used
for numerical computing.
Whereas, Python was created by Guido van Rossum in 1991 and it is a high-
level general-purpose Programming language. Python supports multiple paradigms
such as Procedural, Functional programming and Object-Oriented Programming.
Python is the most widely used language in the modern machine learning
research industry and academia. It is the number in which one language for natural
language processing (NLP), computer vision (CV), and reinforcement learning and
other available packages such as NLTK, OpenCV, OpenAI Gym, etc.
Matlab and Python in Machine Learning 117
Generic programming tasks are problems that are not specific to any application.
For example, reading and saving data to a file, preprocessing CSV or text file,
writing scripts or functions for basic problems like counting the number of
occurrences of an event, plotting data, performing basic statistical tasks such as
computing the mean, median, standard deviation, etc.
MACHINE LEARNING
This is the area where Python and R have a clear advantage over Matlab. They
both have access to numerous libraries and packages for both classical (random
forest, regression, SVM, etc.) and modern (deep learning and neural networks such
as CNN, RNN, etc.) machine learning models. However, Python is the most widely
Matlab and Python in Machine Learning 119
used language for modern machine learning research in industry and academia.
It is the number one language for natural language processing (NLP), computer
vision (CV), and reinforcement learning, thanks to many available packages such
as NLTK, OpenCV, OpenAI Gym, etc.
Python is also the number one language for most research or work involving
neural networks and deep learning, thanks to many available libraries and platforms
such as Tensorflow, Pytorch, Keras, etc.
Probabilistic Graphical Modeling (PGM)
Probabilistic graphical models are a class of models for inference and learning
on graphs. They are divided into undirected graphical models or sometimes
referred to as Markov random field and directed graphical models or Bayesian
network.
Python, R, and Matlab all have support for PGM. However, Python and R
are outperforming Matlab in this area. Matlab, thanks to the BNT (Bayesian
Network Toolbox) by Kevin Murphy, has support for the static and dynamic
Bayesian network. The Matlab standard library (hmmtrain) supports the discrete
hidden Markov model (HMM), a well-known class of dynamic Bayesian networks.
Matlab also supports the conditional random field (CRF) thanks to crfChain (by
Mark Schmidt and Kevin Swersky) and UGM by Mark Schmidt.
Python has excellent support for PGM thanks to hmmlearn (Full support for
discrete and continuous HMM), pomegranate, bnlearn (a wrapper around the
bnlearn in R), pypmc, bayespy, pgmpy, etc. It also has better support for CRF
through sklearn-crfsuite.
R has excellent support for PGM. It has numerous stunning packages and
libraries such as bnlearn, bnstruct, depmixS4, etc. The support for CRF is done
through the CRF and crfsuite packages.
Causal Inference
R by far is the most widely used language in causal inference research (along
with SAS and STATA; however, R is free while the other two are not). It has
numerous libraries such as bnlearn, bnstruct for causal discovery (structure learning)
to learn the DAG (directed acyclic graph) from data. It has libraries and functions
for various techniques such as outcome regression, IPTW, g-estimation, etc.
Python also, thanks to the dowhy package by Microsoft research, is capable
of combining the Pearl causal network framework with the Rubin potential outcome
model and provides an easy interface for causal inference modeling.
12 Deep Learning Using Python
0
Time-Series Analysis
R is also the strongest and by far the most widely used language for time series
analysis and forecasting. Numerous books have been written about time series
forecasting using R. There are many libraries to implement algorithms such as
ARIMA, Holt-Winters, exponential smoothing. For example, the forecast package
by Rob Hyndman is the most used package for time series forecasting. Python,
thanks to neural networks, especially the LSTM, receives lots of attention in time
series forecasting ¹. Furthermore, the Prophet package by Facebook written in both
R and Python provides excellent and automated support for time series analysis
and forecasting.
Signal Processing and Digital Communication
This is the area where Matlab is the strongest and is used often in research
and industry. Matlab communications toolbox provides all functionalities needed
to implement a complete communication system. It has functionalities to implement
all well-known modulation schemes, channel and source coding, equalizer, and
necessary decoding and detection algorithms in the receiver. The DSP system
toolbox provides all functionalities to design IIR (Infinite Impulse Response), FIR
(Finite Impulse Response), and adaptive filters. It has complete support for FFT
(Fast Fourier Transform), IFFT, wavelet, etc.
Python, although is not as capable as Matlab in this area but has support for
digital communication algorithms through CommPy and Komm packages.
Control and Dynamical System
Matlab is still the most widely used language for implementing the control
and dynamical system algorithms thanks to the control system toolbox. It has
extensive supports for all well-known methods such as PID controller, state-space
design, root locus, transfer function, pole-zero diagrams, Kalman Filter, and many
more. However, the main strength of Matlab is coming from its excellent and
versatile graphical editor Simulink. Simulink lets you simulate the real-world
system using drag and drop blocks (It is similar to the LabView). The Simulink
output can then be imported to Matlab for further analysis. Python has support
for control and dynamical system through the control and dynamical systems
library.
Optimization and Numerical Analysis
All three programming languages have excellent support for optimization
problems such as linear programming (LP), convex optimization, nonlinear
optimization with and without constraint.
Matlab and Python in Machine Learning 121
The support for optimization and numerical analysis in Matlab is done through
the optimization toolbox. This supports linear programming (LP), mixed-integer
linear programming (MILP), quadratic programming (QP), second-order cone
programming (SOCP), nonlinear programming (NLP), constrained linear least
squares, nonlinear least squares, nonlinear equations, etc. CVX is another strong
package in Matlab written by Stephen Boys and his Ph.D. student for convex
optimization. Python supports optimization through various packages such as
CVXOPT, pyOpt (Nonlinear optimization), PuLP(Linear Programming), and
CVXPY (python version of CVX for convex optimization problems). R supports
convex optimization through CVXR (Similar to CVX and CVXPY), optimx
(quasi-Newton and conjugate gradient method), and ROI (linear, quadratic, and
conic optimization problems).
Web Development
This is an area where Python outperforms R and Matlab by a large margin.
Actually, neither R nor Matlab are used for any web development design.
• Object-oriented language.
• Easy to learn and user-friendly syntax.
Disadvantage:
• Lack of good packages for signal processing and communication
(still behind for engineering applications).
• Steeper learning curve than MATLAB since it is an object-oriented
programming(OOP) language and is harder to master.
• Requires more time and expertise to setup and install the working
environment.
R
Advantage:
• So many wonderful libraries in statistics and machine learning.
• Open-source and free.
• Number one language for time series analysis, causal inference, and
PGM.
• A large community of researchers, especially in academia.
• Ability to create web applications, for example, through the Shiney
app.
Disadvantage:
• Slower compared to Python and Matlab.
• More limited scope in terms of applications compared to Python.
(Cannot be used for game development or cannot be as a backend
for web developments)
• Not object-oriented language.
• Lack of good packages for signal processing and communication
(still behind for engineering applications).
• Smaller user communities compared to Python.
• Harder and not user-friendly compared to Python and Matlab.
To summarize, Python is the most popular language for machine learning, AI,
and web development while it provides excellent support for PGM and optimization.
On the other hand, Matlab is a clear winner for engineering applications while
it has lots of good libraries for numerical analysis and optimization. The biggest
disadvantage of Matlab is that it is not free or open-source. R is a clear winner
for time series analysis, causal inference, and PGM. It also has excellent support
for machine learning and data science applications.
124 Deep Learning Using Python
Gradient Descent in
Machine Learning
When we have a single parameter (theta), we can plot the dependent variable
cost on the y-axis and theta on the x-axis. If there are two parameters, we can
go with a 3-D plot, with cost on one axis and the two parameters (thetas) along
the other two axes.
Gradient Descent in Machine Learning 125
It can also be visualized by using Contours. This shows a 3-D plot in two
dimensions with parameters along both axes and the response as a contour. The
value of the response increases away from the center and has the same value along
with the rings. The response is directly proportional to the distance of a point from
the center (along a direction).
Local Minima
Gradient Descent in Machine Learning 127
The cost function may consist of many minimum points. The gradient may
settle on any one of the minima, which depends on the initial point (i.e initial
parameters(theta)) and the learning rate. Therefore, the optimization may converge
to different points with different starting points and learning rate.
o Move away from the direction of the gradient, which means slope
increased from the current point by alpha times, where Alpha is defined
as Learning Rate. It is a tuning parameter in the optimization process
which helps to decide the length of the steps.
What is Cost-function?
The cost function is defined as the measurement of difference or error
between actual values and expected values at the current position and present
in the form of a single real number. It helps to increase and improve machine
learning efficiency by providing feedback to this model so that it can minimize
error and find the local or global minimum.
Further, it continuously iterates along the direction of the negative gradient
until the cost function approaches zero. At this steepest descent point, the model
will stop learning further. Although cost function and loss function are considered
synonymous, also there is a minor difference between them.
The slight difference between the loss function and the cost function is
about the error within the training of machine learning models, as loss function
refers to the error of one training example, while a cost function calculates
the average error across an entire training set.
How does Gradient Descent work?
Before starting the working principle of gradient descent, we should know
some basic concepts to find out the slope of a line from linear regression. The
equation for simple linear regression is given as:
1. Y=mX+c
Where ‗m‘ represents the slope of the line, and ‗c‘ represents the intercepts
on the y-axis.
The starting point is used to evaluate the performance as it is considered just
as an arbitrary point. At this starting point, we will derive the first derivative or
slope and then use a tangent line to calculate the steepness of this slope. Further,
this slope will inform the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever
new parameters are generated, then steepness gradually reduces, and at the lowest
point, it approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or
the error between expected and actual. To minimize the cost function, two data
points are required:
Gradient Descent in Machine Learning 129
It is defined as the step size taken to reach the minimum or lowest point. This
is typically a small value that is evaluated and updated based on the behavior of
130 Deep Learning Using Python
the cost function. If the learning rate is high, it results in larger steps but also leads
to risks of overshooting the minimum. At the same time, a low learning rate shows
the small step sizes, which compromises overall efficiency but gives the advantage
of more precision.
TYPES OF GRADIENT DESCENT
Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into Batch gradient descent, stochastic gradient descent,
and mini-batch gradient descent. Let‘s understand these different types of gradient
descent:
Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the
training set and update the model after evaluating all training examples. This
procedure is known as the training epoch. In simple words, it is a greedy approach
where we have to sum over all examples for each update.
Advantages of Batch gradient descent:
o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training
samples.
Stochastic gradient descent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one
training example per iteration. Or in other words, it processes a training epoch
for each example within a dataset and updates each training example‘s parameters
one at a time. As it requires only one training example at a time, hence it is easier
to store in allocated memory. However, it shows some computational efficiency
losses in comparison to batch gradient systems as it shows frequent updates that
require more detail and speed. Further, due to frequent updates, it is also treated
as a noisy gradient. However, sometimes it can be helpful in finding the global
minimum and also escaping the local minimum.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and
it consists of a few advantages over other gradient descent.
o It is easier to allocate in desired memory.
o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
Gradient Descent in Machine Learning 131
Whenever the slope of the cost function is at zero or just close to zero, this
model stops learning further. Apart from the global minimum, there occur some
scenarios that can show this slop, which is saddle point and local minimum. Local
minima generate the shape similar to the global minimum, where the slope of the
cost function increases on both sides of the current points.
In contrast, with saddle points, the negative gradient only occurs on one side
of the point, which reaches a local maximum on one side and a local minimum
on the other side. The name of a saddle point is taken by that of a horse‘s saddle.
The name of local minima is because the value of the loss function is
minimum at that point in a local region. In contrast, the name of the global
minima is given so because the value of the loss function is minimum there,
globally across the entire domain the loss function.
Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent and
backpropagation, there can occur two more issues other than local minima and
saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the
learning rate of earlier layers than the later layer of the network. Once this happens,
the weight parameters update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when
the Gradient is too large and creates a stable model. Further, in this scenario, model
weight increases, and they will be represented as NaN. This problem can be solved
using the dimensionality reduction technique, which helps to minimize complexity
within the model.
or
cost = evaluate(f(coefficient))
The derivative of the cost is calculated. The derivative is a concept from
calculus and refers to the slope of the function at a given point. We need to know
the slope so that we know the direction (sign) to move the coefficient values in
order to get a lower cost on the next iteration.
delta = derivative(cost)
Now that we know from the derivative which direction is downhill, we can
now update the coefficient values. A learning rate parameter (alpha) must be
specified that controls how much the coefficients can change on each update.
coefficient = coefficient – (alpha * delta)
This process is repeated until the cost of the coefficients (cost) is 0.0 or close
enough to zero to be good enough.
You can see how simple gradient descent is. It does require you to know the
gradient of your cost function or the function you are optimizing, but besides that,
it‘s very straightforward.
BATCH GRADIENT DESCENT FOR MACHINE LEARNING
The goal of all supervised machine learning algorithms is to best estimate a
target function (f) that maps input data (X) onto output variables (Y). This
describes all classification and regression problems.
Some machine learning algorithms have coefficients that characterize the
algorithms estimate for the target function (f). Different algorithms have different
representations and different coefficients, but many of them require a process of
optimization to find the set of coefficients that result in the best estimate of the
target function.
Common examples of algorithms with coefficients that can be optimized using
gradient descent are Linear Regression and Logistic Regression.
The evaluation of how close a fit a machine learning model estimates the target
function can be calculated a number of different ways, often specific to the
machine learning algorithm. The cost function involves evaluating the coefficients
in the machine learning model by calculating a prediction for the model for each
training instance in the dataset and comparing the predictions to the actual output
values and calculating a sum or average error (such as the Sum of Squared
Residuals or SSR in the case of linear regression).
From the cost function a derivative can be calculated for each coefficient so
that it can be updated using exactly the update equation described above.
Gradient Descent in Machine Learning 135
The cost is calculated for a machine learning algorithm over the entire training
dataset for each iteration of the gradient descent algorithm. One iteration of the
algorithm is called one batch and this form of gradient descent is referred to as
batch gradient descent.
Batch gradient descent is the most common form of gradient descent described
in machine learning.
STOCHASTIC GRADIENT DESCENT FOR MACHINE
LEARNING
Gradient descent can be slow to run on very large datasets.
Because one iteration of the gradient descent algorithm requires a prediction
for each instance in the training dataset, it can take a long time when you have
many millions of instances.
In situations when you have large amounts of data, you can use a variation
of gradient descent called stochastic gradient descent.
In this variation, the gradient descent procedure described above is run but
the update to the coefficients is performed for each training instance, rather than
at the end of the batch of instances.
The first step of the procedure requires that the order of the training dataset
is randomized. This is to mix up the order that updates are made to the coefficients.
Because the coefficients are updated after every training instance, the updates will
be noisy jumping all over the place, and so will the corresponding cost function.
By mixing up the order for the updates to the coefficients, it harnesses this random
walk and avoids it getting distracted or stuck.
The update procedure for the coefficients is the same as that above, except
the cost is not summed over all training patterns, but instead calculated for one
training pattern.
The learning can be much faster with stochastic gradient descent for very large
training datasets and often you only need a small number of passes through the
dataset to reach a good or good enough set of coefficients, e.g. 1-to-10 passes
through the dataset.
Tips for Gradient Descent
This chapter lists some tips and tricks for getting the most out of the gradient
descent algorithm for machine learning.
• Plot Cost versus Time: Collect and plot the cost values calculated by the
algorithm each iteration. The expectation for a well performing gradient
136 Deep Learning Using Python
For i=1 to m{
j = j - q (learning rate) * ( h (x(i)) - y(i))xj(i)For
every j =0 …n
}
}
ALGORITHM FOR MINI BATCH GRADIENT DESCENT
Say b be the no of examples in one batch, where b < m.
Assume b = 10, m = 100;
Note: However we can adjust the batch size. It is generally kept as power
of 2. The reason behind it is because some hardware such as GPUs achieve better
run time with common batch sizes such as power of 2.
Repeat {
For i=1,11, 21,…..,91
Let Ó be the summation from i to i+9 represented by k.
j = j - p øß¢ (learning rate/size of (b) ) * ( h (x(k)) - y(k))xj(k) For
every j =0 …n
}
Convergence trends in different variants of Gradient Descents
In case of Batch Gradient Descent, the algorithm follows a straight path
towards the minimum. If the cost function is convex, then it converges to a global
minimum and if the cost function is not convex, then it converges to a local
minimum. Here the learning rate is typically held constant. In case of stochastic
gradient Descent and mini-batch gradient descent, the algorithm does not converge
but keeps on fluctuating around the global minimum. Therefore in order to make
it converge, we have to slowly change the learning rate. However the convergence
of Stochastic gradient descent is much noisier as in one iteration, it processes only
one training example.
We‘ll walk through how gradient descent works, what types of it are used today,
and its advantages and tradeoffs.
INTRODUCTION TO GRADIENT DESCENT
Gradient descent is an optimization algorithm that‘s used when training a
machine learning model. It‘s based on a convex function and tweaks its parameters
iteratively to minimize a given function to its local minimum.
Gradient Descent is an optimization algorithm for finding a local minimum
of a differentiable function. Gradient descent is simply used in machine learning
to find the values of a function‘s parameters (coefficients) that minimize a cost
function as far as possible.
You start by defining the initial parameter‘s values and from there gradient
descent uses calculus to iteratively adjust the values so they minimize the given
cost-function. To understand this concept fully, it‘s important to know about
gradients.
What is a Gradient?
“A gradient measures how much the output of a function changes if you
change the inputs a little bit.” —Lex Fridman (MIT)
A gradient simply measures the change in all weights with regard to the change
in error. You can also think of a gradient as the slope of a function. The higher
the gradient, the steeper the slope and the faster a model can learn. But if the slope
is zero, the model stops learning. In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
140 Deep Learning Using Python
Imagine the image below illustrates our hill from a top-down view and the
red arrows are the steps of our climber. Think of a gradient in this context as a
vector that contains the direction of the steepest step the blindfolded man can take
and also how long that step should be.
Note that the gradient ranging from X0 to X1 is much longer than the one
reaching from X3 to X4.
This is because the steepness/slope of the hill, which determines the length
of the vector, is less. This perfectly represents the example of the hill because the
hill is getting less steep the higher it‘s climbed. Therefore a reduced gradient goes
along with a reduced slope and a reduced step size for the hill climber.
Gradient Descent in Machine Learning 141
So this formula basically tells us the next position we need to go, which is
the direction of the steepest descent. Let‘s look at another example to really drive
the concept home.
Imagine you have a machine learning problem and want to train your algorithm
with gradient descent to minimize your cost-function J(w, b) and reach its local
minimum by tweaking its parameters (w and b). The image below shows the
horizontal axes representing the parameters (w and b), while the cost function J(w,
b) is represented on the vertical axes. Gradient descent is a convex function.
then starts at that point (somewhere around the top of our illustration), and it takes
one step after another in the steepest downside direction (i.e., from the top to the
bottom of the illustration) until it reaches the point where the cost function is as
small as possible.
IMPORTANCE OF THE LEARNING RATE
How big the steps the gradient descent takes into the direction of the local
minimum are determined by the learning rate, which figures out how fast or slow
we will move towards the optimal weights.
For gradient descent to reach the local minimum we must set the learning rate
to an appropriate value, which is neither too low nor too high.
This is important because if the steps it takes are too big, it may not reach
the local minimum because it bounces back and forth between the convex function
of gradient descent. If we set the learning rate to a very small value, gradient
descent will eventually reach the local minimum but that may take a while.
So, the learning rate should never be too high or too low for this reason. You
can check if your learning rate is doing well by plotting it on a graph.
the image on the right illustrates the difference between good and bad learning
rates.
If gradient descent is working properly, the cost function should decrease after
every iteration.
When gradient descent can‘t decrease the cost-function anymore and remains
more or less on the same level, it has converged. The number of iterations gradient
descent needs to converge can sometimes vary a lot. It can take 50 iterations,
60,000 or maybe even 3 million, making the number of iterations to convergence
hard to estimate in advance.
There are some algorithms that can automatically tell you if gradient descent
has converged, but you must define a threshold for the convergence beforehand,
which is also pretty hard to estimate. For this reason, simple plots are the preferred
convergence test.
Another advantage of monitoring gradient descent via plots is it allows us to
easily spot if it doesn‘t work properly, for example if the cost function is increasing.
Most of the time the reason for an increasing cost-function when using gradient
descent is a learning rate that‘s too high.
If the plot shows the learning curve just going up and down, without really
reaching a lower point, try decreasing the learning rate. Also, when starting out
with gradient descent on a given problem, simply try 0.001, 0.003, 0.01, 0.03, 0.1,
0.3, 1, etc., as the learning rates and look at which one performs the best.
This introductory video to gradient descent helps to explain one of machine
learning‘s most useful algorithms.
TYPES OF GRADIENT DESCENT
There are three popular types of gradient descent that mainly differ in the
amount of data they use:
144 Deep Learning Using Python
IMAGE RECOGNITION
Image recognition is one of the most common applications of machine learning.
It is used to identify objects, persons, places, digital images, etc. The popular use
case of image recognition and face detection is, Automatic friend tagging
suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever
we upload a photo with our Facebook friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning‘s face
detection and recognition algorithm.
Speech Recognition
While using Google, we get an option of ―Search by voice,‖ it comes under
speech recognition, and it‘s a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and
it is also known as ―Speech to text‖, or ―Computer speech recognition.‖ At
present, machine learning algorithms are widely used by various applications of
speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech
recognition technology to follow the voice instructions.
146 Deep Learning Using Python
Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows
us the correct path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-
moving, or heavily congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to improve the
performance.
Product recommendations:
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser
and this is because of machine learning.
Google understands the user interest using various machine learning algorithms
and suggests the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for
entertainment series, movies, etc., and this is also done with the help of machine
learning.
Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and objects
while driving.
Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important,
normal, and spam. We always receive an important mail in our inbox with the
important symbol and spam emails in our spam box, and the technology behind
this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
Gradient Descent in Machine Learning 147
predict the exact position of lesions in the brain. It helps in finding brain tumors
and other brain-related diseases easily.
Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then
it is not a problem at all, as for this also machine learning helps us by converting
the text into our known languages. Google‘s GNMT (Google Neural Machine
Translation) provide this feature, which is a Neural Machine Learning that translates
the text into our familiar language, and it called as automatic translation.
Imagine how much more valuable your data would be to your business if your
document-intake solution could extract data from images as seamlessly as it does
from the text.
Thanks to deep learning, intelligent document processing (IDP) is able to
combine various AI technologies to not only automatically classify photos, but
also describe the various elements in pictures and write short sentences describing
each segment with proper English grammar.
IDP leverages a deep learning network known as CNN (Convolutional
Neural Networks) to learn patterns that naturally occur in photos. IDP is then
able to adapt as new data is processed, using Imagenet, one of the biggest
databases of labeled images, which has been instrumental in advancing computer
vision.
One of the ways this type of technology is implemented with impact is in the
document-heavy insurance industry. Claims processing starts with a small army
of humans manually entering data from forms.
In a typical use case, the claim includes a set of documents such as: claim
forms, police reports, accident scene and vehicle damage pictures, vehicle operator
driver‘s license, insurance copy, bills, invoices, and receipts.
Documents like these aren‘t standard, and the business systems that automate
most of the claims processing can‘t function without data from the forms.
To turn those documents into data, the Convolutional Neural Networks are
trained using GPU-accelerated deep learning frameworks such as Caffe2, Chainer,
Microsoft Cognitive Toolkit, MXNet, PaddlePaddle, Pytorch, TensorFlow, and
inference optimizers such as TensorRT.
Neural networks were first used in 2009 for speech recognition, and were
Gradient Descent in Machine Learning 149
only implemented by Google in 2012. Deep learning, also called neural networks,
is a subset of machine learning that uses a model of computing that‘s very much
inspired by the structure of the brain.
Data Labeling
It‘s better to manually label the input data so that the deep learning algorithm
can eventually learn to make the predictions on its own. Some off the shelf manual
data labeling tools are given here.
The objective at this point will be mainly to identify the actual object or text
in a particular image, demarcating whether the word or object is oriented improperly,
and identifying whether the script (if present) is in English or other languages.
To automate the tagging and annotation of images, NLP pipelines can be
applied. ReLU (rectified linear unit) is then used for the non-linear activation
functions, as they perform better and decrease training time.
To increase the training dataset, we can also try data augmentation by emulating
the existing images and transforming them. We could transform the available
images by making them smaller, blowing them up, cropping elements etc.
Using RCNN
With the usage of Region-based Convolutional Neural Network (aka RCNN),
locations of objects in an image can be detected with ease. Within just 3 years
the RCNN has moved from Fast RCNN, Faster RCNN to Mask RCNN, making
tremendous progress towards human-level cognition of images. Below is an example
of the final output of the image recognition model where it was trained by deep
learning CNN to identify categories and products in images.
If you are new to deep learning methods and don‘t want to train your own
model, you could have a look on Google Cloud Vision. It works pretty well for
general cases.
Gradient Descent in Machine Learning 151
Category Detection
Product Detection
If you are looking for a specific IDP solution or customization, our ML experts
will ensure your time and resources are well spent in partnering with us.
152 Deep Learning Using Python
Natural Language
Processing
helps analyze notes and text in electronic health records that would otherwise be
inaccessible for study when seeking to improve care.
Figure . A human neuron collects inputs from other neurons using dendrites and
sums all the inputs. If the total is greater than a threshold value, it produces an
output.
The average human brain has approximately 100 billion neurons. A human
neuron uses dendrites to collect inputs from other neurons, adds all the inputs,
and if the resulting sum is greater than a threshold, it fires and produces an output.
The fired output is then sent to other connected neurons.
There are many different types of activation functions with different properties,
but one of the simplest is the step function. A step function outputs a 1 if the input
is higher than a certain threshold, otherwise it outputs a 0. For example, if a
perceptron has two inputs (x1 and x2):
x1 = 0.9
x2 = 0.7
which have weightings (w1 and w2) of:
w1 = 0.2
w2 = 0.9
and the activation function threshold is equal to 0.75, then weighing the inputs
and adding them together yields:
x1w1 + x2w2 = (0.9×0.7) + (0.2×0.9) = 0.81
Because the total input is higher than the threshold (0.75), the neuron will
fire. Since we chose a simple step function, the output would be 1.
So how does all this lead to intelligence? It starts with the ability to learn
something simple through training.
Training a perceptron
Figure . To train a perceptron, the weights are adjusted to minimize the output
error. Output error is defined as the difference between the desired output and the
actual output.
Figure . A perceptron can learn to separate dogs and cats given size and
domestication data. As more training examples are added, the perceptron updates
its linear boundary.
Multilayer perceptrons
A single neuron is capable of learning simple patterns, but when many neurons
158 Deep Learning Using Python
are connected together, their abilities increase dramatically. Each of the 100 billion
neurons in the human brain has, on average, 7,000 connections to other neurons.
It has been estimated that the brain of a three-year-old child has about one
quadrillion connections between neurons. And, theoretically, there are more possible
neural connections in the brain than there are atoms in the universe.
A multilayer perceptron (MLP) is an artificial neural network with multiple
layers of neurons between input and output. MLPs are also called feedforward
neural networks. Feedforward means that data flow in one direction from the input
to the output layer.
Figure . A multilayer perceptron has multiple layers of neurons between the input
and output. Each neuron’s output is connected to every neuron in the next layer.
Typically, every neuron‘s output is connected to every neuron in the next layer.
Layers that come between the input and output layers are referred to as hidden
layers.
Figure . Although single perceptrons can learn to classify linear patterns, they are
unable to handle nonlinear or other more complicated datasets. Multilayer
perceptrons are more capable of handling nonlinear patterns, and can even classify
inseparable data.
Natural Language Processing 159
MLPs are widely used for pattern classification, recognition, prediction, and
approximation, and can learn complicated patterns that are not separable using
linear or other easily articulated curves. The capacity of an MLP network to learn
complicated patterns increases with the number of neurons and layers.
MLPs have been successful at a wide range of AI tasks, from speech recognition
to predicting thermal conductivity of aqueous electrolyte solutions and controlling
a continuous stirred-tank reactor. For example, an MLP for recognizing printed
digits (e.g., the account and routing number printed on a check) would be comprised
of a grid of inputs to read individual pixels of digits (say, a 9×12 bitmap), followed
by one or more hidden layers, and finally 10 output neurons to indicate which
number was recognized in the input (0–9).
Figure . An MLP can learn the dynamics of the plant by evaluating the error
between the actual plant output and the neural network output.
As another example, MLPs have been used for predictive control of chemical
reactors. The typical setup trains a neural network to learn the forward dynamics
of the plant. The prediction error between the plant output and the neural network
output is used for training the neural network. The neural network learns from
previous inputs and outputs to predict future values of the plant output. For
example, a controller for a catalytic continuous stirred-tank reactor can be trained
Natural Language Processing 161
to maintain appropriate product concentration and flow by using past data about
inflow Q1 and Q2 at concentrations Cb1 and Cb2, respectively, liquid level h, and
outflow Q0 at concentration Cb.
In general, given a statistically relevant dataset, an artificial neural network
can learn from it.
TRAINING A MULTILAYER PERCEPTRON
Training a single perceptron is easy — all weights are adjusted repeatedly until
the output matches the expected value for all training data. For a single perceptron,
weights can be adjusted using the formulas:
part of the name stems from the fact that calculation of the gradient proceeds
backward through the network. The gradient of the final layer of weights is
calculated first and the gradient of the first layer of weights is calculated last.
Figure . Sigmoid and tanh functions are nonlinear activation functions. The output
of the sigmoid function is a value between 0 and 1. The output of the sigmoid
function can be used to represent a probability, often the probability that the input
belongs to a category (e.g., cat or dog).
W6, which are connected to a single output neuron (Xo) via weights W7–W9. Assume
that we are using the sigmoid activation function, initial weights are randomly
assigned, and input values [1, 1] will lead to an output of 0.77.
Figure . An example MLP with three layers accepts an input of [1, 1] and computes
an output of 0.77.
Let‘s assume that the desired output for inputs [1, 1] is 0. The backpropagation
algorithm can be used to adjust weights. First, calculate the error at the last
neuron‘s (Xo) output:
Recall that the output (0.77) was obtained by applying the sigmoid activation
function to the weighted sum of the previous layer‘s outputs (1.2):
Hence, the gradient or rate of change of the sigmoid function at x = 1.2 is:
(0.77) × (1 – 0.77) = 0.177
If we multiply the error in output (–0.77) by this rate of change (0.177) we
get –0.13. This can be proposed as a small change in input that could move the
system toward the proverbial ―bottom of the hill.‖
Recall that the sum of the weighted inputs of the output neuron (1.2) is the
product of the output of the three neurons in the previous layer and the weights
between them and the output neuron:
To change this sum (So) by –0.13, we can adjust each incoming weight (W7,
W8, W9) proportional to the corresponding output of the previous (hidden layer)
neuron (Xh1, Xh2, Xh3). So, the weights between the hidden neurons and the output
neuron become:
W7new = W7old + (–0.13/Xh1) = 0.3 + (–0.13/0.73) = 0.11
W8new = W8old + (–0.13/Xh2) = 0.5 + (–0.13/0.79) = 0.33
W9new = W9old + (–0.13/Xh3) = 0.9 + (–0.13/0.67) = 0.7
After adjusting the weights between the hidden layer neurons and the output
neuron, we repeat the process and similarly adjust the weights between the input
and hidden layer neurons.
This is done by first calculating the gradient at the input coming into each
neuron in the hidden layer. For example, the gradient at Xh3 is: 0.67×(1–0.67) =
0.22.
The proposed change in the sum of weighted inputs of Xh3 (i.e., S3) can be
calculated by multiplying the gradient (0.22) by the proposed change in the sum
of weighted inputs of the following neuron (–0.13), and dividing by the weight
from this neuron to the following neuron (W9).
Note that we are propagating errors backward, so it was the error in the
following neuron (Xo) that we proportionally propagated backward to this neuron‘s
inputs.
The proposed change in the sum of weighted inputs of Xh3 (i.e., S3) is:
Change in S3 = Gradient at Xh3 × Proposed change in So/W9
Change in S3 = 0.22 × (–0.13)/0.9 = –0.03
Note that we use the original value of W9 (0.9) rather than the recently
Natural Language Processing 165
calculated new value (0.7) to propagate the error backward. This is because
although we are working one step at a time, we are trying to search the entire
space of possible weight combinations and change them in the right direction
(toward the bottom of the hill). In each iteration, we propagate the output error
through original weights, leading to new weights for the iteration. This global
backward propagation of the output neuron‘s error is the key concept that lets all
weights change toward ideal values.
Once you know the proposed change in the weighted sum of inputs of each
neuron (S1, S2, S3), you can change the weights leading to the neuron ( W1 through
W6) proportional to the output from the previous neuron. Thus, W6 changes from
0.3 to 0.27.
Upon repeating this process for all weights, the new output in this example
becomes 0.68, which is a little closer to the ideal value (0) than what we started
with (0.77). By performing just one such iteration of forward and back propagation,
the network is already learning!
A small neural network like the one in this example will typically learn to
produce correct outputs after a few hundred such iterations of weight adjustments.
On the other hand, training AlphaGo‘s neural network, which has tens of thousands
of neurons arranged in more than a dozen layers, takes more serious computing
power, which is becoming increasingly available.
166 Deep Learning Using Python
Looking forward
Even with all the amazing progress in AI, such as self-driving cars, the
technology is still very narrow in its accomplishments and far from autonomous.
Today, 99% of machine learning requires human work and large amounts of data
that need to be normalized and labeled (i.e., this is a dog; this is a cat). And, people
need to supply and fine-tune the appropriate algorithms. All of this relies on
manual labor.
Other challenges that plague neural networks include:
• Bias. Machine learning is looking for patterns in data. If you start
with bad data, you will end up with bad models.
• Over-fitting. In general, a model is typically trained by maximizing
its performance on a particular training dataset. The model thus
memorizes the training examples, but may not learn to generalize
to new situations and datasets.
• Hyper-parameter optimization. The value of a hyper-parameter is
defined prior to the commencement of the learning process (e.g.,
number of layers, number of neurons per layer, type of activation
function, initial value of weights, value of the learning rate, etc.).
Changing the value of such parameters by a small amount can
invoke large changes in the performance of the network.
• Black-box problems. Neural networks are essentially black boxes, and
researchers have a hard time understanding how they deduce
particular conclusions. Their operation is largely invisible to humans,
rendering them unsuitable for domains in which verifying the process
is important.
Thus far, we have looked at neural networks that learn from data. This
approach is called supervised learning. The training of a neural network under
supervised learning, an input is presented to the network and it produces an output
that is compared with the desired/target output. An error is generated if there is
a difference between the actual output and the target output and the weights are
adjusted based on this error until the actual output matches the desired output.
Supervised learning relies on manual human labor for collecting, preparing, and
labeling a large amount of training data.
Unsupervised learning does not depend on target outputs for learning. Instead,
inputs of a similar type are combined to form clusters. When a new input pattern
is applied, the neural network gives an output indicating the class to which the
input pattern belongs.
Natural Language Processing 167
Reinforcement learning involves learning by trial and error, solely from rewards
or punishments. Such neural networks construct and learn their own knowledge
directly from raw inputs, such as vision, without any hand-engineered features or
domain heuristics. AlphaGo Zero, the successor to AlphaGo, is based on
reinforcement learning. Unlike AlphaGo, which was initially trained on thousands
of human games to learn how to play Go, AlphaGo Zero learned to play simply
by playing games against itself. Although it began with completely random play,
it eventually surpassed human level of play and defeated the previous version of
AlphaGo by 100 games to 0.
Neural networks
A major drawback of statistical methods is that they require elaborate feature
engineering. Since 2015, the field has thus largely abandoned statistical methods
and shifted to neural networks for machine learning. Popular techniques include
the use of word embeddings to capture semantic properties of words, and an
increase in end-to-end learning of a higher-level task (e.g., question answering)
instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech
tagging and dependency parsing).
In some areas, this shift has entailed substantial changes in how NLP systems
are designed, such that deep neural network-based approaches may be viewed as
a new paradigm distinct from statistical natural language processing. For instance,
the term neural machine translation (NMT) emphasizes the fact that deep learning-
based approaches to machine translation directly learn sequence-to-sequence
transformations, obviating the need for intermediate steps such as word alignment
and language modeling that was used in statistical machine translation (SMT).
COMMON NLP TASKS
The following is a list of some of the most commonly researched tasks in
natural language processing. Some of these tasks have direct real-world applications,
while others more commonly serve as subtasks that are used to aid in solving larger
tasks.
Though natural language processing tasks are closely intertwined, they can
be subdivided into categories for convenience. A coarse division is given below.
TEXT AND SPEECH PROCESSING
Optical character recognition (OCR)
Given an image representing printed text, determine the corresponding text.
Speech recognition
Given a sound clip of a person or people speaking, determine the textual
representation of the speech. This is the opposite of text to speech and is one of
the extremely difficult problems colloquially termed ―AI-complete‖. In natural
speech there are hardly any pauses between successive words, and thus speech
segmentation is a necessary subtask of speech recognition. In most spoken languages,
the sounds representing successive letters blend into each other in a process termed
coarticulation, so the conversion of the analog signal to discrete characters can
be a very difficult process. Also, given that words in the same language are spoken
by people with different accents, the speech recognition software must be able
170 Deep Learning Using Python
to recognize the wide variety of input as being identical to each other in terms
of its textual equivalent.
Speech segmentation
Given a sound clip of a person or people speaking, separate it into words. A
subtask of speech recognition and typically grouped with it.
Text-to-speech
Given a text, transform those units and produce a spoken representation. Text-
to-speech can be used to aid the visually impaired.
Word segmentation (Tokenization)
Separate a chunk of continuous text into separate words. For a language like
English, this is fairly trivial, since words are usually separated by spaces. However,
some written languages like Chinese, Japanese and Thai do not mark word
boundaries in such a fashion, and in those languages text segmentation is a
significant task requiring knowledge of the vocabulary and morphology of words
in the language. Sometimes this process is also used in cases like bag of words
(BOW) creation in data mining.
MORPHOLOGICAL ANALYSIS
Lemmatization
The task of removing inflectional endings only and to return the base dictionary
form of a word which is also known as a lemma. Lemmatization is another
technique for reducing words to their normalized form. But in this case, the
transformation actually uses a dictionary to map words to their actual form.
Morphological segmentation
Separate words into individual morphemes and identify the class of the
morphemes. The difficulty of this task depends greatly on the complexity of the
morphology (i.e., the structure of words) of the language being considered. English
has fairly simple morphology, especially inflectional morphology, and thus it is often
possible to ignore this task entirely and simply model all possible forms of a word
(e.g., ―open, opens, opened, opening‖) as separate words. In languages such as
Turkish or Meitei, a highly agglutinated Indian language, however, such an approach
is not possible, as each dictionary entry has thousands of possible word forms.
Part-of-speech tagging
Given a sentence, determine the part of speech (POS) for each word. Many
Natural Language Processing 171
words, especially common ones, can serve as multiple parts of speech. For example,
―book‖ can be a noun (―the book on the table‖) or verb (―to book a flight‖); ―set‖
can be a noun, verb or adjective; and ―out‖ can be any of at least five different
parts of speech.
Stemming
The process of reducing inflected (or sometimes derived) words to a base form
(e.g., ―close‖ will be the root for ―closed‖, ―closing‖, ―close‖, ―closer‖ etc.).
Stemming yields similar results as lemmatization, but does so on grounds of rules,
not a dictionary.
SYNTACTIC ANALYSIS
Grammar induction
Generate a formal grammar that describes a language‘s syntax.
Sentence breaking (also known as “sentence boundary
disambiguation”)
Given a chunk of text, find the sentence boundaries. Sentence boundaries are
often marked by periods or other punctuation marks, but these same characters
can serve other purposes (e.g., marking abbreviations).
Parsing
Determine the parse tree (grammatical analysis) of a given sentence. The
grammar for natural languages is ambiguous and typical sentences have multiple
possible analyses: perhaps surprisingly, for a typical sentence there may be thousands
of potential parses (most of which will seem completely nonsensical to a human).
There are two primary types of parsing: dependency parsing and constituency
parsing. Dependency parsing focuses on the relationships between words in a
sentence (marking things like primary objects and predicates), whereas constituency
parsing focuses on building out the parse tree using a probabilistic context-free
grammar (PCFG).
LEXICAL SEMANTICS (OF INDIVIDUAL WORDS IN
CONTEXT)
Lexical semantics
What is the computational meaning of individual words in context?
Distributional semantics
How can we learn semantic representations from data?
172 Deep Learning Using Python
Semantic parsing
Given a piece of text (typically a sentence), produce a formal representation
of its semantics, either as a graph (e.g., in AMR parsing) or in accordance with
a logical formalism (e.g., in DRT parsing). This challenge typically includes
aspects of several more elementary NLP tasks from semantics (e.g., semantic role
labelling, word-sense disambiguation) and can be extended to include full-fledged
discourse analysis.
Semantic role labelling
Given a single sentence, identify and disambiguate semantic predicates (e.g.,
verbal frames), then identify and classify the frame elements (semantic roles).
DISCOURSE (SEMANTICS BEYOND INDIVIDUAL
SENTENCES)
Coreference resolution
Given a sentence or larger chunk of text, determine which words (―mentions‖)
refer to the same objects (―entities‖). Anaphora resolution is a specific example
of this task, and is specifically concerned with matching up pronouns with the
nouns or names to which they refer. The more general task of coreference resolution
also includes identifying so-called ―bridging relationships‖ involving referring
expressions. For example, in a sentence such as ―He entered John‘s house through
the front door‖, ―the front door‖ is a referring expression and the bridging relationship
to be identified is the fact that the door being referred to is the front door of John‘s
house (rather than of some other structure that might also be referred to).
Discourse analysis
This rubric includes several related tasks. One task is discourse parsing, i.e.,
identifying the discourse structure of a connected text, i.e. the nature of the
discourse relationships between sentences (e.g. elaboration, explanation, contrast).
Another possible task is recognizing and classifying the speech acts in a chunk
of text (e.g. yes-no question, content question, statement, assertion, etc.).
Implicit semantic role labelling
Given a single sentence, identify and disambiguate semantic predicates (e.g.,
verbal frames) and their explicit semantic roles in the current sentence. Then,
identify semantic roles that are not explicitly realized in the current sentence,
classify them into arguments that are explicitly realized elsewhere in the text and
those that are not specified, and resolve the former against the local text. A closely
174 Deep Learning Using Python
related task is zero anaphora resolution, i.e., the extension of coreference resolution
to pro-drop languages.
Recognizing textual entailment
Given two text fragments, determine if one being true entails the other, entails
the other‘s negation, or allows the other to be either true or false.
Topic segmentation and recognition
Given a chunk of text, separate it into segments each of which is devoted to
a topic, and identify the topic of the segment.
Argument mining
The goal of argument mining is the automatic extraction and identification
of argumentative structures from natural language text with the aid of computer
programs. Such argumentative structures include the premise, conclusions, the
argument scheme and the relationship between the main and subsidiary argument,
or the main and counter-argument within discourse.
HIGHER-LEVEL NLP APPLICATIONS
Automatic summarization (text summarization)
Produce a readable summary of a chunk of text. Often used to provide
summaries of the text of a known type, such as research papers, articles in the
financial section of a newspaper.
Book generation
Not an NLP task proper but an extension of natural language generation and
other NLP tasks is the creation of full-fledged books. The first machine-generated
book was created by a rule-based system in 1984 (Racter, The policeman’s beard
is half-constructed). The first published work by a neural network was published
in 2018, 1 the Road, marketed as a novel, contains sixty million words. Both these
systems are basically elaborate but non-sensical (semantics-free) language models.
The first machine-generated science book was published in 2019 (Beta Writer,
Lithium-Ion Batteries, Springer, Cham). Unlike Racter and 1 the Road, this is
grounded on factual knowledge and based on text summarization.
Dialogue management
Computer systems intended to converse with a human.
Document AI
A Document AI platform sits on top of the NLP technology enabling users
Natural Language Processing 175
Question answering
Given a human-language question, determine its answer. Typical questions
have a specific right answer (such as ―What is the capital of Canada?‖), but
sometimes open-ended questions are also considered (such as ―What is the meaning
of life?‖).
Text-to-image generation
Given a description of an image, generate an image that matches the description.
Text-to-scene generation
Given a description of a scene, generate a 3D model of the scene.
Artificial Deep Neural Networks 177
between them. The weights and inputs are multiplied and return an output between
0 and 1. If the network did not accurately recognize a particular pattern, an
algorithm would adjust the weights. That way the algorithm can make certain
parameters more influential, until it determines the correct mathematical
manipulation to fully process the data.
Recurrent neural networks (RNNs), in which data can flow in any direction,
are used for applications such as language modeling. Long short-term memory is
particularly effective for this use. Convolutional deep neural networks (CNNs) are
used in computer vision. CNNs also have been applied to acoustic modeling for
automatic speech recognition (ASR).
Challenges
As with ANNs, many issues can arise with naively trained DNNs. Two
common issues are overfitting and computation time.
DNNs are prone to overfitting because of the added layers of abstraction,
which allow them to model rare dependencies in the training data. Regularization
methods such as Ivakhnenko‘s unit pruning or weight decay or sparsity can be
applied during training to combat overfitting. Alternatively dropout regularization
randomly omits units from the hidden layers during training. This helps to exclude
rare dependencies. Finally, data can be augmented via methods such as cropping
and rotating such that smaller training sets can be increased in size to reduce the
chances of overfitting.
DNNs must consider many training parameters, such as the size (number of
layers and number of units per layer), the learning rate, and initial weights.
Sweeping through the parameter space for optimal parameters may not be feasible
due to the cost in time and computational resources.
Various tricks, such as batching (computing the gradient on several training
examples at once rather than individual examples) speed up computation. Large
processing capabilities of many-core architectures (such as GPUs or the Intel Xeon
Phi) have produced significant speedups in training, because of the suitability of
such processing architectures for the matrix and vector computations.
Alternatively, engineers may look for other types of neural networks with more
straightforward and convergent training algorithms. CMAC (cerebellar model
articulation controller) is one such kind of neural network. It doesn‘t require
learning rates or randomized initial weights for CMAC. The training process can
be guaranteed to converge in one step with a new batch of data, and the computational
complexity of the training algorithm is linear with respect to the number of neurons
involved.
180 Deep Learning Using Python
HARDWARE
Since the 2010s, advances in both machine learning algorithms and computer
hardware have led to more efficient methods for training deep neural networks
that contain many layers of non-linear hidden units and a very large output layer.
By 2019, graphic processing units (GPUs), often with AI-specific enhancements,
had displaced CPUs as the dominant method of training large-scale commercial
cloud AI. OpenAI estimated the hardware computation used in the largest deep
learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-
fold increase in the amount of computation required, with a doubling-time trendline
of 3.4 months. Special electronic circuits called deep learning processors were
designed to speed up deep learning algorithms. Deep learning processors include
neural processing units (NPUs) in Huawei cellphones and cloud computing servers
such as tensor processing units (TPU) in the Google Cloud Platform. Cerebras
Systems has also built a dedicated system to handle large deep learning models,
the CS-2, based on the largest processor in the industry, the second-generation
Wafer Scale Engine (WSE-2).
Atomically thin semiconductors are considered promising for energy-efficient
deep learning hardware where the same basic device structure is used for both
logic operations and data storage. In 2020, Marega et al. published experiments
with a large-area active channel material for developing logic-in-memory devices
and circuits based on floating-gate field-effect transistors (FGFETs).
In 2021, J. Feldmann et al. proposed an integrated photonic hardware accelerator
for parallel convolutional processing. The authors identify two key advantages of
integrated photonics over its electronic counterparts: (1) massively parallel data
transfer through wavelength division multiplexing in conjunction with frequency
combs, and (2) extremely high data modulation speeds. Their system can execute
trillions of multiply-accumulate operations per second, indicating the potential of
integrated photonics in data-heavy AI applications.
adjusts its weighted associations according to a learning rule and using this error
value. Successive adjustments will cause the neural network to produce output
which is increasingly similar to the target output. After a sufficient number of these
adjustments the training can be terminated based upon certain criteria. This is
known as supervised learning.
Such systems ―learn‖ to perform tasks by considering examples, generally
without being programmed with task-specific rules. For example, in image
recognition, they might learn to identify images that contain cats by analyzing
example images that have been manually labeled as ―cat‖ or ―no cat‖ and using
the results to identify cats in other images. They do this without any prior
knowledge of cats, for example, that they have fur, tails, whiskers, and cat-like
faces. Instead, they automatically generate identifying characteristics from the
examples that they process.
History
Warren McCulloch and Walter Pitts (1943) opened the subject by creating a
computational model for neural networks. In the late 1940s, D. O. Hebb created
a learning hypothesis based on the mechanism of neural plasticity that became
known as Hebbian learning. Farley and Wesley A. Clark (1954) first used
computational machines, then called ―calculators‖, to simulate a Hebbian network.
In 1958, psychologist Frank Rosenblatt invented the perceptron, the first artificial
neural network, funded by the United States Office of Naval Research. The first
functional networks with many layers were published by Ivakhnenko and Lapa
in 1965, as the Group Method of Data Handling. The basics of continuous
backpropagation were derived in the context of control theory by Kelley in 1960
and by Bryson in 1961, using principles of dynamic programming. Thereafter
research stagnated following Minsky and Papert (1969), who discovered that basic
perceptrons were incapable of processing the exclusive-or circuit and that computers
lacked sufficient power to process useful neural networks.
In 1970, Seppo Linnainmaa published the general method for automatic
differentiation (AD) of discrete connected networks of nested differentiable
functions. In 1973, Dreyfus used backpropagation to adapt parameters of controllers
in proportion to error gradients. Werbos‘s (1975) backpropagation algorithm enabled
practical training of multi-layer networks. In 1982, he applied Linnainmaa‘s AD
method to neural networks in the way that became widely used.
The development of metal–oxide–semiconductor (MOS) very-large-scale
integration (VLSI), in the form of complementary MOS (CMOS) technology,
enabled increasing MOS transistor counts in digital electronics. This provided
Artificial Deep Neural Networks 183
more processing power for the development of practical artificial neural networks
in the 1980s.
In 1986 Rumelhart, Hinton and Williams showed that backpropagation learned
interesting internal representations of words as feature vectors when trained to
predict the next word in a sequence.
From 1988 onward, the use of neural networks transformed the field of protein
structure prediction, in particular when the first cascading networks were trained
on profiles (matrices) produced by multiple sequence alignments.
In 1992, max-pooling was introduced to help with least-shift invariance and
tolerance to deformation to aid 3D object recognition. Schmidhuber adopted a
multi-level hierarchy of networks (1992) pre-trained one level at a time by
unsupervised learning and fine-tuned by backpropagation.
Neural networks‘ early successes included predicting the stock market and in
1995 a (mostly) self-driving car.
Geoffrey Hinton et al. (2006) proposed learning a high-level representation
using successive layers of binary or real-valued latent variables with a restricted
Boltzmann machine to model each layer. In 2012, Ng and Dean created a network
that learned to recognize higher-level concepts, such as cats, only from watching
unlabeled images. Unsupervised pre-training and increased computing power from
GPUs and distributed computing allowed the use of larger networks, particularly
in image and visual recognition problems, which became known as ―deep learning‖.
Ciresan and colleagues (2010) showed that despite the vanishing gradient
problem, GPUs make backpropagation feasible for many-layered feedforward
neural networks. Between 2009 and 2012, ANNs began winning prizes in image
recognition contests, approaching human level performance on various tasks,
initially in pattern recognition and handwriting recognition. For example, the bi-
directional and multi-dimensional long short-term memory (LSTM) of Graves et
al. won three competitions in connected handwriting recognition in 2009 without
any prior knowledge about the three languages to be learned.
Ciresan and colleagues built the first pattern recognizers to achieve human-
competitive/superhuman performance on benchmarks such as traffic sign recognition
(IJCNN 2012).
MODELS
ANNs began as an attempt to exploit the architecture of the human brain to
perform tasks that conventional algorithms had little success with. They soon
reoriented towards improving empirical results, mostly abandoning attempts to
remain true to their biological precursors. Neurons are connected to each other
184 Deep Learning Using Python
in various patterns, to allow the output of some neurons to become the input of
others. The network forms a directed, weighted graph.
Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs
at axon terminals
In between them are zero or more hidden layers. Single layer and unlayered
networks are also used. Between two layers, multiple connection patterns are
possible. They can be ‗fully connected‘, with every neuron in one layer connecting
to every neuron in the next layer. They can be pooling, where a group of neurons
in one layer connect to a single neuron in the next layer, thereby reducing the
number of neurons in that layer. Neurons with only such connections form a
directed acyclic graph and are known as feedforward networks. Alternatively,
networks that allow connections between neurons in the same or previous layers
are known as recurrent networks.
Hyperparameter
A hyperparameter is a constant parameter whose value is set before the
learning process begins. The values of parameters are derived via learning. Examples
of hyperparameters include learning rate, the number of hidden layers and batch
size. The values of some hyperparameters can be dependent on those of other
hyperparameters. For example, the size of some layers can depend on the overall
number of layers.
Learning
Learning is the adaptation of the network to better handle a task by considering
sample observations. Learning involves adjusting the weights (and optional
thresholds) of the network to improve the accuracy of the result. This is done by
minimizing the observed errors. Learning is complete when examining additional
observations does not usefully reduce the error rate. Even after learning, the error
rate typically does not reach 0. If after learning, the error rate is too high, the
network typically must be redesigned. Practically this is done by defining a cost
function that is evaluated periodically during learning. As long as its output
continues to decline, learning continues. The cost is frequently defined as a
statistic whose value can only be approximated. The outputs are actually numbers,
so when the error is low, the difference between the output (almost certainly a
cat) and the correct answer (cat) is small. Learning attempts to reduce the total
of the differences across the observations. Most learning models can be viewed
as a straightforward application of optimization theory and statistical estimation.
Learning rate
The learning rate defines the size of the corrective steps that the model takes
to adjust for errors in each observation. A high learning rate shortens the training
time, but with lower ultimate accuracy, while a lower learning rate takes longer,
but with the potential for greater accuracy. Optimizations such as Quickprop are
186 Deep Learning Using Python
Other
In a Bayesian framework, a distribution over the set of allowed models is
chosen to minimize the cost. Evolutionary methods, gene expression programming,
simulated annealing, expectation-maximization, non-parametric methods and
particle swarm optimization are other learning algorithms. Convergent recursion
is a learning algorithm for cerebellar model articulation controller (CMAC) neural
networks.
Modes
Two modes of learning are available: stochastic and batch. In stochastic
learning, each input creates a weight adjustment. In batch learning weights are
adjusted based on a batch of inputs, accumulating errors over the batch. Stochastic
learning introduces ―noise‖ into the process, using the local gradient calculated
from one data point; this reduces the chance of the network getting stuck in local
minima. However, batch learning typically yields a faster, more stable descent to
a local minimum, since each update is performed in the direction of the batch‘s
average error. A common compromise is to use ―mini-batches‖, small batches with
samples in each batch selected stochastically from the entire data set.
TYPES
ANNs have evolved into a broad family of techniques that have advanced the
state of the art across multiple domains. The simplest types have one or more static
components, including number of units, number of layers, unit weights and topology.
Dynamic types allow one or more of these to evolve via learning. The latter are
much more complicated, but can shorten learning periods and produce better
results. Some types allow/require learning to be ―supervised‖ by the operator,
while others operate independently. Some types operate purely in hardware, while
others are purely software and run on general purpose computers.
Some of the main breakthroughs include: convolutional neural networks that
have proven particularly successful in processing visual and other two-dimensional
data; long short-term memory avoid the vanishing gradient problem and can handle
signals that have a mix of low and high frequency components aiding large-
vocabulary speech recognition, text-to-speech synthesis, and photo-real talking
heads; competitive networks such as generative adversarial networks in which
multiple networks (of varying structure) compete with each other, on tasks such
as winning a game or on deceiving the opponent about the authenticity of an input.
NETWORK DESIGN
Neural architecture search (NAS) uses machine learning to automate ANN
Artificial Deep Neural Networks 189
design. Various approaches to NAS have designed networks that compare well
with hand-designed systems. The basic search algorithm is to propose a candidate
model, evaluate it against a dataset and use the results as feedback to teach the
NAS network. Available systems include AutoML and AutoKeras.
Design issues include deciding the number, type and connectedness of network
layers, as well as the size of each and the connection type (full, pooling, ...).
Hyperparameters must also be defined as part of the design (they are not
learned), governing matters such as how many neurons are in each layer, learning
rate, step, stride, depth, receptive field and padding (for CNNs), etc.
Use
Using Artificial neural networks requires an understanding of their
characteristics.
• Choice of model: This depends on the data representation and the
application. Overly complex models are slow learning.
• Learning algorithm: Numerous trade-offs exist between learning
algorithms. Almost any algorithm will work well with the correct
hyperparameters for training on a particular data set. However,
selecting and tuning an algorithm for training on unseen data
requires significant experimentation.
• Robustness: If the model, cost function and learning algorithm are
selected appropriately, the resulting ANN can become robust.
ANN capabilities fall within the following broad categories:
• Function approximation, or regression analysis, including time series
prediction, fitness approximation and modeling.
• Classification, including pattern and sequence recognition, novelty
detection and sequential decision making.
• Data processing, including filtering, clustering, blind source
separation and compression.
• Robotics, including directing manipulators and prostheses.
APPLICATIONS
Because of their ability to reproduce and model nonlinear processes, artificial
neural networks have found applications in many disciplines. Application areas
include system identification and control (vehicle control, trajectory prediction,
process control, natural resource management), quantum chemistry, general game
playing, pattern recognition (radar systems, face identification, signal classification,
190 Deep Learning Using Python
function. It is related to the amount of information that can be stored in the network
and to the notion of complexity. Two notions of capacity are known by the
community. The information capacity and the VC Dimension. The information
capacity of a perceptron is intensively discussed in Sir David MacKay‘s book
which summarizes work by Thomas Cover. The capacity of a network of standard
neurons (not convolutional) can be derived by four rules that derive from
understanding a neuron as an electrical element. The information capacity captures
the functions modelable by the network given any data as input. The second notion,
is the VC dimension. VC Dimension uses the principles of measure theory and
finds the maximum capacity under the best possible circumstances. This is, given
input data in a specific form. As noted in, the VC Dimension for arbitrary inputs
is half the information capacity of a Perceptron. The VC Dimension for arbitrary
points is sometimes referred to as Memory Capacity.
Convergence
Models may not consistently converge on a single solution, firstly because
local minima may exist, depending on the cost function and the model. Secondly,
the optimization method used might not guarantee to converge when it begins far
from any local minimum. Thirdly, for sufficiently large data or parameters, some
methods become impractical.
Another issue worthy to mention is that training may cross some Saddle point
which may lead the convergence to the wrong direction.
The convergence behavior of certain types of ANN architectures are more
understood than others. When the width of network approaches to infinity, the
ANN is well described by its first order Taylor expansion throughout training, and
so inherits the convergence behavior of affine models. Another example is when
parameters are small, it is observed that ANNs often fits target functions from low
to high frequencies. This behavior is referred to as the spectral bias, or frequency
principle, of neural networks. This phenomenon is the opposite to the behavior
of some well studied iterative numerical schemes such as Jacobi method. Deeper
neural networks have been observed to be more biased towards low frequency
functions.
Generalization and statistics
Applications whose goal is to create a system that generalizes well to unseen
examples, face the possibility of over-training. This arises in convoluted or over-
specified systems when the network capacity significantly exceeds the needed free
parameters. Two approaches address over-training. The first is to use cross-
validation and similar techniques to check for the presence of over-training and
192 Deep Learning Using Python
Supervised neural networks that use a mean squared error (MSE) cost function
can use formal statistical methods to determine the confidence of the trained
model. The MSE on a validation set can be used as an estimate for variance. This
value can then be used to calculate the confidence interval of network output,
assuming a normal distribution. A confidence analysis made this way is statistically
valid as long as the output probability distribution stays the same and the network
is not modified.
By assigning a softmax activation function, a generalization of the logistic
function, on the output layer of the neural network (or a softmax component in
a component-based network) for categorical target variables, the outputs can be
Artificial Deep Neural Networks 193
Now to compute the derivative with respect to the parameters, the function
must be differentiable so that‘s why we want to have a continuous function.
Example of continuous functions:
196 Deep Learning Using Python
As is clear, this is non-linear data. We cannot draw a line in any way such
that it can separate the red points from the blue points.
We want the model to be such that it gives an output of 0 for all the points
in the green region(all the blue points) and it outputs 1 for all the points in the
red region(all the red points) in the below image:
198 Deep Learning Using Python
We want the function output to be 1 for the red points and the function output
to be 0 for the blue points that mean we want our function to be of the following
form:
Artificial Deep Neural Networks 199
We can notice that this is not a very smooth function, it has sharp edges(in
blue in the above image) which means it will not be differentiable at certain points.
What we want is a very smooth function, a function of the following form:
This is one of the reasons why we need complex functions because a lot of
real-world data would require functions of above form and the above function is
not like a Sigmoid Neuron(S-shaped), it‘s not like an MP Neuron or
Perceptron(Linear), so we need different family of functions, we need more
complex functions than what we have seen so far.
200 Deep Learning Using Python
Now no matter what we set the values of parameters, we are not going to get
a surface that can exactly fit the data.
Artificial Deep Neural Networks 201
Artificial neural networks are a fascinating area of study, although they can
be intimidating when just getting started. There is a lot of specialized terminology
used when describing the data structures and algorithms used in the field. Kick-
start your project with my new book Deep Learning With Python, including step-
by-step tutorials and the Python source code files for all examples.
Let‘s get started.
NEURON WEIGHTS
You may be familiar with linear regression, where the weights on the inputs
are very much like the coefficients used in a regression equation.
Like linear regression, each neuron also has a bias which can be thought of
as an input that always has the value 1.0, and it, too, must be weighted.
For example, a neuron may have two inputs, which require three weights—
one for each input and one for the bias.
Weights are often initialized to small random values, such as values from 0
to 0.3, although more complex initialization schemes can be used.
Like linear regression, larger weights indicate increased complexity and fragility.
Keeping weights in the network is desirable, and regularization techniques can
be used.
Artificial Deep Neural Networks 203
Activation
The weighted inputs are summed and passed through an activation function,
sometimes called a transfer function.
An activation function is a simple mapping of summed weighted input to the
output of the neuron. It is called an activation function because it governs the
threshold at which the neuron is activated and the strength of the output signal.
Historically, simple step activation functions were used when the summed
input was above a threshold of 0.5, for example. Then the neuron would output
a value of 1.0; otherwise, it would output a 0.0.
Traditionally, non-linear activation functions are used. This allows the network
to combine the inputs in more complex ways and, in turn, provide a richer
capability in the functions they can model. Non-linear functions like the logistic,
also called the sigmoid function, were used to output a value between 0 and 1
with an s-shaped distribution. The hyperbolic tangent function, also called tanh,
outputs the same distribution over the range -1 to +1.
Networks of Neurons
Neurons are arranged into networks of neurons.
A row of neurons is called a layer, and one network can have multiple layers.
The architecture of the neurons in the network is often called the network topology.
These are not neurons as described above but simply pass the input value through
to the next layer.
Hidden Layers
Layers after the input layer are called hidden layers because they are not
directly exposed to the input. The simplest network structure is to have a single
neuron in the hidden layer that directly outputs the value.
Given increases in computing power and efficient libraries, very deep neural
networks can be constructed. Deep learning can refer to having many hidden layers
in your neural network. They are deep because they would have been unimaginably
slow to train historically but may take seconds or minutes to train using modern
techniques and hardware.
Output Layer
The final hidden layer is called the output layer, and it is responsible for
outputting a value or vector of values that correspond to the format required for
the problem.
The choice of activation function in the output layer is strongly constrained
by the type of problem that you are modeling. For example:
• A regression problem may have a single output neuron, and the neuron
may have no activation function.
• A binary classification problem may have a single output neuron and use
a sigmoid activation function to output a value between 0 and 1 to represent
the probability of predicting a value for the class 1. This can be turned
into a crisp class value by using a threshold of 0.5 and snap values less
than the threshold to 0, otherwise to 1.
• A multi-class classification problem may have multiple neurons in the
output layer, one for each class (e.g., three neurons for the three classes
in the famous iris flowers classification problem). In this case, a softmax
activation function may be used to output a probability of the network
predicting each of the class values. Selecting the output with the highest
probability can be used to produce a crisp class classification value.
TRAINING NETWORKS
Once configured, the neural network needs to be trained on your dataset.
Data Preparation
You must first prepare your data for training on a neural network.
Artificial Deep Neural Networks 205
Data must be numerical, for example, real values. If you have categorical data,
such as a sex attribute with the values ―male‖ and ―female,‖ you can convert it
to a real-valued representation called one-hot encoding. This is where one new
column is added for each class value (two columns in the case of sex of male
and female), and a 0 or 1 is added for each row depending on the class value for
that row.
This same one-hot encoding can be used on the output variable in classification
problems with more than one class. This would create a binary vector from a single
column that would be easy to directly compare to the output of the neuron in the
network‘s output layer. That, as described above, would output one value for each
class.
Neural networks require the input to be scaled in a consistent way. You can
rescale it to the range between 0 and 1, called normalization. Another popular
technique is to standardize it so that the distribution of each column has a mean
of zero and a standard deviation of 1. Scaling also applies to image pixel data.
Data such as words can be converted to integers, such as the popularity rank of
the word in the dataset and other encoding techniques.
Stochastic Gradient Descent
The classical and still preferred training algorithm for neural networks is
called stochastic gradient descent.
This is where one row of data is exposed to the network at a time as input.
The network processes the input upward, activating neurons as it goes to finally
produce an output value. This is called a forward pass on the network. It is the
type of pass that is also used after the network is trained in order to make
predictions on new data.
The output of the network is compared to the expected output, and an error
is calculated. This error is then propagated back through the network, one layer
at a time, and the weights are updated according to the amount they contributed
to the error. This clever bit of math is called the backpropagation algorithm.
The process is repeated for all of the examples in your training data. One round
of updating the network for the entire training dataset is called an epoch. A network
may be trained for tens, hundreds, or many thousands of epochs.
Weight Updates
The weights in the network can be updated from the errors calculated for each
training example, and this is called online learning. It can result in fast but also
chaotic changes to the network.
206 Deep Learning Using Python
Alternatively, the errors can be saved across all the training examples, and
the network can be updated at the end. This is called batch learning and is often
more stable.
Typically, because datasets are so large and because of computational
efficiencies, the size of the batch, the number of examples the network is shown
before an update, is often reduced to a small number, such as tens or hundreds
of examples.
The amount that weights are updated is controlled by a configuration parameter
called the learning rate. It is also called the step size and controls the step or change
made to a network weight for a given error. Often small weight sizes are used,
such as 0.1 or 0.01 or smaller.
The update equation can be complemented with additional configuration terms
that you can set.
• Momentum is a term that incorporates the properties from the previous
weight update to allow the weights to continue to change in the same
direction even when there is less error being calculated.
• Learning Rate Decay is used to decrease the learning rate over epochs to
allow the network to make large changes to the weights at the beginning
and smaller fine-tuning changes later in the training schedule.
Prediction
Once a neural network has been trained, it can be used to make predictions.
You can make predictions on test or validation data in order to estimate the
skill of the model on unseen data. You can also deploy it operationally and use
it to make predictions continuously. The network topology and the final set of
weights are all you need to save from the model.
Predictions are made by providing the input to the network and performing
a forward-pass, allowing it to generate an output you can use as a prediction.
tasks based on TIMIT. The data set contains 630 speakers from eight major dialects
of American English, where each speaker reads 10 sentences. Its small size lets
many configurations be tried. More importantly, the TIMIT task concerns phone-
sequence recognition, which, unlike word-sequence recognition, allows weak
phone bigram language models. This lets the strength of the acoustic modeling
aspects of speech recognition be more easily analyzed. The error rates listed below,
including these early results and measured as percent phone error rates (PER),
have been summarized since 1991.
Method Percent phoneerror rate (PER) (%)
Randomly Initialized RNN 26.1
Bayesian Triphone GMM-HMM 25.6
Hidden Trajectory (Generative) Model 24.8
Monophone Randomly Initialized DNN 23.4
Monophone DBN-DNN 22.4
Triphone GMM-HMM with BMMI Training 21.7
Monophone DBN-DNN on fbank 20.7
Convolutional DNN 20.0
Convolutional DNN w. Heterogeneous Pooling 18.7
Ensemble DNN/CNN/RNN 18.3
Bidirectional LSTM 17.8
Hierarchical Convolutional Deep Maxout Network 16.5
The debut of DNNs for speaker recognition in the late 1990s and speech
recognition around 2009-2011 and of LSTM around 2003–2007, accelerated
progress in eight major areas:
• Scale-up/out and accelerated DNN training and decoding
• Sequence discriminative training
• Feature processing by deep models with solid understanding of the
underlying mechanisms
• Adaptation of DNNs and related deep models
• Multi-task and transfer learning by DNNs and related deep models
• CNNs and how to design them to best exploit domain knowledge of speech
• RNN and its rich LSTM variants
• Other types of deep models including tensor-based models and integrated
deep generative/discriminative models.
208 Deep Learning Using Python
detect paraphrasing. Deep neural architectures provide the best results for
constituency parsing, sentiment analysis, information retrieval, spoken language
understanding, machine translation, contextual entity linking, writing style
recognition, Text classification and others.
Recent developments generalize word embedding to sentence embedding.
Google Translate (GT) uses a large end-to-end long short-term memory (LSTM)
network. Google Neural Machine Translation (GNMT) uses an example-based
machine translation method in which the system ―learns from millions of examples.‖
It translates ―whole sentences at a time, rather than pieces. Google Translate
supports over one hundred languages. The network encodes the ―semantics of the
sentence rather than simply memorizing phrase-to-phrase translations‖. GT uses
English as an intermediate between most language pairs.
Drug discovery and toxicology
A large percentage of candidate drugs fail to win regulatory approval. These
failures are caused by insufficient efficacy (on-target effect), undesired interactions
(off-target effects), or unanticipated toxic effects. Research has explored use of
deep learning to predict the biomolecular targets, off-targets, and toxic effects of
environmental chemicals in nutrients, household products and drugs.
AtomNet is a deep learning system for structure-based rational drug design.
AtomNet was used to predict novel candidate biomolecules for disease targets
such as the Ebola virus and multiple sclerosis.
In 2017 graph neural networks were used for the first time to predict various
properties of molecules in a large toxicology data set. In 2019, generative neural
networks were used to produce molecules that were validated experimentally all
the way into mice.
Customer relationship management
Deep reinforcement learning has been used to approximate the value of
possible direct marketing actions, defined in terms of RFM variables. The estimated
value function was shown to have a natural interpretation as customer lifetime
value.
Recommendation systems
Recommendation systems have used deep learning to extract meaningful
features for a latent factor model for content-based music and journal
recommendations. Multi-view deep learning has been applied for learning user
preferences from multiple domains. The model uses a hybrid collaborative and
content-based approach and enhances recommendations in multiple tasks.
210 Deep Learning Using Python
Bioinformatics
An autoencoder ANN was used in bioinformatics, to predict gene ontology
annotations and gene-function relationships. In medical informatics, deep learning
was used to predict sleep quality based on data from wearables and predictions
of health complications from electronic health record data.
Medical image analysis
Deep learning has been shown to produce competitive results in medical
application such as cancer cell classification, lesion detection, organ segmentation
and image enhancement. Modern deep learning tools demonstrate the high accuracy
of detecting various diseases and the helpfulness of their use by specialists to
improve the diagnosis efficiency.
Mobile advertising
Finding the appropriate mobile audience for mobile advertising is always
challenging, since many data points must be considered and analyzed before a
target segment can be created and used in ad serving by any ad server. Deep
learning has been used to interpret large, many-dimensioned advertising datasets.
Many data points are collected during the request/serve/click internet advertising
cycle. This information can form the basis of machine learning to improve ad
selection.
Image restoration
Deep learning has been successfully applied to inverse problems such as
denoising, super-resolution, inpainting, and film colorization. These applications
include learning methods such as ―Shrinkage Fields for Effective Image Restoration‖
which trains on an image dataset, and Deep Image Prior, which trains on the image
that needs restoration.
Financial fraud detection
Deep learning is being successfully applied to financial fraud detection, tax
evasion detection, and anti-money laundering. A potentially impressive
demonstration of unsupervised learning as prosecution of financial crime is required
to produce training data.
Also of note is that while the state of the art model in automated financial
crime detection has existed for quite some time, the applications for deep learning
referred to here dramatically under perform much simpler theoretical models. One
such, yet to be implemented model, the Sensor Location Heuristic and Simple Any
Human Detection for Financial Crimes (SLHSAHDFC), is an example.
Artificial Deep Neural Networks 211
The model works with the simple heuristic of choosing where it gets its input
data. By placing the sensors by places frequented by large concentrations of wealth
and power and then simply identifying any live human being, it turns out that the
automated detection of financial crime is accomplished at very high accuracies
and very high confidence levels. Even better, the model has shown to be extremely
effective at identifying not just crime but large, very destructive and egregious
crime. Due to the effectiveness of such models it is highly likely that applications
to financial crime detection by deep learning will never be able to compete.
Military
The United States Department of Defense applied deep learning to train robots
in new tasks through observation.
Partial differential equations
Physics informed neural networks have been used to solve partial differential
equations in both forward and inverse problems in a data driven manner. One
example is the reconstructing fluid flow governed by the Navier-Stokes equations.
Using physics informed neural networks does not require the often expensive mesh
generation that conventional CFD methods relies on.
Image Reconstruction
Image reconstruction is the reconstruction of the underlying images from the
image-related measurements. Several works showed the better and superior
performance of the deep learning methods compared to analytical methods for
various applications, e.g., spectral imaging and ultrasound imaging.
Epigenetic clock
An epigenetic clock is a biochemical test that can be used to measure age.
Galkin et al. used deep neural networks to train an epigenetic aging clock of
unprecedented accuracy using >6,000 blood samples. The clock uses information
from 1000 CpG sites and predicts people with certain conditions older than healthy
controls: IBD, frontotemporal dementia, ovarian cancer, obesity. The aging clock
is planned to be released for public use in 2021 by an Insilico Medicine spinoff
company Deep Longevity.
FRAUD DETECTION
Fraud is a growing problem in the digital world. In 2021, consumers reported
212 Deep Learning Using Python
2.8 million cases of fraud to the Federal Trade Commission. Identify theft and
imposter scams were the two most common fraud categories.
To help prevent fraud, companies like Signifyd use deep learning to detect
anomalies in user transactions. Those companies deploy deep learning to collect
data from a variety of sources, including the device location, length of stride and
credit card purchasing patterns to create a unique user profile.
Mastercard has taken a similar approach, leveraging its Decision Intelligence
and AI Express platforms to more accurately detect fraudulent credit card activity.
And for companies that rely on e-commerce, Riskified is making consumer
finance easier by reducing the number of bad orders and chargebacks for
merchants.
CUSTOMER RELATIONSHIP MANAGEMENT
Customer relationship management systems are often referred to as the ―single
source of truth‖ for revenue teams. They contain emails, phone call records and
notes about all of the company‘s current and former customers as well as its
prospects. Aggregating that information has helped revenue teams provide a better
customer experience, but the introduction of deep learning in CRM systems has
unlocked another layer of customer insights.
Deep learning is able to sift through all of the scraps of data a company collects
about its prospects to reveal trends about why customers buy, when they buy and
what keeps them around. This includes predictive lead scoring, which helps
companies identify customers they have the best chances to close; scraping data
from customer notes to make it easier to identify trends; and predictions about
customer support needs.
COMPUTER VISION
Deep learning aims to mimic the way the human mind digests information
and detects patterns, which makes it a perfect way to train vision-based AI
programs. Using deep learning models, those platforms are able to take in a series
of labeled photo sets to learn to detect objects like airplanes, faces and guns.
The application for image recognition is expansive. Neurala uses an algorithm
it calls Lifelong-DNN to complete manufacturing quality inspections. Others, like
ZeroEyes, use deep learning to detect firearms in public places like schools and
government property. When a gun is detected, the system is designed to alert police
in an effort to prevent shootings. And finally, companies like Motional rely on
AI technologies to reinforce their LiDAR, radar and camera systems in autonomous
vehicles.
Artificial Deep Neural Networks 213
Agriculture
Agriculture will remain a key source of food production in the coming years,
so people have found ways to make the process more efficient with deep learning
and AI tools. In fact, a 2021 Forbes article revealed that the agriculture industry
is expected to invest $4 billion in AI solutions by 2026. Farmers have already
found various uses for the technology, wielding AI to detect intrusive wild animals,
forecast crop yields and power self-driving machinery.
Blue River Technology has explored the possibilities of self-driven farm
products by combining machine learning, computer vision and robotics. The
results have been promising, leading to smart machines — like a lettuce bot that
knows how to single out weeds for chemical spraying while leaving plants alone.
In addition, companies like Taranis blend computer vision and deep learning to
monitor fields and prevent crop loss due to weeds, insects and other causes.
Vocal AI
When it comes to recreating human speech or translating voice to text, deep
learning has a critical role to play. Deep learning models enable tools like Google
Voice Search and Siri to take in audio, identify speech patterns and translate it
into text. Then there‘s DeepMind‘s WaveNet model, which employs neural networks
to take text and identify syllable patterns, inflection points and more. This enables
companies like Google to train their virtual assistants to sound more human. In
addition, Mozilla‘s 2017 RRNoise Project used it to identify and suppress
background noise in audio files, providing users with clearer audio.
NATURAL LANGUAGE PROCESSING
The introduction of natural language processing technology has made it possible
for robots to read messages and divine meaning from them. Still, the process can
be somewhat oversimplified, failing to account for the ways that words combine
together to change the meaning or intent behind a sentence.
Deep learning enables natural language processors to identify more complicated
patterns in sentences to provide a more accurate interpretation. Companies like
Gamalon use deep learning to power a chatbot that is able to respond to a larger
volume of messages and provide more accurate responses.
Other companies like Strong apply it in its NLP tool to help users translate
text, categorize text to help mine data from a collection of messages and identify
sentiment in text. Grammarly also uses deep learning in combination with
grammatical rules and patterns to help users identify the errors and tone of their
messages.
214 Deep Learning Using Python
Data Refining
When large amounts of raw data are collected, it‘s hard for data scientists to
identify patterns, draw insights or do much with it. It needs to be processed. Deep
learning models are able to take that raw data and make it accessible. Companies
like Descartes Labs use a cloud-based supercomputer to refine data. Making sense
of swaths of raw data can be useful for disease control, disaster mitigation, food
security and satellite imagery.
Virtual Assistants
The divide between humans and machines continues to blur as virtual assistants
become a part of everyday life. These AI-driven tools display a mix of AI, machine
learning and deep learning techniques in order to process commands. Apple‘s Siri
and Google‘s Google Assistant are two prominent examples, with both being able
to operate across laptops, speakers, TVs and other devices. People can expect to
see more virtual assistants and chatbots in the near future as the industry is on
track to undergo plenty of growth through 2028.
Autonomous Vehicles
Driving is all about taking in external factors like the cars around you, street
signs and pedestrians and reacting to them safely to get from point A to B. While
we‘re still a ways away from fully autonomous vehicles, deep learning has played
a crucial role in helping the technology come to fruition. It allows autonomous
vehicles to take into account where you want to go, predict what the obstacles
in your environment will do and create a safe path to get you to that location.
For instance, Zoox has used AI technologies to help its fully autonomous
robotaxi vehicles learn from some of the most challenging driving situations to
improve their decision-making under various circumstances. Other self-driving car
companies that use deep learning to power their technology include Tesla-owned
DeepScale and Waymo, a subsidiary of Google.
Supercomputers
While some software uses deep learning in its solution, if you want to build
your own deep learning model, you need a supercomputer. Companies like Boxx
and Nvidia have built workstations that can handle the processing power needed
to build deep learning models. NVIDIA‘s DGX Station claims to be the ―equivalent
of hundreds of traditional servers,‖ and enables users to test and tweak their
models. Boxx‘s APEXX W-class products work with deep learning frameworks
to provide more powerful processing and dependable computer performance.
Artificial Deep Neural Networks 215
Investment Modeling
Investment modeling is another industry that has benefited from deep learning.
Predicting the market requires tracking and interpreting dozens of data points from
earning call conversations to public events to stock pricing. Companies like Aiera
use an adaptive deep learning platform to provide institutional investors with real-
time analysis on individual equities, content from earnings calls and public company
events. Even some of the bigger names like Morgan Stanley are joining the AI
movement, using AI technologies to provide sound advice on wealth management
through robo-advisors.
Climate Change
Organizations are stepping up to help people adapt to quickly accelerating
environmental change. One Concern has emerged as a climate intelligence leader,
factoring environmental events such as extreme weather into property risk
assessments. Meanwhile, NCX has expanded the carbon-offset movement to include
smaller landowners by using AI technology to create an affordable carbon
marketplace.
E-commerce
Online shopping is now the de-facto way people purchase goods, but it can
still be frustrating to scroll through dozens of pages to find the right pair of shoes
that match your style. Some e-commerce companies are turning to deep learning
to make the hunt easier.
Among Clarifai‘s many deep learning offerings is a tool that helps brands with
image labeling to boost SEO traffic and surface alternative products for users when
an item is out of stock. E-commerce giant eBay also applies a suite of AI, machine
learning and deep learning techniques to power its global online marketplace and
further enhance its search engine capabilities.
Emotional Intelligence
While computers may not be able to replicate human emotions, they are
getting better at understanding our moods thanks to deep learning. Patterns like
a shift in tone, a slight frown or a huff are all valuable data signals that can help
AI detect our moods.
Companies like Affectiva are using deep learning to track all of those vocal
and facial reactions to provide a nuanced understanding of our moods. Others like
Cogito analyze the behaviors of customer service representatives to gauge their
emotional intelligence and offer real-time advice for improved interactions.
216 Deep Learning Using Python
Entertainment
Ever wonder how streaming platforms seem to intuit the perfect show for you
to binge-watch next? Well, you have deep learning to thank for that. Streaming
platforms aggregate tons of data on what content you choose to consume and what
you ignore. Take Netflix as an example. The streaming platform uses machine
learning to find patterns in what its viewers watch so that it can create a personalized
experience for its users.
Deep Dreaming
Introduced back in 2015 by a team of Google engineers, the concept of deep
dreaming has given another dimension to the realm of deep learning. Deep dreaming
involves feeding algorithms to machines, which can then mimic the process of
dreaming in human neural networks. A website called Deep Dream Generator has
taken advantage of these algorithms, allowing creators to produce breathtaking
digital art.
Advertising
Companies can glean a lot of information from how a user interacts with its
marketing. It can signal intent to buy, show that the product resonates with them
or that they want to learn more information. Many marketing tech firms are using
deep learning to generate even more insights into customers. Companies like
6sense use deep learning to train their software to better understand buyers based
on how they engage with an app or navigate a website. This can be used to help
businesses more accurately target potential buyers and create tailored ad campaigns.
Other firms like Dstillery use it to understand more about a customer‘s consumers
to help each ad campaign reach the target audience for the product.
Manufacturing
The success of a factory often hinges on machines, humans and robots working
together as efficiently as possible to produce a replicable product. When one part
of the production gets out of whack, it can come at a devastating cost to the
company. Deep learning is being used to make that process even more efficient
and eliminate those errors.
Companies like OneTrack are using it to scan factory floors for anomalies like
a teetering box or an improperly used forklift and alert workers to safety risks.
The goal is to prevent errors that can slow down production and cause harm. Then
there‘s Fanuc, which uses it to train its AI Error Proofing tool to discern good
parts from bad parts during the manufacturing process. Energy giant General
Artificial Deep Neural Networks 217
Electric also uses deep learning in its Predix platform to track and find all possible
points of failure on a factory floor.
Healthcare
The healthcare industry contends with inefficiencies, but deep learning plays
a crucial role in streamlining the patient experience. KenSci, a company under
the Advata umbrella, uses AI technology that learns from past performance data
to predict how much space and what resources teams need to provide proper
patient care. In addition, PathAI harnesses the predictive abilities of AI to garner
more accurate data from drug research, clinical trials and patient diagnostics. Deep
learning has also been proven to detect skin cancer through images, according to
a National Center for Biotechnology Report.
Sports
Top-performing athletes are able to be more intentional about the ways they
improve their games, thanks to AI-driven data. Companies like Hawk-Eye
Innovations have raised the level of professional play through advanced replay
systems, ball-tracking technology and timely game data. However, this attention
to detail isn‘t reserved for sports royalty. Nex has built the HomeCourt app that
basketball players of all skill levels can consult for insights on how to fine-tune
their shooting motion and more.
For us humans, the novel idea of creating machines that can mimic human
intellect, and even augment it, has been and still is, of great interest. And it is
thanks to the efforts made around this idea that Artificial Intelligence, Machine
Learning and later, Deep Learning came into existence.
Now, these three concepts, or rather technologies, are interesting on their own.
However, citing the limitations of the topic we have at hand, we will primarily
discuss Deep Learning here.
Deep Learning is probably the closest we have come so far to engineering
a system based on the working of the brain. It is a complicated system that
endeavors to solve problems and learn concepts that we were once limited to
human intelligence.
Identifying images, translating human language, conversing with humans and
assisting computers to make independent decisions are some of the innumerable
purposes that Deep Learning fulfils.
218 Deep Learning Using Python
Many businesses and organizations across the globe are now employing Deep
Learning to fuel their growth and enhance their operations. They are using it to
predict consumer behaviour, detect changes in market trends, create marketing
strategies and whatnot.
Furthermore, according to ReportLinker, the market for Deep Learning will
be valued at a whopping $44.3 Billion by 2027.
Thus, it makes sense for us to know about some of the most commonplace
but significant Deep Learning applications in today‘s world. And that‘s what we
are going to look at in this chapter on applications of deep learning. But before
that, let us gather some insights into what deep learning is.
UNDERSTANDING DEEP LEARNING
Deep learning is fundamentally a subdiscipline of Machine Learning (hence
the ―learning‖ in the name). It bases its operations on another subset of machine
learning called Neural Networks.
Neural Networks are networks of neurons or nodes – algorithmic locations
where the computation of inputs takes place and output is produced. Neural
networks are either biological or artificial, with the latter ones finding use in many
AI-propelled applications.
These networks are essentially a somewhat less sophisticated reproduction of
the biological structure of the human brain, with the nodes signifying the neurons
or nerve cells.
layer, and the last layer, the output layer, the rest of the layers stay hidden in a
deep learning system.
APPLICATIONS OF DEEP LEARNING
Now, it is time we answered the million-dollar question, ―which are common
applications of deep learning in artificial intelligence(ai)?‖
Healthcare
The healthcare sector has long been one of the prominent adopters of modern
technology to overhaul itself. As such, it is not surprising to see Deep Learning
finding uses in interpreting medical data for
• the diagnosis, prognosis & treatment of diseases
• drug prescription
• analysing MRIs, CT scans, ECG, X-Rays, etc., to detect and notify about
medical anomalies
• personalising treatment
• monitoring the health of patients and more
One notable application of deep learning is found in the diagnosis and treatment
of cancer.
Medical professionals use a CNN or Convolutional Neural Network, a Deep
220 Deep Learning Using Python
learning method, to grade different types of cancer cells. They expose high-res
histopathological images to deep CNN models after magnifying them 20X or 40X.
The deep CNN models then demarcate various cellular features within the sample
and detect carcinogenic elements.
Personalized Marketing
Personalized marketing is a concept that has seen much action in the recent
few years. Marketers are now aiming their advertising campaigns at the pain points
of individual consumers, offering them exactly what they need. And Deep Learning
is playing a significant role in this.
Today, consumers are generating a lot of data thanks to their engagement with
social media platforms, IoT devices, web browsers, wearables and the ilk. However,
most of the data generated from these sources are disparate (text, audio, video,
location data, etc.).
To cope with this, businesses use customisable Deep Learning models to
interpret data from different sources and distil them to extract valuable customer
insights. They then use this information to predict consumer behaviour and target
their marketing efforts more efficiently.
So now you understand how those online shopping sites know what products
to recommend to you.
Financial Fraud Detection
Virtually no sector is exempt from the evil called ―fraudulent transactions‖
or ―financial fraud‖. However, it is the financial corporations (banks, insurance
firms, etc.) that have to bear the brunt of this menace the most. Not a day goes
by when criminals attack financial institutions. There are a plethora of ways to
usurp financial resources from them.
Thus, for these organizations, detecting and predicting financial fraud is
critical, to say the least. And this is where Deep Learning comes into the picture.
Financial organizations are now using the concept of anomaly detection to flag
inappropriate transactions. They employ deep learning algorithms, such as logistic
regression (credit card fraud detection is a prime use case), decision trees, random
forest, etc., to analyze the patterns common to valid transactions. Then, these
models are put into action to flag financial transactions that seem potentially
fraudulent.
Some examples of fraud detection being deterred by Deep Learning include:
• identity theft
Artificial Deep Neural Networks 221
• insurance fraud
• investment fraud
• fund misappropriation
Natural Language Processing
NLP or Natural Language Processing is another prominent area where Deep
Learning is showing promising results.
Natural Language Processing, as the name suggests, is all about enabling
machines to analyze and understand human language. The premise sounds simple,
right? Well, the thing is, human language is punishingly complex for machines
to interpret. It is not just the alphabet and words but also the context, the accents,
the handwriting and whatnot that discourage machines from processing or generating
human language accurately.
Deep Learning-based NLP is doing away with many of the issues related to
understanding human language by training machines (Autoencoders and Distributed
Representation) to produce appropriate responses to linguistic inputs.
One such example is the personal assistants we use on our smartphones. These
applications come embedded with Deep Learning imbued NLP models to understand
human speech and return appropriate output. It is, thus, no wonder why Siri and
Alexa sound so much like how people talk in real life.
Another use case of Deep Learning-based NLP is how websites written in one
human language automatically get translated to the user-specified language.
Autonomous Vehicles
The concept of building automated or self-governing vehicles goes back 45
years when the Tsukuba Mechanical Engineering Laboratory unveiled the world‘s
first semi-automatic car. The car, a technological marvel then, carried a pair of
cameras and an analogue computer to steer itself on a specially designed street.
However, it wasn‘t until 1989 when ALVINN (Autonomous Land Vehicle in
a Neural Network), a modified military ambulance, used neural networks to
navigate by itself on roads.
Since then, deep learning and autonomous vehicles have enjoyed a strong
bond, with the former enhancing the latter‘s performance exponentially.
Autonomous vehicles use cameras, sensors – LiDARs, RADARs, motion
sensors – and external information such as geo-mapping to perceive their
environment and collect relevant data. They use this equipment both individually
and in tandem for documenting the data.
222 Deep Learning Using Python
This data is then fed to deep learning algorithms that direct the vehicle to
perform appropriate actions such as
• accelerating, steering and braking
• identifying or planning routes
• traversing the traffic
• detecting pedestrians and other vehicles at a distance as well as in proximity
• recognising traffic signs
Deep learning is playing a huge role in realizing the perceived motives of self-
driving vehicles of reducing road accidents, helping the disabled drive, eliminating
traffic jams, etc.
And although still in nascent stages, the day is not far when we will see deep
learning-powered vehicles form a majority of the road traffic.
Fake News Detection
The concept of spreading fake news to tip the scales in one‘s favour is not
old. However, due to the explosive popularity of the internet, and social media
platforms, in particular, fake news has become ubiquitous.
Fake news, apart from misinforming the citizens, can be used to alter political
campaigns, vilify certain situations and individuals, and commit other similar
morally illegible acts. As such, curbing any and all fake news becomes a priority.
Deep Learning proposes a way to deal with the menace of fake news by using
complex language detection techniques to classify fraudulent news sources. This
method essentially works by gathering information from trusty sources and
juxtaposing them against a piece of news to verify its validity.
Artificial Deep Neural Networks 223
Python Implementation of
Neuron Model
We will see how to implement MP neuron model using python. The data set
we will be using is breast cancer data set from sklearn. Before start building the
MP Neuron Model. We will start by loading the data and separating the response
and target variables.
Once we load the data, we can use the sklearn‘s train_test_split function to
split the data into two parts — training and testing in the ratio of 80:20.
MP Neuron PreProcessing Steps
Remember from our previous discussion, MP Neuron takes only binary values
as the input. So we need to convert the continuous features into binary format.
To achieve this, we will use pandas.cut function to split all the features into
0 or 1 in one single shot. Once we are ready with the inputs we need to build
the model, train it on the training data and evaluate the model performance on
the test data.
To create a MP Neuron Model we will create a class and inside this class,
we will have three different functions:
• model function — to calculate the summation of the Binarized inputs.
• predict function — to predict the outcome for every observation in the data.
• fit function — the fit function iterates over all the possible values of the
threshold b and find the best value of b, such that the loss will be minimum.
226 Deep Learning Using Python
The McCulloch-Pitts neural model, which was the earliest ANN model, has
only two types of inputs — Excitatory and Inhibitory.
The excitatory inputs have weights of positive magnitude and the inhibitory
weights have weights of negative magnitude.
The inputs of the McCulloch-Pitts neuron could be either 0 or 1. It has a
threshold function as an activation function. So, the output signal yout is 1 if the
input ysum is greater than or equal to a given threshold value, else 0. The diagrammatic
representation of the model is as follows:
McCulloch-Pitts Model
Simple McCulloch-Pitts neurons can be used to design logical operations. For
that purpose, the connection weights need to be correctly decided along with the
threshold function (rather than the threshold value of the activation function). For
better understanding purpose, let me consider an example:
John carries an umbrella if it is sunny or if it is raining. There are four given
situations. I need to decide when John will carry the umbrella. The situations are
as follows:
• First scenario: It is not raining, nor it is sunny
• Second scenario: It is not raining, but it is sunny
Python Implementation of Neuron Model 227
So, any point (x,1x2) which lies above the decision boundary, as depicted by
the graph, will be assigned to class c1 and the points which lie below the boundary
are assigned to class c2.
Thus, we see that for a data set with linearly separable classes, perceptrons
can always be employed to solve classification problems using decision lines (for 2-
dimensional space), decision planes (for 3-dimensional space) or decision
hyperplanes (for n-dimensional space). Appropriate values of the synaptic weights
can be obtained by training a perceptron. However, one assumption for perceptron
to work properly is that the two classes should be linearly separable i.e. the classes
should be sufficiently separated from each other. Otherwise, if the classes are non-
linearly separable, then the classification problem cannot be solved by perceptron.
Python Implementation of Neuron Model 229
The output value can be +1 or -1. A bias input x0 (where x0 =1) having a weight
w0 is added. The activation function is such that if weighted sum is positive or
0, the output is 1, else it is -1. Formally I can say that. The supervised learning
algorithm adopted by ADALINE network is known as Least Mean Square (LMS)
or DELTA Rule. A network combining a number of ADALINE is termed as
MADALINE (many ADALINE). MEADALINE networks can be used to solve
problems related to non-linear separability.
ACTIVATION FUNCTION
It’s just a thing function that you use to get the output of node. It is also known
as Transfer Function.
Activation functions with Neural Networks
It is used to determine the output of neural network like yes or no. It maps
the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).
The Activation Functions can be basically divided into 2 types-
1. Linear Activation Function
2. Non-linear Activation Functions
Linear or Identity Activation Function
As you can see the function is a line or linear. Therefore, the output of the
functions will not be confined between any range.
Equation : f(x) = x
Range : (-infinity to infinity)
It doesn‘t help with the complexity or various parameters of usual data that
is fed to the neural networks.
Non-linear Activation Function
The Nonlinear Activation Functions are the most used activation functions.
Nonlinearity helps to makes the graph look something like this
It makes it easy for the model to generalize or adapt with variety of data and
to differentiate between the output.
The main terminologies needed to understand for nonlinear functions
are:
Derivative or Differential
Change in y-axis w.r.t. change in x-axis.It is also known as slope.
Monotonic function
A function which is either entirely non-increasing or non-decreasing.
The Nonlinear Activation Functions are mainly divided on the basis of their
range or curves-
Python Implementation of Neuron Model 233
The main reason why we use sigmoid function is because it exists between
(0 to 1). Therefore, it is especially used for models where we have to predict the
probability as an output.Since probability of anything exists only between the
range of 0 and 1, sigmoid is the right choice.
The function is differentiable.That means, we can find the slope of the
sigmoid curve at any two points.
The function is monotonic but function‘s derivative is not.
The logistic sigmoid function can cause a neural network to get stuck at the
training time.
The softmax function is a more generalized logistic activation function which
is used for multiclass classification.
Tanh or hyperbolic tangent Activation Function
tanh is also like logistic sigmoid but better. The range of the tanh function
is from (-1 to 1). tanh is also sigmoidal (s - shaped).
The advantage is that the negative inputs will be mapped strongly negative
and the zero inputs will be mapped near zero in the tanh graph.
234 Deep Learning Using Python
immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
Leaky ReLU
It is an attempt to solve the dying ReLU problem
Can you see the Leak? =
The leak helps to increase the range of the ReLU function. Usually, the value
of a is 0.01 or so.
When a is not 0.01 then it is called Randomized ReLU.
Therefore the range of the Leaky ReLU is (-infinity to infinity).
Both Leaky and Randomized ReLU functions are monotonic in nature. Also,
their derivatives also monotonic in nature.
236 Deep Learning Using Python
There are different types of neural networks but they always consist of the
same components: neurons, synapses, weights, biases, and functions.
Neurons
A neuron or a node is a basic unit of neural networks that receives information,
performs simple calculations, and passes it further.
All neurons in a net are divided into three groups:
• Input neurons that receive information from the outside world;
• Hidden neurons that process that information;
• Output neurons that produce a conclusion.
238 Deep Learning Using Python
In a large neural network with many neurons and connections between them,
neurons are organized in layers. There is an input layer that receives information,
a number of hidden layers, and the output layer that provides valuable results.
Every neuron performs transformation on the input information. Neurons only
operate numbers in the range [0,1] or [-1,1]. In order to turn data into something
that a neuron can work with, we need normalization. We talked about what it is
in the post about regression analysis.
Synapses and weights
A synapse is what connects the neurons like an electricity cable. Every synapse
has a weight. The weights also add to the changes in the input information. The
results of the neuron with the greater weight will be dominant in the next neuron,
while information from less ‗weighty‘ neurons will not be passed over. One can
say that the matrix of weights governs the whole neural system.
How do you know which neuron has the biggest weight? During the initialization
(first launch of the NN), the weights are randomly assigned but then you will have
to optimize them.
Python Implementation of Neuron Model 239
Bias
A bias neuron allows for more variations of weights to be stored. Biases add
richer representation of the input space to the model‘s weights.
In the case of neural networks, a bias neuron is added to every layer. It plays
a vital role by making it possible to move the activation function to the left or
right on the graph.
It is true that ANNs can work without bias neurons. However, they are almost
always added and counted as an indispensable part of the overall model.
How ANNs work
Every neuron processes input data to extract a feature. Let‘s imagine that we
have three features and three neurons, each of which is connected with all these
features.
Each of the neurons has its own weights that are used to weight the features.
During the training of the network, you need to select such weights for each
240 Deep Learning Using Python
of the neurons that the output provided by the whole network would be true-to-
life.
To perform transformations and get an output, every neuron has an activation
function. This combination of functions performs a transformation that is described
by a common function F — this describes the formula behind the NN‘s magic.
There are a lot of activation functions. The most common ones are linear,
sigmoid, and hyperbolic tangent. Their main difference is the range of values they
work with.
How do you train an algorithm?
Neural networks are trained like any other algorithm. You want to get some
results and provide information to the network to learn from. For example, we
want our neural network to distinguish between photos of cats and dogs and
provide plenty of examples.
Delta is the difference between the data and the output of the neural network.
We use calculus magic and repeatedly optimize the weights of the network until
the delta is zero. Once the delta is zero or close to it, our model is correctly able
to predict our example data.
Iteration
This is a kind of counter that increases every time the neural network goes
through one training set. In other words, this is the total number of training sets
completed by the neural network.
Python Implementation of Neuron Model 241
Epoch
The epoch increases each time we go through the entire set of training sets.
The more epochs there are, the better is the training of the model.
Batch
Batch size is equal to the number of training examples in one forward/
backward pass. The higher the batch size, the more memory space you‘ll need.
What is the difference between an iteration and an epoch?
• one epoch is one forward pass and one backward pass of all the
training examples;
• number of iterations is a number of passes, each pass using [batch
size] number of examples. To be clear, one pass equals one forward
pass + one backward pass (we do not count the forward pass and
backward pass as two different passes).
And what about errors?
Error is a deviation that reflects the discrepancy between expected and received
output. The error should become smaller after every epoch. If this does not happen,
then you are doing something wrong.
The error can be calculated in different ways, but we will consider only two
main ways: Arctan and Mean Squared Error.
242 Deep Learning Using Python
become more precise every time because it remembers the results of the previous
iteration and can use that information to make better decisions.
Recurrent neural networks are widely used in natural language processing and
speech recognition.
Convolutional neural networks
Convolutional neural networks are the standard of today‘s deep machine
learning and are used to solve the majority of problems. Convolutional neural
networks can be either feed-forward or recurrent.
Let‘s see how they work. Imagine we have an image of Albert Einstein. We
can assign a neuron to all pixels in the input image.
But there is a big problem here: if you connect each neuron to all pixels, then,
firstly, you will get a lot of weights. Hence, it will be a very computationally
intensive operation and take a very long time. Then, there will be so many weights
that this method will be very unstable to overfitting. It will predict everything well
on the training example but work badly on other images.
GANs are used, for example, to generate photographs that are perceived by
the human eye as natural images or deepfakes.
A generative adversarial network is an unsupervised machine learning algorithm
that is a combination of two neural networks, one of which (network G) generates
Python Implementation of Neuron Model 245
patterns and the other (network A) tries to distinguish genuine samples from the
fake ones. Since networks have opposite goals – to create samples and reject
samples – they start an antagonistic game that turns out to be quite effective.
What kind of problems do NNs solve?
Neural networks are used to solve complex problems that require analytical
calculations similar to those of the human brain. The most common uses for neural
networks are:
• Classification. NNs label the data into classes by implicitly analyzing
its parameters. For example, a neural network can analyse the
parameters of a bank client such as age, solvency, credit history and
decide whether to loan them money.
• Prediction. The algorithm has the ability to make predictions. For
example, it can foresee the rise or fall of a stock based on the
situation in the stock market.
• Recognition. This is currently the widest application of neural
networks. For example, a security system can use face recognition
to only let authorized people into the building.
Deep learning and neural networks are useful technologies that expand human
intelligence and skills. Neural networks are just one type of deep learning
architecture. However, they have become widely known because NNs can effectively
solve a huge variety of tasks and cope with them better than other algorithms.
246 Deep Learning Using Python
Sigmoid Function
Whether you implement a neural network yourself or you use a built in library
for neural network learning, it is of paramount importance to understand the
significance of a sigmoid function. The sigmoid function is the key to understanding
how a neural network learns complex problems. This function also served as a
basis for discovering other functions that lead to efficient and good solutions for
supervised learning in deep learning architectures.
In this chapter, you will discover the sigmoid function and its role in learning
from examples in neural networks.
Graph of the sigmoid function and its derivative. Some important properties
are also shown.
A few other properties include:
1. Domain: (-‖, +‖)
2. Range: (0, +1)
3. (0) = 0.5
4. The function is monotonically increasing.
5. The function is continuous everywhere.
6. The function is differentiable everywhere in its domain.
7. Numerically, it is enough to compute this function‘s value over a small
range of numbers, e.g., [-10, +10]. For values less than -10, the function‘s
value is almost zero. For values greater than 10, the function‘s values are
almost one.
THE SIGMOID AS A SQUASHING FUNCTION
The sigmoid function is also called a squashing function as its domain is the
set of all real numbers, and its range is (0, 1). Hence, if the input to the function
252 Deep Learning Using Python
is either a very large negative number or a very large positive number, the output
is always between 0 and 1. Same goes for any number between -‖ and +‖.
Sigmoid As An Activation Function In Neural Networks
The sigmoid function is used as an activation function in neural networks. Just
to review what is an activation function, the figure below shows the role of an
activation function in one layer of a neural network. A weighted sum of inputs
is passed through an activation function and this output serves as an input to the
next layer.
When the activation function for a neuron is a sigmoid function it is a
guarantee that the output of this unit will always be between 0 and 1. Also, as
the sigmoid is a non-linear function, the output of this unit would be a non-linear
function of the weighted sum of inputs. Such a neuron that employs a sigmoid
function as an activation function is termed as a sigmoid unit.
Linear Vs. Non-Linear Separability?
Suppose we have a typical classification problem, where we have a set of
points in space and each point is assigned a class label. If a straight line (or a
hyperplane in an n-dimensional space) can divide the two classes, then we have
a linearly separable problem. On the other hand, if a straight line is not enough
to divide the two classes, then we have a non-linearly separable problem. The
figure below shows data in the 2 dimensional space. Each point is assigned a red
or blue class label. The left figure shows a linearly separable problem that requires
a linear boundary to distinguish between the two classes. The right figure shows
a non-linearly separable problem, where a non-linear decision boundary is required.
For three dimensional space, a linear decision boundary can be described via
the equation of a plane. For an n-dimensional space, the linear decision boundary
is described by the equation of a hyperplane.
Why The Sigmoid Function Is Important In Neural Networks?
If we use a linear activation function in a neural network, then this model can
only learn linearly separable problems. However, with the addition of just one
hidden layer and a sigmoid activation function in the hidden layer, the neural
network can easily learn a non-linearly separable problem. Using a non-linear
function produces non-linear boundaries and hence, the sigmoid function can be
used in neural networks for learning complex decision functions.
The only non-linear function that can be used as an activation function in a
neural network is one which is monotonically increasing. So for example, sin(x)
or cos(x) cannot be used as activation functions. Also, the activation function
Sigmoid Neurons Function in Deep Learning 253
Graph 12. MNIST digit 5, which consist of 28x28 pixel values between 0 and
255.
Now, the computer can‘t really ―see‖ a digit like we humans do, but if we
dissect the image into an array of 784 numbers like [0, 0, 180, 16, 230, …, 4,
77, 0, 0, 0], then we can feed this array into our neural network. The computer
can‘t understand an image by ―seeing‖ it, but it can understand and analyze the
pixel numbers that represent an image.
So, let‘s set up a neural network like above in Graph 13. It has 784 input
neurons for 28x28 pixel values. Let‘s assume it has 16 hidden neurons and 10
output neurons. The 10 output neurons, returned to us in an array, will each be
in charge to classify a digit from 0 to 9. So if the neural network thinks the
handwritten digit is a zero, then we should get an output array of [1, 0, 0, 0, 0,
0, 0, 0, 0, 0], the first output in this array that senses the digit to be a zero is ―fired‖
to be 1 by our neural network, and the rest are 0. If the neural network thinks
the handwritten digit is a 5, then we should get [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]. The
6th element that is in charge to classify a five is triggered while the rest are not.
So on and so forth.
Sigmoid Neurons Function in Deep Learning 255
Graph 13: Multi-Layer Sigmoid Neural Network with 784 input neurons, 16
hidden neurons, and 10 output neurons
Remember we mentioned that neural networks become better by repetitively
training themselves on data so that they can adjust the weights in each layer of
the network to get the final results/actual output closer to the desired output? So
when we actually train this neural network with all the training examples in
MNIST dataset, we don‘t know what weights we should assign to each of the
layers. So we just randomly ask the computer to assign weights in each layer. (We
don‘t want all the weights to be 0, which I‘ll explain in the next post if space
allows).
This concept of randomly initializing weights is important because each time
you train a deep learning neural network, you are initializing different numbers
to the weights. So essentially, you and I have no clue what‘s going on in the neural
network until after the network is trained. A trained neural network has weights
which are optimized at certain values that make the best prediction or classification
on our problem. It‘s a black box, literally. And each time the trained network will
have different sets of weights. For the sake of argument, let‘s imagine the following
case in Graph 14, which I borrow from Michael Nielsen‘s online book:
Graph 15. Neural Networks are Black Boxes. Each Time is Different.
If you train the neural network with a new set of randomized weights, it might
produce the following network instead (compare Graph 15 with Graph 14),
since the weights are randomized and we never know which one will learn which
or what pattern. But the network, if properly trained, should still trigger the correct
hidden neurons and then the correct output.
One last thing to mention: In a multi-layer neural network, the first hidden
layer will be able to learn some very simple patterns. Each additional hidden layer
will somehow be able to learn progressively more complicated patterns. Check
out Graph 16 from Scientific American with an example of face ecognition :)
we multiply the input by the weights, add a bias and apply an activation function
and pass the output to the next layer. We keep repeating the process until we reach
the last layer. The final value is our output. We then compute the error between
the ―calculated output‖ and the ―true output‖ and then calculate the partial derivatives
of this error with respect to the parameters in each layer going backwards and
keep updating the parameters accordingly!
Neural networks are said to be universal function approximators. The main
underlying goal of a neural network is to learn complex non-linear functions. If
we do not apply any non-linearity in our multi-layer neural network, we are simply
trying to seperate the classes using a linear hyperplane. As we know, in the real-
world nothing is linear!
Also, imagine we perform simple linear operation as described above, namely;
multiply the input by weights, add a bias and sum them across all the inputs
arriving to the neuron. It is likely that in certain situations, the output derived
above, takes a large value. When, this output is fed into the further layers, they
can be transformed to even larger values, making things computationally
uncontrollable. This is where the activation functions play a major role i.e. squashing
a real-number to a fix interval (e.g. between -1 and 1).
Let us see different types of activation functions and how they compare against
each other:
Sigmoid
258 Deep Learning Using Python
Tanh
The tanh or hyperbolic tangent activation function has the mathematical form
‗tanh(z) = (e^z — e^-z) / (e^z + e^-z)‗. It is basically a shifted sigmoid neuron.
It basically takes a real valued number and squashes it between -1 and +1. Similar
to sigmoid neuron, it saturates at large positive and negative values. However, its
output is always zero-centered which helps since the neurons in the later layers
of the network would be receiving inputs that are zero-centered. Hence, in practice,
tanh activation functions are preffered in hidden layers over sigmoid.
import numpy as npdef tanh(z):
return np.tanh(z)
ReLU:
260 Deep Learning Using Python
Dead Neurons:
• ReLU units can be fragile during training and can ―die‖. That is, if the
units are not activated initially, then during backpropagation zero gradients
Sigmoid Neurons Function in Deep Learning 261
flow through them. Hence, neurons that ―die‖ will stop responding to the
variations in the output error because of which the parameters will never
be updated/updated during backpropagation. However, there are concepts
such as Leaky ReLU that can be used to overcome this problem. Also,
having a proper setting of the learning rate can prevent causing the neurons
to be dead.
import numpy as npdef relu(z):
return z * (z > 0)
Leaky ReLU:
The Leaky ReLU is just an extension of the traditional ReLU function. As
we saw that for values less than 0, the gradient is 0 which results in ―Dead
Neurons‖ in those regions. To address this problem, Leaky ReLU comes in handy.
That is, instead of defining values less than 0 as 0, we instead define negative
values as a small linear combination of the input. The small value commonly used
is 0.01. It is represented as `LeakyReLU(z) = max(0.01 * z, z)`. The idea of Leaky
ReLU can be extended even further by making a small change. Instead of multiplying
`z` with a constant number, we can learn the multiplier and treat it as an additional
hyperparameter in our process. This is known as Parametric ReLU. In practice,
it is believed that this performs better than Leaky ReLU.
import numpy as npdef leaky_relu(z):
return np.maximum(0.01 * z, z)
262 Deep Learning Using Python
def sigmoid(x):
s=1/(1+np.exp(-x))
ds=s*(1-s)
return s,ds
x=np.arange(-6,6,0.01)
sigmoid(x)
fig, ax = plt.subplots(figsize=(9, 5))
ax.spines[„left‟].set_position(„center‟)
ax.spines[„right‟].set_color(„none‟)
ax.spines[„top‟].set_color(„none‟)
ax.xaxis.set_ticks_position(„bottom‟)
ax.yaxis.set_ticks_position(„left‟)
ax.plot(x,sigmoid(x)[0], color=”#307EC7", linewidth=3,
label=”sigmoid”)
ax.plot(x,sigmoid(x)[1], color=”#9621E2", linewidth=3,
label=”derivative”)
ax.legend(loc=”upper right”, frameon=False)
fig.show()
Observations:
• The sigmoid function has values between 0 to 1.
• The output is not zero-centered.
• Sigmoids saturate and kill gradients.
• At the top and bottom level of sigmoid functions, the curve changes slowly,
the derivative curve above shows that the slope or gradient it is zero.
264 Deep Learning Using Python
The more complicated the information, the more non-linear the mapping of
features to the bottom truth label will usually be. If there is no activation function
in a neural network, the network would in turn not be able to understand such
complicated mappings mathematically and wouldn‘t be able to solve tasks that
the network is really meant to resolve.
It makes it easy for a neural network model to adapt with a variety of data and
to differentiate between the outcomes.
These functions are mainly divided basis on their range or curves:
Sigmoid Activation Functions
Sigmoid takes a real value as the input and outputs another value between
0 and 1. The sigmoid activation function translates the input ranged in (-‖,‖) to
the range in (0,1)
Tanh Activation Functions
The tanh function is just another possible function that can be used as a non-
linear activation function between layers of a neural network. It shares a few things
in common with the sigmoid activation function. Unlike a sigmoid function that
will map input values between 0 and 1, the Tanh will map values between -1 and
1. Similar to the sigmoid function, one of the interesting properties of the tanh
function is that the derivative of tanh can be expressed in terms of the function
itself.
ReLU Activation Functions
The formula is deceptively simple: max(0,z). Despite its name, Rectified
Linear Units, it‘s not linear and provides the same benefits as Sigmoid but with
better performance.
Leaky Relu
Leaky Relu is a variant of ReLU. Instead of being 0 when z<0, a leaky ReLU
allows a small, non-zero, constant gradient (normally, =0.01). However, the
consistency of the benefit across tasks is presently unclear. Leaky ReLUs attempt
to fix the ―dying ReLU‖ problem.
Parametric Relu
PReLU gives the neurons the ability to choose what slope is best in the
negative region. They can become ReLU or leaky ReLU with certain values of
.
Maxout
The Maxout activation is a generalization of the ReLU and the leaky ReLU
functions. It is a piecewise linear function that returns the maximum of inputs,
designed to be used in conjunction with the dropout regularization technique. Both
ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore,
Sigmoid Neurons Function in Deep Learning 267
enjoys all the benefits of a ReLU unit and does not have any drawbacks like dying
ReLU. However, it doubles the total number of parameters for each neuron, and
hence, a higher total number of parameters need to be trained.
ELU
The Exponential Linear Unit or ELU is a function that tends to converge faster
and produce more accurate results. Unlike other activation functions, ELU has an
extra alpha constant which should be a positive number. ELU is very similar to
ReLU except for negative inputs. They are both in the identity function form for
non-negative inputs. On the other hand, ELU becomes smooth slowly until its
output equal to - whereas ReLU sharply smoothes.
Softmax Activation Functions
Softmax function calculates the probabilities distribution of the event over ‗n‘
different events. In a general way, this function will calculate the probabilities of
each target class over all possible target classes. Later the calculated probabilities
will help determine the target class for the given inputs.
When to use which Activation Function in a Neural Network?
Specifically, it depends on the problem type and the value range of the
expected output. For example, to predict values that are larger than 1, tanh or
sigmoid are not suitable to be used in the output layer, instead, ReLU can be used.
On the other hand, if the output values have to be in the range (0,1) or (-1,
1) then ReLU is not a good choice, and sigmoid or tanh can be used here. While
performing a classification task and using the neural network to predict a probability
distribution over the mutually exclusive class labels, the softmax activation function
should be used in the last layer. However, regarding the hidden layers, as a rule
of thumb, use ReLU as an activation for these layers.
In the case of a binary classifier, the Sigmoid activation function should be
used. The sigmoid activation function and the tanh activation function work
terribly for the hidden layer. For hidden layers, ReLU or its better version leaky
ReLU should be used. For a multiclass classifier, Softmax is the best-used activation
function. Though there are more activation functions known, these are known to
be the most used activation functions.
to make is what activation function to use in the hidden layer as well as at the
output layer of the network.
Elements of a Neural Network :- Input Layer :- This layer accepts input
features. It provides information from the outside world to the network, no
computation is performed at this layer, nodes here just pass on the
information(features) to the hidden layer. Hidden Layer :- Nodes of this layer
are not exposed to the outer world, they are the part of the abstraction provided
by any neural network. Hidden layer performs all sort of computation on the
features entered through the input layer and transfer the result to the output layer.
Output Layer :- This layer bring up the information learned by the network to
the outer world.
What is an activation function and why to use them? Definition of activation
function:- Activation function decides, whether a neuron should be activated or
not by calculating weighted sum and further adding bias with it. The purpose of
the activation function is to introduce non-linearity into the output of a neuron.
Explanation :- We know, neural network has neurons that work in
correspondence of weight, bias and their respective activation function. In a neural
network, we would update the weights and biases of the neurons on the basis of
the error at the output. This process is known as back-propagation. Activation
functions make the back-propagation possible since the gradients are supplied
along with the error to update the weights and biases.
OR
tanh(x) = 2 * sigmoid(2x) - 1
• Value Range :- -1 to +1
• Nature :- non-linear
• Uses :- Usually used in hidden layers of a neural network as it‘s values
lies between -1 to 1 hence the mean for the hidden layer comes out be
0 or very close to it, hence helps in centering the data by bringing mean
close to 0. This makes learning for the next layer much easier.
• Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0
otherwise.
• Value Range :- [0, inf)
• Nature :- non-linear, which means we can easily backpropagate the errors
and have multiple layers of neurons being activated by the ReLU function.
• Uses :- ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and
easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
5). Softmax Function :- The softmax function is also a type of sigmoid
function but is handy when we are trying to handle mult- class classification
problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the softmax
function was commonly found in the output layer of image classification
problems.The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
• Output:- The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define
the class of each input.
• The basic rule of thumb is if you really don‘t know what activation
function to use, then simply use RELU as it is a general activation function
in hidden layers and is used in most cases these days.
• If your output is for binary classification then, sigmoid function is very
natural choice for output layer.
• If your output is for multi-class classification then, Softmax is very useful
to predict the probabilities of each classes.
272 Deep Learning Using Python
across the board. So, let‘s get to it! The x and y axes represent the values of the
two weights. The z axis represents the value of the loss function for a particular
value of two weights. Our goal is to find the particular value of weight for which
the loss is minimum. Such a point is called a minima for the loss function.
You have randomly initialized weights in the beginning, so your neural network
is probably behaving like a drunk version of yourself, classifying images of cats
as humans. Such a situation correspond to point A on the contour, where the
network is performing badly and consequently the loss is high.
We need to find a way to somehow navigate to the bottom of the ―valley‖
to point B, where the loss function has a minima? So how do we do that?
Gradient Descent
When we initialize our weights, we are at point A in the loss landscape. The
first thing we do is to check, out of all possible directions in the x-y plane, moving
along which direction brings about the steepest decline in the value of the
loss function. This is the direction we have to move in. This direction is given
by the direction exactly opposite to the direction of the gradient. The gradient,
the higher dimensional cousin of derivative, gives us the direction with the steepest
ascent.
To wrap your head around it, consider the following figure. At any point of
our curve, we can define a plane that is tangential to the point. In higher dimensions,
we can always define a hyperplane, but let‘s stick to 3-D for now. Then, we can
have infinite directions on this plane. Out of them, precisely one direction will
give us the direction in which the function has the steepest ascent. This direction
274 Deep Learning Using Python
Now, once we have the direction we want to move in, we must decide the
size of the step we must take. The the size of this step is called the learning rate.
We must chose it carefully to ensure we can get down to the minima.
If we go too fast, we might overshoot the minima, and keep bouncing along
the ridges of the ―valley‖ without ever reaching the minima. Go too slow, and
Sigmoid Neurons Function in Deep Learning 275
the training might turn out to be too long to be feasible at all. Even if that‘s not
the case, very slow learning rates make the algorithm more prone to get stuck in
a minima, something we‘ll cover later in this post.
Once we have our gradient and the learning rate, we take a step, and recompute
the gradient at whatever position we end up at, and repeat the process.
While the direction of the gradient tells us which direction has the steepest
ascent, it‘s magnitude tells us how steep the steepest ascent/descent is. So, at the
minima, where the contour is almost flat, you would expect the gradient to be
almost zero. In fact, it‘s precisely zero for the point of minima.
Gradient Descent in Action
inaccurate picture of what gradient descent really is. The trajectory we take is
entire confined to the x-y plane, the plane containing the weights.
As depicted in the above animation, gradient descent doesn‘t involve moving
in z direction at all. This is because only the weights are the free parameters,
described by the x and y directions. The actual trajectory that we take is defined
in the x-y plane as follows.
This update is performed during every iteration. Here, w is the weights vector,
which lies in the x-y plane. From this vector, we subtract the gradient of the loss
function with respect to the weights multiplied by alpha, the learning rate. The
gradient is a vector which gives us the direction in which loss function has the
steepest ascent.
The direction of steepest descent is the direction exactly opposite to the
gradient, and that is why we are subtracting the gradient vector from the weights
vector.
If imagining vectors is a bit hard for you, almost the same update rule is
applied to every weight of the network simultaneously. The only change is that
since we are performing the update individually for each weight now, the gradient
in the above equation is replaced the the projection of the gradient vector along
the direction represented by the particular weight.
Gradient descent is driven by the gradient, which will be zero at the base of
any minima. Local minimum are called so since the value of the loss function is
Sigmoid Neurons Function in Deep Learning 279
In fact, it‘s even hard to visualize what such a high dimensional function.
However, given the sheer talent in the field of deep learning these days, people
have come up with ways to visualize, the contours of loss functions in 3-D. A
recent paper pioneers a technique called Filter Normalization, explaining which
is beyond the scope of this post.
However, it does give to us a view of the underlying complexities of loss
functions we deal with. For example, the following contour is a constructed 3-
D representation for loss contour of a VGG-56 deep network‘s loss function on
the CIFAR-10 dataset.
CHALLENGES WITH GRADIENT DESCENT #2: SADDLE
POINTS
The basic lesson we took away regarding the limitation of gradient descent
was that once it arrived at a region with gradient zero, it was almost impossible
for it to escape it regardless of the quality of the minima. Another sort of problem
we face is that of saddle points, which look like this.
280 Deep Learning Using Python
A Saddle Point
You can also see a saddle point in the earlier pic where two ―mountains‖ meet.
A saddle point gets it‘s name from the saddle of a horse with which it
resembles.
While it‘s a minima in one direction (x), it‘s a local maxima in another
direction, and if the contour is flatter towards the x direction, GD would keep
oscillating to and fro in the y - direction, and give us the illusion that we have
converged to a minima.
RANDOMNESS TO THE RESCUE!
So, how do we go about escaping local minima and saddle points, while trying
to converge to a global minima. The answer is randomness.
Till now we were doing gradient descent with the loss function that had been
created by summing loss over all possible examples of the training set. If we get
into a local minima or saddle point, we are stuck. A way to help GD escape these
is to use what is called Stochastic Gradient Descent.
In stochastic gradient descent, instead of taking a step by computing the
gradient of the loss function creating by summing all the loss functions, we take
a step by computing the gradient of the loss of only one randomly sampled (without
replacement) example.
In contrast to Stochastic Gradient Descent, where each example is
stochastically chosen, our earlier approach processed all examples in one single
batch, and therefore, is known as Batch Gradient Descent.
Sigmoid Neurons Function in Deep Learning 281
computational stand point. When we perform gradient descent with a loss function
that is created by summing all the individual losses, the gradient of the individual
losses can be calculated in parallel, whereas it has to calculated sequentially step
by step in case of stochastic gradient descent.
So, what we do is a balancing act. Instead of using the entire dataset, or just
a single example to construct our loss function, we use a fixed number of examples
say, 16, 32 or 128 to form what is called a mini-batch. The word is used in contrast
with processing all the examples at once, which is generally called Batch Gradient
Descent.
The size of the mini-batch is chosen as to ensure we get enough stochasticity
to ward off local minima, while leveraging enough computation power from
parallel processing.
LOCAL MINIMA REVISITED: THEY ARE NOT AS BAD AS
YOU THINK
Before you antagonise local minima, recent research has shown that local
minima is not neccasarily bad. In the loss landscape of a neural network, there
are just way too many minimum, and a ―good‖ local minima might perform just
as well as a global minima.
Why do I say ―good‖? Because you could still get stuck in ―bad‖ local minima
which are created as a result of erratic training examples. ―Good‖ local minima,
or often referred to in literature as optimal local minima, can exist in considerable
numbers given a neural network‘s high dimensional loss function.
It might also be noted that a lot of neural networks perform classification. If
a local minima corresponds to it producing scores between 0.7-0.8 for the correct
labels, while the global minima has it producing scores between 0.95-0.98 for the
correct labels for same examples, the output class prediction is going to be same
for both.
A desirable property of a minima should be it that it should be on the
flatter side. Why? Because flat minimum are easy to converge to, given there‘s
less chance to overshoot the minima, and be bouncing between the ridges of the
minima.
More importantly, we expect the loss surface of the test set to be slightly
different from that of the training set, on which we do our training. For a flat and
wide minima, the loss won‘t change much due to this shift, but this is not the case
for narrow minima. The point that we are trying to make is flatter minima
generalise better and are thus desirable.
Sigmoid Neurons Function in Deep Learning 283
and behavior, online streaming companies give suggestions to help them make
product and service choices. Deep learning techniques are also used to add sound
to silent movies and generate subtitles automatically.
News Aggregation and Fake News Detection
Deep Learning allows you to customize news depending on the readers‘
persona. You can aggregate and filter out news information as per social,
geographical, and economic parameters and the individual preferences of a reader.
Neural Networks help develop classifiers that can detect fake and biased news
and remove it from your feed. They also warn you of possible privacy breaches.
Composing Music
A machine can learn the notes, structures, and patterns of music and start
producing music independently. Deep Learning-based generative models such as
WaveNet can be used to develop raw audio. Long Short Term Memory Network
helps to generate music automatically. Music21 Python toolkit is used for computer-
aided musicology. It allows us to train a system to develop music by teaching music
theory fundamentals, generating music samples, and studying music.
Image Coloring
Image colorization has seen significant advancements using Deep Learning.
Image colorization is taking an input of a grayscale image and then producing an
output of a colorized image. ChromaGAN is an example of a picture colorization
model. A generative network is framed in an adversarial model that learns to
colorize by incorporating a perceptual and semantic understanding of both class
distributions and color.
Robotics
Deep Learning is heavily used for building robots to perform human-like tasks.
Robots powered by Deep Learning use real-time updates to sense obstacles in their
path and pre-plan their journey instantly. It can be used to carry goods in hospitals,
factories, warehouses, inventory management, manufacturing products, etc.
Boston Dynamics robots react to people when someone pushes them around,
they can unload a dishwasher, get up when they fall, and do other tasks as well.
Now, let‘s understand our next deep learning application, i.e. Image captioning.
Image Captioning
Image Captioning is the method of generating a textual description of an
image. It uses computer vision to understand the image‘s content and a language
model to turn the understanding of the image into words in the right order. A
Sigmoid Neurons Function in Deep Learning 285
recurrent neural network such as an LSTM is used to turn the labels into a coherent
sentence. Microsoft has built its caption bot where you can upload an image or
the URL of any image, and it will display the textual description of the image.
Another such application that suggests a perfect caption and best hashtags for a
picture is Caption AI.
Advertising
In Advertising, Deep Learning allows optimizing a user‘s experience. Deep
Learning helps publishers and advertisers to increase the significance of the ads
and boosts the advertising campaigns. It will enable ad networks to reduce costs
by dropping the cost per acquisition of a campaign from $60 to $30. You can create
data-driven predictive advertising, real-time bidding of ads, and target display
advertising.
Self Driving Cars
Deep Learning is the driving force behind the notion of self-driving automobiles
that are autonomous. Deep Learning technologies are actually ―learning machines‖
that learn how to act and respond using millions of data sets and training. To
diversify its business infrastructure, Uber Artificial Intelligence laboratories are
powering additional autonomous cars and developing self-driving cars for on-
demand food delivery. Amazon, on the other hand, has delivered their merchandise
using drones in select areas of the globe.
The perplexing problem about self-driving vehicles that the bulk of its designers
are addressing is subjecting self-driving cars to a variety of scenarios to assure
safe driving. They have operational sensors for calculating adjacent objects.
Furthermore, they manoeuvre through traffic using data from its camera, sensors,
geo-mapping, and sophisticated models. Tesla is one popular example.
Natural Language Processing
Another important field where Deep Learning is showing promising results
is NLP, or Natural Language Processing. It is the procedure for allowing robots
to study and comprehend human language.
However, keep in mind that human language is excruciatingly difficult for
robots to understand. Machines are discouraged from correctly comprehending or
creating human language not only because of the alphabet and words, but also
because of context, accents, handwriting, and other factors.
Many of the challenges associated with comprehending human language are
being addressed by Deep Learning-based NLP by teaching computers (Autoencoders
and Distributed Representation) to provide suitable responses to linguistic inputs.
286 Deep Learning Using Python
Visual Recognition
Just assume you‘re going through your old memories or photographs. You may
choose to print some of these. In the lack of metadata, the only method to achieve
this was through physical labour. The most you could do was order them by date,
but downloaded photographs occasionally lack that metadata. Deep Learning, on
the other hand, has made the job easier. Images may be sorted using it based on
places recognised in pictures, faces, a mix of individuals, events, dates, and so
on. To detect aspects when searching for a certain photo in a library, state-of-the-
art visual recognition algorithms with various levels from basic to advanced are
required.
Fraud Detection
Another attractive application for deep learning is fraud protection and detection;
major companies in the payment system sector are already experimenting with it.
PayPal, for example, uses predictive analytics technology to detect and prevent
fraudulent activity. The business claimed that examining sequences of user behaviour
using neural networks‘ long short-term memory architecture increased anomaly
identification by up to 10%. Sustainable fraud detection techniques are essential
for every fintech firm, banking app, or insurance platform, as well as any organisation
that gathers and uses sensitive data. Deep learning has the ability to make fraud
more predictable and hence avoidable.
Personalisations
Every platform is now attempting to leverage chatbots to create tailored
experiences with a human touch for its users. Deep Learning is assisting e-
commerce behemoths such as Amazon, E-Bay, and Alibaba in providing smooth
tailored experiences such as product suggestions, customised packaging and
discounts, and spotting huge income potential during the holiday season. Even in
newer markets, reconnaissance is accomplished by providing goods, offers, or
plans that are more likely to appeal to human psychology and contribute to growth
in micro markets. Online self-service solutions are on the increase, and dependable
procedures are bringing services to the internet that were previously only physically
available.
Detecting Developmental Delay in Children
Early diagnosis of developmental impairments in children is critical since
early intervention improves children‘s prognoses. Meanwhile, a growing body of
research suggests a link between developmental impairment and motor competence,
therefore motor skill is taken into account in the early diagnosis of developmental
Sigmoid Neurons Function in Deep Learning 287
Bibliography
Index
A Descent Algorithm, 59, 127, 135,
Activation Functions, 4, 5, 6, 11, 136.
13, 14, 15, 18, 35, 67, 150, G
156, 162, 195, 203, 230, 231,
Geometric Interpretation, 41.
232, 234, 240, 252, 256, 257,
Gradient Descent, 4, 53, 124, 127,
258, 259, 260, 262, 264, 265,
267, 268. 128, 130, 131, 132, 133, 134,
135, 136, 138, 139, 141, 144,
Artificial Deep Neural Networks, 177.
195, 205, 272, 273, 274, 275,
Artificial Neural Network, 158, 161,
276, 278, 279, 280, 281, 282.
178, 180, 182, 184, 187.
Automatic Speech Recognition, 8, I
179, 206. Image Processing, 13, 25, 27, 29,
31, 63, 90.
B
Biological Neurons, 73, 177, 184. L
Learning Applications, 211, 283.
C
Components of Neural Networks, M
237. Machine Learning, 12, 25, 71, 91,
135, 150, 153, 159, 168, 174,
D
175, 183, 186, 283, 284, 289.
Deep Learning, 1, 2, 3, 6, 8, 12,
13, 14, 22, 33, 36, 63, 71, N
72, 73, 74, 75, 116, 148, 149, Natural Language Processing, 2, 75,
195, 201, 206, 211, 217, 218, 152, 213, 221, 285.
219, 220, 221, 222, 223, 224, Neural Network, 1, 13, 33, 63, 69,
236, 246, 256, 272, 283, 284, 150, 219, 221, 255, 265, 267,
285, 286, 287, 288. 268, 272.
292 Deep Learning Using Python
ABOUT THE AUTHOR
Ahmad Ali AlZubi is a full Professor at Computer Science Department, King
Saud University, Saudi Arabia. He obtained his PhD from National Technical
University of Ukraine (NTUU) in Computer Networks Engineering in 1999.
His current research interests include but not limited to Computer Networks,
Grid Computing, Cloud Computing, AI, Machine learning and Deep Learning
and their applications in various fields, and services automation. He has also
gained valuable industry experience, having worked as a consultant and a
member of the Saudi National Team for E-Government in Saudi Arabia. He has
author a book title Heart Disease Prediction Using Machine Learning having
ISBN: 978-81-19477-42-5
India | UAE | Nigeria | Malaysia | Montenegro | Iraq | Egypt | Thailand | Uganda | Philippines | Indonesia
Parab Publications || www.parabpublications.com || info@parabpublications.com