DLT Unit-5
DLT Unit-5
DLT Unit-5
Topics
➢ Interactive Applications of Deep Learning
❖ Natural Language Processing
❖ Generative Adversarial Networks
❖ Deep Reinforcement Learning
➢ Deep Learning Research
❖ Auto Encoders
❖ Deep Generative Models
▪ Boltzmann Machine
▪ Restricted Boltzmann Machine
▪ Deep Belief Networks
▪ Deep Boltzmann Machine
Components
• Natural Language Generation (NLG)
• Natural Language Understanding (NLU)
Natural Language Generation (NLG)
• NLG is a method of creating meaningful phrases and sentences (natural language) from
data. It comprises three stages: text planning, sentence planning, and text realization.
1
– Text planning: Retrieving applicable content.
– Sentence planning: Forming meaningful phrases and setting the sentence tone.
– Text realization: Mapping sentence plans to sentence structures.
• Chatbots, machine translation tools, analytics platforms, voice assistants, sentiment
analysis platforms, and AI-powered transcription tools are some applications of NLG.
3
• This is obvious in languages like English, where the end of a sentence is marked by a
period, but it is still not trivial.
• A period can be used to mark an abbreviation as well as to terminate a sentence, and in
this case, the period should be part of the abbreviation token itself.
• The process becomes even more complex in languages, such as ancient Chinese, that
don’t have a delimiter that marks the end of a sentence.
• Stop word removal aims to remove the most commonly occurring words that don’t
add much information to the text.
• For example, “the,” “a,” “an,” and so on.
• Tokenization splits text into individual words and word fragments.
• The result generally consists of a word index and tokenized text in which words may
be represented as numerical tokens for use in various deep learning methods.
• A method that instructs language models to ignore unimportant tokens can improve
efficiency.
4
• TF-IDF: In Bag-of-Words, we count the occurrence of each word or n-gram in a
document. In contrast, with TF-IDF, we weight each word by its importance. To
evaluate a word’s significance, we consider two things:
– Term Frequency: How important is the word in the document?
• TF (word in a document) = Number of occurrences of that word in
document / Number of words in document
– Inverse Document Frequency: How important is the term in the whole
corpus?
• IDF (word in a corpus) = log (number of documents in the corpus /
number of documents that include the word)
• A word is important if it occurs many times in a document. But that creates a problem.
Words like “a” and “the” appear often. And as such, their TF score will always be high.
We resolve this issue by using Inverse Document Frequency, which is high if the word
is rare and low if the word is common across the corpus. The TF-IDF score of a term is
the product of TF and IDF.
NLP Libraries
• Scikit-learn: It provides a wide range of algorithms for building machine learning
models in Python.
• Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP
techniques.
• Pattern: It is a web mining module for NLP and machine learning.
• TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment analysis,
noun phrase extraction, or pos-tagging.
• Quepy: Quepy is used to transform natural language questions into queries in a
database query language.
• SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data
Analysis, Sentiment Analysis, and Text Summarization.
7
• Gensim: Gensim works with large datasets and processes data streams.
Applications of NLP
Question Answering
– Question Answering focuses on building systems that automatically answer the
questions asked by humans in a natural language.
Spam detection
– Spam detection is used to detect unwanted e-mails getting to a user's inbox.
Sentiment Analysis
• Sentiment Analysis is also known as opinion mining. It is used on the web to analyse
the attitude, behavior, and emotional state of the sender.
• This application is implemented through a combination of NLP (Natural Language
Processing) and statistics by assigning the values to the text (positive, negative, or
natural), identify the mood of the context (happy, sad, angry, etc.)
Machine translation
Machine translation is used to translate text or speech from one natural language to another
natural language.
Speech Recognition
• Speech recognition is used for converting spoken words into text. It is used in
applications, such as mobile, home automation, video recovery, dictating to Microsoft
Word, voice biometrics, voice user interface, and so on.
Chatbot
• Implementing the Chatbot is one of the important applications of NLP.
• It is used by many companies to provide the customer's chat services.
Information extraction
• Information extraction is one of the most important applications of NLP. It is used for
extracting structured information from unstructured or semi-structured machine-
readable documents.
Natural Language Understanding (NLU)
• It converts a large set of text into more formal representations such as first-order logic
structures that are easier for the computer programs to manipulate notations of the
natural language processing.
9
a way that the model can be used to generate or output new examples that plausibly
could have been drawn from the original dataset.
• GANs are a clever way of training a generative model by framing the problem as a
supervised learning problem with two sub-models: the generator model that we train to
generate new examples, and the discriminator model that tries to classify examples as
either real (from the domain) or fake (generated).
• The two models are trained together in a zero-sum game, adversarial, until the
discriminator model is fooled about half the time, meaning the generator model is
generating plausible examples.
• GANs are an exciting and rapidly changing field, delivering on the promise of
generative models in their ability to generate realistic examples across a range of
problem domains, most notably in image-to-image translation tasks such as translating
photos of summer to winter or day to night, and in generating photorealistic photos of
objects, scenes, and people that even humans cannot tell are fake.
Shown below is an example of a GAN. There is a database that has real 100 rupee notes. The
generator neural network generates fake 100 rupee notes. The discriminator network will help
identify the real and fake notes.
The GAN model architecture involves two sub-models: a generator model for generating new
examples and a discriminator model for classifying whether generated examples are real, from
the domain, or fake, generated by the generator model.
• Generator. Model that is used to generate new plausible examples from the problem domain.
• Discriminator. Model that is used to classify examples as real (from the domain) or fake
(generated).
Generative adversarial networks are based on a game theoretic scenario in which the
generator network must compete against an adversary. The generator network directly
produces samples. Its adversary, the discriminator network, attempts to distinguish between
samples drawn from the training data and samples drawn from the generator.
The GAN architecture was first described in the 2014 paper by Ian Goodfellow, et al. titled
“Generative Adversarial Networks.”
10
A standardized approach called Deep Convolutional Generative Adversarial Networks, or
DCGAN, that led to more stable models was later formalized by Alec Radford, et al. in the
2015 paper titled “Unsupervised Representation Learning with Deep Convolutional Generative
Adversarial Networks.
What is a Generator?
A Generator in GANs is a neural network that creates fake data to be trained on the
discriminator. It learns to generate plausible data. The generated examples/instances become
negative training examples for the discriminator. It takes a fixed-length random vector carrying
noise as input and generates a sample.
The main aim of the Generator is to make the discriminator classify its output as real. The part
of the GAN that trains the Generator includes:
Let’s see the next topic in this article on what GANs are, i.e., a Discriminator.
What is a Discriminator?
The Discriminator is a neural network that identifies real data from the fake data created by the
Generator. The discriminator's training data comes from different two sources:
11
• The real data instances, such as real pictures of birds, humans, currency notes, etc., are used
by the Discriminator as positive samples during training.
• The fake data instances created by the Generator are used as negative examples during the
training process.
While training the discriminator, it connects to two loss functions. During discriminator
training, the discriminator ignores the generator loss and just uses the discriminator loss.
In the process of training the discriminator, the discriminator classifies both real data and fake
data from the generator. The discriminator loss penalizes the discriminator for misclassifying
a real data instance as fake or a fake data instance as real.
The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.
Now, let’s learn how GANs work in this article on ‘What are GANs’.
How Do GANs Work?
GANs consists of two neural networks. There is a Generator G(x) and a Discriminator D(x).
Both of them play an adversarial game. The generator's aim is to fool the discriminator by
producing data that are similar to those in the training set. The discriminator will try not to be
fooled by identifying fake data from real data. Both of them work simultaneously to learn and
train complex data like audio, video, or image files.
12
The Generator network takes a sample and generates a fake sample of data. The Generator is
trained to increase the Discriminator network's probability of making mistakes.
Below is an example of a GAN trying to identify if the 100 rupee notes are real or fake. So,
first, a noise vector or the input vector is fed to the Generator network. The generator creates
fake 100 rupee notes. The real images of 100 rupee notes stored in a database are passed to the
discriminator along with the fake notes. The Discriminator then identifies the notes as
classifying them as real or fake.
We train the model, calculate the loss function at the end of the discriminator network, and
backpropagate the loss into both discriminator and generator models.
There are two main components of a GAN – Generator Neural Network and Discriminator
Neural Network.
The Generator Network takes an random input and tries to generate a sample of data. In the
above image, we can see that generator G(z) takes a input z from p(z), where z is a sample
13
from probability distribution p(z). It then generates a data which is then fed into a discriminator
network D(x). The task of Discriminator Network is to take input either from the real data or
from the generator and try to predict whether the input is real or generated. It takes an input x
from pdata(x) where pdata(x) is our real data distribution. D(x) then solves a binary classification
problem using sigmoid function giving output in the range 0 to 1.
Now the training of GAN is done (as we saw above) as a fight between generator and
discriminator. This can be represented mathematically as
In our function V (D, G) the first term is entropy that the data from real distribution (pdata(x))
passes through the discriminator (aka best case scenario). The discriminator tries to maximize
this to 1. The second term is entropy that the data from random input (p(z)) passes through the
generator, which then generates a fake sample which is then passed through the discriminator
to identify the fakeness (aka worst case scenario). In this term, discriminator tries to maximize
it to 0 (i.e. the log probability that the data from generated is fake is equal to 0). So overall,
the discriminator is trying to maximize our function V.
On the other hand, the task of generator is exactly opposite, i.e. it tries to minimize the
function V so that the differentiation between real and fake data is bare minimum. This, in
other words is a cat and mouse game between generator and discriminator!
14
• Pass 1: Train discriminator and freeze generator (freezing means setting training as false. The
network does only forward pass and no backpropagation is applied)
Step 2: Define architecture of GAN. Define how your GAN should look like. Should both
your generator and discriminator be multi layer perceptrons, or convolutional neural networks?
This step will depend on what problem you are trying to solve.
Step 3: Train Discriminator on real data for n epochs. Get the data you want to generate
fake on and train the discriminator to correctly predict them as real. Here value n can be any
natural number between 1 and infinity.
Step 4: Generate fake inputs for generator and train discriminator on fake data. Get
generated data and let the discriminator correctly predict them as fake.
15
Step 5: Train generator with the output of discriminator. Now when the discriminator is
trained, you can get its predictions and use it as an objective for training the generator. Train
the generator to fool the discriminator.
Step 7: Check if the fake data manually if it seems legit. If it seems appropriate, stop
training, else go to step 3. This is a bit of a manual task, as hand evaluating the data is the best
way to check the fakeness. When this step is over, you can evaluate whether the GAN is
performing well enough.
• Problem with Counting: GANs fail to differentiate how many of a particular object should
occur at a location. As we can see below, it gives more number of eyes in the head than naturally
present.
• Problems with Global Structures: Same as the problem with perspective, GANs do not
understand a holistic structure. For example, in the bottom left image, it gives a generated
image of a quadruple cow, i.e. a cow standing on its hind legs and simultaneously on all four
legs. That is definitely not possible in real life!
16
Implementing a Toy GAN
Lets see a toy implementation of GAN to strengthen our theory. We will try to generate digits
by training a GAN on Identify the Digits dataset. A bit about the dataset; the dataset contains
28×28 images which are black and white. All the images are in “.png” format. For our task, we
will only work on the training set.
• numpy
• pandas
• tensorflow
• keras
• keras_adversarial
Before starting with the code, let us understand the internal working thorugh pseudocode. A
pseudocode of GAN training can be thought out as follows
Note: This is the first implementation of GAN that was published in the paper. Numerous
improvements/updates in the pseudocode can be seen in the recent papers such as adding batch
normalization in the generator and discrimination network, training generator k times etc.
17
Let us first import all the modules
# import modules
%pylab inline
import os
import numpy as np
import pandas as pd
from scipy.misc import imread
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten, Reshape, InputLayer
from keras.regularizers import L1L2
To have a deterministic randomness, we set a seed value
# set path
root_dir = os.path.abspath('.')
data_dir = os.path.join(root_dir, 'Data')
Let us load our data
# load data
train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))
test = pd.read_csv(os.path.join(data_dir, 'test.csv'))
temp = []
for img_name in train.filename:
image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
img = imread(image_path, flatten=True)
img = img.astype('float32')
temp.append(img)
train_x = np.stack(temp)
18
train_x = train_x / 255.
To visualize what our data looks like, let us plot one of the image
# print image
img_name = rng.choice(train.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()
# define variables
# define vars g_input_shape = 100 d_input_shape = (28, 28) hidden_1_num_units = 500
hidden_2_num_units = 500 g_output_num_units = 784 d_output_num_units = 1 epochs = 25
batch_size = 128
# generator
model_1 = Sequential([
Dense(units=hidden_1_num_units, input_dim=g_input_shape, activation='relu',
kernel_regularizer=L1L2(1e-5, 1e-5)),
Dense(units=hidden_2_num_units, activation='relu', kernel_regularizer=L1L2(1e-5, 1e-
5)),
Dense(units=g_output_num_units, activation='sigmoid', kernel_regularizer=L1L2(1e-5,
1e-5)),
Reshape(d_input_shape),
])
# discriminator
model_2 = Sequential([
19
InputLayer(input_shape=d_input_shape),
Flatten(),
Dense(units=hidden_1_num_units, activation='relu', kernel_regularizer=L1L2(1e-5, 1e-
5)),
Dense(units=hidden_2_num_units, activation='relu', kernel_regularizer=L1L2(1e-5, 1e-
5)),
Dense(units=d_output_num_units, activation='sigmoid', kernel_regularizer=L1L2(1e-5,
1e-5)),
])
Here is the architecture of our networks
We will then define our GAN, for that we will first import a few important modules
20
history = model.fit(x=train_x, y=gan_targets(train_x.shape[0]), epochs=10,
batch_size=batch_size)
Here’s how our GAN would look like,
plt.plot(history.history['player_0_loss'])
plt.plot(history.history['player_1_loss'])
plt.plot(history.history['loss'])
After training for 100 epochs, I got the following generated images
21
Applications of Generative Adversarial Networks (GANs)
1. Generate new data from available data – It means generating new samples from an
available sample that is not similar to a real one.
2. Generate realistic pictures of people that have never existed.
3. Gans is not limited to Images, It can generate text, articles, songs, poems, etc.
4. Generate Music by using some clone Voice – If you provide some voice then GANs
can generate a similar clone feature of it. In this research paper, researchers from NIT
in Tokyo proposed a system that is able to generate melodies from lyrics with help of
learned relationships between notes and subjects.
5. Text to Image Generation (Object GAN and Object Driven GAN)
6. Creation of anime characters in Game Development and animation production.
7. Image to Image Translation – We can translate one Image to another without changing
the background of the source image. For example, Gans can replace a dog with a cat.
8. Low resolution to High resolution – If you pass a low-resolution Image or video, GAN
can produce a high-resolution Image version of the same.
9. Prediction of Next Frame in the video – By training a neural network on small frames
of video, GANs are capable to generate or predict a small next frame of video. For
example, you can have a look at below GIF
10. Interactive Image Generation – It means that GANs are capable to generate images and
video footage in an art form if they are trained on the right real dataset.
11. Speech – Researchers from the College of London recently published a system called
GAN-TTS that learns to generate raw audio through training on 567 corpora of speech
data.
22
• Reinforcement Learning is a type of machine learning algorithm that learns to solve a
multi-level problem by trial and error. The machine is trained on real-life scenarios to
make a sequence of decisions. It receives either rewards or penalties for the actions it
performs. Its goal is to maximize the total reward.
• By Deep Reinforcement Learning we mean multiple layers of Artificial Neural
Networks that are present in the architecture to replicate the working of a human brain.
• Environment - All actions that the reinforcement learning agent makes directly affect
the environment. Here, the board of chess is the environment. The environment takes
the agent's present state and action as information and returns the reward to the agent
with a new state.
• For example, the move made by the bot will either have a negative/positive effect on
the whole game and the arrangement of the board. This will decide the next action and
state of the board.
• State - A state (S) is a particular situation in which the agent finds itself.
• Reward (R) - The environment gives feedback by which we determine the validity of
the agent’s actions in each state. It is crucial in the scenario of Reinforcement Learning
where we want the machine to learn all by itself and the only critic that would help it
in learning is the feedback/reward it receives.
• For example, in a chess game scenario it happens when the bot takes the place of an
opponent's piece and later captures it.
• Discount factor - Over time, the discount factor modifies the importance of incentives.
Given the uncertainty of the future it’s better to add variance to the value estimates.
Discount factor helps in reducing the degree to which future rewards affect our value
function estimates.
• Policy (π) - It decides what action to take in a certain state to maximize the reward.
• Value (V)—It measures the optimality of a specific state. It is the expected discounted
rewards that the agent collects following the specific policy.
• Q-value or action-value - Q Value is a measure of the overall expected reward if the
agent (A) is in state (s) and takes action (a), and then plays until the end of the episode
according to some policy (π).
24
Markov Decision Process (MDP)
• Markov Decision Process is a Reinforcement Learning algorithm that gives us a way to
formalize sequential decision making.
• This formalization is the basis to the problems that are solved by Reinforcement
Learning. The components involved in a Markov Decision Process (MDP) is a
decision maker called an agent that interacts with the environment it is placed in.
• These interactions occur sequentially overtime.
• In each timestamp, the agent will get some representation of the environment state.
Given this representation, the agent selects an action to make. The environment is then
transitioned into some new state and the agent is given a reward as a consequence of its
previous action.
• Let’s wrap up everything that we have covered till now.
• The process of selecting an action from a given state, transitioning to a new state and
receiving a reward happens sequentially over and over again. This creates something
called a trajectory that shows the sequence of states, actions and rewards.
• Throughout the process, it is the responsibility of the reinforcement learning agent to
maximize the total amount of rewards that it received from taking actions in given states
of environments.
• The agent not only wants to maximize the immediate rewards but the cumulative reward
it receives in the whole process.
• An important point to note about the Markov Decision Process is that it does not worry
about the immediate reward but aims to maximize the total reward of the entire
trajectory.
• Sometimes, it might prefer to get a small reward in the next timestamp to get a higher
reward eventually over time.
Bellman Equations
• Rather than summing over numerous time steps, this equation simplifies the
computation of the value function, allowing us to find the best solution to a complex
problem by breaking it down into smaller, recursive subproblems.
Dynamic Programming
• In Bellman Optimality Equations if we have large state spaces, it becomes extremely
difficult and close to impossible to solve this system of equations explicitly.
• Hence, we shift our approach from recursion to Dynamic Programming.
• Dynamic Programming is a method of solving problems by breaking them into simpler
sub-problems. In Dynamic Programming, we are going to create a lookup table to
estimate the value of each state.
• There are two classes of Dynamic Programming:
– 1. Value Iteration
– 2. Policy Iteration
• Value iteration
– In this method, the optimal policy (optimal action for a given state) is obtained
by choosing the action that maximizes optimal state-value function for the given
state.
– The optimal state-value function is obtained using an iterative function and
hence its name—Value Iteration.
– By iteratively improving the estimate of V,the Value Iteration method computes
the ideal state value function (s). V (s) is initialized with arbitrary random values
by the algorithm. The Q (s, a) and V (s) values are updated until they converge.
Value Iteration is guaranteed to get you to the best results.
• Policy iteration
This algorithm has two phases in its working:
– 1. Policy Evaluation—It computes the values for the states in the environment
using the policy provided by the policy improvement phase.
– 2. Policy Improvement—Looking into the state values provided by the policy
evaluation part, it improves the policy so that it can get higher state values.
• Firstly, the reinforcement learning agents tarts with a random policy π (i). Policy
Evaluation will evaluate the value functions like state values for that particular policy.
• The policy improvement will improve the policy and give us π (1) and so on until we
get the optimal policy where the algorithm stops. This algorithm communicates back
and forth between the two phases—Policy Improvement gives the policy to the policy
evaluation module which computes values.
26
• Later, looking at the computed policy, policy evaluation improves the policy and
iterates this process.
Q-learning
• Q-Learning combines the policy and value functions, and it tells us jointly how useful
a given action is in gaining some future reward.
• Quality is assigned to a state-action pair as Q (s,a) based on the future value that it
expects given the current state and best possible policy the agent has. Once the agent
learns this Q-Function, it looks for the best possible action at a particular state (s) that
yields the highest quality.
• Once we have an optimal Q-function (Q*), we can determine the optimal policy by
applying a Reinforcement Learning algorithm to find an action that maximizes the
value for each state.
• In other words, Q* gives the largest expected return achievable by any policy π for each
possible state-action pair.
• In the basic Q-Learning approach, we need to maintain a look-up table called q-map
for each state-action pair and the corresponding value associated with it.
Autoencoder
• An autoencoder is a type of artificial neural network used to learn efficient data
coding’s in an unsupervised manner.
• The goal of an autoencoder is to:
28
– learn a representation for a set of data, usually for dimensionality reduction by
training the network to ignore signal noise.
– Along with the reduction side, a reconstructing side is also learned, where the
autoencoder tries to generate from the reduced encoding a representation as
close as possible to its original input. This helps autoencoders to learn important
features present in the data.
• Recently, the autoencoder concept has become more widely used for learning
generative models of data.
Types of Autoencoders
• Denoising autoencoder
• Sparse Autoencoder
• Deep Autoencoder
• Contractive Autoencoder
• Undercomplete Autoencoder
• Convolutional Autoencoder
• Variational Autoencoder
1) Denoising Autoencoder
• Denoising autoencoders create a corrupted copy of the input by introducing
some noise. This helps to avoid the autoencoders to copy the input to the output
without learning features about the data.
• These autoencoders take a partially corrupted input while training to recover the
original undistorted input.
• The model learns a vector field for mapping the input data towards a lower
dimensional manifold which describes the natural data to cancel out the added
noise.
Advantages:
• It was introduced to achieve good representation. Such a representation is one that can
be obtained robustly from a corrupted input and that will be useful for recovering the
corresponding clean input.
• Corruption of the input can be done randomly by making some of the input as zero.
Remaining nodes copy the input to the noised input.
• Minimizes the loss function between the output node and the corrupted input.
• Setting up a single-thread denoising autoencoder is easy.
Drawbacks:
• To train an autoencoder to denoise data, it is necessary to perform preliminary stochastic
mapping in order to corrupt the data and use as input.
29
• This model isn't able to develop a mapping which memorizes the training data because
our input and target output are no longer the same.
2) Sparse Autoencoder
• Sparse autoencoders have hidden nodes greater than input nodes. They can still discover
important features from the data.
• A generic sparse autoencoder is visualized where the obscurity of a node corresponds
with the level of activation.
• Sparsity constraint is introduced on the hidden layer. This is to prevent output layer
copy input data.
• Sparsity may be obtained by additional terms in the loss function during the training
process, either by comparing the probability distribution of the hidden unit activations
with some low desired value, or by manually zeroing all but the strongest hidden unit
activations.
Advantages:
• Sparse autoencoders have a sparsity penalty, a value close to zero but not exactly zero.
Sparsity penalty is applied on the hidden layer in addition to the reconstruction error.
This prevents overfitting.
• They take the highest activation values in the hidden layer and zero out the rest of the
hidden nodes. This prevents autoencoders to use all of the hidden nodes at a time and
forcing only a reduced number of hidden nodes to be used.
Drawbacks:
• For it to be working, it's essential that the individual nodes of a trained model which
activate are data dependent, and that different inputs will result in activations of
different nodes through the network.
3) Deep Autoencoder
• Deep Autoencoders consist of two identical deep belief networks,
30
o One network for encoding and another for decoding.
• Typically, deep autoencoders have 4 to 5 layers for encoding and the next 4 to 5 layers
for decoding.
• We use unsupervised layer by layer pre-training for this model. The layers are
Restricted Boltzmann Machines which are the building blocks of deep-belief networks.
• Processing the benchmark dataset MNIST, a deep autoencoder would use binary
transformations after each RBM.
• Deep autoencoders are useful in topic modeling, or statistically modeling abstract topics
that are distributed across a collection of documents.
• They are also capable of compressing images into 30 number vectors.
Advantages:
• Deep autoencoders can be used for other types of datasets with real-valued data, on
which you would use Gaussian rectified transformations for the RBMs instead.
• Final encoding layer is compact and fast.
Disadvantages:
• Chances of overfitting to occur since there's more parameters than input data.
• Training the data maybe a nuance since at the stage of the decoder’s backpropagation,
the learning rate should be lowered or made slower depending on whether binary or
continuous data is being handled.
4) Contractive Autoencoder
• The objective of a contractive autoencoder is to have a robust learned representation
which is less sensitive to small variation in the data.
• Robustness of the representation for the data is done by applying a penalty term to the
loss function.
• Contractive autoencoder is another regularization technique just like sparse and
denoising autoencoders. However, this regularizer corresponds to the Frobenius norm
of the Jacobian matrix of the encoder activations with respect to the input.
• Frobenius norm of the Jacobian matrix for the hidden layer is calculated with respect
to input and it is basically the sum of square of all elements.
31
Advantages:
• Contractive autoencoder is a better choice than denoising autoencoder to learn useful
feature extraction.
• This model learns an encoding in which similar inputs have similar encodings. Hence,
we're forcing the model to learn how to contract a neighborhood of inputs into a smaller
neighborhood of outputs.
5) Undercomplete Autoencoder
• The objective of undercomplete autoencoder is to capture the most important features
present in the data.
• Undercomplete autoencoders have a smaller dimension for hidden layer compared to
the input layer. This helps to obtain important features from the data.
• It minimizes the loss function by penalizing the g(f(x)) for being different from the
input x.
Advantages:
• Undercomplete autoencoders do not need any regularization as they maximize the
probability of data rather than copying the input to the output.
Drawbacks:
• Using an overparameterized model due to lack of sufficient training data can create
overfitting.
6) Convolutional Autoencoder
• Autoencoders in their traditional formulation does not take into account the
fact that a signal can be seen as a sum of other signals.
• Convolutional Autoencoders use the convolution operator to exploit this
observation.
• They learn to encode the input in a set of simple signals and then try to
reconstruct the input from them, modify the geometry or the reflectance of the
image.
32
• They are the state-of-art tools for unsupervised learning of convolutional
filters.
• Once these filters have been learned, they can be applied to any input in order
to extract features. These features, then, can be used to do any task that requires
a compact representation of the input, like classification.
Advantages:
• Due to their convolutional nature, they scale well to realistic-sized high dimensional
images.
• Can remove noise from picture or reconstruct missing parts.
Drawbacks:
• The reconstruction of the input image is often blurry and of lower quality due to
compression during which information is lost.
7) Variational Autoencoder
• Variational autoencoder models make strong assumptions concerning the distribution
of latent variables. They use a variational approach for latent representation learning,
which results in an additional loss component and a specific estimator for the training
algorithm called the Stochastic Gradient Variational Bayes estimator.
• It assumes that the data is generated by a directed graphical model and that the encoder
is learning an approximation to the posterior distribution where Ф and θ denote the
parameters of the encoder (recognition model) and decoder (generative model)
respectively. The probability distribution of the latent vector of a variational
autoencoder typically matches that of the training data much closer than a standard
autoencoder.
Advantages:
• It gives significant control over how we want to model our latent distribution unlike the
other models.
• After training you can just sample from the distribution followed by decoding and
generating new data.
Drawbacks:
• When training the model, there is a need to calculate the relationship of each parameter
in the network with respect to the final output loss using a technique known as
backpropagation. Hence, the sampling process requires some extra attention.
33
Applications
1. Dimensionality Reduction
2. Image Compression
3. Image Denoising
4. Feature Extraction
5. Image Generation
6. Sequence to Sequence Prediction
7. Recommendation System
Summary
• Autoencoders work by compressing the input into a latent space representation and then
reconstructing the output from this representation. This kind of network is composed
of two parts:
• Encoder: This is the part of the network that compresses the input into a latent-space
representation. It can be represented by an encoding function h=f(x).
• Decoder: This part aims to reconstruct the input from the latent space representation. It
can be represented by a decoding function r=g(h).
• If the only purpose of autoencoders was to copy the input to the output, they would be
useless. We hope that by training the autoencoder to copy the input to the output, the
latent representation will take on useful properties. This can be achieved by creating
constraints on the copying task.
• If the autoencoder is given too much capacity, it can learn to perform the copying task
without extracting any useful information about the distribution of the data. This can
34
also occur if the dimension of the latent representation is the same as the input, and in
the overcomplete case,
• where the dimension of the latent representation is greater than the input. In these cases,
even a linear encoder and linear decoder can learn to copy the input to the output
without learning anything useful about the data distribution.
• Ideally, one could train any architecture of autoencoder successfully, choosing the code
dimension and the capacity of the encoder and decoder based on the complexity of
distribution to be modeled.
• Autoencoders are learned automatically from data examples. It means that it is easy to
train specialized instances of the algorithm that will perform well on a specific type of
input and that it does not require any new engineering, only the appropriate training
data.
• However, autoencoders will do a poor job for image compression. As the autoencoder
is trained on a given set of data, it will achieve reasonable compression results on data
similar to the training set used but will be poor general-purpose image compressors.
35
36
37
38
39
40
41
42
RBM Training
1. Gibbs Sampling
2. Contrastive divergence
43
44
45
46
Applications of Deep Belief Networks
• Image Recognition
• Video Recognition
• Motion-Capture Data
Deep Boltzmann Machines
47
44
48