E-Note_28189_Content_Document_20241127105359AM (1)
E-Note_28189_Content_Document_20241127105359AM (1)
E-Note_28189_Content_Document_20241127105359AM (1)
Syllabus: The purpose of GAN, An analogy from the real world, Building blocks of GAN.
Implementation of GAN, Applications of GAN, Challenges of GAN Models, Setting up failure
and bad initialization, Mode collapse, Problems with counting, Problems with perceptive
Generative adversarial networks are machine learning systems that can learn to mimic a
given distribution of data. They were first proposed in a 2014 NeurIPS paper by deep
learning expert Ian Goodfellow and his colleagues.
GANs consist of two neural networks, one trained to generate data and the other trained to
distinguish fake data from real data (hence the “adversarial” nature of the model). Although
the idea of a structure to generate data isn’t new, when it comes to image and video
generation, GANs have provided impressive results such as:
Style transfer using CycleGAN, which can perform a number of convincing style
transformations on images
Generation of human faces with StyleGAN, as demonstrated on the website This
Person Does Not Exist
Structures that generate data, including GANs, are considered generative models in contrast
to the more widely studied discriminative models.
Some generative models are able to generate samples from model distribution. GANs are an
example of generative models. GAN focuses primarily on generating samples from
distribution. You might be wondering why generative models are worth studying, especially
generative models that are only capable of generating data rather than providing an
estimate of the density function. Some of the reasons to study generative models are as
follows: Sampling (or generation) is straightforward Training doesn't involve maximum
likelihood estimation Robust to overfitting since the generator never sees the training data
GANs are good at capturing the modes of distribution
Let's consider the real-world relationship between a money counterfeiting criminal and the
police. Let's enumerate the objective of the criminal and the police in terms of money:
Figure1a: GAN real world analogy
To become a successful money counterfeiter, the criminal needs to fool the police so that
the police can't tell the difference between the counterfeit/fake money and real money As a
paragon of justice, the police want to detect fake money as effectively as possible This can
be modeled as a minimax game in game theory. This phenomenon is called adversarial
process. GAN, introduced by Ian Goodfellow in 2014 is a special case of an adversarial
process where two neural networks compete against each other. The first network
generates data and the second network tries to find the difference between the real data
and the fake data generated by the first network. The second network will output a scalar
[0, 1], which represents a probability of real data.
The building blocks of GAN
In GAN, the first network is called generator and is often represented as G(z) and the second
network is called discriminator and is often represented as D(x):
Here are the steps a GAN takes:
The discriminator is in a feedback loop with the ground truth of the images, which
we know.
The generator is in a feedback loop with the discriminator.
At the equilibrium point, which is the optimal point in the minimax game, the first network
will model the real data and the second network will output a probability of 0.5 as the
output of the first network = real data:
Sometimes the two networks eventually reach equilibrium, but this is not always
guaranteed and the two networks can continue learning for a long time. An example of
learning with both generator and discriminator loss is shown in the following figure:
Figure 1c: Loss of two networks, generator and discriminator
Generator
The generator network takes as input random noise and tries to generate a sample of data.
In the preceding figure, we can see that generator G(z) takes an input z from probability
distribution p(z) and generates data that is then fed into a discriminator network D(x).
Discriminator
The discriminator network takes input either from the real data or from the generator's
generated data and tries to predict whether the input is real or generated. It takes an input
x from real data distribution P data (x) and then solves a binary classification problem giving
output in the scalar range 0 to 1.
GANs are gaining lot of popularity because of their ability to tackle the important challenge
of unsupervised learning, since the amount of available unlabeled data is much larger than
the amount of labeled data. Another reason for their popularity is that GANs are able to
generate the most realistic images among generative models. Although this is subjective, it
is an opinion shared by most practitioners.
As you can see, this dataset consists of points (x₁, x₂) located over a sine curve, having a very
particular distribution. The overall structure of a GAN to generate pairs (x̃₁, x̃₂) resembling
the samples of the dataset is shown in the following figure:
The generator G is fed with random data from a latent space, and its role is to generate data
resembling the real samples. In this example, you have a two-dimensional latent space, so
that the generator is fed with random (z₁, z₂) pairs and is required to transform them so that
they resemble the real samples.
The structure of the neural network G can be arbitrary, allowing you to use neural networks
as a multilayer perceptron (MLP), a convolutional neural network (CNN), or any other
structure as long as the dimensions of the input and output match the dimensions of the
latent space and the real data.
The discriminator D is fed with either real samples from the training dataset or generated
samples provided by G. Its role is to estimate the probability that the input belongs to the
real dataset. The training is performed so that D outputs 1 when it’s fed a real sample and 0
when it’s fed a generated sample.
As with G, you can choose an arbitrary neural network structure for D as long as it respects
the necessary input and output dimensions. In this example, the input is two-dimensional.
For a binary discriminator, the output may be a scalar ranging from 0 to 1.
The GAN training process consists of a two-player minimax game in which D is adapted to
minimize the discrimination error between real and generated samples, and G is adapted to
maximize the probability of D making a mistake.
Although the dataset containing the real data isn’t labeled, the training processes
for D and G are performed in a supervised way. At each step in the training, D and G have
their parameters updated. In fact, in the original GAN proposal, the parameters of D are
updated k times, while the parameters of G are updated only once for each training step.
However, to make the training simpler, you can consider k equal to 1.
To train D, at each iteration you label some real samples taken from the training data as 1
and some generated samples provided by G as 0. This way, you can use a conventional
supervised training framework to update the parameters of D in order to minimize a loss
function, as shown in the following scheme:
For each batch of training data containing labeled real and generated samples, you update
the parameters of D to minimize a loss function. After the parameters of D are updated, you
train G to produce better generated samples. The output of G is connected to D, whose
parameters are kept frozen, as depicted here:
You can imagine the system composed of G and D as a single classification system that
receives random samples as input and outputs the classification, which in this case can be
interpreted as a probability.
When G does a good enough job to fool D, the output probability should be close to 1. You
could also use a conventional supervised training framework here: the dataset to train the
classification system composed of G and D would be provided by random input samples,
and the label associated with each input sample would be 1.
During training, as the parameters of D and G are updated, it’s expected that the generated
samples given by G will more closely resemble the real data, and D will have more trouble
distinguishing between real and generated data.
Implementation of GAN
The generator(z) takes as input a 100-dimensional vector from a random distribution (in this
case we are using uniform distribution) and returns a 786-dimensional vector, which is a
MNIST
image (28x28). The z here is the prior for the G(z). In this way, it learns a mapping between
the prior space to p data (real data distribution):
Whereas the discriminator(x) takes MNIST image(s) as input and returns a scalar that
represents a probability of real image. Now, let's discuss an algorithm for training GAN.
G_sample = generator(Z)
D_real, D_logit_real = discriminator(X)
D_fake, D_logit_fake = discriminator(G_sample)
# Loss functions according the GAN original paper
D_loss = -tf.reduce_mean(tf.log(D_real) + tf.log(1. - D_fake))
G_loss = -tf.reduce_mean(tf.log(D_fake))
The TensorFlow optimizer can only do minimization, so in order to maximize the loss
function,
we are using a negative sign for the loss as seen previously. Also, as per the paper's pseudo
algorithm, it's better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing
tf.reduce_mean(1 - tf.log(D_fake). Then we train the networks one by one with those
preceding loss functions:
# Only update D(X)'s parameters, so var_list = theta_D
D_solver = tf.train.AdamOptimizer().minimize(D_loss, var_list=theta_D)
# Only update G(X)'s parameters, so var_list = theta_G
G_solver = tf.train.AdamOptimizer().minimize(G_loss, var_list=theta_G)
def sample_Z(m, n):
'''Uniform prior for G(Z)'''
return np.random.uniform(-1., 1., size=[m, n])
for it in range(1000000):
X_mb, _ = mnist.train.next_batch(mb_size)
_, D_loss_curr = sess.run([D_solver, D_loss], feed_dict={X: X_mb, Z:
sample_Z(mb_size, Z_dim)})
_, G_loss_curr = sess.run([G_solver, G_loss], feed_dict={Z:
sample_Z(mb_size, Z_dim)})
After that we start with random noise and as the training continues, G(Z) starts moving
towards p data. This is proved by the more similar samples generated by G(Z) compared to
original
MNIST images.
Applications of GAN
GAN is generating lots of excitement in a wide variety of fields. Some of the exciting
applications of GAN in recent years are listed as follows:
Translating one image to another (such as horse to zebra) with CycleGAN and
performing
image editing through Conditional GAN.
Automatic synthesis of realistic images from a textual sentence using StackGAN. And
transferring style from one domain to another domain using Discovery GAN
(DiscoGAN).
Enhancing image quality and generating high resolution images with pre-trained
models using SRGAN.
Generating realistic a image from attributes: Let's say a burglar comes to your
apartment but you don't have a picture of him/her. Now the system at the police
station could generate a realistic image of the thief based on the description
provided by you and search a database.
Predicting the next frame in a video or dynamic video generation
Generative Adversarial Networks (GANs) were introduced in 2014 by Ian J. Goodfellow and
co-authors. GANs perform unsupervised learning tasks in machine learning. It consists of 2
models that automatically discover and learn the patterns in input data.
They compete with each other to scrutinize, capture, and replicate the variations within a
dataset. GANs can be used to generate new examples that plausibly could have been drawn
from the original dataset.
Shown below is an example of a GAN. There is a database that has real 100 rupee notes. The
generator neural network generates fake 100 rupee notes. The discriminator network will
help identify the real and fake notes.
What is a Generator?
A Generator in GANs is a neural network that creates fake data to be trained on the
discriminator. It learns to generate plausible data. The generated examples/instances
become negative training examples for the discriminator. It takes a fixed-length random
vector carrying noise as input and generates a sample.
The main aim of the Generator is to make the discriminator classify its output as real. The
part of the GAN that trains the Generator includes:
Let’s see the next topic in this article on what GANs are, i.e., a Discriminator.
What is a Discriminator?
The Discriminator is a neural network that identifies real data from the fake data created by
the Generator. The discriminator's training data comes from different two sources:
The real data instances, such as real pictures of birds, humans, currency notes, etc., are
used by the Discriminator as positive samples during training.
The fake data instances created by the Generator are used as negative examples during
the training process.
While training the discriminator, it connects to two loss functions. During discriminator
training, the discriminator ignores the generator loss and just uses the discriminator loss.
In the process of training the discriminator, the discriminator classifies both real data and
fake data from the generator. The discriminator loss penalizes the discriminator for
misclassifying a real data instance as fake or a fake data instance as real.
The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.
Now, let’s learn how GANs work in this article on ‘What are GANs’.
GANs consists of two neural networks. There is a Generator G(x) and a Discriminator D(x).
Both of them play an adversarial game. The generator's aim is to fool the discriminator by
producing data that are similar to those in the training set. The discriminator will try not to
be fooled by identifying fake data from real data. Both of them work simultaneously to learn
and train complex data like audio, video, or image files.
The Generator network takes a sample and generates a fake sample of data. The Generator
is trained to increase the Discriminator network's probability of making mistakes.
Below is an example of a GAN trying to identify if the 100 rupee notes are real or fake. So,
first, a noise vector or the input vector is fed to the Generator network. The generator
creates fake 100 rupee notes. The real images of 100 rupee notes stored in a database are
passed to the discriminator along with the fake notes. The Discriminator then identifies the
notes as classifying them as real or fake.
We train the model, calculate the loss function at the end of the discriminator network, and
backpropagate the loss into both discriminator and generator models.
Mathematical Equation
Here,
G = Generator
D = Discriminator
Vanilla GANs: Vanilla GANs have a min-max optimization formulation where the
Discriminator is a binary classifier and uses sigmoid cross-entropy loss during optimization.
The Generator and the Discriminator in Vanilla GANs are multi-layer perceptrons. The
algorithm tries to optimize the mathematical equation using stochastic gradient descent.
Deep Convolutional GANs (DCGANs): DCGANs support convolution neural networks instead
of vanilla neural networks at both Discriminator and Generator. They are more stable and
generate better quality images. The Generator is a set of convolution layers with fractional-
strided convolutions or transpose convolutions, so it up-samples the input image at every
convolutional layer. The discriminator is a set of convolution layers with strided
convolutions, so it down-samples the input image at every convolution layer.
Conditional GANs: Vanilla GANs can be extended into Conditional models by using extra-
label information to generate better results. In CGAN, an additional parameter ‘y’ is added
to the Generator for generating the corresponding data. Labels are fed as input to the
Discriminator to help distinguish the real data from the fake generated data.
Super Resolution GANs: SRGANs use deep neural networks along with an adversarial
network to produce higher resolution images. SRGANs generate a photorealistic high-
resolution image when given a low-resolution image.
Generative methods are a very powerful tool that can be used to solve a number of
problems. Their goal is to generate new data samples that are likely to belong to the training
dataset. Generative methods can do this in two ways, by learning an approximate
distribution of the data space then sampling from it, or by learning to generate samples that
are likely to belong to this data space (avoiding the step of approximating the data
distribution).
Above you can see a diagram of the architecture of GANs. GANs consist of two networks
(generator and discriminator) that are essentially competing against each other; the two
networks have adversarial goals.
The generator attempts to maximize the probability of fooling the discriminator into thinking
its generated images are real. The discriminator’s goal is to correctly classify the real data as
real, and the generated data as fake. These objectives are expressed in the loss functions of
the networks, which will be optimized during training.
In GANs, the generator's loss function is minimized, and the discriminator's loss function is
maximized. The generator attempts to maximize the number of samples that the
discriminator’s false positives, and the discriminator attempts to maximize its classification
accuracy.
The goals are naturally opposite, and therefore so will be the gradients used to train the
networks. This can become a problem, and I will discuss this later.
Once the training is complete, the generator is all we care about. The generator is capable of
taking in a random noise vector, then this one will output the image that is most likely to
belong to the training data space. Remember that even though this is effectively learning a
mapping between the random variable (z) and the image data space, there is no guarantee
that the mappings between the two spaces will be smooth. GANs do not learn the
distribution of the data, they learn how to generate samples similar to those belonging to the
training data.
Application of GANs
With the help of DCGANs, you can train images of cartoon characters for generating faces
of anime characters as well as Pokemon characters.
GANs can be trained on the images of humans to generate realistic faces. The faces that
you see below have been generated using GANs and do not exist in reality.
GANs can build realistic images from textual descriptions of objects like birds, humans,
and other animals. We input a sentence and generate multiple images fitting the
description.
Below is an example of a text to image translation using GANs for a bird with a black head,
yellow body, and a short break.
A major disadvantage of GANs is that as mentioned earlier, both the discriminator and
generator have opposite objectives and therefore opposite sign gradients. It can be shown
that when optimizing a GAN, a minimum will not be achieved. Instead, the optimization
algorithm will end up in a saddle point.
Another common problem with GANs is that when training these models, it is easy for the
discriminator to overpower the generator. The discriminator simply gets too good too quickly
and the generator is unable to learn how to generate images that fool the discriminator.
Intuitively this makes sense, a classification task will always be easier than the generator’s
task of learning how to generate new samples.
DCGAN
DCGAN uses convolutional and convolutional-transpose layers in the generator and
discriminator, respectively. It was proposed by Radford et. al. in the paper Unsupervised
Representation Learning With Deep Convolutional Generative Adversarial Networks. Here
the discriminator consists of strided convolution layers, batch normalization layers, and
LeakyRelu as activation function. It takes a 3x64x64 input image. The generator consists of
convolutional-transpose layers, batch normalization layers, and ReLU activations. The
output will be a 3x64x64 RGB image.
Architecture
The generator of the DCGAN architecture takes 100 uniform generated values using normal
distribution as an input. First, it changes the dimension to 4x4x1024 and performed a
fractionally stridden convolution 4 times with a stride of 1/2 (this means every time when
applied, it doubles the image dimension while reducing the number of output channels).
The generated output has dimensions of (64, 64, 3). There are some architectural changes
proposed in the generator such as the removal of all fully connected layers, and the use of
Batch Normalization which helps in stabilizing training. In this paper, the authors use ReLU
activation function in all layers of the generator, except for the output layers. We will be
implementing generator with similar guidelines but not completely the same architecture.
The role of the discriminator here is to determine that the image comes from either a real
dataset or a generator. The discriminator can be simply designed similar to a convolution
neural network that performs an image classification task. However, the authors of this
paper suggested some changes in the discriminator architecture. Instead of fully connected
layers, they used only strided-convolutions with LeakyReLU as an activation function, the
input of the generator is a single image from the dataset or generated image and the
output is a score that determines whether the image is real or generated.
Implementation
In this section we will be discussing the implementation of DCGAN in Keras, since our
dataset in the Fashion MNIST dataset, this dataset contains images of size (28, 28) of 1
color channel instead of (64, 64) of 3 color channels. So, we need to make some changes in
the architecture, we will be discussing these changes as we go along.
In the first step, we need to import the necessary classes such as TensorFlow, Keras,
matplotlib, etc. We will be using TensorFlow version 2. This version
of TensorFlow provides inbuilt support for the Keras library as its default High-level API.
Now we load the fashion-MNIST dataset, the good thing is that the dataset can be
imported from tf.keras.datasets API. So, we don’t need to load datasets manually by
copying files. This dateset contains 60k training images and 10k test images for each
dimension (28, 28, 1). Since the value of each pixel is in the range (0, 255), we divide these
values by 255 to normalize it.
Now in the next step, we will be visualizing some of the images from the Fashion-MNIST
dateset, we use matplotlib library for that.
# code
batch_size = 32
# This dataset fills a buffer with buffer_size elements,
# then randomly samples elements from this buffer,
# replacing the selected elements with new elements.
def create_batch(x_train):
dataset = tf.data.Dataset.from_tensor_slices(x_train).shuffle(1000)
# Combines consecutive elements of this dataset into batches.
Now, we define the generator architecture, this generator architecture takes a vector of
size 100 and first reshape that into (7, 7, 128) vector and then, it applies transpose
convolution on that reshaped image in combination with batch normalization. The output
of this generator is a trained image of dimension (28, 28, 1).
#code
num_features = 100
generator = keras.models.Sequential([
keras.layers.Dense(7 * 7 * 128, input_shape =[num_features]),
keras.layers.Reshape([7, 7, 128]),
keras.layers.BatchNormalization(),
keras.layers.Conv2DTranspose(
64, (5, 5), (2, 2), padding ="same", activation ="selu"),
keras.layers.BatchNormalization(),
keras.layers.Conv2DTranspose(
1, (5, 5), (2, 2), padding ="same", activation ="tanh"),
])
generator.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 6272) 633472
_________________________________________________________________
reshape (Reshape) (None, 7, 7, 128) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 7, 7, 128) 512
_________________________________________________________________
conv2d_transpose (Conv2DTran (None, 14, 14, 64) 204864
_________________________________________________________________
batch_normalization_1 (Batch (None, 14, 14, 64) 256
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 28, 28, 1) 1601
=================================================================
Total params: 840, 705
Trainable params: 840, 321
Non-trainable params: 384
_________________________________________________________________
Now, we define discriminator architecture, the discriminator takes an image of size 28*28
with 1 color channel and outputs a scalar value representing an image from either dataset
or generated image.
python3
discriminator = keras.models.Sequential([
keras.layers.Conv2D(64, (5, 5), (2, 2), padding
="same", input_shape =[28, 28, 1]),
keras.layers.LeakyReLU(0.2),
keras.layers.Dropout(0.3),
keras.layers.Conv2D(128, (5, 5), (2, 2), padding
="same"),
keras.layers.LeakyReLU(0.2),
keras.layers.Dropout(0.3),
keras.layers.Flatten(),
keras.layers.Dense(1, activation ='sigmoid')
])
discriminator.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 14, 14, 64) 1664
_________________________________________________________________
leaky_re_lu (LeakyReLU) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 7, 7, 128) 204928
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 7, 7, 128) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 6272) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 6273
=================================================================
Total params: 212, 865
Trainable params: 212, 865
Non-trainable params: 0
_________________________________________________________________
Now we need to compile our DCGAN model (combination of generator and discriminator),
we will first compile the discriminator and set its training to False, because we first want to
train the generator.
python3
# compile discriminator using binary cross entropy loss and adam optimizer
discriminator.compile(loss ="binary_crossentropy", optimizer ="adam")
# make discriminator no-trainable as of now
discriminator.trainable = False
# Combine both generator and discriminator
gan = keras.models.Sequential([generator, discriminator])
# compile generator using binary cross entropy loss and adam optimizer
for i in range(25):
plt.subplot(5, 5, i + 1)
plt.imshow(predictions[i, :, :, 0] * 127.5 + 127.5, cmap ='binary')
plt.axis('off')
plt.savefig('image_epoch_{:04d}.png'.format(epoch))
Now, we need to train the model but before that, we also need to create batches of
training data and add a dimension that represents number of color maps.
python3
# reshape to add a color map
x_train_dcgan = x_train.reshape(-1, 28, 28, 1) * 2. - 1.
# create batches
dataset = create_batch(x_train_dcgan)
# callthe training function with 10 epochs and record time %% time
train_dcgan(gan, dataset, batch_size, num_features, epochs = 10)
Now we will define a function that takes the saved images and convert them into GIF. We
use this function from here
python3
import imageio
import glob
anim_file = 'dcgan_results.gif'
ADDITIONAL
As an
example, let’s say X is a set of images of horse and Y is a set of images of zebra.
The goal is to learn a mapping function G: X-> Y such that images generated by G(X) are indistinguishable from the
image of Y. This objective is achieved using an Adversarial loss. This formulation not only learns G, but it also learns
an inverse mapping function F: Y->X and use cycle-consistency loss to enforce F(G(X)) = X and vice versa.
As an example, let’s say X is a set of images of horse and Y is a set of images of zebra.
The goal is to learn a mapping function G: X-> Y such that images generated by G(X) are indistinguishable from the
image of Y. This objective is achieved using an Adversarial loss. This formulation not only learns G, but it also learns
an inverse mapping function F: Y->X and use cycle-consistency loss to enforce F(G(X)) = X and vice versa.
As an example, let’s say X is a set of images of horse and Y is a set of images of zebra.
The goal is to learn a mapping function G: X-> Y such that images generated by G(X) are indistinguishable from the
image of Y. This objective is achieved using an Adversarial loss. This formulation not only learns G, but it also learns
an inverse mapping function F: Y->X and use cycle-consistency loss to enforce F(G(X)) = X and vice versa.
Let X be a set of images of horse and Y be a set of images of zebra.
The goal is to learn a mapping function G: X-> Y such that images generated by G(X) are indistinguishable
from the image of Y. This objective is achieved using an Adversarial loss. This formulation not only learns G,
but it also learns an inverse mapping function F: Y->X and use cycle-consistency loss to enforce F(G(X)) = X
and vice versa.
While training, 2 kinds of training observations are given as input.
One set of observations have paired images {Xi, Yi} for i where each Xi has it’s Yi counterpart.
The other set of observations has a set of images from X and another set of images from Y without any match
between Xi and Yi.