Autoencoders and Restricted Boltzmann Machines
Amir H. Payberah
payberah@kth.se
2020-10-22
Let’s Start With An Example
1 / 61
I Which of them is easier to memorize?
2 / 61
I Which of them is easier to memorize?
I Seq1: 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
2 / 61
I Which of them is easier to memorize?
I Seq1: 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
I Seq2: 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
2 / 61
Seq1 : 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
Seq2 : 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
3 / 61
Seq1 : 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
Seq2 : 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
I Seq1 is shorter, so it should be easier.
3 / 61
Seq1 : 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
Seq2 : 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
I Seq1 is shorter, so it should be easier.
I But, Seq2 follows two simple rules:
3 / 61
Seq1 : 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
Seq2 : 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
I Seq1 is shorter, so it should be easier.
I But, Seq2 follows two simple rules:
• Even numbers are followed by their half.
• Odd numbers are followed by their triple plus one.
3 / 61
Seq1 : 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
Seq2 : 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
I Seq1 is shorter, so it should be easier.
I But, Seq2 follows two simple rules:
• Even numbers are followed by their half.
• Odd numbers are followed by their triple plus one.
I You don’t need pattern if you could quickly and easily
memorize very long sequences
3 / 61
Seq1 : 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
Seq2 : 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
I Seq1 is shorter, so it should be easier.
I But, Seq2 follows two simple rules:
• Even numbers are followed by their half.
• Odd numbers are followed by their triple plus one.
I You don’t need pattern if you could quickly and easily
memorize very long sequences
I But, it is hard to memorize long sequences that makes it useful
to recognize patterns.
3 / 61
I 1970, W. Chase and H. Simon
I They observed that expert chess players were able to memorize the positions of all
the pieces in a game by looking at the board for just 5 seconds.
4 / 61
I This was only the case when the pieces were placed in realistic positions, not when
the pieces were placed randomly.
5 / 61
I This was only the case when the pieces were placed in realistic positions, not when
the pieces were placed randomly.
I Chess experts don’t have a much better memory than you and I.
5 / 61
I This was only the case when the pieces were placed in realistic positions, not when
the pieces were placed randomly.
I Chess experts don’t have a much better memory than you and I.
I They just see chess patterns more easily due to
their experience with the game.
5 / 61
I This was only the case when the pieces were placed in realistic positions, not when
the pieces were placed randomly.
I Chess experts don’t have a much better memory than you and I.
I They just see chess patterns more easily due to
their experience with the game.
I Patterns helps them store information efficiently.
5 / 61
Autoencoders
6 / 61
Autoencoders (1/5)
I Just like the chess players in this memory experiment.
7 / 61
Autoencoders (1/5)
I Just like the chess players in this memory experiment.
I An autoencoder looks at the inputs, converts them to an efficient internal represen-
tation, and then spits out something that looks very close to the inputs.
7 / 61
Autoencoders (2/5)
I The same architecture as a Multi-Layer Perceptron (MLP).
8 / 61
Autoencoders (2/5)
I The same architecture as a Multi-Layer Perceptron (MLP).
I Except that the number of neurons in the output layer must be equal to the number
of inputs.
8 / 61
Autoencoders (3/5)
I An autoencoder is always composed of two parts.
9 / 61
Autoencoders (3/5)
I An autoencoder is always composed of two parts.
I An encoder (recognition network), h = f(x)
Converts the inputs to an internal representation.
9 / 61
Autoencoders (3/5)
I An autoencoder is always composed of two parts.
I An encoder (recognition network), h = f(x)
Converts the inputs to an internal representation.
I A decoder (generative network), r = g(h)
Converts the internal representation to the outputs.
9 / 61
Autoencoders (3/5)
I An autoencoder is always composed of two parts.
I An encoder (recognition network), h = f(x)
Converts the inputs to an internal representation.
I A decoder (generative network), r = g(h)
Converts the internal representation to the outputs.
I If an autoencoder learns to set g(f(x)) = x everywhere,
it is not especially useful, why?
9 / 61
Autoencoders (4/5)
I Autoencoders are designed to be unable to learn to copy perfectly.
10 / 61
Autoencoders (4/5)
I Autoencoders are designed to be unable to learn to copy perfectly.
I The models are forced to prioritize which aspects of the input should be copied, they
often learn useful properties of the data.
10 / 61
Autoencoders (5/5)
I Autoencoders are neural networks capable of learning efficient representations of the
input data (called codings) without any supervision.
11 / 61
Autoencoders (5/5)
I Autoencoders are neural networks capable of learning efficient representations of the
input data (called codings) without any supervision.
I Dimension reduction: these codings typically have a much lower dimensionality than
the input data.
11 / 61
Different Types of Autoencoders
I Stacked autoencoders
I Denoising autoencoders
I Sparse autoencoders
I Variational autoencoders
12 / 61
Different Types of Autoencoders
I Stacked autoencoders
I Denoising autoencoders
I Sparse autoencoders
I Variational autoencoders
13 / 61
Stacked Autoencoders (1/3)
I Stacked autoencoder: autoencoders with multiple hidden layers.
14 / 61
Stacked Autoencoders (1/3)
I Stacked autoencoder: autoencoders with multiple hidden layers.
I Adding more layers helps the autoencoder learn more complex codings.
14 / 61
Stacked Autoencoders (1/3)
I Stacked autoencoder: autoencoders with multiple hidden layers.
I Adding more layers helps the autoencoder learn more complex codings.
I The architecture is typically symmetrical with regards to the central hidden layer.
14 / 61
Stacked Autoencoders (2/3)
I In a symmetric architecture, we can tie the weights of the decoder layers to the
weights of the encoder layers.
15 / 61
Stacked Autoencoders (2/3)
I In a symmetric architecture, we can tie the weights of the decoder layers to the
weights of the encoder layers.
I In a network with N layers, the decoder layer weights can be defined as wN−l+1 = wTl ,
with l = 1, 2, · · · , N2 .
15 / 61
Stacked Autoencoders (2/3)
I In a symmetric architecture, we can tie the weights of the decoder layers to the
weights of the encoder layers.
I In a network with N layers, the decoder layer weights can be defined as wN−l+1 = wTl ,
with l = 1, 2, · · · , N2 .
I This halves the number of weights in the model, speeding up training and limiting
the risk of overfitting.
15 / 61
Stacked Autoencoders (3/3)
stacked_encoder = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(30, activation="relu"),
])
stacked_decoder = keras.models.Sequential([
keras.layers.Dense(100, activation="relu", input_shape=[30]),
keras.layers.Dense(28 * 28, activation="sigmoid"),
keras.layers.Reshape([28, 28])
])
model = keras.models.Sequential([stacked_encoder, stacked_decoder])
16 / 61
Different Types of Autoencoders
I Stacked autoencoders
I Denoising autoencoders
I Sparse autoencoders
I Variational autoencoders
17 / 61
Denoising Autoencoders (1/4)
I One way to force the autoencoder to learn useful features is to add noise to its inputs,
training it to recover the original noise-free inputs.
18 / 61
Denoising Autoencoders (1/4)
I One way to force the autoencoder to learn useful features is to add noise to its inputs,
training it to recover the original noise-free inputs.
I This prevents the autoencoder from trivially copying its inputs to its outputs, so it
ends up having to find patterns in the data.
18 / 61
Denoising Autoencoders (2/4)
I The noise can be pure Gaussian noise added to the inputs, or it can be randomly
switched off inputs, just like in dropout.
19 / 61
Denoising Autoencoders (3/4)
denoising_encoder = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dropout(0.5),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(30, activation="relu")
])
denoising_decoder = keras.models.Sequential([
keras.layers.Dense(100, activation="relu", input_shape=[30]),
keras.layers.Dense(28 * 28, activation="sigmoid"),
keras.layers.Reshape([28, 28])
])
model = keras.models.Sequential([denoising_encoder, denoising_decoder])
20 / 61
Denoising Autoencoders (4/4)
denoising_encoder = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.GaussianNoise(0.2),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(30, activation="relu")
])
denoising_decoder = keras.models.Sequential([
keras.layers.Dense(100, activation="relu", input_shape=[30]),
keras.layers.Dense(28 * 28, activation="sigmoid"),
keras.layers.Reshape([28, 28])
])
model = keras.models.Sequential([denoising_encoder, denoising_decoder])
21 / 61
Different Types of Autoencoders
I Stacked autoencoders
I Denoising autoencoders
I Sparse autoencoders
I Variational autoencoders
22 / 61
Sparse Autoencoders (1/2)
I Adding an appropriate term to the cost function to push the autoencoder to reducing
the number of active neurons in the coding layer.
23 / 61
Sparse Autoencoders (1/2)
I Adding an appropriate term to the cost function to push the autoencoder to reducing
the number of active neurons in the coding layer.
I This forces the autoencoder to represent each input as a combination of a small
number of activations.
23 / 61
Sparse Autoencoders (1/2)
I Adding an appropriate term to the cost function to push the autoencoder to reducing
the number of active neurons in the coding layer.
I This forces the autoencoder to represent each input as a combination of a small
number of activations.
I As a result, each neuron in the coding layer typically ends up representing a useful
feature.
23 / 61
Sparse Autoencoders (2/2)
sparse_l1_encoder = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation="selu"),
keras.layers.Dense(300, activation="sigmoid", activity_regularizer=keras.regularizers.l1(1e-3))
])
sparse_l1_decoder = keras.models.Sequential([
keras.layers.Dense(100, activation="selu", input_shape=[300]),
keras.layers.Dense(28 * 28, activation="sigmoid"),
keras.layers.Reshape([28, 28])
])
model = keras.models.Sequential([sparse_l1_encoder, sparse_l1_decoder])
24 / 61
Different Types of Autoencoders
I Stacked autoencoders
I Denoising autoencoders
I Sparse autoencoders
I Variational autoencoders
25 / 61
Variational Autoencoders (1/6)
I Variational autoencoders are probabilistic autoencoders.
26 / 61
Variational Autoencoders (1/6)
I Variational autoencoders are probabilistic autoencoders.
I Their outputs are partly determined by chance, even after training.
• As opposed to denoising autoencoders, which use randomness only during training.
26 / 61
Variational Autoencoders (1/6)
I Variational autoencoders are probabilistic autoencoders.
I Their outputs are partly determined by chance, even after training.
• As opposed to denoising autoencoders, which use randomness only during training.
I They are generative autoencoders, meaning that they can generate new instances
that look like they were sampled from the training set.
26 / 61
Variational Autoencoders (2/6)
I Instead of directly producing a coding for a given input, the encoder produces a mean
coding µ and a standard deviation σ.
27 / 61
Variational Autoencoders (2/6)
I Instead of directly producing a coding for a given input, the encoder produces a mean
coding µ and a standard deviation σ.
I The actual coding is then sampled randomly from a Gaussian distribution with mean
µ and standard deviation σ.
27 / 61
Variational Autoencoders (2/6)
I Instead of directly producing a coding for a given input, the encoder produces a mean
coding µ and a standard deviation σ.
I The actual coding is then sampled randomly from a Gaussian distribution with mean
µ and standard deviation σ.
I After that the decoder just decodes the
sampled coding normally.
27 / 61
Variational Autoencoders (3/6)
I The cost function is composed of two parts.
28 / 61
Variational Autoencoders (3/6)
I The cost function is composed of two parts.
I 1. the usual reconstruction loss.
• Pushes the autoencoder to reproduce its inputs.
• Using cross-entropy.
28 / 61
Variational Autoencoders (3/6)
I The cost function is composed of two parts.
I 1. the usual reconstruction loss.
• Pushes the autoencoder to reproduce its inputs.
• Using cross-entropy.
I 2. the latent loss
• Pushes the autoencoder to have codings that look as though they were sampled from
a simple Gaussian distribution.
28 / 61
Variational Autoencoders (3/6)
I The cost function is composed of two parts.
I 1. the usual reconstruction loss.
• Pushes the autoencoder to reproduce its inputs.
• Using cross-entropy.
I 2. the latent loss
• Pushes the autoencoder to have codings that look as though they were sampled from
a simple Gaussian distribution.
• Using the KL divergence between the target distribution (the Gaussian distribution) and
the actual distribution of the codings.
28 / 61
Variational Autoencoders (3/6)
I The cost function is composed of two parts.
I 1. the usual reconstruction loss.
• Pushes the autoencoder to reproduce its inputs.
• Using cross-entropy.
I 2. the latent loss
• Pushes the autoencoder to have codings that look as though they were sampled from
a simple Gaussian distribution.
• Using the KL divergence between the target distribution (the Gaussian distribution) and
the actual distribution
PK of the codings.
• latent loss = − 12 1 (1 + log (σi2 ) − σi2 − µ2i )
28 / 61
Variational Autoencoders (4/6)
I Encoder part
inputs = keras.layers.Input(shape=[28, 28])
z = keras.layers.Flatten()(inputs)
z = keras.layers.Dense(150, activation="relu")(z)
z = keras.layers.Dense(100, activation="relu")(z)
codings_mean = keras.layers.Dense(10)(z)
codings_log_var = keras.layers.Dense(10)(z)
codings = Sampling()([codings_mean, codings_log_var]) # normal distribution
variational_encoder = keras.models.Model(inputs=[inputs], outputs=[codings])
29 / 61
Variational Autoencoders (5/6)
I Decoder part
decoder_inputs = keras.layers.Input(shape=[codings_size])
x = keras.layers.Dense(100, activation="relu")(decoder_inputs)
x = keras.layers.Dense(150, activation="relu")(x)
x = keras.layers.Dense(28 * 28, activation="sigmoid")(x)
outputs = keras.layers.Reshape([28, 28])(x)
variational_decoder = keras.models.Model(inputs=[decoder_inputs], outputs=[outputs])
30 / 61
Variational Autoencoders (6/6)
codings = variational_encoder(inputs)
reconstructions = variational_decoder(codings)
model = keras.models.Model(inputs=[inputs], outputs=[reconstructions])
latent_loss = -0.5 * K.sum(1 + codings_log_var - K.exp(codings_log_var)
- K.square(codings_mean), axis=-1)
model.add_loss(K.mean(latent_loss) / 784.)
31 / 61
32 / 61
Restricted Boltzmann Machines
33 / 61
Restricted Boltzmann Machines
I A Restricted Boltzmann Machine (RBM) is a stochastic neural network.
34 / 61
Restricted Boltzmann Machines
I A Restricted Boltzmann Machine (RBM) is a stochastic neural network.
I Stochastic meaning these activations have a probabilistic element, instead of deter-
ministic functions, e.g., logistic or ReLU.
34 / 61
Restricted Boltzmann Machines
I A Restricted Boltzmann Machine (RBM) is a stochastic neural network.
I Stochastic meaning these activations have a probabilistic element, instead of deter-
ministic functions, e.g., logistic or ReLU.
I The neurons form a bipartite graph:
• One visible layer and one hidden layer.
• A symmetric connection between the two layers.
• There are no connections between neurons within
a layer.
34 / 61
Let’s Start With An Example
35 / 61
RBM Example (1/11)
I We have a set of six movies, and we ask users to tell us which ones they want to
watch.
36 / 61
RBM Example (1/11)
I We have a set of six movies, and we ask users to tell us which ones they want to
watch.
I We want to learn two latent neurons (hidden neurons) underlying movie preferences,
e.g., SF/fantasy and Oscar winners
36 / 61
RBM Example (2/11)
I Our RBM would look like the following.
37 / 61
RBM Example (3/11)
I Alice: (HP=1, Avatar=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
38 / 61
RBM Example (3/11)
I Alice: (HP=1, Avatar=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I Bob: (HP=1, Avatar=0, LOTR=1, Glad=0, Titan=0, Sep=0), SF fan, but not Avatar.
38 / 61
RBM Example (3/11)
I Alice: (HP=1, Avatar=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I Bob: (HP=1, Avatar=0, LOTR=1, Glad=0, Titan=0, Sep=0), SF fan, but not Avatar.
I Carol: (HP=1, Avat=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
38 / 61
RBM Example (3/11)
I Alice: (HP=1, Avatar=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I Bob: (HP=1, Avatar=0, LOTR=1, Glad=0, Titan=0, Sep=0), SF fan, but not Avatar.
I Carol: (HP=1, Avat=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I David: (HP=0, Avat= 0, LOTR=1, Glad=1, Titan=1, Sep=1), Big Oscar winners fan.
38 / 61
RBM Example (3/11)
I Alice: (HP=1, Avatar=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I Bob: (HP=1, Avatar=0, LOTR=1, Glad=0, Titan=0, Sep=0), SF fan, but not Avatar.
I Carol: (HP=1, Avat=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I David: (HP=0, Avat= 0, LOTR=1, Glad=1, Titan=1, Sep=1), Big Oscar winners fan.
I Eric: (HP=0, Avat=0, LOTR=1, Glad=1, Titan=0, Sep=1), Oscar winners fan, but not Titanic.
38 / 61
RBM Example (3/11)
I Alice: (HP=1, Avatar=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I Bob: (HP=1, Avatar=0, LOTR=1, Glad=0, Titan=0, Sep=0), SF fan, but not Avatar.
I Carol: (HP=1, Avat=1, LOTR=1, Glad=0, Titan=0, Sep=0), Big SF fan.
I David: (HP=0, Avat= 0, LOTR=1, Glad=1, Titan=1, Sep=1), Big Oscar winners fan.
I Eric: (HP=0, Avat=0, LOTR=1, Glad=1, Titan=0, Sep=1), Oscar winners fan, but not Titanic.
I Fred: (HP=0, Avat=0, LOTR=1, Glad=1, Titan=1, Sep=1), Big Oscar winners fan.
38 / 61
RBM Example (4/11)
I Assume the given input xi is the 0 or 1 for each visible neuron vi .
• 1: like a movie, and 0: dislike a movie
39 / 61
RBM Example (4/11)
I Assume the given input xi is the 0 or 1 for each visible neuron vi .
• 1: like a movie, and 0: dislike a movie
I Compute the activation energy at hidden neuron hj :
X
a(hj ) = wij vi
i
39 / 61
RBM Example (5/11)
I For each hidden neuron hj , we compute the probability p(hj ).
X
a(hj ) = wij vi
i
1
p(hj ) = sigmoid(a(hj )) =
1 + e−a(hj )
40 / 61
RBM Example (5/11)
I For each hidden neuron hj , we compute the probability p(hj ).
X
a(hj ) = wij vi
i
1
p(hj ) = sigmoid(a(hj )) =
1 + e−a(hj )
I We turn on the hidden neuron hj with the probability p(hj ), and turn it off with
probability 1 − p(hj ).
40 / 61
RBM Example (6/11)
I Declaring that you like Harry Potter, Avatar, and LOTR, doesn’t guarantee that the
SF/fantasy hidden neuron will turn on.
41 / 61
RBM Example (6/11)
I Declaring that you like Harry Potter, Avatar, and LOTR, doesn’t guarantee that the
SF/fantasy hidden neuron will turn on.
I But it will turn on with a high probability.
41 / 61
RBM Example (6/11)
I Declaring that you like Harry Potter, Avatar, and LOTR, doesn’t guarantee that the
SF/fantasy hidden neuron will turn on.
I But it will turn on with a high probability.
• In reality, if you want to watch all three of those movies makes us highly suspect you
like SF/fantasy in general.
• But there’s a small chance you like them for other reasons.
41 / 61
RBM Example (7/11)
I Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy
neuron is on)
42 / 61
RBM Example (7/11)
I Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy
neuron is on)
I We can ask the RBM to generate a set of movie recommendations.
42 / 61
RBM Example (7/11)
I Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy
neuron is on)
I We can ask the RBM to generate a set of movie recommendations.
I The hidden neurons send messages to the visible (movie) neurons, telling them to
update their states. X
a(vi ) = wij hj
j
1
p(vi ) = sigmoid(a(vi )) =
1 + e−a(vi )
42 / 61
RBM Example (7/11)
I Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy
neuron is on)
I We can ask the RBM to generate a set of movie recommendations.
I The hidden neurons send messages to the visible (movie) neurons, telling them to
update their states. X
a(vi ) = wij hj
j
1
p(vi ) = sigmoid(a(vi )) =
1 + e−a(vi )
I Being on the SF/fantasy neuron doesn’t guarantee that we’ll always recommend all
three of Harry Potter, Avatar, and LOTR.
42 / 61
RBM Example (7/11)
I Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy
neuron is on)
I We can ask the RBM to generate a set of movie recommendations.
I The hidden neurons send messages to the visible (movie) neurons, telling them to
update their states. X
a(vi ) = wij hj
j
1
p(vi ) = sigmoid(a(vi )) =
1 + e−a(vi )
I Being on the SF/fantasy neuron doesn’t guarantee that we’ll always recommend all
three of Harry Potter, Avatar, and LOTR.
• For example not everyone who likes science fiction liked Avatar.
42 / 61
RBM Example (8/11)
I How do we learn the connection weights wij in our network?
43 / 61
RBM Example (8/11)
I How do we learn the connection weights wij in our network?
I Assume, as an input we have a bunch of binary vectors x with six elements corre-
sponding to a user’s movie preferences.
43 / 61
RBM Example (8/11)
I How do we learn the connection weights wij in our network?
I Assume, as an input we have a bunch of binary vectors x with six elements corre-
sponding to a user’s movie preferences.
I We do the following steps in each epoch:
43 / 61
RBM Example (8/11)
I How do we learn the connection weights wij in our network?
I Assume, as an input we have a bunch of binary vectors x with six elements corre-
sponding to a user’s movie preferences.
I We do the following steps in each epoch:
I 1. Take a training instance x and set the states of the visible neurons to these
preferences.
43 / 61
RBM Example (9/11)
I 2. Update the states of the hidden neurons.
44 / 61
RBM Example (9/11)
I 2. Update the states of the hidden neurons.
P
• Compute a(hj ) = i wij vi for each hidden neuron hj .
44 / 61
RBM Example (9/11)
I 2. Update the states of the hidden neurons.
P
• Compute a(hj ) = i wij vi for each hidden neuron hj .
1
• Set hj to 1 with probability p(hj ) = sigmoid(a(hj )) =
1+e−a(hj )
44 / 61
RBM Example (9/11)
I 2. Update the states of the hidden neurons.
P
• Compute a(hj ) = i wij vi for each hidden neuron hj .
1
• Set hj to 1 with probability p(hj ) = sigmoid(a(hj )) =
1+e−a(hj )
I 3. For each edge eij , compute positive(eij ) = vi × hj
• I.e., for each pair of neurons, measure whether they are both on.
44 / 61
RBM Example (10/11)
I 4. Update the state of the visible neurons in a similar manner.
45 / 61
RBM Example (10/11)
I 4. Update the state of the visible neurons in a similar manner.
0
• We denote the updated
P visible neurons with vi .
• Compute a(vi ) = j wij hj for each visible neuron v0i .
0
45 / 61
RBM Example (10/11)
I 4. Update the state of the visible neurons in a similar manner.
0
• We denote the updated
P visible neurons with vi .
• Compute a(vi ) = j wij hj for each visible neuron v0i .
0
• Set v0i to 1 with probability p(v0i ) = sigmoid(a(v0i )) = 1
0
1+e−a(vi )
45 / 61
RBM Example (10/11)
I 4. Update the state of the visible neurons in a similar manner.
0
• We denote the updated
P visible neurons with vi .
• Compute a(vi ) = j wij hj for each visible neuron v0i .
0
• Set v0i to 1 with probability p(v0i ) = sigmoid(a(v0i )) = 1
0
1+e−a(vi )
I 5. Update the hidden neurons again similar to step 2. We denote the updated hidden
neurons with h0j .
45 / 61
RBM Example (10/11)
I 4. Update the state of the visible neurons in a similar manner.
0
• We denote the updated
P visible neurons with vi .
• Compute a(vi ) = j wij hj for each visible neuron v0i .
0
• Set v0i to 1 with probability p(v0i ) = sigmoid(a(v0i )) = 1
0
1+e−a(vi )
I 5. Update the hidden neurons again similar to step 2. We denote the updated hidden
neurons with h0j .
I 6. For each edge eij , compute negative(eij ) = v0i × h0j
45 / 61
RBM Example (11/11)
I 7. Update the weight of each edge eij .
wij = wij + η(positive(eij ) − negative(eij ))
46 / 61
RBM Example (11/11)
I 7. Update the weight of each edge eij .
wij = wij + η(positive(eij ) − negative(eij ))
I 8. Repeat over all training examples.
46 / 61
RBM Example (11/11)
I 7. Update the weight of each edge eij .
wij = wij + η(positive(eij ) − negative(eij ))
I 8. Repeat over all training examples.
I 9. Continue until the error between the training examples and their reconstructions
falls below some threshold or we reach some maximum number of epochs.
46 / 61
RBM Training (1/2)
I Step 1, Gibbs sampling: what we have done in steps 1-6.
47 / 61
RBM Training (1/2)
I Step 1, Gibbs sampling: what we have done in steps 1-6.
I Given an input vector v, compute p(h|v).
47 / 61
RBM Training (1/2)
I Step 1, Gibbs sampling: what we have done in steps 1-6.
I Given an input vector v, compute p(h|v).
I Knowing the hidden values h, we use p(v|h) for prediction of new input values v.
47 / 61
RBM Training (1/2)
I Step 1, Gibbs sampling: what we have done in steps 1-6.
I Given an input vector v, compute p(h|v).
I Knowing the hidden values h, we use p(v|h) for prediction of new input values v.
I This process is repeated k times.
47 / 61
RBM Training (2/2)
I Step 2, contrastive divergence: what we have done in step 7.
• Just a fancy name for approximate gradient descent.
w = w + η(positive(e) − negative(e))
48 / 61
More Details about RBM
49 / 61
Energy-based Model (1/3)
I Energy a quantitative property of physics.
50 / 61
Energy-based Model (1/3)
I Energy a quantitative property of physics.
• E.g., gravitational energy describes the potential energy a body with mass has in
relation to another massive object due to gravity.
50 / 61
Energy-based Model (2/3)
I One purpose of deep learning models is to encode dependencies between variables.
51 / 61
Energy-based Model (2/3)
I One purpose of deep learning models is to encode dependencies between variables.
I The capturing of dependencies happen through associating of a scalar energy to each
state of the variables.
• Serves as a measure of compatibility.
51 / 61
Energy-based Model (2/3)
I One purpose of deep learning models is to encode dependencies between variables.
I The capturing of dependencies happen through associating of a scalar energy to each
state of the variables.
• Serves as a measure of compatibility.
I A high energy means a bad compatibility.
51 / 61
Energy-based Model (2/3)
I One purpose of deep learning models is to encode dependencies between variables.
I The capturing of dependencies happen through associating of a scalar energy to each
state of the variables.
• Serves as a measure of compatibility.
I A high energy means a bad compatibility.
I An energy based model tries always to minimize a predefined energy function.
51 / 61
Energy-based Model (3/3)
I The energy function for the RBMs is defined as:
X X X
E(v, h) = −( wij vi hj + bi vi + cj hj )
ij i j
52 / 61
Energy-based Model (3/3)
I The energy function for the RBMs is defined as:
X X X
E(v, h) = −( wij vi hj + bi vi + cj hj )
ij i j
I v and h represent the visible and hidden units, respectively.
52 / 61
Energy-based Model (3/3)
I The energy function for the RBMs is defined as:
X X X
E(v, h) = −( wij vi hj + bi vi + cj hj )
ij i j
I v and h represent the visible and hidden units, respectively.
I w represents the weights connecting visible and hidden units.
52 / 61
Energy-based Model (3/3)
I The energy function for the RBMs is defined as:
X X X
E(v, h) = −( wij vi hj + bi vi + cj hj )
ij i j
I v and h represent the visible and hidden units, respectively.
I w represents the weights connecting visible and hidden units.
I b and c are the biases of the visible and hidden layers, respectively.
52 / 61
RBM is a Probabilistic Model (1/2)
I The probability of a certain state of v and h:
e−E(v,h)
p(v, h) = P −E(v,h)
v,h e
53 / 61
RBM is a Probabilistic Model (1/2)
I The probability of a certain state of v and h:
e−E(v,h)
p(v, h) = P −E(v,h)
v,h e
I In physics, the joint distribution p(v, h) is known as the Boltzmann Distribution or
Gibbs Distribution.
53 / 61
RBM is a Probabilistic Model (1/2)
I The probability of a certain state of v and h:
e−E(v,h)
p(v, h) = P −E(v,h)
v,h e
I In physics, the joint distribution p(v, h) is known as the Boltzmann Distribution or
Gibbs Distribution.
I At each point in time the RBM is in a certain state.
• The state refers to the values of neurons in the visible and hidden layers v and h.
53 / 61
RBM is a Probabilistic Model (2/2)
I It is difficult to calculate the joint probability due to the huge number of possible
combination of v and h.
e−E(v,h)
p(v, h) = P −E(v,h)
v,h e
54 / 61
RBM is a Probabilistic Model (2/2)
I It is difficult to calculate the joint probability due to the huge number of possible
combination of v and h.
e−E(v,h)
p(v, h) = P −E(v,h)
v,h e
I Much easier is the calculation of the conditional probabilities of state h given the
state v and vice versa (Gibbs sampling)
p(h|v) = Πi p(hi |v)
p(v|h) = Πj p(vj |h)
54 / 61
Learning in Boltzmann Machines (1/2)
I RBMs try to learn a probability distribution from the data they are given.
55 / 61
Learning in Boltzmann Machines (1/2)
I RBMs try to learn a probability distribution from the data they are given.
I Given a training set of state vectors v, learning consists of finding parameters w of
p(v, h), in a way that the training vectors have high probability p(v).
P −E(v,h)
e
p(v|h) = P h −E(v,h)
v,h e
55 / 61
Learning in Boltzmann Machines (1/2)
I RBMs try to learn a probability distribution from the data they are given.
I Given a training set of state vectors v, learning consists of finding parameters w of
p(v, h), in a way that the training vectors have high probability p(v).
P −E(v,h)
e
p(v|h) = P h −E(v,h)
v,h e
I Use the maximum-likelihood estimation.
55 / 61
Learning in Boltzmann Machines (1/2)
I RBMs try to learn a probability distribution from the data they are given.
I Given a training set of state vectors v, learning consists of finding parameters w of
p(v, h), in a way that the training vectors have high probability p(v).
P −E(v,h)
e
p(v|h) = P h −E(v,h)
v,h e
I Use the maximum-likelihood estimation.
I For a model of the form p(v) with parameters w, the log-likelihood given a single
training example v is:
P −E(v,h)
e X X
log p(v|h) = log P h −E(v,h) = log e−E(v,h) − log e−E(v,h)
v,h e h v,h
55 / 61
Learning in Boltzmann Machines (2/2)
I The log-likelihood gradients for an RBM with binary units:
∂ log p(v|h)
= positive(eij ) − negative(eij )
∂wij
56 / 61
Learning in Boltzmann Machines (2/2)
I The log-likelihood gradients for an RBM with binary units:
∂ log p(v|h)
= positive(eij ) − negative(eij )
∂wij
I Then, we can update the weight w as follows:
(next)
wij = wij + η(positive(eij ) − negative(eij ))
56 / 61
57 / 61
Summary
58 / 61
Summary
I Autoencoders
• Stacked autoencoders
• Denoising autoencoders
• Variational autoencoders
I Restricted Boltzmann Machine
• Gibbs sampling
• Contrastive divergence
59 / 61
Reference
I Ian Goodfellow et al., Deep Learning (Ch. 14, 20)
I Aurélien Géron, Hands-On Machine Learning (Ch. 17)
60 / 61
Questions?
61 / 61