0% found this document useful (0 votes)
61 views

DL Lecture8 Autoencoder

- Autoencoders are neural networks trained to copy their input to their output. They have an encoder that maps the input to a hidden representation, and a decoder that maps the hidden representation back to the output. - Undercomplete autoencoders constrain the hidden representation to have smaller dimension than the input, forcing it to learn a compressed representation. This can be viewed as learning the principal components of the data. - Regularized autoencoders add regularization terms to encourage other properties, like sparsity of the hidden representation (sparse autoencoder) or robustness to noise (denoising autoencoder).

Uploaded by

gourav Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

DL Lecture8 Autoencoder

- Autoencoders are neural networks trained to copy their input to their output. They have an encoder that maps the input to a hidden representation, and a decoder that maps the hidden representation back to the output. - Undercomplete autoencoders constrain the hidden representation to have smaller dimension than the input, forcing it to learn a compressed representation. This can be viewed as learning the principal components of the data. - Regularized autoencoders add regularization terms to encourage other properties, like sparsity of the hidden representation (sparse autoencoder) or robustness to noise (denoising autoencoder).

Uploaded by

gourav Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Deep Learning Basics

Lecture 8: Autoencoder & DBM


Princeton University COS 495
Instructor: Yingyu Liang
Autoencoder
Autoencoder
• Neural networks trained to attempt to copy its input to its output

• Contain two parts:


• Encoder: map the input to a hidden representation
• Decoder: map the hidden representation to the output
Autoencoder

ℎ Hidden representation (the code)

Input 𝑥 𝑟 Reconstruction
Autoencoder

Encoder 𝑓(⋅) Decoder 𝑔(⋅)

𝑥 𝑟

ℎ = 𝑓 𝑥 , 𝑟 = 𝑔 ℎ = 𝑔(𝑓 𝑥 )
Why want to copy input to output
• Not really care about copying

• Interesting case: NOT able to copy exactly but strive to do so


• Autoencoder forced to select which aspects to preserve and thus
hopefully can learn useful properties of the data

• Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988;
Hinton and Zemel, 1994).
Undercomplete autoencoder
• Constrain the code to have smaller dimension than the input
• Training: minimize a loss function
𝐿 𝑥, 𝑟 = 𝐿(𝑥, 𝑔 𝑓 𝑥 )

𝑥 ℎ 𝑟
Undercomplete autoencoder
• Constrain the code to have smaller dimension than the input
• Training: minimize a loss function
𝐿 𝑥, 𝑟 = 𝐿(𝑥, 𝑔 𝑓 𝑥 )

• Special case: 𝑓, 𝑔 linear, 𝐿 mean square error


• Reduces to Principal Component Analysis
Undercomplete autoencoder
• What about nonlinear encoder and decoder?

• Capacity should not be too large


• Suppose given data 𝑥1 , 𝑥2 , … , 𝑥𝑛
• Encoder maps 𝑥𝑖 to 𝑖
• Decoder maps 𝑖 to 𝑥𝑖
• One dim ℎ suffices for perfect reconstruction
Regularization
• Typically NOT
• Keeping the encoder/decoder shallow or
• Using small code size

• Regularized autoencoders: add regularization term that encourages


the model to have other properties
• Sparsity of the representation (sparse autoencoder)
• Robustness to noise or to missing inputs (denoising autoencoder)
• Smallness of the derivative of the representation
Sparse autoencoder
• Constrain the code to have sparsity
• Training: minimize a loss function
𝐿𝑅 = 𝐿(𝑥, 𝑔 𝑓 𝑥 ) + 𝑅(ℎ)

𝑥 ℎ 𝑟
Probabilistic view of regularizing ℎ
• Suppose we have a probabilistic model 𝑝(ℎ, 𝑥)
• MLE on 𝑥
log 𝑝(𝑥) = log ෍ 𝑝(ℎ′ , 𝑥)
ℎ′

•  Hard to sum over ℎ′


Probabilistic view of regularizing ℎ
• Suppose we have a probabilistic model 𝑝(ℎ, 𝑥)
• MLE on 𝑥
max log 𝑝(𝑥) = max log ෍ 𝑝(ℎ′ , 𝑥)
ℎ′

• Approximation: suppose ℎ = 𝑓(𝑥) gives the most likely hidden


representation, and σℎ′ 𝑝(ℎ′ , 𝑥) can be approximated by 𝑝(ℎ, 𝑥)
Probabilistic view of regularizing ℎ
• Suppose we have a probabilistic model 𝑝(ℎ, 𝑥)
• Approximate MLE on 𝑥, ℎ = 𝑓(𝑥)
max log 𝑝(ℎ, 𝑥) = max log 𝑝(𝑥|ℎ) + log 𝑝(ℎ)

Loss Regularization
Sparse autoencoder
• Constrain the code to have sparsity
𝜆 𝜆
• Laplacian prior: 𝑝 ℎ = exp(− ℎ 1)
2 2

• Training: minimize a loss function


𝐿𝑅 = 𝐿(𝑥, 𝑔 𝑓 𝑥 ) + 𝜆 ℎ 1
Denoising autoencoder
• Traditional autoencoder: encourage to learn 𝑔 𝑓 ⋅ to be identity

• Denoising : minimize a loss function


𝐿 𝑥, 𝑟 = 𝐿(𝑥, 𝑔 𝑓 𝑥෤ )
where 𝑥෤ is 𝑥 + 𝑛𝑜𝑖𝑠𝑒
Boltzmann machine
Boltzmann machine
• Introduced by Ackley et al. (1985)

• General “connectionist” approach to learning arbitrary probability


distributions over binary vectors
exp(−𝐸 𝑥 )
• Special case of energy model: 𝑝 𝑥 =
𝑍
Boltzmann machine
• Energy model:
exp(−𝐸 𝑥 )
𝑝 𝑥 =
𝑍
• Boltzmann machine: special case of energy model with
𝐸 𝑥 = −𝑥 𝑇 𝑈𝑥 − 𝑏 𝑇 𝑥
where 𝑈 is the weight matrix and 𝑏 is the bias parameter
Boltzmann machine with latent variables
• Some variables are not observed
𝑥 = 𝑥𝑣 , 𝑥ℎ , 𝑥𝑣 visible, 𝑥ℎ hidden
𝐸 𝑥 = −𝑥𝑣𝑇 𝑅𝑥𝑣 − 𝑥𝑣𝑇 𝑊𝑥ℎ − 𝑥ℎ𝑇 𝑆𝑥ℎ − 𝑏 𝑇 𝑥𝑣 − 𝑐 𝑇 𝑥ℎ

• Universal approximator of probability mass functions


Maximum likelihood
• Suppose we are given data 𝑋 = 𝑥𝑣1 , 𝑥𝑣2 , … , 𝑥𝑣𝑛
• Maximum likelihood is to maximize
log 𝑝 𝑋 = ෍ log 𝑝(𝑥𝑣𝑖 )
𝑖
where
1
𝑝 𝑥𝑣 = ෍ 𝑝(𝑥𝑣 , 𝑥ℎ ) = ෍ exp(−𝐸(𝑥𝑣 , 𝑥ℎ ))
𝑍
𝑥ℎ 𝑥ℎ

• 𝑍 = σ exp(−𝐸(𝑥𝑣 , 𝑥ℎ )): partition function, difficult to compute


Restricted Boltzmann machine
• Invented under the name harmonium (Smolensky, 1986)
• Popularized by Hinton and collaborators to Restricted Boltzmann
machine
Restricted Boltzmann machine
• Special case of Boltzmann machine with latent variables:
exp(−𝐸 𝑣, ℎ )
𝑝 𝑣, ℎ =
𝑍
where the energy function is
𝐸 𝑣, ℎ = −𝑣 𝑇 𝑊ℎ − 𝑏 𝑇 𝑣 − 𝑐 𝑇 ℎ
with the weight matrix 𝑊 and the bias 𝑏, 𝑐
• Partition function
𝑍 = ෍ ෍ exp(−𝐸 𝑣, ℎ )
𝑣 ℎ
Restricted Boltzmann machine

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Restricted Boltzmann machine
• Conditional distribution is factorial
𝑝(𝑣, ℎ)
𝑝 ℎ|𝑣 = = ෑ 𝑝(ℎ𝑗 |𝑣)
𝑝(𝑣)
𝑗
and
𝑝 ℎ𝑗 = 1|𝑣 = 𝜎 𝑐𝑗 + 𝑣 𝑇 𝑊:,𝑗
is logistic function
Restricted Boltzmann machine
• Similarly,
𝑝(𝑣, ℎ)
𝑝 𝑣|ℎ = = ෑ 𝑝(𝑣𝑖 |ℎ)
𝑝(ℎ)
𝑖
and
𝑝 𝑣𝑖 = 1|ℎ = 𝜎 𝑏𝑖 + 𝑊𝑖,: ℎ
is logistic function
Deep Boltzmann machine
• Special case of energy model. Take 3 hidden layers and ignore bias:
exp(−𝐸 𝑣, ℎ1 , ℎ2 , ℎ3 )
𝑝 𝑣, ℎ1 , ℎ2 , ℎ3 =
𝑍
• Energy function
𝐸 𝑣, ℎ1 , ℎ2 , ℎ3 = −𝑣 𝑇 𝑊 1 ℎ1 − (ℎ1 )𝑇 𝑊 2 ℎ2 − (ℎ2 )𝑇 𝑊 3 ℎ3
with the weight matrices 𝑊 1 , 𝑊 2 , 𝑊 3
• Partition function
𝑍= ෍ exp(−𝐸 𝑣, ℎ1 , ℎ2 , ℎ3 )
𝑣,ℎ1 ,ℎ2 ,ℎ3
Deep Boltzmann machine

Figure from Deep Learning,


Goodfellow, Bengio and Courville

You might also like