Deep Learning Computer Vision
Deep Learning Computer Vision
• Playing Go
f( ) = “5-5” (next move)
• Dialogue System
f( “Hi” )= “Hello”
(what the user said) (system response)
Image Recognition:
Framework f( )= “cat”
A set of Model
function f1 , f 2
f1 ( )= “cat” f2 ( )= “monkey”
f1 ( )= “dog” f2 ( )= “snake”
Image Recognition:
Framework f( )= “cat”
A set of Model
function f1 , f 2 Better!
Goodness of
function f
Supervised Learning
Framework f( )= “cat”
Training Testing
A set of Model
function f1 , f 2 “cat”
Step 1
Training
Data
“monkey” “cat” “dog”
Three Steps for Deep Learning
a1 w1 A simple function
…
wk z (z )
ak + a
…
Activation
…
wK function
aK weights b bias
Neural Network
Neuron Sigmoid Function (z )
(z ) =
1
−z
1+ e z
2
1
(z )
4
-1 -2 + 0.98
Activation
-1
function
1 weights 1 bias
Neural Network
Different connections lead to
different network structures
+ (z )
+ (z ) + (z )
+ (z )
The neurons have different values of
weights and biases.
Weights and biases are network parameters 𝜃
Fully Connect Feedforward
Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function (z )
(z ) =
1
−z
1+ e z
Fully Connect Feedforward
Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Fully Connect Feedforward
Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
𝑓 = 𝑓 =
Input vector, output vector −1 0.83 0 0.85
Given parameters 𝜃, define a function
Given network structure, define a function set
Fully Connect Feedforward
Network neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers
Why Deep? Universality Theorem
Any continuous function f
f : R N → RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html
http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf
8 layers
6.7%
7.3%
16.4%
Special
structure
3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Output Layer
• Softmax layer as the output layer
Ordinary Layer
z1 ( )
y1 = z1
In general, the output of
z2 ( )
y2 = z 2
network can be any value.
3 0.88 3
e
20
z1 e e z1
y1 = e z1 zj
j =1
1 0.12 3
z2 e e z 2 2.7
y2 = e z2
e
zj
j =1
0.05 ≈0
z3 -3
3
e
z3
e y3 = e z3 zj
e
3 j =1
+ e zj
j =1
Example Application
Input Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……
……
……
Network
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”
……
……
……
……
……
……
Handwriting Digit Recognition
xN …… y10 is 0
Input Output
Layer Hidden Layers Layer
copy Gate
controller copy
Softmax
x2 ……
…… y2 is 2
……
……
x256 …… y10 is 0
16 x 16 = 256
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value
“1”
x1 …… y1 As close as 1
x2 possible
Softmax
Given a set ……
of y2 0
parameters
……
……
……
……
……
Loss
x256 …… y10 𝑙 0
𝐿 = 𝑙𝑟
For all training data … 𝑟=1
x1 NN y1 𝑦ො 1
𝑙1 As small as possible
x2 NN y2 𝑦ො 2
𝑙2 Find a function in
function set that
x3 NN y3 𝑦ො 3 minimizes total loss L
𝑙3
……
……
……
……
Network parameters 𝜃 =
106
𝑤1 , 𝑤2 , 𝑤3 , ⋯ , 𝑏1 , 𝑏2 , 𝑏3 , ⋯
weights
……
……
Millions of parameters
w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
Positive Decrease w
w
http://chico386.pixnet.net/album/photo/171572850
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
η is called
−𝜂𝜕𝐿Τ𝜕𝑤 “learning rate” w
Network parameters 𝜃 =
Gradient Descent 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , 𝑏2 , ⋯
w
Gradient Descent
Color: Value of
𝑤2 Total Loss L
𝑤1
Gradient Descent Hopfully, we would reach
a minima …..
Color: Value of
𝑤2 Total Loss L
𝑤1
Local Minima
Total
Loss Very slow at the
plateau
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
libdnn
台大周伯威
同學開發
Ref: https://www.youtube.com/watch?v=ibJpTrp5mcE
Three Steps for Deep Learning
“dog”
For example, you can do …….
“Talk” in e-mail
Spam
filtering Network 1/0
(Yes/No)
“free” in e-mail
1 (Yes)
0 (No)
(http://spam-filter-review.toptenreviews.com/)
For example, you can do …….
政治
“stock” in document
經濟
Network
體育
“president” in document
體育 政治 財經
http://top-breaking-news.com/
Outline
Very flexible
or
Need some
effort to learn
使用 Keras 心得
Example Application
• Handwriting Digit Recognition
Machine “1”
28 x 28
……
500
……
500
Softmax
y1 y2
…… y10
Keras
Keras
𝑤 ← 𝑤 − 𝜂𝜕𝐿Τ𝜕𝑤
0.1
Step 3.2: Find the optimal network parameters
28 x 28 …… 10 ……
=784
case 1:
case 2:
Keras
• Using GPU to speed training
• Way 1
• THEANO_FLAGS=device=gpu0 python
YourCode.py
• Way 2 (in your code)
• import os
• os.environ["THEANO_FLAGS"] =
"device=gpu0"
Demo
Three Steps for Deep Learning
Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Do not always blame Overfitting
Overfitting?
Good Results on
Different approaches for Testing Data?
different problems.
Neural
Network
Recipe of Deep Learning
YES
Momentum
Choosing Proper Loss
“1”
x1 …… y1 1 𝑦ො1 1
x2
Softmax
…… y2 0 𝑦ො2 0
……
……
……
……
……
……
loss
x256 …… y10 0 𝑦ො10 0
Which one is better?
10 10 target
Square 2 Cross
Error 𝑦𝑖 − 𝑦ෝ𝑖 Entropy − 𝑦ෝ𝑖 𝑙𝑛𝑦𝑖
𝑖=1 =0 𝑖=1 =0
Demo
Square Error
Cross Entropy
Total
Loss
Square
Error
http://jmlr.org/procee
dings/papers/v9/gloro
w1 w2
t10a/glorot10a.pdf
Recipe of Deep Learning
YES
Momentum
We do not really minimize total loss!
Mini-batch ➢ Randomly initialize
network parameters
𝑙1 𝐿′ = 𝑙1 + 𝑙31 + ⋯
x31 NN y31 𝑦ො 31 Update parameters once
𝑙31 ➢ Pick the 2nd batch
……
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
x2 NN y2 𝑦ො 2
Mini-batch
…
𝑙2 ➢ Until all mini-batches
have been picked
x16 NN y16 𝑦ො 16
𝑙16 one epoch
……
𝑙1
Update parameters once
x31 NN y31 𝑦ො 31
𝑙31 ➢ Pick the 2nd batch
……
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
100 examples in a mini-batch
…
➢ Until all mini-batches
Repeat 20 times have been picked
one epoch
Mini-batch
Original Gradient Descent With Mini-batch
Unstable!!!
1 epoch
x1 NN y1 𝑦ො 1 x1 NN y1 𝑦ො 1
Mini-batch
Mini-batch
𝑙1 𝑙1
x31 NN y31 𝑦ො 31 x17 NN y17 𝑦ො 17
𝑙31 𝑙17
……
……
𝑙2 𝑙2
……
……
Recipe of Deep Learning
YES
Momentum
Hard to get the power of Deep …
……
……
……
……
xN …… yM
x1 …… 𝑦1 𝑦ො1
x2 Small
…… output 𝑦2 𝑦ො2
……
……
……
……
……
……
𝑙
+∆𝑙
xN …… 𝑦𝑀 𝑦ො𝑀
Large
+∆𝑤
input
Intuitive way to compute the derivatives …
𝜕𝑙 ∆𝑙
=?
𝜕𝑤 ∆𝑤
Hard to get the power of Deep …
x1 y1
0 y2
x2
0
0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
Demo
ReLU - variant
𝑧 𝑧
𝑎 = 0.01𝑧 𝑎 = 𝛼𝑧
α also learned by
gradient descent
Maxout ReLU is a special cases of Maxout
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + −1 + 4
Max 1 Max 4
+ 1 + 3
Momentum
Learning Rates Set the learning
rate η carefully
𝑤1
Learning Rates Set the learning
rate η carefully
𝑤1
Learning Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: 𝜂 𝑡 = 𝜂 Τ 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates
Adagrad
Original: 𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤
Adagrad: w ← 𝑤 − 𝜂𝑤 𝜕𝐿 ∕ 𝜕𝑤
Parameter dependent
learning rate
𝜂 constant
𝜂𝑤 =
σ𝑡𝑖=0 𝑔𝑖 2 𝑔𝑖 is 𝜕𝐿 ∕ 𝜕𝑤 obtained
at the i-th update
Summation of the square of the previous derivatives
𝜂
𝜂𝑤 =
Adagrad σ𝑡𝑖=0 𝑔𝑖 2
g0 g1 …… g0 g1 ……
𝑤1 𝑤2
0.1 0.2 …… 20.0 10.0 ……
Learning rate: Learning rate:
𝜂 𝜂 𝜂 𝜂
= =
0.12 0.1 20 2 20
𝜂 𝜂 𝜂 𝜂
= =
0.12 + 0.22 0.22 202 + 102 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives
Smaller
Learning Rate
Smaller Derivatives
Momentum
Hard to find
optimal network parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
𝜕𝐿∕𝜕𝑤 = 0
Adam RMSProp (Advanced Adagrad) + Momentum
Demo
Recipe of Deep Learning
YES
Regularization
YES
Network Structure
Panacea for Overfitting
• Have more training data
• Create more training data (?)
Handwriting recognition:
Original Created
Training Data: Training Data:
Shift 15。
Recipe of Deep Learning
YES
Regularization
YES
Network Structure
Dropout
Training:
Thinner!
➢ No dropout
⚫ If the dropout rate at training is p%,
all the weights times 1-p%
⚫ Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout - Intuitive Reason
Testing
No dropout
(拿下重物後就變很強)
Training
Dropout (腳上綁重物)
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧 ′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply 1-p%
𝑧′ ≈ 𝑧
Dropout is a kind of ensemble.
Training
Ensemble Set
y1 y2 y3 y4
average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
All the
weights
……
multiply
1-p%
y1 y2 y3
?????
average ≈ y
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
Demo
……
……
500
model.add( dropout(0.8) )
……
500
model.add( dropout(0.8) )
Softmax
y1 y2
…… y10
Demo
Recipe of Deep Learning
YES
Regularization
YES
Network Structure
CNN is a very good example!
(next lecture)
Concluding Remarks
Recipe of Deep Learning
YES
Step 1: define a NO
Good Results on
set of function
Testing Data?
Step 2: goodness
of function YES
NO
Step 3: pick the Good Results on
best function Training Data?
Neural
Network
Lecture II:
Variants of Neural
Networks
Variants of Neural Networks
Convolutional Neural
Network (CNN) Widely used in
image processing
x1 ……
x2
…… ……
……
……
……
Represented
as pixels xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
“beak” detector
Why CNN for Image
• The same patterns appear in different regions.
“upper-left
beak” detector
“middle beak”
detector
Why CNN for Image
• Subsampling the pixels will not change the object
bird
bird
subsampling
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
The whole CNN
Property 1
➢ Some patterns are much Convolution
smaller than the whole image
Property 2
Max Pooling
➢ The same patterns appear in
Can repeat
different regions.
many times
Property 3 Convolution
➢ Subsampling the pixels will
not change the object
Max Pooling
Flatten
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
CNN – Convolution Those are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1 Matrix
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
Matrix
0 0 1 0 1 0 -1 1 -1
……
6 x 6 image
Each filter detects a small
Property 1
pattern (3 x 3).
1 -1 -1
CNN – Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
CNN – Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
We set stride=1 below
0 0 1 0 1 0
6 x 6 image
1 -1 -1
CNN – Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
Property 2
-1 1 -1
CNN – Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
4 x 4 image
1 -1 -1
CNN – Zero Padding -1 1 -1 Filter 1
-1 -1 1
0 0 0
0 1 0 0 0 0 1
0 0 1 0 0 1 0
0 0 1 1 0 0 You will get another 6 x 6
1 0 0 0 1 0 images in this way
0 1 0 0 1 0 0
0 0 1 0 1 0 0 Zero padding
0 0 0
6 x 6 image
CNN – Colorful image
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Colorful image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected ……
……
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1: 1
-1 1 -1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1 Only connect to 9
16: 1 input, not fully
connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1
16: 1 Shared weights
Even less parameters!
…
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
CNN – Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
CNN – Max Pooling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
Can repeat
A new image many times
Smaller than the original Convolution
image
The number of the channel Max Pooling
is the number of filters
The whole CNN
cat dog ……
Convolution
Max Pooling
A new image
Fully Connected
Feedforward network Convolution
Max Pooling
A new image
Flatten
3
Flatten
0
1
3 0
-1 1 3
3 1 -1
0 3 Flatten
1 Fully Connected
Feedforward network
0
3
Convolutional Neural Network
“monkey” 0
“cat” 1
CNN
……
“dog” 0
Convolution, Max target
Pooling, fully connected
Learning: Nothing special, just gradient descent ……
Only modified the network structure and
CNN in Keras
input format (vector -> 3-D tensor)
input
Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 …… There are 25
-1 -1 1 3x3 filters.
-1 1 -1 Max Pooling
Input_shape = ( 1 , 28 , 28 )
1: black/white, 3: RGB 28 x 28 pixels Convolution
3 -1 3 Max Pooling
-3 1
Only modified the network structure and
CNN in Keras
input format (vector -> 3-D tensor)
input
1 x 28 x 28
Convolution
How many parameters
9 25 x 26 x 26
for each filter?
Max Pooling
25 x 13 x 13
Convolution
How many parameters
225 50 x 11 x 11
for each filter?
Max Pooling
50 x 5 x 5
Only modified the network structure and
CNN in Keras
input format (vector -> 3-D tensor)
input
1 x 28 x 28
output Convolution
25 x 26 x 26
Fully Connected Max Pooling
Feedforward network
25 x 13 x 13
Convolution
50 x 11 x 11
Max Pooling
1250 50 x 5 x 5
Flatten
Live Demo
What does CNN learn?
The output of the k-th filter is a x
11 x 11 matrix. input
Degree of the activation 11 11
of the k-th filter: 𝑎𝑘 = 𝑎𝑖𝑗 𝑘
25 3x3
Convolution
𝑖=1 𝑗=1 filters
𝑥 ∗ = 𝑎𝑟𝑔 max 𝑎𝑘 (gradient ascent)
𝑥
11 Max Pooling
3 -1 …… -1
𝑘 50 3x3
𝑎𝑖𝑗 Convolution
filters
-3 1 …… -3
11 50 x 11 x 11
……
……
……
Max Pooling
3 -2 …… -1
What does CNN learn?
The output of the k-th filter is a
11 x 11 matrix. input
Degree of the activation 11 11
of the k-th filter: 𝑎𝑘 = 𝑎𝑖𝑗 𝑘
25 3x3
Convolution
𝑖=1 𝑗=1 filters
𝑥 ∗ = 𝑎𝑟𝑔 max 𝑎𝑘 (gradient ascent)
𝑥
Max Pooling
50 3x3
Convolution
filters
50 x 11 x 11
Max Pooling
Max Pooling
0 1 2
Convolution
Max Pooling
3 4 5
flatten
6 7 8
0 1 2 0 1 2
3 4 5 3 4 5
6 7 8 6 7 8
CNN
Modify
Deep Dream image
http://deepdreamgenerator.com/
Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
CNN CNN
?
More Application: Playing Go
Next move
Network (19 x 19
positions)
19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedforward
white: -1 network can be used
none: 0 But CNN performs much better.
More Application: Playing Go
record of
Training: 黑: 5之五 白: 天元 黑: 五之5 …
previous plays
Target:
CNN “天元” = 1
else = 0
Target:
CNN “五之 5” = 1
else = 0
Why CNN for playing Go?
• Some patterns are much smaller than the whole
image
Convolutional Neural
Network (CNN)
Destination: Taipei
Slot
time of arrival: November 2nd
Example Application
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is represented
as a vector)
Taipei x1 x2
1-of-N encoding
apple 0 a-a-a 0
bag 0 a-a-b 0
…
…
cat 0 a-p-p 1
dog 0
…
26 X 26 X 26
elephant 0 p-l-e 1
…
…
…
p-p-l 1
“other” 1
…
…
w = “apple”
w = “Gandalf” w = “Sauron”
173
Example Application time of
dest departure
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is represented
as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Taipei x1 x2
Example Application time of
dest departure
y1 y2
arrive Taipei on November 2nd
place of departure
a1 a2
x1 x2 x3
x1 x2 …… x1 x2 ……
leave Taipei arrive Taipei
…… ……
……
……
……
…… ……
…… ……
xt xt+1 xt+2
Bidirectional RNN
xt xt+1 xt+2
…… ……
yt yt+1 yt+2
…… ……
xt xt+1 xt+2
Long Short-term Memory (LSTM)
Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part of
the network)
Memory Forget Signal control
Cell Gate the forget gate
(Other part of
the network)
Signal control
Input Gate LSTM
the input gate
(Other part of
the network)
Other part of the network
𝑎 = ℎ 𝑐 ′ 𝑓 𝑧𝑜
𝑧𝑜 multiply
Activation function f is
𝑓 𝑧𝑜 ℎ 𝑐′ usually a sigmoid function
Between 0 and 1
Mimic open and close gate
𝑐 𝑓 𝑧𝑓
𝑐c′ 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑔 𝑧 𝑓 𝑧𝑖
𝑐 ′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
𝑓 𝑧𝑖
𝑧𝑖
multiply
𝑔 𝑧
𝑧
0
≈0
-10
10
10
7 10
≈1
≈1 3
10
3
3
-3
≈1
10
-3
10
-3
7 -10
≈0
≈1 -3
10
-3
-3
LSTM
ct-1
……
vector
zf zi z zo 4 vectors
xt
LSTM
yt
zo
ct-1
× + ×
× zf
zi
zf zi z zo
xt
z
Extension: “peephole”
LSTM
yt yt+1
ct-1 ct ct+1
× + × × + ×
× ×
zf zi z zo zf zi z zo
This is quite
standard now.
https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Three Steps for Deep Learning
y1 y2 y3
copy copy
a1 a2 a3
a1 a2
Wi
x1 x2 x3
Training
Sentences: arrive Taipei on November 2nd
other dest other time time
Three Steps for Deep Learning
Backpropagation
through time (BPTT)
copy
a1 a2
𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤 x1 x2
Unfortunately ……
• RNN-based network is not always easy to learn
Real experiments on Language modeling
sometimes
Total Loss
Lucky
Epoch
The error surface is rough.
The error surface is either
very flat or very steep.
Total
Clipping
CostLoss
w2
……
我 覺 得 …… 太 糟 了
Many to one [Shen & Lee, Interspeech 16]
Key Term …
Key Terms:
Extraction DNN, LSTN
V1 V2 V3 V4 … VT
Output Layer
Embedding Layer
document x1 x2 x3 x4 … xT
Hidden Layer
Embedding Layer
V1 V2 V3 V4 … VT ΣαiVi
α1 α2 α3 α4 … αT
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
learning
Containing all
information about
input sequence
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
機 器 學 習 慣 性 ……
……
machine
learning
推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 (鄉民百科)
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
===
機 器 學 習
machine
learning
CNN ……
Input
image Caption Generation
Image Caption Generation
• Can machine describe what it see from image?
• Demo:台大電機系 大四 蘇子睿、林奕辰、徐翊
祥、陳奕安
MTK 產學大聯盟
http://news.ltn.com.tw/photo/politics/breakin
gnews/975542_1
Video Caption Generation
A girl is running.
Video
214
Attention-based Model
What you learned Breakfast
in these lectures today
What is deep
learning?
summer
vacation 10
Answer Organize years ago
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention-based Model
Input DNN/RNN output
Reading Head
Controller
Reading Head
…… ……
Machine’s Memory
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e
cm.mp4/index.html
Attention-based Model v2
…… ……
Machine’s Memory
Reading Head
Controller
Semantic
Analysis
…… ……
source: http://visualqa.org/
Visual Question Answering
Reading Head
Controller
random
(proposed by FB AI group)
(proposed by FB AI group)
Convolutional Neural
Network (CNN)
Unsupervised Learning
• 化繁為簡
• Auto-encoder
• Word Vector and Audio Word Vector
• 無中生有
Reinforcement Learning
Unsupervised Learning
• 化繁為簡 • 無中生有
only having
function input
only having
function
function output
function
code
Outline
Unsupervised Learning
• 化繁為簡
• Auto-encoder
• Word Vector and Audio Word Vector
• 無中生有
Reinforcement Learning
Motivation
• In MNIST, a digit is 28 x 28 dims.
• Most 28 x 28 dim vectors are not digits
Unsupervised Learning
• 化繁為簡
• Auto-encoder
• Word Vector and Audio Word Vector
• 無中生有
Reinforcement Learning
Auto-encoder
Usually <784
Compact
NN code representation of
Encoder the input object
28 X 28 = 784
Learn together
NN Can reconstruct
code the original object
Decoder
As close as possible
NN NN
𝑐
Encoder Decoder
Deep Auto-encoder
• NN encoder + NN decoder = a deep network
As close as possible
Output Layer
Input Layer
bottle
Layer
Layer
Layer
Layer
Layer
Layer
… …
Encoder Decoder
𝑥 Code 𝑥ො
Reference: Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the
dimensionality of data with neural networks." Science 313.5786 (2006): 504-507
Deep Auto-encoder
Original
Image
784
784
30
PCA
Deep
Auto-encoder
500
500
250
250
30
1000
1000
784
784
784 784
1000
2
500
784
250
2
250
500
1000
784
More: Contractive auto-encoder
Ref: Rifai, Salah, et al. "Contractive
Auto-encoder auto-encoders: Explicit invariance
during feature extraction.“ Proceedings
of the 28th International Conference on
Machine Learning (ICML-11). 2011.
• De-noising auto-encoder
As close as possible
encode decode
𝑐
𝑥 𝑥′ 𝑥ො
Add
noise
500
Target
1000 784 𝑥ො
W1’
1000 1000
W1
Input 784 Input 784 𝑥
Auto-encoder – Pre-training DNN
• Greedy Layer-wise Pre-training again
output 10
1000 1000
W2
1000 1000 𝑎1
fix W1
Input 784 Input 784 𝑥
Auto-encoder – Pre-training DNN
• Greedy Layer-wise Pre-training again
output 10 1000 𝑎ො 2
W3’
500 500
W3
Target
1000 1000 𝑎2
fix W2
1000 1000 𝑎1
fix W1
Input 784 Input 784 𝑥
Auto-encoder – Pre-training DNN
Find-tune by
• Greedy Layer-wise Pre-training again backpropagation
output 10 output 10
Random
W4 init
500 500
W3
Target
1000 1000
W2
1000 1000
W1
Input 784 Input 784 𝑥
Outline
Unsupervised Learning
• 化繁為簡
• Auto-encoder
• Word Vector and Audio Word Vector
• 無中生有
Reinforcement Learning
Word Vector/Embedding
• Machine learn the meaning of words from reading
a lot of documents without supervision
tree
flower
dog rabbit
run
jump cat
Word Embedding
• Machine learn the meaning of words from reading
a lot of documents without supervision
• A word can be understood by its context
You shall know a word
蔡英文、馬英九 are
by the company it keeps
something very similar
馬英九 520宣誓就職
蔡英文 520宣誓就職
How to exploit the context?
• Count based
• If two words wi and wj frequently co-occur, V(wi) and
V(wj) would be close to each other
• E.g. Glove Vector:
http://nlp.stanford.edu/projects/glove/
• Prediction based
wi
…… wi-2 wi-1 ___
Prediction-based
0 z1
1-of-N
1 z2 The probability
encoding
0 for each word as
of the
…
……
……
the next word wi
word wi-1
……
Neural
潮水 退了
Network
0 z1
1 z2 The probability
0 for each word as
…
……
……
蔡英文 the next word wi
……
or
“宣誓就職”
馬英九
should have large
z2 probability
Training text:
…… 蔡英文 宣誓就職 …… 蔡英文
wi-1 wi
馬英九
…… 馬英九 宣誓就職 ……
wi-1 wi z1
Prediction-based
– Various Architectures
• Continuous bag of word (CBOW) model
wi-1
…… wi-1 ____ wi+1 …… Neural
wi
wi+1 Network
Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014
251
Word Embedding
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
• Characteristics ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
• Solving analogies
254
Document to Vector
• Paragraph Vector: Le, Quoc, and Tomas Mikolov. "Distributed Representations of
Sentences and Documents.“ ICML, 2014
• Seq2seq Auto-encoder: Li, Jiwei, Minh-Thang Luong, and Dan Jurafsky. "A
hierarchical neural autoencoder for paragraphs and documents." arXiv preprint,
2015
• Skip Thought: Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S.
Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler, “Skip-Thought Vectors”
arXiv preprint, 2015.
• Exploiting other kind of labels:
• Huang, Po-Sen, et al. "Learning deep structured semantic models for web
search using clickthrough data." ACM, 2013.
• Shen, Yelong, et al. "A latent semantic model with convolutional-pooling
structure for information retrieval." ACM, 2014.
• Socher, Richard, et al. "Recursive deep models for semantic
compositionality over a sentiment treebank." EMNLP, 2013.
• Tai, Kai Sheng, Richard Socher, and Christopher D. Manning. "Improved
semantic representations from tree-structured long short-term memory
networks." arXiv preprint, 2015.
Audio Word to Vector
Like an infant
ever ever
Sequence-to-sequence
Auto-encoder
vector
audio segment
x1 x2 x3 x4 acoustic features
audio segment
Sequence-to-sequence
Input acoustic features
Auto-encoder
x1 x2 x3 x4
The RNN encoder and
decoder are jointly trained.
y1 y2 y3 y4
RNN Encoder
RNN Decoder
x1 x2 x3 x4 acoustic features
audio segment
Sequence-to-sequence
Auto-encoder
• Visualizing embedding vectors of the words
fear
fame
name near
Audio Word to Vector
–Application
“US President”
spoken
query
user
Spoken Content
Audio
Segment to
Vector
Audio
Segment to Similarity
Spoken Vector
Query
On-line Search Result
Experimental Results
• Query-by-Example Spoken Term Detection
SA: sequence
auto-encoder
DSA: de-noising
MAP
sequence auto-encoder
Input: clean speech +
noise
output: clean speech
training epochs for sequence
auto-encoder
Next Step ……
• Can we include semantics?
walk
dog
walked
cat
run cats
flower tree
Outline
Unsupervised Learning
• 化繁為簡
• Auto-encoder
• Word Vector and Audio Word Vector
• 無中生有
Reinforcement Learning
Creation
Draw something!
Creation
• Generative Models:
https://openai.com/blog/generative-models/
https://www.quora.com/What-did-Richard-Feynman-mean-when-he-said-What-I-
cannot-create-I-do-not-understand
Ref: Aaron van den Oord, Nal Kalchbrenner, Koray
PixelRNN Kavukcuoglu, Pixel Recurrent Neural Networks,
arXiv preprint, 2016
E.g. 3 x 3 images
NN
NN
NN
Real
World
PixelRNN – beyond Image
Audio: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu,
WaveNet: A Generative Model for Raw Audio, arXiv preprint, 2016
Video: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo
Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu, Video Pixel Networks ,
arXiv preprint, 2016
Auto-encoder
As close as possible
code
NN NN
Encoder Decoder
Randomly generate
code
NN
a vector as code Decoder
Image ?
NN NN
input output
Encoder Decoder
code
VAE
Minimize
m1 reconstruction error
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 exp + 𝑐2 output
Decoder
𝜎2 𝑐3
𝜎3
X 𝑐𝑖 = 𝑒𝑥𝑝 𝜎𝑖 × 𝑒𝑖 + 𝑚𝑖
𝑒
From a normal 1
𝑒2 Minimize
distribution 𝑒3 3
2
𝑒𝑥𝑝 𝜎𝑖 − 1 + 𝜎𝑖 + 𝑚𝑖
𝑖=1
Why VAE?
decode
code
encode
VAE
Cifar-10
https://github.com/openai/iaf
sentence NN NN
sentence
Encoder Decoder
code
Code Space
i went to the store to buy some groceries.
i store to buy some groceries.
i were to buy any groceries.
……
"come with me," she said.
"talk to me," she said.
"don’t worry about it," she said.
Ref: http://www.wired.co.uk/article/google-artificial-intelligence-poetry
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy
Bengio, Generating Sentences from a Continuous Space, arXiv prepring, 2015
Problems of VAE
• It does not really try to simulate real images
code
NN As close as
Output
Decoder possible
Realistic Fake
Generative Adversarial Network
(GAN)
棕色 葉脈
Ref: https://openai.com/blog/generative-models/
畫漫畫
• Ref: https://github.com/mattya/chainer-DCGAN
畫漫畫
• Ref: http://qiita.com/mattya/items/e5bfe5e04b9d2f0bbd47
Want to practice
Generation Models?
Pokémon Creation
• Small images of 792 Pokémon's
• Can machine learn to create new Pokémons?
• Source of image:
http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A
9mon_by_base_stats_(Generation_VI)
Original image is 40 x 40
Making them into 20 x 20
Pokémon Creation
0 0 1 0 0 ……
Cover 50%
It is difficult to evaluate generation.
Cover 75%
Drawing from scratch
Pokémon Creation Need some randomness
Pokémon Creation
m1
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 exp + 𝑐2 output
Decoder
𝜎2 𝑐3
𝜎3 10-dim
X
𝑒1
𝑒2
Pick two dim, and 𝑒3
fix the rest eight
𝑐1 NN
𝑐2 ?
Decoder
𝑐3
10-dim
Pokémon Creation - Data
• Original image (40 x 40):
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/ima
ge.rar
• Pixels (20 x 20):
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_creation/pixe
l_color.txt
• Each line corresponds to an image, and each number corresponds to a pixel
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Pokemon_cre
ation/colormap.txt
0
1
2
……
…
Outline
Unsupervised Learning
• 化繁為簡
• Example: Word Vector and Audio Word Vector
• 無中生有
Reinforcement Learning
Scenario of Reinforcement
Learning
Observation Action
Agent
Don’t do Reward
that
Environment
Scenario of Reinforcement
Learning Agent learns to take actions to
maximize expected reward.
Observation Action
Agent
http://www.sznews.com/news/conte Environment
nt/2013-11/26/content_8800180.htm
Supervised v.s. Reinforcement
• Supervised “Hello” Say “Hi”
Learning from
teacher “Bye bye” Say “Good bye”
• Reinforcement
……. ……. ……
Hello ☺ …… Bad
Learning from
critics Agent Agent
Scenario of Reinforcement
Learning Agent learns to take actions to
maximize expected reward.
Observation Action
If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environment
Supervised v.s. Reinforcement
• Supervised:
• Reinforcement Learning
…
Function Function
Input Output
Environment
Application: Interactive Retrieval
• Interactive retrieval is helpful. [Wu & Lee, INTERSPEECH 16]
“Deep Learning”
user
Better retrieval
The task cannot be addressed
performance,
Less user labor by linear model.
More Interaction
More applications
• Alpha Go, Playing Video Games, Dialogue
• Flying Helicopter
• https://www.youtube.com/watch?v=0JL04JJjocc
• Driving
• https://www.youtube.com/watch?v=0xo1Ldx3L
5Q
• Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
• http://www.bloomberg.com/news/articles/2016-07-
19/google-cuts-its-giant-electricity-bill-with-deepmind-
powered-ai
To learn deep reinforcement
learning ……
• Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te
aching.html
• 10 lectures (1:30 each)
• Deep Reinforcement Learning
• http://videolectures.net/rldm2015_silver_reinfo
rcement_learning/
Conclusion