COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks

COMPX310-19A
Machine Learning
Chapter 11: Training Deep Neural
Networks
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Training DNNs
 Is rather tricky, often seen as “black art”, rather than science
 Issues include:
 Vanishing gradients
 Not enough (labelled) training data
 Very slow training times
 High risk of overfitting
 But: there are solutions for all of these nowadays 
03/08/2021 COMPX310 2
Vanishing gradients
 Opposite of ‘exploding gradients’, a zero gradient stops learning
03/08/2021 COMPX310 3
Vanishing gradients
 One remedy: init weights with 0 mean and specific variance:
03/08/2021 COMPX310 4
Leaky ReLU
 Rectified linear unit: gradient is either 0 or 1, 0 can cause “death”
03/08/2021 COMPX310 5
ELU
 Exponential linear unit: more expensive to compute, but
sometimes faster to converge
03/08/2021 COMPX310 6
Self normalizing NN
 Uses SELU, a scaled ELU, keeps output of every layer
automatically in a range of mean=0, variance=1, provided:
 Input was scaled to mean=0, variance=1 (use StandardScaler)
 Every hidden layer was initialised with

kernel_initializer=“lecun_normal”
 Network is ‘sequential’, no complex architecture allowed
 Theoretical guarantees only for Dense layers (but in practise seems

to work for convolutional layers and others as well)
 Suggested SELU > ELU > leakyRELU > ReLU > tanh > sigmoid
03/08/2021 COMPX310 7
Batch Normalization
 Like using an incremental and trainable version of
StandardScaler inside the network:
03/08/2021 COMPX310 8
Normalizing activation input
 We can normalize after the summation, just before the activation
function is called:
 Notice how Activation is now a separate ”layer”

 Bias term is unnecessary, because BatchNorm has one, kind of
03/08/2021 COMPX310 9
Not enough data?
 Re-use network trained on similar data
03/08/2021 COMPX310 10
Keras example
 A: trained on 8 FashionMNIST classes, B: shirts vs. sandals
 Copy the model structure:
 Keep a copy of the old A (training B will also modify A):
 ‘Freeze’ most weights (temporarily)
03/08/2021 COMPX310 11
Keras example cont.
 Train a bit, unfreeze all, train some more
 Why: large errors from new last layer might also damage ‘good’
weights from copied layers;
 Also note the use of ‘optimizer …’
03/08/2021 COMPX310 12
Nothing similar enough?
 Maybe use unsupervised pretraining
 Or invent an auxiliary task for pretraining, e.g. predict ‘missing’

word in a tweet first; later try sentiment analysis (happy/sad)
03/08/2021 COMPX310 13
Faster than SGD
 Lots of ideas for SGD improvements:
 Momentum: physics idea, escape local minima, push across
plateaus
 Nesterov momentum, or accelerated gradient: smarter variant
 AdaGrad: adaptive gradient adjusting learning rate/dimension

 RMSProp: smarter AdaGrad
 Adam: adaptive momentum estimate, combines RMSProp +

Momentum
 Nadam: Adam + Nesterov
 Current best practise: use Adam first, but be open to alternatives
03/08/2021 COMPX310 14
Learning rate?
 What is a good learning rate?
03/08/2021 COMPX310 15
Adaptive rates are better
 Many ideas:
 Power scheduling: r, r/2, r/3, r/4, …
 Exponential scheduling: r, 0.1r, 0.01r, …
 Piecewise constant: e.g. 0.1 for 5 epochs, then 0.001 for 50 epochs …
 Performance scheduling: monitor validation set, reduce rate when
validation error stalls
 Book explains how to do achieve this in Keras
 Generally, Performance scheduling and Exponential scheduling

seem to be the best choice
03/08/2021 COMPX310 16
Regularization
 Can use the usual L2, L1, and L1+L2 regularization:
 Also keras.regularizers.l1() and keras.regularizers.l1_l2()
03/08/2021 COMPX310 17
DropOut
 Regularization the ’computer science’ way (like early stopping)
 During training, for each mini-batch, ignore some units at random
03/08/2021 COMPX310 18
More on DropOut
 In Keras:
 Can also ‘dropout’ single weights/connections

 AlphaDropout: use for SELUs, preserves mean+sdev
03/08/2021 COMPX310 19
MCDropout
 Monte Carlo Dropout:
 Use dropout also during prediction
 Predict multiple times (e.g. 100x)
 Average
 => more reliable prediction plus variance information
 Best way to use in Keras: define your own subclass
03/08/2021 COMPX310 20
Good defaults
 If you train from scratch: scale input and use
03/08/2021 COMPX310 21

COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks

Uploaded by

Copyright:

Available Formats

COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks

Uploaded by

Copyright:

Available Formats

COMPX310-19A

 But: there are solutions for all of these nowadays 

 Input was scaled to mean=0, variance=1 (use StandardScaler)

 Every hidden layer was initialised with

 Network is ‘sequential’, no complex architecture allowed

 Theoretical guarantees only for Dense layers (but in practise seems

 Notice how Activation is now a separate ”layer”

 Keep a copy of the old A (training B will also modify A):

 ‘Freeze’ most weights (temporarily)

 Or invent an auxiliary task for pretraining, e.g. predict ‘missing’

 AdaGrad: adaptive gradient adjusting learning rate/dimension

 Adam: adaptive momentum estimate, combines RMSProp +

 Current best practise: use Adam first, but be open to alternatives

 Book explains how to do achieve this in Keras

 Generally, Performance scheduling and Exponential scheduling

 Also keras.regularizers.l1() and keras.regularizers.l1_l2()

 Can also ‘dropout’ single weights/connections

You might also like