COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

COMPX310-19A

Machine Learning
Chapter 11: Training Deep Neural
Networks
An introduction using Python, Scikit-Learn, Keras, and Tensorflow

Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Training DNNs
 Is rather tricky, often seen as “black art”, rather than science
 Issues include:
 Vanishing gradients
 Not enough (labelled) training data
 Very slow training times
 High risk of overfitting

 But: there are solutions for all of these nowadays 

03/08/2021 COMPX310 2
Vanishing gradients
 Opposite of ‘exploding gradients’, a zero gradient stops learning

03/08/2021 COMPX310 3
Vanishing gradients
 One remedy: init weights with 0 mean and specific variance:

03/08/2021 COMPX310 4
Leaky ReLU
 Rectified linear unit: gradient is either 0 or 1, 0 can cause “death”

03/08/2021 COMPX310 5
ELU
 Exponential linear unit: more expensive to compute, but
sometimes faster to converge

03/08/2021 COMPX310 6
Self normalizing NN
 Uses SELU, a scaled ELU, keeps output of every layer
automatically in a range of mean=0, variance=1, provided:

 Input was scaled to mean=0, variance=1 (use StandardScaler)

 Every hidden layer was initialised with


kernel_initializer=“lecun_normal”

 Network is ‘sequential’, no complex architecture allowed

 Theoretical guarantees only for Dense layers (but in practise seems


to work for convolutional layers and others as well)

 Suggested SELU > ELU > leakyRELU > ReLU > tanh > sigmoid
03/08/2021 COMPX310 7
Batch Normalization
 Like using an incremental and trainable version of
StandardScaler inside the network:

03/08/2021 COMPX310 8
Normalizing activation input
 We can normalize after the summation, just before the activation
function is called:

 Notice how Activation is now a separate ”layer”


 Bias term is unnecessary, because BatchNorm has one, kind of
03/08/2021 COMPX310 9
Not enough data?
 Re-use network trained on similar data

03/08/2021 COMPX310 10
Keras example
 A: trained on 8 FashionMNIST classes, B: shirts vs. sandals
 Copy the model structure:

 Keep a copy of the old A (training B will also modify A):

 ‘Freeze’ most weights (temporarily)

03/08/2021 COMPX310 11
Keras example cont.
 Train a bit, unfreeze all, train some more

 Why: large errors from new last layer might also damage ‘good’
weights from copied layers;
 Also note the use of ‘optimizer …’
03/08/2021 COMPX310 12
Nothing similar enough?
 Maybe use unsupervised pretraining

 Or invent an auxiliary task for pretraining, e.g. predict ‘missing’


word in a tweet first; later try sentiment analysis (happy/sad)

03/08/2021 COMPX310 13
Faster than SGD
 Lots of ideas for SGD improvements:
 Momentum: physics idea, escape local minima, push across
plateaus
 Nesterov momentum, or accelerated gradient: smarter variant

 AdaGrad: adaptive gradient adjusting learning rate/dimension


 RMSProp: smarter AdaGrad

 Adam: adaptive momentum estimate, combines RMSProp +


Momentum
 Nadam: Adam + Nesterov

 Current best practise: use Adam first, but be open to alternatives

03/08/2021 COMPX310 14
Learning rate?
 What is a good learning rate?

03/08/2021 COMPX310 15
Adaptive rates are better
 Many ideas:
 Power scheduling: r, r/2, r/3, r/4, …
 Exponential scheduling: r, 0.1r, 0.01r, …
 Piecewise constant: e.g. 0.1 for 5 epochs, then 0.001 for 50 epochs …
 Performance scheduling: monitor validation set, reduce rate when
validation error stalls

 Book explains how to do achieve this in Keras

 Generally, Performance scheduling and Exponential scheduling


seem to be the best choice

03/08/2021 COMPX310 16
Regularization
 Can use the usual L2, L1, and L1+L2 regularization:

 Also keras.regularizers.l1() and keras.regularizers.l1_l2()

03/08/2021 COMPX310 17
DropOut
 Regularization the ’computer science’ way (like early stopping)
 During training, for each mini-batch, ignore some units at random

03/08/2021 COMPX310 18
More on DropOut
 In Keras:

 Can also ‘dropout’ single weights/connections


 AlphaDropout: use for SELUs, preserves mean+sdev

03/08/2021 COMPX310 19
MCDropout
 Monte Carlo Dropout:
 Use dropout also during prediction
 Predict multiple times (e.g. 100x)
 Average
 => more reliable prediction plus variance information
 Best way to use in Keras: define your own subclass

03/08/2021 COMPX310 20
Good defaults
 If you train from scratch: scale input and use

03/08/2021 COMPX310 21

You might also like