COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks
COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks
COMPX310-19A Machine Learning Chapter 11: Training Deep Neural Networks
Machine Learning
Chapter 11: Training Deep Neural
Networks
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
Training DNNs
Is rather tricky, often seen as “black art”, rather than science
Issues include:
Vanishing gradients
Not enough (labelled) training data
Very slow training times
High risk of overfitting
03/08/2021 COMPX310 2
Vanishing gradients
Opposite of ‘exploding gradients’, a zero gradient stops learning
03/08/2021 COMPX310 3
Vanishing gradients
One remedy: init weights with 0 mean and specific variance:
03/08/2021 COMPX310 4
Leaky ReLU
Rectified linear unit: gradient is either 0 or 1, 0 can cause “death”
03/08/2021 COMPX310 5
ELU
Exponential linear unit: more expensive to compute, but
sometimes faster to converge
03/08/2021 COMPX310 6
Self normalizing NN
Uses SELU, a scaled ELU, keeps output of every layer
automatically in a range of mean=0, variance=1, provided:
Suggested SELU > ELU > leakyRELU > ReLU > tanh > sigmoid
03/08/2021 COMPX310 7
Batch Normalization
Like using an incremental and trainable version of
StandardScaler inside the network:
03/08/2021 COMPX310 8
Normalizing activation input
We can normalize after the summation, just before the activation
function is called:
03/08/2021 COMPX310 10
Keras example
A: trained on 8 FashionMNIST classes, B: shirts vs. sandals
Copy the model structure:
03/08/2021 COMPX310 11
Keras example cont.
Train a bit, unfreeze all, train some more
Why: large errors from new last layer might also damage ‘good’
weights from copied layers;
Also note the use of ‘optimizer …’
03/08/2021 COMPX310 12
Nothing similar enough?
Maybe use unsupervised pretraining
03/08/2021 COMPX310 13
Faster than SGD
Lots of ideas for SGD improvements:
Momentum: physics idea, escape local minima, push across
plateaus
Nesterov momentum, or accelerated gradient: smarter variant
03/08/2021 COMPX310 14
Learning rate?
What is a good learning rate?
03/08/2021 COMPX310 15
Adaptive rates are better
Many ideas:
Power scheduling: r, r/2, r/3, r/4, …
Exponential scheduling: r, 0.1r, 0.01r, …
Piecewise constant: e.g. 0.1 for 5 epochs, then 0.001 for 50 epochs …
Performance scheduling: monitor validation set, reduce rate when
validation error stalls
03/08/2021 COMPX310 16
Regularization
Can use the usual L2, L1, and L1+L2 regularization:
03/08/2021 COMPX310 17
DropOut
Regularization the ’computer science’ way (like early stopping)
During training, for each mini-batch, ignore some units at random
03/08/2021 COMPX310 18
More on DropOut
In Keras:
03/08/2021 COMPX310 19
MCDropout
Monte Carlo Dropout:
Use dropout also during prediction
Predict multiple times (e.g. 100x)
Average
=> more reliable prediction plus variance information
Best way to use in Keras: define your own subclass
03/08/2021 COMPX310 20
Good defaults
If you train from scratch: scale input and use
03/08/2021 COMPX310 21