DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS
HYPERPARAMETER
OPTIMIZATION
Big Data Science (Master in Statistical Data Analysis)
PARAMETER OPTIMIZATION
̶ So far, we have talked about parameter optimization:
̶ Our model contains trainable parameters
̶ We define a loss function
̶ An optimization algorithm searches the parameters that
minimize the loss:
‒ Analytic solutions
‒ Newton-Raphson
‒ (Stochastic) gradient descent
‒ ...
2
HYPERPARAMETER OPTIMIZATION
̶ Most models also have hyperparameters:
̶ Fixed before training the model
̶ Involve assumptions of the model
̶ Not taken into account in the gradient of the
optimization function
3
EXAMPLES OF HYPERPARAMETERS
Neural
Linear models Random Forest SVM KNN
networks
• Regularization • Number of • Kernel • Architecture •𝐾
constant trees • Margin • Number of • Distance
• Maximum • Kernel layers metric
depth parameters: • Size of each • Parameters of
• Minimum leaf • Polinomial layer approximate
size degree • Activation structures
• Criterion for • Gaussian function
split kernel width • Dropout
• Number of • ... • Regularization
features per • ...
split
• ...
4
CHOOSING HYPERPARAMETERS
̶ Manual search
̶ Grid search
̶ Random search
̶ Automated methods:
̶ Bayesian optimization
̶ Evolutionary optimization
5
MANUAL TUNING
̶ Using assumptions or knowledge to select the hyperparameters
̶ Pros:
̶ Computationally efficient
̶ Cons:
̶ Requires manual labor
̶ Prone to bias
̶ Limited combinations are tested
6
GRID SEARCH
̶ For each hyperparameter, define a subset of values that will be
tested
̶ Iteratively test all combinations
̶ Pros:
̶ The individual effect of parameters can be studied
̶ Cons:
̶ The number of combinations can become very high
̶ Few values are tested for every parameter
̶ The combined effect of parameters is not completely modeled
7
RANDOM SEARCH
̶ A random distribution is specified for each parameter
̶ Samples are drawn and tested
̶ Pros:
̶ The combined effect of parameters is somewhat modeled
̶ More values per parameter can be considered
̶ Cons:
̶ The search is not guided
̶ The individual effect of parameters is not clear
8
GRID VS RANDOM
J. Bergstra, Y. Bengio, “Random Search for Hyper-
Parameter Optimization”, Journal of Machine Learning
Research 13 (2012) 281-305
9
HYPERPARAMETER
OPTIMIZATION AS AN
OPTIMIZATION PROBLEM
10
AUTOMATED HYPERPARAMETER OPTIMIZATION
̶ Why not solve hyperparameter optimization in the
same way as parameter optimization?
̶ Main approaches:
̶ Bayesian optimization
̶ Evolutionary algorithms
11
SEQUENTIAL MODEL-BASED BAYESIAN OPTIMIZATION (SMBO)
1. Query the function 𝑓 at 𝑡 values and record the
𝑡
resulting pairs S = 𝜽𝑖 , 𝑓(𝜽𝑖 ) 𝑖=1
2. For a fixed number of iterations:
1. Fit a probabilistic model ℳ to the pairs in S
2. Apply an aquisition function 𝑎(𝜽, ℳ) to select a
promising input 𝜽 to evaluate next
3. Evaluate 𝑓(𝜽) and include 𝜽, 𝑓(𝜽) into S
12
GENETIC ALGORITHMS
̶ Applying the principles of natural selection to optimization
̶ Solutions are encoded as "chromosomes"
̶ A crossover operator combines two chromosomes into new ones
̶ A mutation operator introduces random mutations
1. Generate an initial population of solutions
2. For a number of generations:
1. Crossover solutions to increase population size
2. Apply mutation operator
3. Evaluate new solutions
4. Discard some "bad" solutions to maintain a "good" population
13
PARTITIONING
14
PARTITIONING FOR HYPERPARAMETER OPTIMIZATION
̶ Remember: NEVER TRAIN ON THE TEST SET
̶ This is also valid when training hyperparameters
15
TEST SET + CROSS VALIDATION
Valid. Training
Training
Valid.
Training
CV Valid. ...
Training
Training
Training
Valid.
Test
Daniel Peralta <daniel.peralta@ugent.vib.be> 16
NESTED CROSS VALIDATION
Test Training
Test
Valid. Training
Training
Training
...
Training Valid.
Training
Training
Training Valid.
Test
Daniel Peralta <daniel.peralta@ugent.vib.be>
17
NESTED CROSS VALIDATION: EXAMPLE
̶ 5 folds
̶ 3 classifiers: Logistic Regression, Random Forest, SVM
̶ We want to know which classifier is better suited to our problem
̶ We also want to optimize the hyperparameters of each classifier
̶ 3 inner folds for hyperparameter optimization
̶ The ultimate goal is to have a system in production doing real
predictions
18
NESTED CROSS VALIDATION: EXAMPLE
1. For each outer fold i in [1...5]:
1. Validation set: Fold i
2. Training set: Folds {1,2,3,4,5}\{i}
3. Split training set into 3 inner folds
4. For each classifier 𝐶 in {LR, RF, SVM}:
1. For each combination of hyperparameters 𝜃𝑐 for 𝐶:
1. For each inner fold j in [1...3]:
1. (Inner) validation set: Fold j
2. (Inner) training set: Folds {1,2,3}\{j}
3. Train classifier 𝐶(𝜃𝑐 ) on training set
4. Evaluate 𝐶(𝜃𝑐 ) on validation set
2. Calculate average performance of 𝐶 𝜃𝑐 across 3 inner folds
∗(𝑖)
2. Select best performing parameters 𝜃𝑐 for classifier 𝐶
∗(𝑖)
3. Evaluate C(𝜃𝑐 ) on validation set
∗(𝑖)
2. Calculate average performance of each C(𝜃𝑐 ) across all validation folds
3. Select the best classifier C ∗
4. Select 𝜃𝑐∗∗ as the optimal hyperparameters for C ∗
5. Train C ∗ 𝜃𝑐∗∗ on the entire dataset
∗(𝑖)
̶ Note that the best parameters 𝜃𝑐 for each classifier depend on the outer fold that was used for training
19