Pso Adaboost1
Pso Adaboost1
Pso Adaboost1
swarm optimization
Yaru Li, Yulai Zhang, Xiaohan Wei
Department of Software Engineering, Zhejiang University of Science and Technology,
arXiv:2011.11944v2 [cs.LG] 14 Dec 2020
Abstract
Particle swarm optimization (PSO) methods cannot be directly used in the
problem of hyper-parameters estimation since the mathematical formulation
of the mapping from hyper-parameters to loss function or generalization ac-
curacy is unclear. Bayesian optimization (BO) framework is capable of con-
verting the optimization of hyper-parameters into the optimization of an
acquisition function. The acquisition function is non-convex and multi-peak.
So the problem can be better solved by the PSO. The proposed method in
this paper uses the particle swarm method to optimize the acquisition func-
tion in the BO framework to get better hyper-parameters. The performances
of proposed method in both of the classification and regression models are
evaluated and demonstrated. The results on several benchmark problems are
improved.
Keywords: particle swarm, Bayesian optimization, hyper-parameters
1. Introduction
Particle swarm optimization (PSO) [1] methods have been successfully used
in the estimation of model parameters in the field of machine learning [2][3].
However, when it comes to the problem of hyper-parameters estimation [4][5],
particle swarm methods, as well as many other optimization methods, can-
not be directly used to deal with this problem. The difficulty lies in the
fact that the mapping from hyper-parameters of model to the loss function
or generalization error is lack of explicit mathematical expressions and the
computation complexity is very high.
2. Preliminaries
2.1. Particle Swarm Optimization
PSO is a method based on swarm intelligence, which was first proposed by
Kenndy and Eberhart in 1995 [1][12]. Because of its simplicity in imple-
mentation, PSO algorithm is successfully used in machine learning, signal
processing, adaptive control and so on [13].
2
As first step, a population of m particles is initialized randomly, each
particle is a potential solution to the problem that needs to be solved in the
search space. In each iteration, the velocities and positions of each particle
are updated using two values: one is the best value (pb ) of particle, and the
other is the best value (gb ) of population overall previous . Suppose there
are m particles in the d-dimensional search space, the velocity and position
of the i-th particle at the time of t are expressed as
3
Step 3. Compare particle’s fitness evaluation with particle’s pb . If current
value is better than pb , then let pb equal to the current value, and the pb
location equal to the current location;
Step 4. Compare fitness evaluation with the population’s overall previous
best. If current value is better than gb ,then reset gb to the current particle’s
array index and value;
Step 5. Change the position and velocity of the particle according to equa-
tions (1)(2);
Step 6. Go to Step 2 until a criterion is met.
4
xt+1 = argmax αt (x; D) (4)
There are three commonly acquisition functions: probability of improvement
(PI), expected improvement (EI) and upper confidence bounds (UCB). These
acquisition functions trade off exploration against exploitation.
In recent years, BO has been widely used in machine learning model
hyper-parameters estimation and model automatic selection [16][17][18][19][20],
which promotes the research of BO method for hyper-parameters estimation
in many aspects.
3. PSO-BO
The BO algorithm based on PSO is an iterative process. First of all,
use Algorithm 1 to optimize the acquisition function to obtain xt+1 ; Then,
evaluate the objective function value according to yt+1 = f (xt+1 )+ε; Finally,
update D with the new sample point {(xt+1 , yt+1 )}, and update the posterior
distribution of the probabilistic surrogate model for the next iteration.
5
K + σε2 I
k
[y1:t+1 ] ∼ N 0,
kT k(xt+1 , xt+1 )
where k : x∗x → R is the covariance function, k = [k(x1 , xt+1 ), · · · , k(xt , xt+1 )]T
Gram matrix
k(x1 , x1 ) · · · k(x1 , xt )
K=
.. .. ..
. . .
k(xt , x1 ) · · · k(xt , xt )
I is the identity matrix and σε2 is the noise variance. The prediction can be
made by considering the original observation data as well as the new x. Since
the posterior distribution of yt+1 is
0
p 5 2 0
p
KM 52 (x, x ) = θ0 1 + 5r (x, x ) + r (x, x ) exp{− 5r2 (x, x0 )}
2 0 (8)
3
The second choice we need to make is acquisition function. Although our
method is applicable to most acquisition functions, we choose to use UCB
which is more popular in our experiment. GP-UCB proposed by Srinivas in
2009 [21]. The UCB strategy considers to increase the value of the confidence
boundary on the surrogate model as much as possible, and its acquisition
functions is as follows
6
3.2. Algorithm framework
PSO-BO consists of the following steps: (i) assume a surrogate model for
the black box function f , (ii) define an acquisition function α based on the
surrogate model of f , and maximize α by the PSO to decide the next eval-
uation point , (iii) observe the objective function at the point specified by α
maximization, and update the GP model using the observed data. PSO-BO
algorithm repeat (ii) and (iii) above until it meets the stopping conditions.
The algorithm framework is as follows
Algorithm 2 PSO-BO
Input: surrogate model for f , acquisition function α
Output: hyper-parameter vector optimal x∗
Step 1. Initialize hyper-parameter vector x0 ;
Step 2. For t = 1, 2, ..., T do:
Step 3. Using algorithm 1 to maximize the acquisition function to get the
next evaluation point: xt+1 = argmaxx∈X α(x|D);
Step 4. Evaluation objective function value yt+1 = f (xt+1 ) + εt+1 ;
Step 5. Update data:Dt+1 = D ∪ (xt+1 , yt+1 ), and update the surrogate
model;
Step 6. End for.
4. Experiments
PSO can guarantee the convergence of the algorithm when choosing (ω, c1 , c2 )
within this stability region: −1 < ω < 1, 0 < c1 + c2 < 4(1 + ω) [22][23]. c1
and c2 affect the expectation and variance of position. The smaller the vari-
ance is, the more concentrated the optimization results are, and the better
the stability of the optimization system is. [1] have studied the influence of
values of c1 , c2 on the expectation and variance of position for the purpose
of reducing variance.
In order to demonstrate the performances of the proposed PSO-BO al-
gorithm, two different data sets were analyzed on AdaBoost regressor, ran-
dom forest (RF) classifier and XGBoost classifier [24]. Both of the datasets
were randomly split into train/validation sets. The zero mean function and
Matern-52 (8) covariance function were adopted as the prior for the GP.
In experiments we used the AdaBoost regressor, RF classifier and XGBoost
classifier implementation available in Scikit-learn.
7
4.1. Data sets and setups
Two datasets are used in this experiment: Boston housing dataset and
Digits dataset [25] , which are available in Scikit-learn. The Boston housing
dataset contains 506 examples and each example has 13 dimensions for re-
gression tasks. The Digits dataset contains 1797 examples and each example
has 64 dimensions for classification tasks.
We first estimated four hyper-parameters of RF classifier model trained
on the Boston housing dataset to set the value of ω . A table with the hyper-
parameters to be optimized, their range and their type is displayed in Table 1.
Note that while the first parameter takes real values, the others take integer
values. In the process of optimizing the hyper-parameters of RF classifier
by PSO-BO, ω was set from 0.1 to 0.9, and all other parameters were set
to the same to ensure a fair comparison. The experiments were repeated 10
times. We use accuracy to measure performance on the classification task
and the averaged results were shown in Fig.1. The vertical axis represents
the accuracy on the validation set of the Boston housing dataset, and 5-fold
cross validation was used on the dataset. As Fig.1 is shown, when ω = 0.8,
the value of accuracy is the highest, and as ω growing, however, the time
required for the process of optimizing also increases significantly. Refer to
the research of [1] on parameter setting of PSO algorithm, that PSO has the
biggest rate of convergence when ω is between 0.8 and 1.2, the parameter of
PSO algorithm in PSO-BO was set to c1 = 1.85, c2 = 2, ω = 0.8.
8
Figure 1: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the accuracy on the validation set of RF
classifier model.
9
Table 2: Types and range of hyper-parameters of AdaBoost
Name Type Range
Learning rate Real (0.1, 1)
Number of estimators Integer (10, 250)
Figure 2: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the R2 on the validation set of AdaBoost
regressor model.
10
PSO-BO, L-BFGS-B-BO and TNC-BO respectively. The range and type of
hyper-parameters be optimized is displayed in Table 1. Note that while the
first parameter takes real values, the others take integer values. As shown
in Table 4, comparing the averaged results, maximum results and minimum
results, PSO-BO outperforms the other two algorithms under comparison
in terms of the accuracy of classification on the validation set of the Dig-
its dataset, and 5-fold cross validation was used on the dataset. In Fig.3,
the vertical axis represents the accuracy, it can be seen that PSO-BO with
ω = 0.8 performed better than the other settings.
Figure 3: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the accuracy on the validation set of RF
classifier model.
11
4.4. Hyper-parameter Optimization of XGBoost
In this section, we estimated five hyper-parameters (d = 5) of XGBoost
classifier using Digits dataset. And the five hyper-parameters are estimated
by PSO-BO, L-BFGS-B-BO and TNC-BO respectively. The range and type
of hyper-parameters be optimized is displayed in Table 5. Note that while
the first and second parameters take real values, the others take integer
values. As shown in Table 6, comparing the averaged results, maximum
results and minimum results, PSO-BO outperforms the other two algorithms
under comparison in terms of the accuracy of classification on the validation
set of the Digits dataset, and 5-fold cross validation was used on the dataset.
In Fig.4, the vertical axis represents the accuracy, it can be seen that PSO-
BO with ω = 0.8 performed better than the other settings.
5. Conclusion
In this paper, we developed a new approach, PSO-BO, based on PSO
algorithm. In PSO-BO framework, PSO method is used to optimize the ac-
quisition function to obtain new evaluation points, that significantly reduces
12
Figure 4: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the accuracy on the validation set of
XGBoost classifier model.
Acknowledgements
This work is supported by NSFC-61803337, NSFC-61803338, ZJSTF-
LGF18F020011.
References
[1] Q. Bai, Analysis of particle swarm optimization algorithm, Computer
and information science 3 (1) (2010) 180.
13
[3] M. Schwaab, E. C. Biscaia Jr, J. L. Monteiro, J. C. Pinto, Nonlinear
parameter estimation through particle swarm optimization, Chemical
Engineering Science 63 (6) (2008) 1542–1552.
[10] D. C. Liu, J. Nocedal, On the limited memory BFGS method for large
scale optimization, Mathematical programming 45 (1-3) (1989) 503–528.
14
[14] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization
of machine learning algorithms, in: Advances in neural information pro-
cessing systems, 2012, pp. 2951–2959.
[23] Y.-L. Zheng, L.-H. Ma, L.-Y. Zhang, J.-X. Qian, On the convergence
analysis and parameter selection in particle swarm optimization, in: Pro-
ceedings of the 2003 International Conference on Machine Learning and
Cybernetics (IEEE Cat. No. 03EX693), Vol. 3, IEEE, 2003, pp. 1802–
1807.
15
[24] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in:
Proceedings of the 22nd acm sigkdd International Conference on knowl-
edge discovery and data mining, 2016, pp. 785–794.
16