0% found this document useful (0 votes)
13 views16 pages

Pso Adaboost1

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 16

Hyper-parameter estimation method with particle

swarm optimization
Yaru Li, Yulai Zhang, Xiaohan Wei
Department of Software Engineering, Zhejiang University of Science and Technology,
arXiv:2011.11944v2 [cs.LG] 14 Dec 2020

Hangzhou, China, 310023, zhangyulai@zust.edu.cn

Abstract
Particle swarm optimization (PSO) methods cannot be directly used in the
problem of hyper-parameters estimation since the mathematical formulation
of the mapping from hyper-parameters to loss function or generalization ac-
curacy is unclear. Bayesian optimization (BO) framework is capable of con-
verting the optimization of hyper-parameters into the optimization of an
acquisition function. The acquisition function is non-convex and multi-peak.
So the problem can be better solved by the PSO. The proposed method in
this paper uses the particle swarm method to optimize the acquisition func-
tion in the BO framework to get better hyper-parameters. The performances
of proposed method in both of the classification and regression models are
evaluated and demonstrated. The results on several benchmark problems are
improved.
Keywords: particle swarm, Bayesian optimization, hyper-parameters

1. Introduction
Particle swarm optimization (PSO) [1] methods have been successfully used
in the estimation of model parameters in the field of machine learning [2][3].
However, when it comes to the problem of hyper-parameters estimation [4][5],
particle swarm methods, as well as many other optimization methods, can-
not be directly used to deal with this problem. The difficulty lies in the
fact that the mapping from hyper-parameters of model to the loss function
or generalization error is lack of explicit mathematical expressions and the
computation complexity is very high.

Preprint submitted to Arxiv December 15, 2020


Therefore, naive methods such as grid search [6] and random search [7]
are always used in the field of hyper-parameters estimation in traditional
engineering practices. These methods run large numbers of independent ex-
periments under different hyper-parameter guesses and then pick the best
hyper-parameters. Recently, bayesian optimization (BO) framework [8][9] is
proposed in to deal with the hyper-parameters estimation problem in the ma-
chine learning community. BO tries to find the optimal value of a black-box
function by constructing a posterior probability of the black box function’s
output when a finite number of the sample points are obtained from ex-
periments. A surrogate model is used to construct the mapping from the
hyper-parameters to the model accuracy. Then, it turns the optimization of
the hyper-parameters into an optimization problem of the acquisition func-
tion [8][9]. The acquisition function describes the likelihood of the maximum
or minimum points of the generalized accuracy or error of the model. The
mathematical expression of the acquisition function may be high dimensional
and has plenty of local minimum points.
The particle swarm method fits well with the above task. In this paper,
we use the PSO method to solve the optimization problem of the acquisition
function. In the existing works, the gradient based optimization methods
such as L-BFGS-B [10]and TNC [11] are used in the BO framework. Cal-
culating the first derivatives and the second derivatives of the acquisition
functions are computationally expensive and the global optimal cannot be
well guaranteed. If the PSO can return better results on the acquisition
functions, the generalized accuracy of the machine learning model can be
improved with high probability.
In the rest of this paper. The preliminaries of the PSO and BO are
described in section 2; The section 3 introduces in details of the proposed
method; The section 4 verifies the performance of the algorithm by experi-
ments; Conclusions are offered in section 5.

2. Preliminaries
2.1. Particle Swarm Optimization
PSO is a method based on swarm intelligence, which was first proposed by
Kenndy and Eberhart in 1995 [1][12]. Because of its simplicity in imple-
mentation, PSO algorithm is successfully used in machine learning, signal
processing, adaptive control and so on [13].

2
As first step, a population of m particles is initialized randomly, each
particle is a potential solution to the problem that needs to be solved in the
search space. In each iteration, the velocities and positions of each particle
are updated using two values: one is the best value (pb ) of particle, and the
other is the best value (gb ) of population overall previous . Suppose there
are m particles in the d-dimensional search space, the velocity and position
of the i-th particle at the time of t are expressed as

vi (t) = [vi1 (t), vi2 (t), · · · , vid (t)]T


xi (t) = [xi1 (t), xi2 (t), · · · , xid (t)]T
the best value of particle and the overall previous best value of population
at iteration t are

pbi (t) = [pi1 (t), pi2 (t), · · · , pid (t)]T


gb (t) = [g1 (t), g2 (t), · · · , gd (t)]T
At iteration t + 1, the position and velocity of the particle are updated as
follows:

vi (t + 1) = ωvi (t) + c1 r1 (pbi (t) − xi (t)) + c2 r2 (gb (t) − xi (t)) (1)


xi (t + 1) = xi (t) + vi (t + 1) (2)
where ω is the inertia weight coefficient, which can trade off the global search
ability against local search ability; c1 and c2 are the learning factors of the
algorithm. If c1 = 0, it is easy to fall into local optimization and can not
jump out; if c2 = 0, it will lead to slow convergence speed of PSO; r1 and r2
are random variables uniformly distributed in [0, 1].
In each iteration of the PSO algorithm, only the optimal particle can
transmit the information to other particles. The algorithm generally has two
termination conditions: a maximum number of iterations or a sufficiently
good fitness value. The process of PSO is as follows

Algorithm 1 Particle Swarm Optimization


Step 1. Initialize a population of particles with random position and velocity
on d-dimensions of the problem space;
Step 2. For each particle, evaluate the fitness function of each particle in
d-dimensional;

3
Step 3. Compare particle’s fitness evaluation with particle’s pb . If current
value is better than pb , then let pb equal to the current value, and the pb
location equal to the current location;
Step 4. Compare fitness evaluation with the population’s overall previous
best. If current value is better than gb ,then reset gb to the current particle’s
array index and value;
Step 5. Change the position and velocity of the particle according to equa-
tions (1)(2);
Step 6. Go to Step 2 until a criterion is met.

2.2. Bayesian optimization


BO was first proposed by Pelikan of the University of Illinois at Urbana-
Champaign in 1998 [14]. Under the condition that the finite sample points are
known, BO finds the optimal value of the function by constructing a posterior
probability of the output of the objective function f [8][9]. Because the BO
framework is very data efficient, it is particularly useful in situations where
evaluations of f are costly and one does not have access to derivatives with
respect to x and f is non-convex and multi-peak. BO framework has two key
ingredients. One is a probabilistic surrogate model, which consists of a prior
distribution. The other is an acquisition function. BO is a sequential model-
based approach since the postierior propability is constructed consequently
by each data points [15].
Mathematically, denote f (x) as objective function

x∗ = argmaxf (x) (3)


d
where x ∈ X, X ⊆ R , X is hyper-parameters space. The purpose of this
article is to find the maximum value of the objective function. Suppose
the existing data is D1:t = (xi , yi ), i = 1, 2, · · · , t, yi is the generalization
accuracy of the model under the hyper-parameter xi . In the following, D1:t =
(xi , yi ), i = 1, 2, · · · , t was simplified as D. We hope to estimate the maximum
value of the objective function in a limited number of iterations. If y is
regarded as a random observation of the generalization accuracy, y = f (x)+ε,
where the noise ε satisfies p(ε) = N (0, σε2 ), i.i.d.. The goal of hyper-parameter
estimation is to find x∗ in the d-dimensional hyper-parameters space.
One problem with this maximum expected accuracy framework is that
the true sequential accuracy is typically computationally intractable. This
has led to the introduction of many myopic heuristics known as acquisition
functions, which is maximized as

4
xt+1 = argmax αt (x; D) (4)
There are three commonly acquisition functions: probability of improvement
(PI), expected improvement (EI) and upper confidence bounds (UCB). These
acquisition functions trade off exploration against exploitation.
In recent years, BO has been widely used in machine learning model
hyper-parameters estimation and model automatic selection [16][17][18][19][20],
which promotes the research of BO method for hyper-parameters estimation
in many aspects.

3. PSO-BO
The BO algorithm based on PSO is an iterative process. First of all,
use Algorithm 1 to optimize the acquisition function to obtain xt+1 ; Then,
evaluate the objective function value according to yt+1 = f (xt+1 )+ε; Finally,
update D with the new sample point {(xt+1 , yt+1 )}, and update the posterior
distribution of the probabilistic surrogate model for the next iteration.

3.1. Algorithm Description


The effectiveness of BO depends on the acquisition function α. In gen-
eral, α is non-convex and multi-peak, which needs to solve the non-convex
optimization problems in the search space X. PSO algorithm is simple, with
a few adjustment parameters and fast convergence speed. It is not neces-
sary to calculate the derivatives of the objective function in the process of
PSO. Therefore, PSO algorithm was chosen to optimize acquisition function
to obtain new sample point in this paper.
The first choice we need to make is the surrogate model. Using a Gaussian
process (GP) as the surrogate model is a popular choice, due to the potent
function approximation properties and ability to quantify uncertainty of GP.
A GP is a prior over functions which allows us to encode our prior beliefs
about the properties of the function f , such as smoothness and periodicity
[4]. GP is a nonparametric model [21] that is fully characterized by its prior
mean function and its positive-definite kernel, or covariance function. For-
mally, each finite subset of GP model obeys multivariate normal distribution.
Assuming that the output expectation of the model is 0, the joint distribu-
tion of the original observation data D and the new sample point (xt+1 , yt+1 )
can be expressed as follows

5
K + σε2 I
  
k
[y1:t+1 ] ∼ N 0,
kT k(xt+1 , xt+1 )

where k : x∗x → R is the covariance function, k = [k(x1 , xt+1 ), · · · , k(xt , xt+1 )]T
Gram matrix
 
k(x1 , x1 ) · · · k(x1 , xt )
K=
 .. .. .. 
. . . 
k(xt , x1 ) · · · k(xt , xt )

I is the identity matrix and σε2 is the noise variance. The prediction can be
made by considering the original observation data as well as the new x. Since
the posterior distribution of yt+1 is

p(yt+1 | y1:t , x1:t+1 ) = N (µt (xt+1 ), σt2 (xt+1 )) (5)


The mathematical expectation and variance of yt+1 are as follows

µt (xt+1 ) = k T (K + σε2 I)−1 y1:t (6)


σt+1 = k(xt+1 , xt+1 ) − k T (K + σε2 I)−1 k (7)
The ability of GP to express the distribution of functions only depends on
the covariance function. Matern-52 covariance function is one of them and
as follows

 
0
p 5 2 0
p
KM 52 (x, x ) = θ0 1 + 5r (x, x ) + r (x, x ) exp{− 5r2 (x, x0 )}
2 0 (8)
3
The second choice we need to make is acquisition function. Although our
method is applicable to most acquisition functions, we choose to use UCB
which is more popular in our experiment. GP-UCB proposed by Srinivas in
2009 [21]. The UCB strategy considers to increase the value of the confidence
boundary on the surrogate model as much as possible, and its acquisition
functions is as follows

αU CB (x) = µ(x) + γσ(x) (9)

γ is a parameter that controls the trade-off between exploration (visiting


unexplored areas in X) and exploitation (refining our belief by querying
close to previous samples). This parameter can be fixed to a constant value.

6
3.2. Algorithm framework
PSO-BO consists of the following steps: (i) assume a surrogate model for
the black box function f , (ii) define an acquisition function α based on the
surrogate model of f , and maximize α by the PSO to decide the next eval-
uation point , (iii) observe the objective function at the point specified by α
maximization, and update the GP model using the observed data. PSO-BO
algorithm repeat (ii) and (iii) above until it meets the stopping conditions.
The algorithm framework is as follows

Algorithm 2 PSO-BO
Input: surrogate model for f , acquisition function α
Output: hyper-parameter vector optimal x∗
Step 1. Initialize hyper-parameter vector x0 ;
Step 2. For t = 1, 2, ..., T do:
Step 3. Using algorithm 1 to maximize the acquisition function to get the
next evaluation point: xt+1 = argmaxx∈X α(x|D);
Step 4. Evaluation objective function value yt+1 = f (xt+1 ) + εt+1 ;
Step 5. Update data:Dt+1 = D ∪ (xt+1 , yt+1 ), and update the surrogate
model;
Step 6. End for.

4. Experiments
PSO can guarantee the convergence of the algorithm when choosing (ω, c1 , c2 )
within this stability region: −1 < ω < 1, 0 < c1 + c2 < 4(1 + ω) [22][23]. c1
and c2 affect the expectation and variance of position. The smaller the vari-
ance is, the more concentrated the optimization results are, and the better
the stability of the optimization system is. [1] have studied the influence of
values of c1 , c2 on the expectation and variance of position for the purpose
of reducing variance.
In order to demonstrate the performances of the proposed PSO-BO al-
gorithm, two different data sets were analyzed on AdaBoost regressor, ran-
dom forest (RF) classifier and XGBoost classifier [24]. Both of the datasets
were randomly split into train/validation sets. The zero mean function and
Matern-52 (8) covariance function were adopted as the prior for the GP.
In experiments we used the AdaBoost regressor, RF classifier and XGBoost
classifier implementation available in Scikit-learn.

7
4.1. Data sets and setups
Two datasets are used in this experiment: Boston housing dataset and
Digits dataset [25] , which are available in Scikit-learn. The Boston housing
dataset contains 506 examples and each example has 13 dimensions for re-
gression tasks. The Digits dataset contains 1797 examples and each example
has 64 dimensions for classification tasks.
We first estimated four hyper-parameters of RF classifier model trained
on the Boston housing dataset to set the value of ω . A table with the hyper-
parameters to be optimized, their range and their type is displayed in Table 1.
Note that while the first parameter takes real values, the others take integer
values. In the process of optimizing the hyper-parameters of RF classifier
by PSO-BO, ω was set from 0.1 to 0.9, and all other parameters were set
to the same to ensure a fair comparison. The experiments were repeated 10
times. We use accuracy to measure performance on the classification task
and the averaged results were shown in Fig.1. The vertical axis represents
the accuracy on the validation set of the Boston housing dataset, and 5-fold
cross validation was used on the dataset. As Fig.1 is shown, when ω = 0.8,
the value of accuracy is the highest, and as ω growing, however, the time
required for the process of optimizing also increases significantly. Refer to
the research of [1] on parameter setting of PSO algorithm, that PSO has the
biggest rate of convergence when ω is between 0.8 and 1.2, the parameter of
PSO algorithm in PSO-BO was set to c1 = 1.85, c2 = 2, ω = 0.8.

Table 1: Types and range of hyper-parameters of RF


Name Type Range
Max features Real (0.1, 0.999)
Number of estimators Integer (10, 250)
Minimum Number of Samples to Split Integer (2, 25)
Max depth Integer (5, 15)

In the process of optimizing hyper-parameters of machine learning, there


are generally bounds on hyper-parameters. However, among the methods
used to optimize the acquisition function in the Bayesian optimization frame-
work, L-BFGS-B, TNC, SLSQP, and trust-constr method apply to this re-
striction, in which L-BFGS-B and TNC are the most commonly used meth-
ods. Then, in this paper, L-BFGS-B and TNC are selected as the methods

8
Figure 1: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the accuracy on the validation set of RF
classifier model.

to optimize the acquisition function in the comparative experiment, and the


corresponding Bayesian optimization framework is referred to as L-BFGS-B-
BO and TNC-BO respectively.
To ensure a fair comparison, we implemented all methods in Python using
the same packages. In all experiments, we used a zero-mean GP surrogate
model with a Matern-52 kernel. We optimized the kernel and likelihood
hyper-parameters by maximising the log marginal likelihood [1]. All methods
used UCB as the acquisition function. Each experiment was started with 5
random initial observations.

4.2. Hyper-parameter Optimization of AdaBoost


In this section, we estimated two hyper-parameters (d = 2) of AdaBoost
regressor using Boston housing dataset. And the two hyper-parameters were
estimated by PSO-BO, L-BFGS-B-BO and TNC-BO respectively. The range
and type of hyper-parameters be optimized is displayed in Table 2. Note that
while the first parameter takes real values, the other takes integer values. We
use R2 to measure performance on the regression task. As shown in Table
3, comparing the averaged results, maximum results and minimum results,
PSO-BO outperforms the other two algorithms under comparison in terms of
the R2 on the validation set of the Boston housing dataset, and 5-fold cross
validation was used on the dataset. In Fig.2, the vertical axis represents the
R2, it can be seen that PSO-BO with ω = 0.8 performed better than the
other settings.

9
Table 2: Types and range of hyper-parameters of AdaBoost
Name Type Range
Learning rate Real (0.1, 1)
Number of estimators Integer (10, 250)

Figure 2: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the R2 on the validation set of AdaBoost
regressor model.

Table 3: Comparison between the PSO-BO, L-BFGS-B-BO, TNC-BO


Method PSO-BO (ω = 0.8) L-BFGS-B-BO TNC-BO
MAX 0.614 0.61 0.614
MIN 0.603 0.595 0.598
AVE 0.609 0.6021 0.6045

4.3. Hyper-parameter Optimization of RF


In this section, we estimated four hyper-parameters (d = 4) of RF classi-
fier using Digits dataset. And the four hyper-parameters were estimated by

10
PSO-BO, L-BFGS-B-BO and TNC-BO respectively. The range and type of
hyper-parameters be optimized is displayed in Table 1. Note that while the
first parameter takes real values, the others take integer values. As shown
in Table 4, comparing the averaged results, maximum results and minimum
results, PSO-BO outperforms the other two algorithms under comparison
in terms of the accuracy of classification on the validation set of the Dig-
its dataset, and 5-fold cross validation was used on the dataset. In Fig.3,
the vertical axis represents the accuracy, it can be seen that PSO-BO with
ω = 0.8 performed better than the other settings.

Figure 3: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the accuracy on the validation set of RF
classifier model.

Table 4: Comparison between the PSO-BO, L-BFGS-B-BO, TNC-BO


Method PSO-BO (ω = 0.8) L-BFGS-B-BO TNC-BO
MAX 0.9471 0.9449 0.9443
MIN 0.9421 0.9404 0.9399
AVE 0.9443 0.9427 0.9421

11
4.4. Hyper-parameter Optimization of XGBoost
In this section, we estimated five hyper-parameters (d = 5) of XGBoost
classifier using Digits dataset. And the five hyper-parameters are estimated
by PSO-BO, L-BFGS-B-BO and TNC-BO respectively. The range and type
of hyper-parameters be optimized is displayed in Table 5. Note that while
the first and second parameters take real values, the others take integer
values. As shown in Table 6, comparing the averaged results, maximum
results and minimum results, PSO-BO outperforms the other two algorithms
under comparison in terms of the accuracy of classification on the validation
set of the Digits dataset, and 5-fold cross validation was used on the dataset.
In Fig.4, the vertical axis represents the accuracy, it can be seen that PSO-
BO with ω = 0.8 performed better than the other settings.

Table 5: Types and range of hyper-parameters of XGBoost


Name Type Range
Sub sample Real (0.5, 1)
Col sample by tree Real (0.1, 1)
Gamma Real (0, 10)
Min child weight Integer (1, 20)
Max depth Integer (2, 10)

Table 6: Comparison between the PSO-BO, L-BFGS-B-BO, TNC-BO


Method PSO-BO (ω = 0.8) L-BFGS-B-BO TNC-BO
MAX 0.947 0.9438 0.946
MIN 0.9393 0.9376 0.938
AVE 0.9414 0.93945 0.9405

5. Conclusion
In this paper, we developed a new approach, PSO-BO, based on PSO
algorithm. In PSO-BO framework, PSO method is used to optimize the ac-
quisition function to obtain new evaluation points, that significantly reduces

12
Figure 4: Experimental results of the developed PSO-BO approach with different values
of ω. Horizontal-axis: values of ω ; Vertical-axis: the accuracy on the validation set of
XGBoost classifier model.

the computational burden. Empirical evaluation on machine learning model


showed that PSO-BO improves upon the state of the art. The resulting
method can be used with most acquisition function. However, the algorithm
runs slowly in high-dimensional space. In the future work, we plan to im-
prove the PSO-BO algorithm on this basis to improve its running efficiency
in high-dimensional space.

Acknowledgements
This work is supported by NSFC-61803337, NSFC-61803338, ZJSTF-
LGF18F020011.

References
[1] Q. Bai, Analysis of particle swarm optimization algorithm, Computer
and information science 3 (1) (2010) 180.

[2] P. Regulski, D. Vilchis-Rodriguez, S. Djurović, V. Terzija, Estimation of


composite load model parameters using an improved particle swarm op-
timization method, IEEE Transactions on Power Delivery 30 (2) (2014)
553–560.

13
[3] M. Schwaab, E. C. Biscaia Jr, J. L. Monteiro, J. C. Pinto, Nonlinear
parameter estimation through particle swarm optimization, Chemical
Engineering Science 63 (6) (2008) 1542–1552.

[4] J. S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-


parameter optimization, in: Advances in neural information processing
systems, 2011, pp. 2546–2554.

[5] K. Krajsek, R. Mester, Marginalized maximum a posteriori hyper-


parameter estimation for global optical flow techniques, in: AIP Con-
ference Proceedings, Vol. 872, American Institute of Physics, 2006, pp.
311–318.

[6] P. Stoica, A. B. Gershman, Maximum-likelihood doa estimation by data-


supported grid search, IEEE Signal Processing Letters 6 (10) (1999)
273–275.

[7] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimiza-


tion, The Journal of Machine Learning Research 13 (1) (2012) 281–305.

[8] P. I. Frazier, A tutorial on bayesian optimization, arXiv preprint


arXiv:1807.02811 (2018).

[9] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Freitas, Taking


the human out of the loop: A review of bayesian optimization, Proceed-
ings of the IEEE 104 (1) (2015) 148–175.

[10] D. C. Liu, J. Nocedal, On the limited memory BFGS method for large
scale optimization, Mathematical programming 45 (1-3) (1989) 503–528.

[11] S. G. Nash, A survey of truncated-newton methods, Journal of compu-


tational and applied mathematics 124 (1-2) (2000) 45–59.

[12] C.-C. Hung, L. Wan, Hybridization of particle swarm optimization with


the k-means algorithm for image classification, in: 2009 IEEE Sympo-
sium on Computational Intelligence for Image Processing, IEEE, 2009,
pp. 60–64.

[13] Z. Wenjing, Parameter identification of lugre friction model in servo


system based on improved particle swarm optimization algorithm, in:
2007 Chinese Control Conference, IEEE, 2007, pp. 135–139.

14
[14] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization
of machine learning algorithms, in: Advances in neural information pro-
cessing systems, 2012, pp. 2951–2959.

[15] E. Brochu, V. M. Cora, N. De Freitas, A tutorial on bayesian optimiza-


tion of expensive cost functions, with application to active user modeling
and hierarchical reinforcement learning, arXiv preprint arXiv:1012.2599
(2010).

[16] N. Mahendran, Z. Wang, F. Hamze, N. De Freitas, Adaptive mcmc with


bayesian optimization, in: Artificial Intelligence and Statistics, 2012, pp.
751–760.

[17] P. Hennig, C. J. Schuler, Entropy search for information-efficient global


optimization, The Journal of Machine Learning Research 13 (1) (2012)
1809–1837.

[18] E. C. Garrido-Merchán, D. Hernández-Lobato, Dealing with categori-


cal and integer-valued variables in Bayesian optimization with Gaussian
processes, Neurocomputing 380 (2020) 20–35.

[19] S. Toscano-Palmerin, P. I. Frazier, Bayesian optimization with expensive


integrands, arXiv preprint arXiv:1803.08661 (2018).

[20] M. Seeger, Gaussian processes for machine learning, International jour-


nal of neural systems 14 (02) (2004) 69–106.

[21] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger, Gaussian process


optimization in the bandit setting: No regret and experimental design,
arXiv preprint arXiv:0912.3995 (2009).

[22] M. Jiang, Y. P. Luo, S. Y. Yang, Stochastic convergence analysis and


parameter selection of the standard particle swarm optimization algo-
rithm, Information processing letters 102 (1) (2007) 8–16.

[23] Y.-L. Zheng, L.-H. Ma, L.-Y. Zhang, J.-X. Qian, On the convergence
analysis and parameter selection in particle swarm optimization, in: Pro-
ceedings of the 2003 International Conference on Machine Learning and
Cybernetics (IEEE Cat. No. 03EX693), Vol. 3, IEEE, 2003, pp. 1802–
1807.

15
[24] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in:
Proceedings of the 22nd acm sigkdd International Conference on knowl-
edge discovery and data mining, 2016, pp. 785–794.

[25] S. Hettich, C. Blake, C. Merz, UCI repository of machine information


and computer sciences (1998).

16

You might also like