1 s2.0 S0167668724001057 Main
1 s2.0 S0167668724001057 Main
A R T I C L E I N F O A B S T R A C T
Keywords: Machine Learning (ML) has gained popularity in actuarial research and insurance industrial applications.
AutoML However, the performance of most ML tasks heavily depends on data preprocessing, model selection, and
Insurance data analytics hyperparameter optimization, which are considered to be intensive in terms of domain knowledge, experience,
Imbalance learning
and manual labor. Automated Machine Learning (AutoML) aims to automatically complete the full life-cycle of ML
AI education
tasks and provides state-of-the-art ML models without human intervention or supervision. This paper introduces
an AutoML workflow that allows users without domain knowledge or prior experience to achieve robust and
effortless ML deployment by writing only a few lines of code. This proposed AutoML is specifically tailored for
the insurance application, with features like the balancing step in data preprocessing, ensemble pipelines, and
customized loss functions. These features are designed to address the unique challenges of the insurance domain,
including the imbalanced nature of common insurance datasets. The full code and documentation are available
on the GitHub repository.1
1. Introduction sector, as summarized by Ma and Sun (2020). Guerra and Castelli (2021)
present the ML innovations in the banking sector, particularly in the
Machine Learning (ML), as described by Mitchell et al. (1990), is analysis of liquidity risks, bank risks, and credit risks. Additionally, there
a multidisciplinary sub-field of Artificial Intelligence (AI) focused on is a growing trend in adopting ML models in the insurance sector and
developing and implementing algorithms and statistical models that en- among actuarial researchers and industry practitioners, as evidenced by
able computer systems to perform data-driven tasks or make predictions the recent literature. Some recent literature advances emerging data-
through “leveraging data” and iterative learning processes. This data- driven research topics in areas such as climate risks, health and long-
driven approach guides the design of ML algorithms, allowing them to term care, and telematics. For instance, by combining dynamic weather
grasp the distributions and structures within datasets and unveil cor- information with deep learning techniques, Shi et al. (2024) develop an
relations that elude traditional mathematical and statistical methods. improved predictive model, leading to improved insurance claim man-
Professionals in data-related fields, such as data scientists and ML en- agement. Hartman et al. (2020) and Cummings and Hartman (2022)
gineers, can engage in autonomous decision-making based on data and explore various ML models on health and long-term care insurance.
benefit from cutting-edge predictions generated by modern ML models. Masello et al. (2023) and Peiris et al. (2024) find that the integration
In recent decades, ML has significantly reshaped various indus- of telematics through ML models can better comprehend risk charac-
tries and gained widespread popularity in academia due to its excep- teristics. Other researchers focus on developing enhanced ML models
tional predictive capabilities. As summarized by Jordan and Mitchell to further improve predictive capabilities or address specific challenges
(2015), ML has made significant contributions in various fields, includ- within the insurance sector. Charpentier et al. (2023) propose a rein-
ing robotics, autonomous driving, language processing, and computer forcement learning technique and explore its application in the financial
vision. The medical and healthcare industry, as suggested by Kononenko sector. Gan and Valdez (2024) explore the use of exponential family
(2001) and Qayyum et al. (2020), is increasingly adopting ML for appli- principal component analysis in analyzing insurance compositional data
cations such as medical image analysis and clinical treatments. Further- to improve predictive accuracy. Turcotte and Boucher (2024) develop a
more, ML models have significantly improved personalization and tar- generalized additive model for location, scale, and shape (GAMLSSs) to
geting, marketing strategy, and customer engagement in the marketing better understand longitudinal data. To address the excessive zeros and
* Corresponding author.
E-mail addresses: panyid2@illinois.edu (P. Dong), zquan@illinois.edu (Z. Quan).
1
https://github.com/PanyiDong/InsurAutoML.
https://doi.org/10.1016/j.insmatheco.2024.10.002
Received 24 August 2024; Received in revised form 6 October 2024; Accepted 29 October 2024
Available online 8 November 2024
0167-6687/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
the heavily right-skewed distribution in claim management, So (2024) the initial model, users can then compare it with the results generated by
compare various ML models and propose a zero-inflated boosted tree our AutoML to gauge if their model surpasses the benchmark set by our
model. By combining classification and regression, Quan et al. (2023) AutoML. This utilization of AutoML as a benchmark aids in standard-
propose a hybrid tree-based ML algorithm to model claim frequency and izing and advancing ML methodologies within the dynamic landscape
severity. In addition to advancements in ML models, researchers, such of the insurance industry, including academic research. Secondly, our
as Frees et al. (2016) and Charpentier (2015), have collected various in- proposed AutoML generates training history as a byproduct, offering in-
surance datasets. For further studies, see Quan and Valdez (2018), Noll sights that users may have overlooked. For example, users can review
et al. (2020), Si et al. (2022) and Quan et al. (2024). the training process and the results recorded by AutoML over time. This
In an ideal scenario, ML models are crafted to automate data-driven feature can provide insights that align with user experiments or intu-
tasks, aiming to minimize manual labor and human intervention. How- ition, or it can sometimes present counterintuitive findings that prompt
ever, due to their dependence on data, there is no one-size-fits-all solu- users to reconsider their approach. Thirdly, our AutoML, designed with
tion, necessitating the training of a specific ML model for each task. This flexibility in the search space and optimization algorithms, can seam-
complexity is compounded by the vast array of ML models available, lessly incorporate future innovations while maintaining its strength in
each with numerous hyperparameters controlling the learning process automation. For experienced users seeking to unlock the full poten-
and impacting model performance. It even leads to the emergence of tial of our AutoML, its flexible design enables them to leverage tuning
a new research field, Hyperparameter Optimization (Yang and Shami, results to gain a deeper understanding of the underlying data struc-
2020), which has become increasingly vital in both academic and indus- ture. Such insights can then serve as a guideline for manually reducing
trial practice. Consequently, finding the optimal hyperparameter setting the search complexity, ultimately facilitating the attainment of opti-
for the ML model becomes a labor-intensive process reliant on extensive mal performance in time and computation. Finally, our research has
experience in ML. Additionally, the proliferation of digitalization has real-world applicability and is available as an open-source tool, mak-
led to explosive growth in data volume and variety (number of features ing it free for users to implement in practical scenarios. For instance,
and observations), coupled with a decline in overall data quality. As in the life-cycle of insurance business operations, our AutoML can of-
a result, real-world datasets, especially in industrial settings, demand fer tool sets to insurers to improve underwriting processes, optimize
meticulous data preprocessing to achieve optimal performance. This pricing strategies, enhance risk management practices, and improve op-
data preprocessing often involves manual optimization, model-specific, erational efficiency, cost reduction, and customer satisfaction. Thus,
and trial-and-error, further complicating the training of the ML model. AutoML presents a promising opportunity for insurance companies to
These factors contribute to the difficulty of building practical ML models leverage advanced ML models and unlock the full potential of their
for individuals lacking prior experience. Even for ML experts, achieving
data. Additionally, we believe that our open-source AutoML can serve
optimal performance in new datasets can be daunting, as evidenced in
as an educational tool for university students and a benchmark-building
Kaggle2 competitions.
resource for academic research.
In this paper, we endeavor to make ML more accessible to inexperi-
The paper is structured in the following sections: Section 2 provides
enced users and develop an inclusive learning tool that assists insurance
an overview of the general AutoML workflow and formulates the pro-
practitioners and actuarial researchers in utilizing state-of-the-art ML
cesses involved in model selection and hyperparameter optimization.
tools in their daily operations. In fact, this research stems from the IRisk
Section 3 focuses specifically on our AutoML design tailored for the in-
Lab3 project, which facilitates the learning of ML for actuarial science
surance domain, emphasizing the integration of sampling techniques
students and establishes a comprehensive pipeline for automating data-
and ensemble learning strategies to address the unique issues in insur-
driven tasks. Automated Machine Learning (AutoML), as summarized in
ance data. To showcase the feasibility of our AutoML in the insurance
Zöller and Huber (2021), offers a solution aimed at diminishing repet-
domain, Section 4 presents experiment results demonstrating the per-
itive manual tuning efforts, thus expediting ML training. Its objective
formance of our AutoML on various insurance datasets and compares
is to encourage the adoption of ML across various domains, particu-
it with existing research. Notations utilized throughout the paper can
larly among inexperienced users, by facilitating full ML life cycles and
be found in the Appendix A, Table 11. The experiments demonstrate
enhancing comprehension of domain datasets. Successful implementa-
that our AutoML carries the potential to achieve superior performance
tions of AutoML span across academic open-source libraries, LeDell and
without extensive human intervention, thereby freeing practitioners and
Poirier (2020), Feurer et al. (2022), and industry-commercialized prod-
ucts, including startups like DataRobot4 and cloud services such as AWS researchers from tedious manual tuning work. Section 5 provides con-
SageMaker.5 However, existing open-source AutoML implementations cluding remarks, summarizing the key findings and insights presented
may not be suitable for the insurance domain due to specific challenges, in the paper.
such as imbalanced datasets, a high prevalence of missing values, and
scalability issues. 2. The concept of AutoML
Our proposed AutoML pipeline tailored for the insurance domain
encompasses fully functional components for data preprocessing, model Most ML algorithm architectures characterize an ML model 𝑀 by
selection, and hyperparameter optimization. We envision several use a set of parameters 𝜃 and hyperparameters 𝜆. Parameters 𝜃 are essen-
cases for our proposed AutoML. Firstly, it can serve as a performance tial components of the ML model structure, representing values that are
benchmark for evaluating future ML creations among researchers, ac- trainable during model calibration and estimated from data, such as the
tuaries, and data scientists in the insurance sector. For instance, users estimated prediction values of the leaf nodes in tree-based models or the
can input datasets into our AutoML at the onset of a data project. Mean- weights and biases in a neural network’s linear layer. In contrast, hyper-
while, users can manually analyze the datasets, plan their subsequent parameters 𝜆 control the model’s flexibility and learning process, such as
steps, and produce preliminary results. Upon obtaining a prototype of the maximum depth in tree-based models or the learning rate in neural
networks. Unlike parameters 𝜃 , which are determined during the train-
2
ing process, hyperparameters 𝜆 are set before training and are chosen
https://www.kaggle.com/competitions.
3 by users based on their experience and preferences. Extensive empirical
See IRisk Lab currently serves as an academic-industry collaboration hub,
experiments have shown that the careful selection of hyperparameters
facilitates the integration of discovery-based learning experiences for students,
and showcases state-of-the-art research in all areas of Risk Analysis and Ad- can significantly impact model performance. However, there exists no
vanced Analytics. Retrieved from https://asrm.illinois.edu/illinois-risk-lab. universal set of hyperparameters that guarantees optimal performance
4
https://www.datarobot.com/. across all datasets, nor are there established theoretical foundations pro-
5
https://aws.amazon.com/sagemaker/. viding precise guidance for their selection.
18
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
With a set of fixed hyperparameters 𝜆, the optimization of the ML where and refer to the same evaluation process and the loss func-
model 𝑀 utilizing dataset = (X, y) can be expressed as tion defined in the naive model selection process.
As suggested by He et al. (2021), many solutions for Model Selec-
argmin(𝑀𝜆𝜃 (X), y) tion and HPO, especially in the field of Neural Networks (NNs) where
𝜃
the Model Selection is transformed into the Neural Architecture Search
where denotes the loss function, which takes predictions and true (NAS) problem, rely on a two-stage framework that separates the model
values as inputs and returns a numerical value indicating the goodness- architecture search and hyperparameter optimization. The two-stage op-
of-fit of the model. Thus, for any given loss functions , the model 𝑀 timization can be expressed as:
and hyperparameter 𝜆 are critical components that control the task per-
formance. 𝑀 ∗ = argmin 𝔼∼(𝑡𝑟𝑎𝑖𝑛 ,𝑣𝑎𝑙𝑖𝑑 ) (, 𝑀𝜆𝜃 , )
0
In a broader sense, selecting the loss function can also be consid- 𝑀∈
19
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
through automated trial and error, which is otherwise infeasible with- ble computational run-time required for optimization, and sub-optimal
out extensive manual work. Our AutoML aims to build models that are performance metrics during the evaluation phase. For example, miss-
robust, accurate, and capable of delivering reliable predictions despite ing values can hinder the application of linear models without proper
the inherent challenges associated with data quality. This systematic imputation methods, and the inclusion of wide-ranging features can
approach allows us to maximize the value extracted from the available substantially slow down the convergence rate of gradient-based opti-
data, ultimately leading to better insights and decision-making in the mization methods. To address these challenges, a category of techniques
insurance sector. known as Data Preprocessing, is employed to prepare the raw dataset be-
Another unique challenge in the insurance sector is the problem fore model optimization.
of imbalanced data. When modeling future claims, for example, claim Data preprocessing techniques, such as imputation and feature se-
events are relatively rare, with most policyholders not experiencing any lection, lack dedicated evaluation metrics aside from the final model
claims. Typically, the claim events constitute less than 10% of all poli- prediction metrics. Moreover, there is no universal preprocessing tech-
cies, and in some cases, such as policies covering catastrophic events, the nique that consistently demonstrates superior performance across all
proportion can be as low as 0.1%. This phenomenon is referred to as an datasets. As a result, these data preprocessing techniques are often data-
imbalanced data problem in ML, where a single class comprises the ma- dependent, requiring extensive manual adjustments and domain exper-
jority of all observations. Specifically, in the field of imbalance learning, tise. This reliance adds another layer of complexity to the construction
the class taking the dominant proportion is typically referred to as major- of the optimal pipeline for any ML task, in addition to model selection
ity class while others denote minority class. In certain domains, observa- and hyperparameter optimization.
tions within the minority class or rare observations are referred to as out- In our AutoML, we incorporate five classes of data preprocessing
liers (Hodge and Austin, 2004). These outliers are often treated as noise techniques commonly employed in ML, aiming to cover a comprehen-
that can be removed or merged into other classes. However, in the insur- sive range of data preprocessing needs. According to García et al. (2016),
ance industry, these minority class or rare observations corresponding to these data preprocessing techniques are both essential and beneficial
the claim events contribute as a crucial estimation of financial liabilities for ML models. In the order of pipeline fitting, we include Data Encod-
or pure premiums in the actuarial term. Consequently, accurately estimat- ing, Data Imputation, Data Balancing, Data Scaling, and Feature Selection,
ing the minority class or rare observations is crucial for the insurance which we implement in our AutoML based on the reference outlined as
domain. However, the majority of ML models, by their designs, assume follows (Due to space constraints, please refer to Appendix C for detailed
equal contributions from each observation, regardless of whether they lists of the components used in our AutoML pipelines):
belong to the majority or minority classes. As a result, many ML mod- Data Encoding: Data Encoding converts string-based categorical
els, without appropriate modifications, underperform on imbalanced features into numerical ones, either in the format of ordinal or binary
datasets. The problem of imbalanced data has caught the attention of one-hot encoding. Since most ML models do not support string variables,
researchers across diverse fields, leading to the development of various this unified numerical representation of features ensures consistency
solutions summarized by He and Garcia (2009) and Guo et al. (2017). across the dataset and prevents issues like unseen categorical variables
These imbalance learning techniques, in general, can be categorized as: and type inconsistencies.
(1) Sampling methods; (2) Cost-sensitive methods; (3) Ensemble meth- Data Imputation: Missing values, whether due to intrinsic infor-
ods; and (4) Kernel-based methods and Active Learning methods. Our mation loss or database mismanagement, pose significant challenges
AutoML incorporates a series of sampling methods as a critical data for ML. Various imputation techniques have been proposed to address
preprocessing component to balance majority and minority classes. In the missing value issue, such as statistical methods like multiple im-
addition, advancements in imbalance learning within actuarial science, putation (Azur et al., 2011), non-parametric approach (Stekhoven and
such as cost-sensitive loss functions (Hu et al., 2022; So et al., 2021) Bühlmann, 2012), and generative adversarial networks (Yoon et al.,
can be integrated into our loss function. Furthermore, as summarized by 2018). These imputation solutions determine the optimal estimates for
Sagi and Rokach (2018), ensemble learning not only achieves state-of- the missing cells and populate them accordingly. Unlike the common
the-art performance, but also has the potential to address the imbalance practice of discarding observations containing missing values, i.e., com-
problem effectively by combining multiple ML models into a predictive plete data analysis, which maintains the accuracy of the remaining data,
ensemble model. The multiple evaluation pipelines fitted during our Au- imputation techniques leverage all available information, potentially en-
toML training can naturally serve as candidates to form an ensemble. hancing the understanding of the dataset while also carrying the risk of
Our AutoML focuses on structured data, such as tabular data from introducing misleading imputed values. To address the uncertainty in
databases or CSV files, and is particularly designed for supervised learn- imputation, we offer users several highly cited methods to help find the
ing problems, specifically regression and classification. While many best imputation solution.
other types of AutoML handle unstructured data, like audio, text, video, Data Balancing: Data balancing addresses class imbalance using
and images, and incorporate unsupervised learning or active learning, sampling methods, which are especially beneficial in fields like insur-
these may lie beyond the scope of our AutoML. Instead, we aim to tailor ance, where infrequent but significant events, such as frauds and claims,
AutoML solutions to the insurance domain to enhance the performance must be accurately modeled. As summarized by Batista et al. (2004),
and adaptability of ML solutions. In the following, we introduce our Au- sampling methods allow ML models to learn from these rare events bet-
toML from a microscopic to a macroscopic perspective. Subsection 3.1 ter by adjusting class distributions and improving decision boundaries.
outlines each component of the model pipeline, with a specific focus on Summarized by Chawla et al. (2002), sampling methods are gen-
sampling techniques. Subsection 3.2 details how these components are erally classified into over-sampling and down-sampling. Over-sampling
interconnected and optimized within our AutoML workflow. Subsection increases the number of minority class instances by generating synthetic
3.3 summarizes the integration of ensemble model as an alternative so- data, while under-sampling reduces the number of majority class obser-
lution to imbalance learning and as the highest layer of our AutoML vations.
production. Lastly, Subsection 3.4 discusses the selection of loss func- For the dataset = (X, y), an imbalance problem can be described by
tions and their strategic implications in business contexts. splitting it into a majority subset 𝑚𝑎𝑗𝑜𝑟 = (X𝑚𝑎𝑗𝑜𝑟 , y𝑚𝑎𝑗𝑜𝑟 ) and a minority
subset 𝑚𝑖𝑛𝑜𝑟 = (X𝑚𝑖𝑛𝑜𝑟 , y𝑚𝑖𝑛𝑜𝑟 ), where 𝑦 = 𝐴 for every 𝑦 ∈ y𝑚𝑎𝑗𝑜𝑟 and 𝑦 ≠
[ ] [ ] [ ]
3.1. Model pipeline 𝑚𝑎𝑗𝑜𝑟 X y
𝐴 for every 𝑦 ∈ y𝑚𝑖𝑛𝑜𝑟 , such that = = ( 𝑚𝑎𝑗𝑜𝑟 , 𝑚𝑎𝑗𝑜𝑟 )
𝑚𝑖𝑛𝑜𝑟 X𝑚𝑖𝑛𝑜𝑟 y𝑚𝑖𝑛𝑜𝑟
Applying the model 𝑀 directly to the raw dataset , as formulated and an imbalance ratio
in Section 2, is often impractical or even impossible for several reasons. |𝑚𝑎𝑗𝑜𝑟 |
These include the diverse and complex structures of datasets, the infeasi- ≫1
|𝑚𝑖𝑛𝑜𝑟 |
20
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Here, 𝐴 refers to the response variable defining the majority class, and wrapper methods) or model-free (i.e., filter methods), and the effective-
|∗ | denotes the cardinality of dataset ∗ , indicating the number of ness of the selection heavily relies on the datasets.
observations. In insurance, for instance, non-fraudulent observations Preprocessing techniques, combined with ML models, comprise a
usually form the majority class, suggesting 𝐴 = 0. Given the imbalance pipeline that represents a real-world workflow of modeling tasks. To dif-
between 𝑚𝑎𝑗𝑜𝑟 and 𝑚𝑖𝑛𝑜𝑟 , conventional ML models are usually more ferentiate the types of hyperparameters, we extend the notations from
influenced by observations in 𝑚𝑎𝑗𝑜𝑟 than those in 𝑚𝑖𝑛𝑜𝑟 , resulting in Section 2 as follows: encoding algorithm 𝐸 with hyperparameter 𝜆𝐸 ; im-
inferior ML performance. putation algorithm 𝐼 with hyperparameter 𝜆𝐼 ; balancing algorithm 𝐵
Over-sampling retains the structure of 𝑚𝑎𝑗𝑜𝑟 and construct a sam- with hyperparameter 𝜆𝐵 ; scaling algorithm 𝑆 with hyperparameter 𝜆𝑆 ;
pled minority subset 𝑌𝑚𝑖𝑛𝑜𝑟 such that 𝑚𝑖𝑛𝑜𝑟 ⊂ 𝑌𝑚𝑖𝑛𝑜𝑟 and feature selection algorithm 𝐹 with hyperparameter 𝜆𝐹 ; and ML model
𝑀 with hyperparameter 𝜆𝑀 . In our AutoML, the pipeline can be ini-
|𝑚𝑎𝑗𝑜𝑟 | tialized by the input of algorithm-hyperparameter pairs, denoted as
=𝑅
|𝑌𝑚𝑖𝑛𝑜𝑟 |
0 = 𝑀𝜆𝑀 ◦𝐹𝜆𝐹 ◦𝑆𝜆𝑆 ◦𝐵𝜆𝐵 ◦𝐼𝜆𝐼 ◦𝐸𝜆𝐸
where 𝑅 is a pre-defined threshold or can be a hyperparameter, measur-
ing the imbalance ratio, typically 𝑅 ≈ 1. In our AutoML, datasets with where the initialized pipeline 0 is controlled by trainable parameters
an imbalance ratio greater than 𝑅 are considered imbalanced, and sam- 𝜃 . The pipeline can then be trained to optimize 𝜃 as follows:
pling methods are applied to adjust the imbalance ratio. The additional
observations in 𝑌𝑚𝑖𝑛𝑜𝑟 ⧵ 𝑚𝑖𝑛𝑜𝑟 are synthetically generated samples that 𝜃 ∗ = argmin(0 (X), y)
𝜃
simulate the statistical properties of the observations in 𝑚𝑖𝑛𝑜𝑟 . Com-
mon generation methods include duplication, as described by Batista where = (X, y) is a dataset and refers to the loss function. The pre-
et al. (2004) in the Simple Random Over-Sampling method, and linear processing techniques are applied to the datasets sequentially in the
interpolation in the feature space, as used in Synthetic Minority Over- specified order, as demonstrated in the parameterization of the pipeline.
Sampling Techniques (SMOTE) by Chawla et al. (2002). These synthetic This ordering demonstrated in our AutoML is a widely accepted mod-
minority samples increase the size of the minority subset, making the eling process that embeds some of the typical incentives in the domain
majority and minority classes comparable in size and mitigating the of data science. For example, the encoding and imputation processes
imbalance problem caused by the disparity between the majority and solve fundamental incompatibility issues that are not resolvable in sub-
minority classes. sequent operations. Further, it can be evident from the pipeline fitting
Under-sampling, on the contrary, maintains 𝑚𝑖𝑛𝑜𝑟 and construct a that, while the pipeline encompasses the entire life-cycle of modeling
reduced majority subset 𝑌𝑚𝑎𝑗𝑜𝑟 such that 𝑌𝑚𝑎𝑗𝑜𝑟 ⊂ 𝑚𝑎𝑗𝑜𝑟 and tasks, the selection of algorithms and their corresponding hyperparam-
eters still relies on manual decisions or automated optimization. The
|𝑌𝑚𝑎𝑗𝑜𝑟 | automated optimization of such algorithm-hyperparameter pairs under-
=𝑅 scores the term “Auto” in AutoML.
|𝑚𝑖𝑛𝑜𝑟 |
The subset 𝑌𝑚𝑎𝑗𝑜𝑟 is constructed by removing observations from 3.2. Automated optimization
𝑚𝑎𝑗𝑜𝑟 while preserving the statistical similarities. Most under-sampling
techniques use k Nearest Neighbors (kNN) as the base learner. Tomek To achieve automated optimization, we adopt the framework of
Link, proposed by Tomek (1976), utilizes kNN to find the adjoining CASH, as formulated in Section 2, and extend it to the preprocessing-
majority-minority pairs for removal. Edited Nearest Neighbors (ENN) modeling space. Given the preprocessing-modeling pipeline summarized
(Wilson, 1972) employs the predictions of kNN as majority votes to in Subsection 3.1, we construct the conjunction space for Data Encod-
determine which majority class observations should be removed. Con- ing, Data Imputation, Data Balancing, Data Scaling, Feature Selection,
densed Nearest Neighbors (CNN) by Hart (1968) uses kNN as the bench- denoted as 𝐸 , 𝐼 , 𝐵 , 𝑆 , 𝐹 , respectively. In addition, the optimal
mark to determine the necessary majority class observations required algorithm-hyperparameter pairs can be identified by exploring the doc-
to generate the subset. Assuming that the majority and minority classes umentation from implemented libraries or packages. This allows us to
can be viewed as a binary classification problem and that distinctions incorporate the pipeline fitting into the objective function , extending
between them can be discerned through their spatial distributions in it as follows:
feature space, the kNN leverages nearest neighbors as a criterion for
statistical significance to identify sample importance. By removing ob- (, 𝑀𝜆𝑀 ◦𝐹𝜆𝐹 ◦𝑆𝜆𝑆 ◦𝐵𝜆𝐵 ◦𝐼𝜆𝐼 ◦𝐸𝜆𝐸 , )
servations that disagree with kNN predictions, majority observations
In the following, we demonstrate the optimization strategy adopted
that distort the smoothness of decision boundaries can be precisely elim-
in our AutoML, focusing on supervised learning tasks on tabular
inated, leading to smooth decision boundaries and a proper balance
datasets. Fig. 1 illustrates the automated optimization workflow, where
between majority and minority classes.
a space of preprocessing techniques and ML models, along with the cor-
Data Scaling: Data scaling might not always significantly impact
responding hyperparameter space for each method, is constructed and
model performance, but it usually accelerates convergence, especially
stored prior to optimization. This space is usually denoted as the Search
in the case of gradient-based optimizations on skewed features. Consid-
Space and can be written as
ering the intensive computation involved in AutoML optimization, time
efficiency is crucial, making the scaling of features as important as other
= 𝐸 × 𝐼 × 𝐵 × 𝑆 × 𝐹 × 𝑀
data preprocessing components. Furthermore, some of the scaling tech-
niques help remove outliers in features, which are beneficial to the ML It is worth noting that, while our AutoML default search space includes
models in certain scenarios. In our AutoML, we incorporate a series of all possible methods, representing the largest possible space, it is flexible
common scaling techniques, such as standardization and normalization, and can be modified (add/remove) according to user needs.
are incorporated. Although the optimization theoretically guarantees finding the
Feature Selection: In real-world applications, unprocessed features global optimum, in practice, it is nearly impossible due to limited com-
can suffer from redundancy or ambiguity, negatively affecting run-time puting resources, especially given the multi-dimensional search space
and potentially undermining the model performance. Feature selection we have designed in our AutoML. Thus, we introduce two of the most
addresses this issue by reducing dimensions and effectively identifying a apparent and natural constraints, time and number of trials as the com-
subset of valuable features. As summarized by Chandrashekar and Sahin puting budget. These two constraints are referred to as time budget and
(2014), feature selection techniques can be either model-dependent (i.e., evaluation budget, respectively, and both can be modified according to
21
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
user demands. The time budget denotes the maximum allowed run- Algorithm 1: The AutoML optimization.
time for experiments, while the evaluation budget limits the number of Input: Dataset = (𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙𝑖𝑑 ); Search space ; Time budget 𝑇 ;
evaluations or trials executed during the experiments. Both budgets are Evaluation budget 𝐺 ; Search algorithm 𝑆𝑎𝑚𝑝
audited before each round of evaluations to determine whether a new Output: Optimal pipline with hyperparameter settings ∗
trial should be generated. A new trial is initiated only if both budgets 1 𝑘=0 ; /* Round of evaluation */
are not depleted. It is worth noting that the time and evaluation bud- 2 𝑡𝑟𝑒 = 𝑇 ; /* Remaining time budget */
gets are highly correlated: a larger number of trials typically requires a 3 𝑔 𝑟𝑒 = 𝐺 ; /* Remaining evaluation budget */
longer runtime, while allowing for longer experiments generally enables 4 while 𝑡𝑟𝑒 > 0 and 𝑔 𝑟𝑒 > 0 do
5 𝑡𝑠𝑡𝑎𝑟𝑡 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒;
deeper exploration of the search space. Furthermore, since the interme-
diate optimal loss during the search is non-increasing with respect to 6 (𝐸 (𝑘) , 𝜆(𝑘)
𝐸
), (𝐼 (𝑘) , 𝜆(𝑘)
𝐼
), (𝐵 (𝑘) , 𝜆(𝑘)
𝐵
), (𝑆 (𝑘) , 𝜆(𝑘)
𝑆
), (𝐹 (𝑘) , 𝜆(𝑘)
𝐹
), (𝑀 (𝑘) , 𝜆(𝑘)
𝑀
)=
𝑆𝑎𝑚𝑝 ( );
(𝑘)
the number of trials for fixed sampling procedures, the optimization
7 𝑘 = 𝑀 (𝑘) ◦𝐹 (𝑘) ◦𝑆 (𝑘) ◦𝐵 (𝑘) ◦𝐼 (𝑘) ◦𝐸 (𝑘) ;
guarantees at least non-degenerating performance as more time is spent (𝑘)
𝜆𝑀 (𝑘)
𝜆𝐹 𝜆𝑆(𝑘)
𝜆𝐵(𝑘)
𝜆𝐼 (𝑘)
𝜆𝐸 (𝑘)
searching or more sets of hyperparameters are explored. Consequently, 8 𝐿𝑒𝑣𝑎𝑙,(𝑘) = (, 𝑘 , );
increasing either the time or evaluation budget generally improves the 9 𝑡𝑒𝑛𝑑 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒;
performance of the optimization in practice. 10 𝑘 = 𝑘 + 1;
11 𝑡𝑟𝑒 = 𝑡𝑟𝑒 − (𝑡𝑒𝑛𝑑 − 𝑡𝑠𝑡𝑎𝑟𝑡 );
For each round of evaluation, a specific set of hyperparameters is
12 𝑔 𝑟𝑒 = 𝑔 𝑟𝑒 − 1;
sampled from the search space using a predefined sampling method,
/* Recording hyperparameters, preprocessed data,
commonly referred to as the Search Algorithm. The sampled hyperparam- trained pipeline */
eter sets can be either independent of each other (e.g., Random Search, 13 end
Grid Search) or conditional on previous evaluations (e.g., Bayesian 14 𝑘∗ = argmin𝐿𝑒𝑣𝑎𝑙,(𝑘) ; /* Find optimal pipeline order */
Search, Snoek et al. (2012), Wu et al. (2019)). Each sampled hyper- 𝑘
22
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
integration or conversion of frequently-used search algorithms like Hy- to a real value, loss functions provide a quantifiable measure of model
perOpt (Bergstra et al., 2013), Nevergrad (Rapin and Teytaud, 2018), performance. As summarized by Wang et al. (2022), a variety of loss
and Optuna (Akiba et al., 2019), combined with the parallel hyperpa- functions have been developed to address the specific demands of differ-
rameter tuning architecture supported by Ray Tune, enables us to create ent ML tasks. These loss functions are meticulously designed to optimize
a more flexible environment for experiment settings while ensuring ef- the model performance across diverse applications, ensuring alignment
ficiency and effectiveness. with the objectives and constraints of each particular task.
The choice of loss functions can significantly influence model cre-
3.3. Ensemble model ation and, ultimately, the success of ML models. In financial modeling,
Bams et al. (2009) demonstrate that the choice of loss functions has a
As discussed previously, ensemble learning offers alternative solu- substantial impact on option valuation, underscoring their critical role
tions to the imbalance problem. In addition to the sampling techniques in this domain. Similarly, in deep learning, the design of appropriate loss
summarized in Subsection 3.1, we integrate ensemble learning into our functions is crucial for tasks such as Object Detection (Lin et al., 2017)
AutoML to further address the imbalance problem. Ensemble models and Image Segmentation (Salehi et al., 2017). In the insurance domain,
combine multiple trained models into a single model by aggregating one of the major challenges is the imbalanced distribution of response
their predictions, potentially addressing the imbalance problem that in- variables, which complicates the direct application of ML models. To
dividual models cannot effectively handle. Moreover, constructing an address this issue, researchers have proposed carefully calibrated im-
ensemble model is recognized as an effective technique for achieving balance learning algorithms and adjusted cost-sensitive loss functions,
state-of-the-art performance in practical applications. In our AutoML, as highlighted by Hu et al. (2022), Zhang et al. (2024), and So and
pipeline-level ensembles are integrated within the pipeline optimization Valdez (2024).
framework. Consequently, the pipelines {𝑘 }, 𝑘 = 1, 2, ..., 𝐺 , generated Choosing the appropriate loss function is essential for aligning ML
during each evaluation round in Algorithm 1 can naturally serve as can- models with insurance business objectives. It ensures that models not
didates to form the ensemble model. To enhance the flexibility of the only perform well technically, but also deliver outcomes that are mean-
ensemble models, we implement three major ensemble structures: Stack- ingful and beneficial to the insurance business. Loss functions guide the
ing, Bagging, and Boosting, as summarized in Dong et al. (2020). model during training by quantifying errors, ensuring that the model
Considering a total 𝐺 trained pipelines constrained by the time/e- learns to minimize the types of error that matter the most to the in-
valuation budget following specific training protocols, 𝐻 pipelines are surance context. Our AutoML framework provides a variety of common
selected to construct the ensemble model. The final predictions of the loss functions suitable for both imbalanced and balanced learning sce-
ensemble model can be computed as the aggregation of individual pre- narios. Additionally, it offers flexibility for users to employ customized
dictions through certain Voting Mechanisms. Consequently, the ensemble loss functions based on their specific needs. This customization allows
model Σ, given 𝐺 pipelines {𝑔 }, 𝑔 = 1, 2, … , 𝐺 , can be expressed as users to define loss functions that better capture their unique objec-
tives, leading to more relevant and actionable insights. For instance, in
Σ𝐻 = Σ𝐻 (1 , 2 , … , 𝐺 )
sensitive areas like insurance pricing, ensuring fairness across different
= Σ𝐻 ((1) , (2) , … , (𝐻) ) demographic groups is crucial. Custom loss functions can be designed to
enforce fairness constraints and comply with regulatory requirements,
where 𝐻 denotes the pre-fixed hyperparameter for the size of the en-
helping to ensure ethical and equitable outcomes.
semble model (number of selected pipelines), and (ℎ) refers to the ℎ-th
pipeline ranked by the evaluation loss 𝐿𝑒𝑣𝑎𝑙 computed from the objec-
4. AutoML in action
tive function . The predictions, given the ensemble model Σ𝐻 and a
input matrix X, can be expressed as
To demonstrate the feasibility and efficacy of our AutoML, we con-
ŷ = Σ𝐻 (X) ducted experiments using several datasets studied by actuarial science
researchers. We evaluated both the performance and run-time of these
= 𝛾((1) (X), (2) (X), … , (𝐻) (X)) experiments. Our AutoML framework is user-friendly and requires only a
where 𝛾 denotes the voting mechanism attached to the ensemble model few lines of code to deploy, making it accessible to inexperienced users.
Σ𝐻 . Please refer to Appendix D for the code required to run the experiments
It is important to note that the three ensemble structures only func- and the corresponding descriptions, along with the optimal pipeline hy-
tion as the protocol of the training diagram, predictions through the perparameter settings observed in these experiments.
voting mechanism, and, in our work, validation for the completion of
the task. They do not involve any additional training procedures. The 4.1. French motor third-part liability
optimization of each individual pipeline given the input training/vali-
dation datasets thus remains unaffected by the deployment of ensemble In this experiment, we use the French Motor Third-Part Liability
models. Specifically, the naming of these ensemble models distinguishes datasets, freMTPL2freq, from package CASDatasets (Charpentier, 2015),
them by their training diagrams. The stacking ensemble models utilize a which comprises 677,991 motor liability policies collected in France
fully parallel optimization approach across entire datasets, where mul- with the response variable ClaimNb, indicating number of claims during
tiple base pipelines are trained independently and their predictions are the exposure period and 10 categorical/numerical explanatory variables
aggregated through the voting mechanism. In contrast, bagging ensem- (excluding the policy ID, IDpol). Refer to Table 1 for the full list of fea-
ble models are optimized by training multiple base pipelines on random tures and response variables and their descriptions. The task is to predict
subsets of the data, aiming to reduce variance and enhance generaliza- future claim frequency, which is framed as a regression problem. We
tion. Boosting ensemble models iteratively improve model predictions follow the same random seed and train/test split percentage suggested
by focusing on the residuals from previous pipelines, sequentially re- by Noll et al. (2020) to replicate the train/test sets, without applying
fining predictions to minimize overall error. Reference Appendix B for any preprocessing techniques before feeding the data into the AutoML
details of the ensemble strategies. pipeline. The performance of the experiment is evaluated by mean Pois-
son deviance, as the same metrics utilized in Noll et al. (2020), which
3.4. Loss functions can be expressed as
𝑍
Loss functions play a pivotal role in both model creation and eval- 2 ∑( 𝑦 )
𝑃 𝑜𝑖 (y, ŷ ) = 𝑦̂ − 𝑦𝑧 + 𝑦𝑧 log( 𝑧 )
uation. By mapping pairs of true observations and model predictions 𝑍 𝑧=1 𝑧 𝑦̂𝑧
23
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Table 1
Features & Response variables of freMTPL2freq dataset.
Table 2 search do not provide enough valid and high-performing pipelines. This
AutoML performance on freMTPL2freq dataset. suggests that users should consider increasing the evaluation and time
G T/s runtime/s Train Deviance Test Deviance budget for better results. On the other hand, the increasing test deviance
observed from 𝐺 = 256 to 𝐺 = 1, 024 does not necessarily indicate over-
8 900 807.62 0.3622 0.3689
16 1,800 1,082.21 0.3826 0.3890
fitting but rather reflects the limitations of the search concerning the
32 3,600 2,092.15 0.3156 0.3250 vast hyperparameter space designed for the AutoML pipeline. This sug-
64 7,200 4,417.51 0.3022 0.3122 gests that users should consider stopping the AutoML process and inves-
128 14,400 8,052.91 0.2925 0.3034 tigating the fitted pipelines.
256 28,800 12,624.60 0.2779 0.3009
512 57,600 34,036.03 0.2762 0.3020
1024 115,200 63,401.81 0.2539 0.3114 4.2. Wisconsin local government property insurance fund
𝑥̂ = log(𝑥 + 1)
for a total of 𝑍 true response values 𝑦𝑧 and predictions 𝑦̂𝑧 (𝑧 =
1, 2, ..., 𝑍 ). The evaluation of mean Poisson deviance is common in ac- where 𝑥 denotes the original variable and 𝑥̂ refers to the transformed
tuarial practice for claim frequency modeling. variable. In training our AutoML, we optimize based on the loss function
To illustrate the performance of our AutoML in terms of evaluation using the Coefficient of Determination (𝑅2 ). For a total of 𝑍 true values
and time budget, we train the AutoML across a range of increasing evalu- y and their corresponding predictions ŷ generated by fitted AutoML, can
ation budget 𝐺 and time budget 𝑇 . The runtime and train/test deviance be written as
for each ensemble of fitted pipelines are shown in Table 2 with a visu- 𝑅𝑆𝑆
𝑅2 (y, ŷ ) =
alization provided in Fig. 2 on a log scale. In Fig. 2, it is evident that as 𝑇 𝑆𝑆
𝐺 and 𝑇 increase synchronously, the runtime increases approximately ∑𝑍 ∑𝑍
where 𝑅𝑆𝑆 = 𝑧=1 (𝑦̂𝑧 − 𝑦𝑧 )2 and 𝑇 𝑆𝑆 = 𝑧=1 (𝑦𝑧 − ȳ )2 denote the
linearly with the evaluation budget. Concurrently, train/test deviance
regression sum of squares and the total sum of squares respectively, and
decreases, indicating the performance gain from a more extensive hy- 1 ∑𝑍
perparameter search. The superior performance at 𝐺 = 8 compared to ȳ = 𝑦 is the mean of all true values.
𝑍 𝑧=1 𝑧
𝐺 = 16 can be explained by the insufficient number of fitted pipelines The experiment follows the same structure as demonstrated in Sub-
at early training stages. At the starting point 𝐺 = 8, only two valid section 4.1, where we record the runtime and train/test errors as the
pipelines could be selected to form the ensemble model out of five evaluation and time budget increase. Table 4 and Fig. 3 summarize
trained pipelines, due to the metric of mean Poisson deviance only al- the results and the visualization correspondingly. Both Table 4 and
lowing non-zero predictions. In contrast, at 𝐺 = 16, although five valid Fig. 3 suggest that scaling evaluation and time budget on the LGPIF
pipelines could be selected under the current evaluation and time bud- dataset has a less significant impact on the performance compared to
get, the ensemble construction is underperforming compared to the one freMTPL2freq dataset. One obvious reason is that the LGPIF dataset is
from the 𝐺 = 8 stage. This is because the early stages of hyperparameter significantly smaller than the freMTPL2freq dataset, requiring fewer tri-
24
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Table 3
Features & Response variables of LGPIF dataset.
Type Categorical Binary indicator of property type (City, County, Misc, School, Town, Village)
IsRC Categorical Binary indicator of replacement cost
Feature Coverage Numerical Coverage of the property (BC, IM, PN, PO, CN, CO)
lnDeduct Numerical Logarithm of deductible (BC, IM)
NoClaimCredit Categorical Binary indicator of prior claim reports (BC, IM, PN, PO, CN, CO)
Table 4 Table 5
AutoML performance on LGPIF dataset. AutoML performance on ausprivauto claim occurrence.
G T/s runtime/s Train 𝑅2 Test 𝑅2 G T/s runtime/s Train AUC Test AUC
25
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Table 6
AutoML performance on ausprivauto claim frequency.
Table 7
AutoML performance on ausprivauto claim amount.
26
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Table 8
Comparison of AutoML with GLM models.
𝑅2 0.2062 0.2377
Gini 0.4089 0.4187
LGPIF Tweedie ME 0.1609 1024 0.0476
MSE 14.0533 13.4956
MAE 2.8749 2.8955
Table 9
Comparison of AutoML with other actuarial literature.
𝑅2 0.229 0.2377
Gini 0.414 0.4187
LGPIF Quan and Valdez (2018) ME 0.048 1024 0.0476
MSE 13.651 13.4956
MAE 2.883 2.8955
the claim severity prediction. While Quan and Valdez (2018) utilized Table 10
various tree-based models, our AutoML ensemble model leverages mul- Comparison of AutoML on general ML tasks.
tiple state-of-the-art algorithms, resulting in superior prediction perfor- Task Model Evaluation Metric Score
mance. A similar trend is observed for the claim occurrence prediction
InsurAutoML 0.86476
on the ausprivauto_occ dataset, where our AutoML outperforms the stack- autogluon medium 0.86171
ing ensemble model approach suggested by Si et al. (2022). The com- Flood Prediction autogluon best 𝑅2 score 0.86262
parison of the experiment LGPIF in Table 8 and Table 9 suggests that h2o 0.86073
our AutoML not only outperforms the GLM but also has the potential to auto-sklearn 0.86233
In addition to insurance-specific tasks, our AutoML can be employed those used in the corresponding competitions. Specifically, the 𝑅2 score
in general ML tasks. In the following, we demonstrate the capability of is used for the Flood Prediction task, while the Root Mean Squared Log-
our AutoML comparing other popular AutoML frameworks. arithmic Error (RMSLE) is utilized for the Abalone task. The RMSLE can
Specifically, we test our AutoML on two Kaggle competition datasets, be mathematically expressed as:
Regression with a Flood Prediction Dataset6 and Regression with an
√
Abalone Dataset.7 We compare our AutoML with three popular Au- √
√1 ∑ 𝑍
( )2
toML architecture, autogluon8 (Erickson et al., 2020), h2o9 (LeDell and 𝑅𝑀𝑆𝐿𝐸 (y, ŷ ) = √ log(1 + 𝑦̂𝑧 ) − log(1 + 𝑦𝑧 )
Poirier, 2020) and auto-sklearn10 (Feurer et al., 2015). 𝑍 𝑧=1
The private scores submitted by each AutoML architecture are pre-
For autogluon, we ran experiments using medium preset and best preset,
sented in Table 10. The evaluation metrics employed are consistent with
denoting autogluon medium and autogluon best respectively. For our Au-
toML, as well as for the h2o and auto-sklearn frameworks, we allocated a
6 runtime of 6-10 hours for each experiment. The results indicate that our
https://www.kaggle.com/competitions/playground-series-s4e5.
7
https://www.kaggle.com/competitions/playground-series-s4e4. AutoML is competitive in performance compared to these widely used
8
https://github.com/autogluon/autogluon. and well-established AutoML frameworks on general ML tasks. More-
9
https://github.com/h2oai/h2o-3. over, our AutoML pipeline demonstrates versatility, being applicable to
10
https://github.com/automl/auto-sklearn. both general ML tasks and insurance-specific ML applications.
27
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
28
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
the stacking procedure for the ensemble model, specifically line 14 in is the best-performing pipeline for subset dataset (ℎ) . The ensemble
the algorithm description. model, targeting optimal performance, is then constructed as:
29
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
set 𝑡𝑟𝑎𝑖𝑛 = (X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛 ), but instead focuses on optimizing the residual lowing the residual learning framework, the boosting ensemble utilized
dataset. in our work can be summarized as Algorithm 4. The algorithm closely
resembles Algorithm 3, with the distinction of modifying its learning
ℎ = (X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛,ℎ ) data set using residuals on response variables instead of employing
where feature subsets on feature matrices, as seen in the bagging ensemble.
Similar to the bagging ensemble approach, the optimization of the resid-
y𝑡𝑟𝑎𝑖𝑛,ℎ = y𝑡𝑟𝑎𝑖𝑛,ℎ−1 − ℎ−1 (X𝑡𝑟𝑎𝑖𝑛 ) ual dataset ℎ can be fully parallelized. This leads to more efficient
optimization and prevents the occurrence of sub-optimal pipelines on
for ℎ = 1, 2, ..., 𝐻 . The initialization of the residuals and predictions
individual residual datasets.
can be defined as y𝑡𝑟𝑎𝑖𝑛,0 = y𝑡𝑟𝑎𝑖𝑛 and 0 (X𝑡𝑟𝑎𝑖𝑛 ) = 0, indicating the first
Voting Mechanism in Ensemble: The voting mechanism, a criti-
pipeline is trained on the original train set 𝑡𝑟𝑎𝑖𝑛 . By learning from
residuals instead of original response variables, the subsequent pipelines cal component in the ensemble, determines how the predictions from
possess the potential to address errors that preceding pipelines could individual candidate pipelines can be aggregated to produce the final
not, thereby potentially solving the problem of imbalance. In our Au- prediction for the ensemble model.
toML, rather than utilizing a full sequential training diagram, we split For the boosting ensemble, given their unique training diagram of
the training by steps. At step ℎ, (ℎ = 1, 2, ..., 𝐻 ), pipelines ℎ,𝑘 , (𝑘 = residual learning, the voting mechanism is typically the summation of
1, 2, ..., 𝐺∕∕𝐻 ), aim to optimize on datasets ℎ , and the best-performing pipelines:
pipelines ℎ,(1) are selected by the ascending order of evaluation losses.
𝐻
∑
The components of the ensemble model Σ𝐻 can then be defined as
ŷ = Σ𝐻 (X) = ℎ (X)
ℎ = 𝐏ℎ,(1) and utilized to generate dataset ℎ+1 for the next step. Fol-
ℎ=1
30
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
ℎ (X) = 𝑜 = (0, … , 0, 1, 0, … , 0)
Algorithm 4: The Boosting Ensemble.
↑
Input: Dataset = (𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙𝑖𝑑 ); Search space ; Time budget 𝑇 ;
𝑜
Evaluation budget 𝐺 ; Search algorithm 𝑆𝑎𝑚𝑝; Size of the
ensemble model 𝐻 The aggregation of predictions in hard voting can be considered as the
Output: Ensemble Σ𝑀 majority vote among all 𝐻 pipelines, with the probabilities reduced
1 Initialization: y𝑡𝑟𝑎𝑖𝑛,0 = y𝑡𝑟𝑎𝑖𝑛 ; y𝑣𝑎𝑙𝑖𝑑,0 = y𝑣𝑎𝑙𝑖𝑑 ; y ̂ 𝑡𝑟𝑎𝑖𝑛,0 = 0; ŷ 𝑣𝑎𝑙𝑖𝑑,0 = 0; to indicator functions/unit vector representations. The eventual predic-
2 for ℎ ← 1 to 𝐻 do
tions of the ensemble model can be calculated as
3 𝑘=0 ; /* Round of evaluation */
𝐻
∑ 𝐻
∑ 𝐻
∑
4 𝑡𝑟𝑒 = 𝑇 ∕∕𝐻 ; /* Remaining time budget */
5 𝑔 𝑟𝑒 = 𝐺∕∕𝐻 ; /* Remaining evaluation budget */ ŷ = argmax( 1{ℎ (X)=1} , 1{ℎ (X)=2} , … , 1{ℎ (X)=𝑂} )
6 y𝑡𝑟𝑎𝑖𝑛,ℎ = y𝑡𝑟𝑎𝑖𝑛,ℎ−1 − ŷ 𝑡𝑟𝑎𝑖𝑛,ℎ−1 ; y𝑣𝑎𝑙𝑖𝑑,ℎ = y𝑣𝑎𝑙𝑖𝑑,ℎ−1 − ŷ 𝑣𝑎𝑙𝑖𝑑,ℎ−1 ; ℎ=1 ℎ=1 ℎ=1
7 ℎ = ((X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛,ℎ ), (X𝑣𝑎𝑙𝑖𝑑 , y𝑣𝑎𝑙𝑖𝑑,ℎ )); 𝐻
∑
8 while 𝑡𝑟𝑒 > 0 and 𝑔 𝑟𝑒 > 0 do = argmax ℎ (X)
9 𝑡𝑠𝑡𝑎𝑟𝑡 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒; ℎ=1
10 (𝐸 (𝑘) , 𝜆(𝑘)
𝐸
), (𝐼 (𝑘) , 𝜆(𝑘)
𝐼
), (𝐵 (𝑘) , 𝜆(𝑘)
𝐵
), (𝑆 (𝑘) , 𝜆(𝑘)
𝑆
), (𝐹 (𝑘) , 𝜆(𝑘)
𝐹
), (𝑀 (𝑘) , 𝜆(𝑘)
𝑀
)= where the first row utilizes the numerical representation of ℎ (X) and
𝑆𝑎𝑚𝑝(𝑘) ( ); second row corresponds to the unit vector representation.
(𝑘) (𝑘) (𝑘) (𝑘) (𝑘) (𝑘)
11 ℎ,𝑘 = 𝑀 (𝑘) ◦𝐹 (𝑘) ◦𝑆 (𝑘) ◦𝐵 (𝑘) ◦𝐼 (𝑘) ◦𝐸 (𝑘) ; With the training diagram and the voting mechanism, multiple
𝜆𝑀 𝜆𝐹 𝜆𝑆 𝜆𝐵 𝜆𝐼 𝜆𝐸
12 𝐿𝑒𝑣𝑎𝑙,(𝑘) = (, ℎ,𝑘 , ℎ ); pipelines can be coordinated to address the imbalance problems that
13 𝑡𝑒𝑛𝑑 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒; individual pipelines alone might struggle with, as demonstrated in var-
14 𝑘 = 𝑘 + 1; ious empirical studies Polikar (2006), Galar et al. (2012), Dong et al.
15 𝑡𝑟𝑒 = 𝑡𝑟𝑒 − (𝑡𝑒𝑛𝑑 − 𝑡𝑠𝑡𝑎𝑟𝑡 ); (2020).
16 𝑔 𝑟𝑒 = 𝑔 𝑟𝑒 − 1;
17 end Appendix C. Preprocessing methods and ML models
18 {ℎ,(𝑘) } = 𝑠𝑜𝑟𝑡({ℎ,𝑘 });
19 ŷ 𝑡𝑟𝑎𝑖𝑛,ℎ = ℎ,(1) (X𝑡𝑟𝑎𝑖𝑛 ); ŷ 𝑣𝑎𝑙𝑖𝑑,ℎ = ℎ,(1) (X𝑣𝑎𝑙𝑖𝑑 );
We have provided a brief list of the preprocessing methods and ML
20 end
models currently used in our AutoML (see Tables 12–18). As develop-
21 Σ𝐻 = Σ𝐻 (1,(1) , 2,(1) , … , 𝐻,(1) );
22 return Σ𝐻 ; ment progresses, we intend to expand the framework by integrating
more advanced and updated methods. We also welcome contributions
from the broader community to support the ongoing development of our
open-source AutoML.
For stacking and bagging ensembles, the voting mechanisms must
be tailored to the nature of the regression or classification tasks, but are
Table 12
interchangeable between stacking and bagging structures. Data Encoding Methods.
In regression tasks, the voting mechanisms often involve aggregation
Method Description
statistics. Commonly used statistics include mean, median, and maxi-
mum, calculated by performing the corresponding operation on all 𝐻 Data Encoding Transform string-based variables into numeric representations,
and preserve the corresponding conversion table to ensure
predictions by individual pipelines. For example, in the mean voting
consistent application on the test sets. Ordinal and One-hot
mechanism, the corresponding predictions of the ensemble models can encoding available.
be expressed as
𝐻
1 ∑
ŷ = Σ𝐻 (X) = (X) Table 13
𝐻 ℎ=1 ℎ Data Imputation Methods.
In classification tasks, the voting can be either hard or soft (Po- Method Description
likar, 2006), which differs by aggregating the class-level predictions Simple Imputation Impute missing values by applying relevant feature
or probabilities of each class. The soft voting aggregates the prediction statistics, such as the mean or median.
probabilities generated by individual pipelines. For a 𝑂 -category clas- Joint Imputation Assuming the data follows a multivariate normal
distribution, compute the mean vector and covariance
sification problem, the prediction probability of pipeline ℎ, ℎ , can be
matrix based on the observed values. Then, impute the
represented as a probability vector missing values by sampling from this distribution.
Expectation Iteratively update the imputed values, along with the mean
ℎ𝑝𝑟𝑜𝑏 (X) = (𝑝ℎ,1 , 𝑝ℎ,2 , … , 𝑝ℎ,𝑂 ) Maximization (EM) vector and covariance matrix of the joint distribution, to
maximize the likelihood of the multivariate normal
where 𝑝ℎ,𝑜 ∈ [0, 1], (𝑜 = 1, 2, … , 𝑂 ), denotes the probability of being cat- distribution.
∑𝑂
egorised as class 𝑜 and 𝑜=1 𝑝ℎ,𝑜 = 1. The ensemble model prediction in kNN Imputation Impute missing values using a k-Nearest Neighbors (kNN)
model trained on the observed values. (Stekhoven and
soft voting is the class with the highest aggregated probability: Bühlmann, 2012)
Miss Forest Impute missing values by training a Random Forest model
𝐻
∑ 𝐻
∑ 𝐻
∑ 𝐻
∑ Imputation on the observed values. (Stekhoven and Bühlmann, 2012)
ŷ = argmax ℎ𝑝𝑟𝑜𝑏 (X) = argmax( 𝑝ℎ,1 , 𝑝ℎ,2 , … , 𝑝ℎ,𝑂 ) Multiple Initially impute the missing values through a simple
ℎ=1 ℎ=1 ℎ=1 ℎ=1 Imputation by imputation method. Then, iteratively refine the imputation
The hard voting, however, aggregates the prediction classes instead of Chained Equations by removing the previously imputed values, training a base
(MICE) ML model using only the observed values, and re-imputing
the prediction probabilities. The prediction classes of individual pipeline the missing values column by column. This process is
ℎ can be expressed as repeated multiple times until the imputations converge to
stable estimates. (Azur et al., 2011)
ℎ (X) = argmax(𝑝ℎ,1 , 𝑝ℎ,2 , … , 𝑝ℎ,𝑂 ) Generative Train a Generative Adversarial Network (GAN) model to
Adversarial impute missing values. The model consists of a generator
indicating that the predicted class is the most probable class based on Imputation Nets that produces realistic imputations and a discriminator that
probability. If class 𝑜 is the predicted class, the prediction can further (GAIN) differentiates between actual observed values and the
generated imputations. (Yoon et al., 2018)
be re-written as a unit vector
31
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Table 14
Data Balancing Methods.
Method Description
Simple Random Over-Sampling Randomly replicate minority samples until the desired majority-minority threshold is achieved.
Simple Random Under-Sampling Randomly remove majority samples to achieve the desired majority-minority threshold.
Tomek Link By identifying adjacent majority-minority class pairs and removing majority samples from these pairs, the ratio of majority class can
be reduced. Tomek (1976)
Edited Nearest Neighbor (ENN) Remove majority samples that are misclassified by a kNN model trained on the original data. (Wilson, 1972)
Condensed Nearest Neighbor (CNN) Select a subset of majority class observations that aligns with the predictions of a kNN model trained on the original data. (Hart, 1968)
Synthetic Minority Over-Sampling Generate synthetic minority samples by applying linear interpolation between existing minority samples in the feature space. (Chawla
Techniques (SMOTE) et al., 2002)
One Sided Selection (OSS) A two-step method where the first stage utilizes Tomek Links, followed by CNN.
CNN-TomekLink
Smote-TomekLink Two-step balancing strategy that combines two balancing algorithms.
Smote-ENN
Table 15
Data Scaling Methods.
Method Description
x𝑗 − 𝜇𝑗
Standardization Scale by x𝑠𝑐𝑎𝑙𝑒𝑑
𝑗 = where 𝜇𝑗 and 𝜎𝑗 are mean and standard deviation of feature 𝑗 .
𝜎𝑗
x𝑗 − x𝑗,𝑚𝑖𝑛
Normalization Scale by x𝑠𝑐𝑎𝑙𝑒𝑑
𝑗 = where x𝑗,𝑚𝑖𝑛 and x𝑗,𝑚𝑎𝑥 are the minimum and maximum value of feature 𝑗 .
x𝑗,𝑚𝑎𝑥 − x𝑗,𝑚𝑖𝑛
x𝑗 − x𝑗,𝑚𝑖𝑛
Min-Max Scaling Scale by x𝑠𝑐𝑎𝑙𝑒𝑑 = (𝑓 − 𝑓𝑗,𝑚𝑖𝑛 ) where 𝑓𝑗,𝑚𝑖𝑛 are 𝑓𝑗,𝑚𝑎𝑥 two pre-defined hyperparameters for customized scaling range.
𝑗
x𝑗,𝑚𝑎𝑥 − x𝑗,𝑚𝑖𝑛 𝑗,𝑚𝑎𝑥
x𝑗 − 𝑄1∕2,𝑗
Robust Scaling Scale by x𝑠𝑐𝑎𝑙𝑒𝑑
𝑗 = where 𝑄1∕2,𝑗 , 𝑄𝑙,𝑗 and 𝑄ℎ,𝑗 are 50-th percentile, some pre-defined low and high quantiles of feature 𝑗 .
𝑄ℎ,𝑗 − 𝑄𝑙,𝑗
Power Transformation Scale the feature exponentially.
Quantile Transformation Transform the feature into uniform or normal distribution.
Winsorization Cap the feature by a certain quantile to remove outliers.
Table 16
Feature Selection Methods.
Method Description
Feature Filter Rank features based on their univariate correlation with the response variable, then select top ranking features.
Percentile Feature Selection Select features that fall within the highest percentile/rate by constructing univariate ML models to rank them.
Rates Feature Selection
Stepwise Feature Selection (SFS) Iteratively add or remove one feature at a time to identify the optimal subset of features.
Adaptive Sequential Forward Floating An extension of SFS that allows both the addition and removal of features with dynamic steps to handle variable dependencies. Somol
Search (ASFFS) et al. (1999)
Principle Component Analysis (PCA) Perform eigenvalue decomposition to calculate principal components, reducing data dimensionality while retaining essential
information.
Truncated Singular Value Use SVD to decompose the design matrix and reduce dimensionality.
Decomposition (SVD)
Minimal-Redundancy-Maximal- A variant of SFS that enhances feature selection performance by simultaneously maximizing the relevance of the selected features with
Relevance (mRMR) respect to the response variable while minimizing redundancy among the selected features. (Peng et al., 2005)
Copula-based Feature Selection (CBFS) A variant of SFS that accounts for variable dependencies by embedding information into a copula framework. (Lall et al., 2021)
Genetic Algorithm (GA) Optimize feature selection by employing processes such as selection, crossover, and mutation to maximize model performance. (Tan et
al., 2007)
Extra Tree Feature Selection Select features based on their importance in an Extra Trees model, which uses random splits to determine feature significance.
32
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Table 17
Classification Models.
Method Description
Stochastic Gradient Descent (SGD) Regularized linear models with non-linear loss functions optimized by SGD.
Logistic Regression A linear model that uses a logistic (sigmoid) link function to predict probabilities.
Adaboost Sequentially trains a series of weak learners, adjusting their weights so that each subsequent learner focuses more on the errors made
by its predecessors.
Passive Aggressive (PA) Updates the model only when a prediction is incorrect or uncertain, aggressively adjusting the decision boundary while remaining
passive when predictions are correct.
Linear Discriminant Analysis (LDA) Identifies the Bayes-optimal linear/quadratic boundary in the feature space that can effectively distinguish the distinct classes.
Quadratic Discriminant Analysis (QDA)
Generalized Additive Models (GAMs) Linear additive framework that learns smooth functions for each feature.
k Nearest Neighbor (kNN) Predicts the output by considering the majority or average outcome of the k closest data points in the feature space.
Linear Support Vector Machine (SVM) Linear/Non-linear kernel-based SVM models that finds the optimal linear decision boundary.
Kernel SVM
Gaussian Naive Bayes Gaussian/Bernoulli/Multinomial-based probabilistic classification algorithm based on Bayes Theorem.
Bernoulli Naive Bayes
Multinomial Naive Bayes
Decision Tree Grow a single decision tree that recursively splits the data into branches and nodes based on feature values.
Extra Tree Train a series of completely randomized decision trees on sub-samples and take average for predictions.
Random Forest An ensemble of decision trees trained on random feature subsets, with the final prediction made by averaging or voting across all trees.
Histogram Gradient Boosting Different implementations of gradient boosting decision trees.
LightGBM
XGBoost
Multi-layer perception (MLP) A feedforward neural networks consisting of multiple layers of fully interconnected neurons.
Table 18
Regression Models.
Method Description
Appendix D. Experiment details preparing for the experiment. Specifically, Line 5 specifies the evalua-
tion metric, which is the mean Poisson deviance. Lines 7-10 define key
In the following, we present the code necessary to run the experi- parameters such as the experiment’s random seed, evaluation budget,
ments demonstrated in Section 4 and provide brief descriptions for each the number of candidates used to construct the ensemble model, and the
code block. The scripts for all experiments are available in our GitHub time budget. Lines 15-16 handle the data loading process, while Lines
repository.11 19-20 define the response variable and the set of features. The train/test
split process is executed between Lines 26-34, where the split is deter-
D.1. French motor third-part liability mined based on the train set index generated by Listing 2, following the
procedure outlined by Noll et al. (2020). Once the training is complete,
Listing 1 presents the code used to run the French Motor Third-Part predictions are made on both the train and test sets, with the train/test
Liability experiment. The first 12 lines (Lines 1-12) are dedicated to mean Poisson deviance being reported as indicated in Lines 53-56.
In the training setup, we assign the experiment name based on the
evaluation budget, as indicated in Line 39. For instance, if the evaluation
11
https://github.com/PanyiDong/InsurAutoML/tree/master/experiments. budget is set to 512, the experiment will be named freMTPL2freq_512.
33
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
1 import InsurAutoML
2 from InsurAutoML import load_data, AutoTabularRegressor
3 import numpy as np
4 import pandas as pd
5 from sklearn.metrics import mean_poisson_deviance
6
7 seed = 42
8 n_trials = 64
9 N_ESTIMATORS = 4
10 TIMEOUT = (n_trials / 4) * 450
11
12 InsurAutoML.set_seed(seed)
13
14 # load data
15 database = load_data(data_type = ".csv").load(path = "")
16 database_names = [*database]
17
18 # define response/features
19 response = "ClaimNb"
20 features = np.sort(list(
21 set(database["freMTPL2freq"].columns) - set(["IDpol", "ClaimNb"])
22 ))
23
24 # read train index & get test index
25 # python dataframe index starts from 0, but R starts from 1
26 train_index = np.sort(pd.read_csv("train_index.csv").values.flatten()) - 1
27 test_index = np.sort(
28 list(set(database["freMTPL2freq"].index) - set(train_index))
29 )
30 # train/test split
31 train_X, test_X, train_y, test_y = (
32 database["freMTPL2freq"].loc[train_index, features], database["freMTPL2freq"].loc[test_index, features],
33 database["freMTPL2freq"].loc[train_index, response], database["freMTPL2freq"].loc[test_index, response],
34 )
35
36
37 # fit AutoML model
38 mol = AutoTabularRegressor(
39 model_name = "freMTPL2freq_{}".format(n_trials),
40 n_estimators = N_ESTIMATORS,
41 max_evals = n_trials,
42 timeout = TIMEOUT,
43 validation=False,
44 search_algo="HyperOpt",
45 objective= mean_poisson_deviance,
46 cpu_threads = 12,
47 balancing = ["SimpleRandomOverSampling", "SimpleRandomUnderSampling"],
48 seed = seed,
49 )
50 mol.fit(train_X, train_y)
51
52
53 train_pred = mol.predict(train_X)
54 test_pred = mol.predict(test_X)
55
56 mean_poisson_deviance(train_y, train_pred), mean_poisson_deviance(test_y, test_pred)
1 RNGversion("3.5.0")
2 set.seed (100)
3 ll <- sample (c (1: nrow ( freMTPL2freq )) , round (0.9* nrow ( freMTPL2freq )) , replace = FALSE )
4 write.csv(ll, "train_index.csv") # the train_index.csv generated in R is utilized in AutoML train/test split
Upon completion of the training, a folder with the same name is cre- Listing 3 presents the optimal hyperparameters identified through
ated to store all the experiment results, along with a file of the same our AutoML experiments. All four top-performing pipelines employ or-
name containing the final ensemble model. The experiment configura- dinal encoding for data transformation. Given the absence of missing
tion, including the components of the ensemble model, the evaluation values, imputation processes are not necessary. Among the pipelines,
budget, the time budget, the cross-validation methodology, the search two implement normalization, while the other two utilize quantile
algorithm, the evaluation metric, the number of parallel computing transformation. Three of the selected pipelines are based on Histogram
threads, the balancing algorithms, and the random seed, are defined Gradient Boosting regression tree models, whereas one employs Light-
in Lines 40-48. Specifically, we employ the HyperOpt search algorithm GBM as the regression model. Although the selected pipelines employ
as described by Bergstra et al. (2013). Due to computational constraints, similar preprocessing techniques and regression models, the chosen
we limit the balancing algorithms to simple random over-sampling and hyperparameters for each pipeline vary, with some differing signifi-
under-sampling techniques. cantly.
34
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
1 For pipeline 1:
2 Optimal encoding method is: DataEncoding
3 Optimal encoding hyperparameters:{’dummy_coding’: False}
4
5 Optimal imputation method is: no_processing
6 Optimal imputation hyperparameters:{}
7
8 Optimal balancing method is: SimpleRandomOverSampling
9 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9637986293893803}
10
11 Optimal scaling method is: QuantileTransformer
12 Optimal scaling hyperparameters:{}
13
14 Optimal feature selection method is: no_processing
15 Optimal feature selection hyperparameters:{}
16
17 Optimal regression model is: HistGradientBoostingRegressor
18 Optimal regression hyperparameters:{’early_stop’: ’valid’, ’l2_regularization’: 3.5597237113674115e-08, ’learning_rate’:
0.021408454901122625, ’loss’: ’squared_error’, ’max_bins’: 255, ’max_depth’: None, ’max_leaf_nodes’: 771, ’min_samples_leaf’:
2, ’n_iter_no_change’: 10, ’scoring’: ’loss’, ’tol’: 1e-07, ’validation_fraction’: 0.2533770905902288}
19
20 For pipeline 2:
21 Optimal encoding method is: DataEncoding
22 Optimal encoding hyperparameters:{’dummy_coding’: False}
23
24 Optimal imputation method is: no_processing
25 Optimal imputation hyperparameters:{}
26
27 Optimal balancing method is: SimpleRandomUnderSampling
28 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9903581841265658}
29
30 Optimal scaling method is: Normalize
31 Optimal scaling hyperparameters:{}
32
33 Optimal feature selection method is: no_processing
34 Optimal feature selection hyperparameters:{}
35
36 Optimal regression model is: HistGradientBoostingRegressor
37 Optimal regression hyperparameters:{’early_stop’: ’valid’, ’l2_regularization’: 2.0273936664656744e-09, ’learning_rate’:
0.02368640948373309, ’loss’: ’squared_error’, ’max_bins’: 255, ’max_depth’: None, ’max_leaf_nodes’: 165, ’min_samples_leaf’:
3, ’n_iter_no_change’: 19, ’scoring’: ’loss’, ’tol’: 1e-07, ’validation_fraction’: 0.318356865844186}
38
39 For pipeline 3:
40 Optimal encoding method is: DataEncoding
41 Optimal encoding hyperparameters:{’dummy_coding’: False}
42
43 Optimal imputation method is: no_processing
44 Optimal imputation hyperparameters:{}
45
46 Optimal balancing method is: SimpleRandomUnderSampling
47 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9661230677755623}
48
49 Optimal scaling method is: QuantileTransformer
50 Optimal scaling hyperparameters:{}
51
52 Optimal feature selection method is: no_processing
53 Optimal feature selection hyperparameters:{}
54
55 Optimal regression model is: LightGBM_Regressor
56 Optimal regression hyperparameters:{’boosting’: ’dart’, ’learning_rate’: 0.984884665575003, ’max_depth’: -1, ’min_data_in_leaf’: 2,
’n_estimators’: 54, ’num_iterations’: 53, ’num_leaves’: 3, ’objective’: ’poisson’, ’tree_learner’: ’voting’}
57
58 For pipeline 4:
59 Optimal encoding method is: DataEncoding
60 Optimal encoding hyperparameters:{’dummy_coding’: False}
61
62 Optimal imputation method is: no_processing
63 Optimal imputation hyperparameters:{}
64
65 Optimal balancing method is: SimpleRandomUnderSampling
66 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9856650936908325}
67
68 Optimal scaling method is: Normalize
69 Optimal scaling hyperparameters:{}
70
71 Optimal feature selection method is: extra_trees_preproc_for_regression
72 Optimal feature selection hyperparameters:{’bootstrap’: True, ’criterion’: ’friedman_mse’, ’max_depth’: None, ’max_features’:
0.9979378812336601, ’max_leaf_nodes’: None, ’min_samples_leaf’: 2, ’min_samples_split’: 8, ’min_weight_fraction_leaf’: 0.0, ’
n_estimators’: 100}
35
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
73
74 Optimal regression model is: HistGradientBoostingRegressor
75 Optimal regression hyperparameters:{’early_stop’: ’valid’, ’l2_regularization’: 0.5251286015868422, ’learning_rate’:
0.01514498873147682, ’loss’: ’squared_error’, ’max_bins’: 255, ’max_depth’: None, ’max_leaf_nodes’: 3, ’min_samples_leaf’: 1,
’n_iter_no_change’: 3, ’scoring’: ’loss’, ’tol’: 1e-07, ’validation_fraction’: 0.07374400131323675}
1 import InsurAutoML
2 from InsurAutoML import load_data, AutoTabularRegressor
3 import numpy as np
4 from sklearn.metrics import r2_score
5
6 seed = 42
7 n_trials = 64
8 N_ESTIMATORS = 5
9 TIMEOUT = (n_trials / 4) * 450
10
11 InsurAutoML.set_seed(seed)
12
13 # load data
14 database = load_data(data_type = ".rdata").load(path = "")
15 database_names = [*database]
16
17 # define response/features
18 response = ["yAvgBC"]
19 features = [
20 ’TypeCity’, ’TypeCounty’, ’TypeMisc’, ’TypeSchool’, ’TypeTown’, ’TypeVillage’, ’IsRC’, ’CoverageBC’, ’lnDeductBC’,
21 ’NoClaimCreditBC’, ’CoverageIM’, ’lnDeductIM’, ’NoClaimCreditIM’, ’CoveragePN’, ’NoClaimCreditPN’, ’CoveragePO’,
22 ’NoClaimCreditPO’,’CoverageCN’, ’NoClaimCreditCN’, ’CoverageCO’, ’NoClaimCreditCO’
23 ]
24 # log transform of response
25 database["data"][response] = np.log(database["data"][response] + 1)
26 database["dataout"][response] = np.log(database["dataout"][response] + 1)
27 # log transform of coverage feateres
28 database["data"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] = np.log(
29 database["data"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] + 1
30 )
31 database["dataout"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] = np.log(
32 database["dataout"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] + 1
33 )
34
35 train_X, train_y = database["data"][features], database["data"][response]
36 test_X, test_y = database["dataout"][features], database["dataout"][response]
37
38 # fit AutoML model
39 mol = AutoTabularRegressor(
40 model_name = "LGPIF_{}".format(n_trials),
41 n_estimators = N_ESTIMATORS,
42 max_evals = n_trials,
43 timeout = TIMEOUT,
44 validation="KFold",
45 valid_size=0.2,
46 search_algo="HyperOpt",
47 objective= "R2",
48 cpu_threads = 12,
49 seed = seed,
50 )
51 mol.fit(train_X, train_y)
52
53 train_pred = mol.predict(train_X)
54 test_pred = mol.predict(test_X)
55 r2_score(train_y, train_pred), r2_score(test_y, test_pred)
D.2. Wisconsin local government property insurance fund by Quan and Valdez (2018) by applying logarithmic transformations to
the response variable and selected features.
Listing 4 is the code used to run the Wisconsin Local Government In the training setup, we employ five-fold cross-validation, using val-
Property Insurance Fund experiments. The structure of this experiment idation as KFold with valid_size set to 0.2. Given the smaller data size, the
follows the same setup as previously described in the French Motor restrictions on the balancing algorithms applied in the previous exper-
Third-Party Liability experiment, including environment configuration, iment are not enforced in this experiment. Additionally, the evaluation
data loading, definition of features and response variables, AutoML fit- metric selected for this experiment is the 𝑅2 score.
ting, and prediction generation. Unlike the French Motor Third-Part Listing 5 presents the five optimal pipelines identified in the LGPIF
Liability experiment, the LGPIF dataset includes both in-sample and out- experiments. With a stacking ensemble strategy in place, the top-
of-sample data, eliminating the need for a train/test split process. After performing pipelines share almost identical selection in preprocessing
reading the two Rdata files, we replicate the preprocessing steps outlined techniques and regression models. In particular, the Extra Trees regres-
36
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
1 For pipeline 1:
2 Optimal encoding method is: DataEncoding
3 Optimal encoding hyperparameters:{’dummy_coding’: False}
4
5 Optimal imputation method is: no_processing
6 Optimal imputation hyperparameters:{}
7
8 Optimal balancing method is: no_processing
9 Optimal balancing hyperparameters:{}
10
11 Optimal scaling method is: Winsorization
12 Optimal scaling hyperparameters:{}
13
14 Optimal feature selection method is: truncatedSVD
15 Optimal feature selection hyperparameters:{’target_dim’: 206}
16
17 Optimal regression model is: ExtraTreesRegressor
18 Optimal regression hyperparameters:{’bootstrap’: False, ’criterion’: ’squared_error’, ’max_depth’: None, ’max_features’:
0.6070310845755006, ’max_leaf_nodes’: None, ’min_impurity_decrease’: 0.0, ’min_samples_leaf’: 17, ’min_samples_split’: 7, ’
min_weight_fraction_leaf’: 0.0}
19
20 For pipeline 2:
21 Optimal encoding method is: DataEncoding
22 Optimal encoding hyperparameters:{’dummy_coding’: False}
23
24 Optimal imputation method is: no_processing
25 Optimal imputation hyperparameters:{}
26
27 Optimal balancing method is: no_processing
28 Optimal balancing hyperparameters:{}
29
30 Optimal scaling method is: Winsorization
31 Optimal scaling hyperparameters:{}
32
33 Optimal feature selection method is: truncatedSVD
34 Optimal feature selection hyperparameters:{’target_dim’: 207}
35
36 Optimal regression model is: ExtraTreesRegressor
37 Optimal regression hyperparameters:{’bootstrap’: False, ’criterion’: ’squared_error’, ’max_depth’: None, ’max_features’:
0.6132171875599337, ’max_leaf_nodes’: None, ’min_impurity_decrease’: 0.0, ’min_samples_leaf’: 17, ’min_samples_split’: 7, ’
min_weight_fraction_leaf’: 0.0}
38
39 For pipeline 3:
40 Optimal encoding method is: DataEncoding
41 Optimal encoding hyperparameters:{’dummy_coding’: False}
42
43 Optimal imputation method is: no_processing
44 Optimal imputation hyperparameters:{}
45
46 Optimal balancing method is: Smote_TomekLink
47 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.997815634129791, ’k’: 8}
48
49 Optimal scaling method is: Winsorization
50 Optimal scaling hyperparameters:{}
51
52 Optimal feature selection method is: truncatedSVD
53 Optimal feature selection hyperparameters:{’target_dim’: 226}
54
55 Optimal regression model is: ExtraTreesRegressor
56 Optimal regression hyperparameters:{’bootstrap’: False, ’criterion’: ’squared_error’, ’max_depth’: None, ’max_features’:
0.6204531346722685, ’max_leaf_nodes’: None, ’min_impurity_decrease’: 0.0, ’min_samples_leaf’: 17, ’min_samples_split’: 8, ’
min_weight_fraction_leaf’: 0.0}
57
58 For pipeline 4:
59 Optimal encoding method is: DataEncoding
60 Optimal encoding hyperparameters:{’dummy_coding’: False}
61
62 Optimal imputation method is: no_processing
63 Optimal imputation hyperparameters:{}
64
65 Optimal balancing method is: no_processing
66 Optimal balancing hyperparameters:{}
67
68 Optimal scaling method is: Winsorization
69 Optimal scaling hyperparameters:{}
70
71 Optimal feature selection method is: select_rates_regression
72 Optimal feature selection hyperparameters:{’alpha’: 0.20903223117422198, ’mode’: ’fdr’, ’score_func’: ’f_regression’}
73
37
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
1 import pandas as pd
2 import InsurAutoML
3 from InsurAutoML import load_data, AutoTabular
4 from InsurAutoML.utils import train_test_split
5
6 seed = 42
7 n_trials = 128
8 N_ESTIMATORS = 4
9 TIMEOUT = (n_trials / 4) * 450
10
11 InsurAutoML.set_seed(seed)
12
13 # load data
14 database = load_data(data_type = ".csv").load(path = "")
15 database_names = [*database]
16
17 # define response/features
18 response = "ClaimOcc"
19 features = list(
20 set(database["ausprivauto"].columns) - set(["ClaimOcc", "ClaimNb", "ClaimAmount"])
21 )
22 features.sort()
23
24 # train/test split
25 train_X, test_X, train_y, test_y = train_test_split(
26 database[’ausprivauto’][features], database[’ausprivauto’][[response]], test_perc = 0.1, seed = seed
27 )
28 pd.DataFrame(train_X.index.sort_values()).to_csv("train_index.csv", index=False)
29
30 # fit AutoML model
31 mol = AutoTabular(
32 model_name="ausprivauto_occ_{}".format(n_trials),
33 max_evals=n_trials,
34 n_estimators=N_ESTIMATORS,
35 timeout=TIMEOUT,
36 validation="KFold",
37 valid_size=0.25,
38 search_algo="Optuna",
39 objective="AUC",
40 cpu_threads=12,
41 seed=seed,
42 )
43 mol.fit(train_X, train_y)
44
45 from sklearn.metrics import roc_auc_score
46
47 y_train_pred = mol.predict_proba(train_X)
48 y_test_pred = mol.predict_proba(test_X)
49 roc_auc_score(train_y.values, y_train_pred["class_1"].values), roc_auc_score(test_y.values, y_test_pred["class_1"].values)
38
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
1 For pipeline 1:
2 Optimal encoding method is: DataEncoding
3 Optimal encoding hyperparameters:{’dummy_coding’: False}
4
5 Optimal imputation method is: no_processing
6 Optimal imputation hyperparameters:{}
7
8 Optimal balancing method is: CNN_TomekLink
9 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9277384882773799}
10
11 Optimal scaling method is: PowerTransformer
12 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
13
14 Optimal feature selection method is: select_rates_classification
15 Optimal feature selection hyperparameters:{’alpha’: 0.29938397321922683, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
16
17 Optimal classification model is: QDA
18 Optimal classification hyperparameters:{’reg_param’: 0.05294078667761132}
19
20 For pipeline 2:
21 Optimal encoding method is: DataEncoding
22 Optimal encoding hyperparameters:{’dummy_coding’: False}
23
24 Optimal imputation method is: no_processing
25 Optimal imputation hyperparameters:{}
26
27 Optimal balancing method is: CNN_TomekLink
28 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9221903451264218}
29
30 Optimal scaling method is: PowerTransformer
31 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
32
33 Optimal feature selection method is: select_rates_classification
34 Optimal feature selection hyperparameters:{’alpha’: 0.29698209545431653, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
35
36 Optimal classification model is: QDA
37 Optimal classification hyperparameters:{’reg_param’: 0.043945988174408424}
38
39 For pipeline 3:
40 Optimal encoding method is: DataEncoding
41 Optimal encoding hyperparameters:{’dummy_coding’: False}
42
43 Optimal imputation method is: no_processing
44 Optimal imputation hyperparameters:{}
45
46 Optimal balancing method is: CNN_TomekLink
47 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9243338187985631}
48
49 Optimal scaling method is: PowerTransformer
50 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
51
52 Optimal feature selection method is: select_rates_classification
53 Optimal feature selection hyperparameters:{’alpha’: 0.30437601126836655, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
54
55 Optimal classification model is: QDA
56 Optimal classification hyperparameters:{’reg_param’: 0.055735061547359833}
57
58 For pipeline 4:
59 Optimal encoding method is: DataEncoding
60 Optimal encoding hyperparameters:{’dummy_coding’: False}
61
62 Optimal imputation method is: no_processing
63 Optimal imputation hyperparameters:{}
64
65 Optimal balancing method is: CNN_TomekLink
66 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.8982560884166318}
67
68 Optimal scaling method is: PowerTransformer
69 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
70
71 Optimal feature selection method is: select_rates_classification
72 Optimal feature selection hyperparameters:{’alpha’: 0.2982712022144762, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
73
74 Optimal classification model is: QDA
75 Optimal classification hyperparameters:{’reg_param’: 0.05729605531950389}
76
77 For pipeline 5:
78 Optimal encoding method is: DataEncoding
79 Optimal encoding hyperparameters:{’dummy_coding’: False}
39
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
80
81 Optimal imputation method is: no_processing
82 Optimal imputation hyperparameters:{}
83
84 Optimal balancing method is: CNN_TomekLink
85 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.912043608305704}
86
87 Optimal scaling method is: PowerTransformer
88 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
89
90 Optimal feature selection method is: select_rates_classification
91 Optimal feature selection hyperparameters:{’alpha’: 0.28567583661991386, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
92
93 Optimal classification model is: QDA
94 Optimal classification hyperparameters:{’reg_param’: 0.05069185745224875}
sion model is frequently chosen in four of the selected pipelines, paired Dong, X., Yu, Z., Cao, W., Shi, Y., Ma, Q., 2020. A survey on ensemble learning. Front.
with Winsorization to cap extreme feature values. Comput. Sci. 14, 241–258.
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., Smola, A., 2020.
Autogluon-tabular: robust and accurate AutoML for structured data. arXiv preprint
D.3. Australian automobile insurance arXiv:2003.06505.
Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F., 2022. Auto-Sklearn 2.0:
The code for running the Australian Automobile Insurance experi- hands-free AutoML via meta-learning. J. Mach. Learn. Res. 23 (261), 1–61.
ment is detailed in Listing 6. As this is a classification task, we utilize Feurer, M., Klein, A., Jost, K.E., Springenberg, T., Blum, M., Hutter, F., 2015. Efficient and
robust automated machine learning. Adv. Neural Inf. Process. Syst. 28.
AutoTabular with automatic task type selection. Initially, we perform
Frees, E.W., Lee, G., Yang, L., 2016. Multivariate frequency-severity regression models in
a 90/10 train/test split for the first experiment and reuse the gen- insurance. Risks 4 (1), 4.
erated train set index for subsequent experiments. The experimental Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F., 2012. A review on
setup involves four-fold cross-validation, employing the Optuna (Akiba ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based
et al., 2019) search algorithm and using AUC as the evaluation met- approaches. IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 42 (4), 463–484.
Gan, G., Valdez, E.A., 2024. Compositional data regression in insurance with exponential
ric.
family PCA. Variance 17 (1).
For ausprivauto experiments, as illustrated in Listing 7, all five opti- García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F., 2016. Big data pre-
mal pipelines consistently employ the same preprocessing methods and processing: methods and prospects. Big Data Anal. 1, 1–22.
classification models. Guerra, P., Castelli, M., 2021. Machine learning applied to banking supervision a literature
review. Risks 9 (7), 136.
Guo, H., Li, Y., Shang, J., Gu, M., Huang, Y., Gong, B., 2017. Learning from class-
Data availability
imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239.
Hart, P.E., 1968. The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14 (3),
Data will be made available on request. 515–516.
Hartman, B., Owen, R., Gibbs, Z., 2020. Predicting high-cost health insurance members
References through boosted trees and oversampling: an application using the HCCI database. N.
Am. Actuar. J. 25 (1), 53–61.
He, H., Garcia, E.A., 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M., 2019. Optuna: a next-generation
Eng. 21 (9), 1263–1284.
hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD
He, X., Zhao, K., Chu, X., 2021. AutoML: a survey of the state-of-the-art. Knowl.-Based
International Conference on Knowledge Discovery & Data Mining, Proceedings of the
Syst. 212, 106622.
International Conference on Knowledge Discovery & Data Mining. Association for
Hodge, V., Austin, J., 2004. A survey of outlier detection methodologies. Artif. Intell.
Computing Machinery, pp. 2623–2631.
Rev. 22, 85–126.
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J., 2011. Multiple imputation by chained
equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20 (1), Hu, C., Quan, Z., Chong, W.F., 2022. Imbalanced learning for insurance using modified
40–49. loss functions in tree-based models. Insur. Math. Econ. 106, 13–32.
Bakhteev, O.Y., Strijov, V.V., 2020. Comprehensive analysis of gradient-based hyperpa- Jeong, H., 2024. Tweedie multivariate semi-parametric credibility with the exchangeable
rameter optimization algorithms. Ann. Oper. Res. 289 (1), 51–65. correlation. Insur. Math. Econ. 115, 13–21.
Bams, D., Lehnert, T., Wolff, C.C.P., 2009. Loss functions in option valuation: a framework Jordan, M.I., Mitchell, T.M., 2015. Machine learning: trends, perspectives, and prospects.
for selection. Manag. Sci. 55 (5), 853–862. Science 349 (6245), 255–260.
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C., 2004. A study of the behavior of several Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y., 2017. LightGBM:
methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6 a highly efficient gradient boosting decision tree. In: Advances in Neural Information
(1), 20–29. Processing Systems, vol. 30.
Bergstra, J., Yamins, D., Cox, D., 2013. Making a science of model search: hyperparame- Kononenko, I., 2001. Machine learning for medical diagnosis: history, state of the art and
ter optimization in hundreds of dimensions for vision architectures. In: Dasgupta, S., perspective. Artif. Intell. Med. 23 (1), 89–109.
McAllester, D. (Eds.), Proceedings of the 30th International Conference on Machine Lall, S., Sinha, D., Ghosh, A., Sengupta, D., Bandyopadhyay, S., 2021. Stable feature se-
Learning. In: Proceedings of Machine Learning Research, vol. 28. PMLR, pp. 115–123. lection using copula based mutual information. Pattern Recognit. 112, 107697.
Chandrashekar, G., Sahin, F., 2014. A survey on feature selection methods. Comput. Electr. LeDell, E., Poirier, S., 2020. H2O AutoML: scalable automatic machine learning. In: 7th
Eng. 40 (1), 16–28. ICML Workshop on Automated Machine Learning (AutoML).
Charpentier, A., 2015. Computational actuarial science with R. J. R. Stat. Soc., Ser. A, Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I., 2018. Tune: a re-
Stat. Soc. 178 (3), 782–783. search platform for distributed model selection and training. arXiv preprint arXiv:
Charpentier, A., Élie, R., Remlinger, C., 2023. Reinforcement learning in economics and 1807.05118.
finance. Comput. Econ. 62 (1), 425–462. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P., 2017. Focal loss for dense object
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic mi- detection. In: Proceedings of the IEEE International Conference on Computer Vision,
nority over-sampling technique. J. Artif. Intell. Res. 16 (1), 321–357. pp. 2980–2988.
Chen, T., Guestrin, C., 2016. XGBoost: a scalable tree boosting system. In: Proceedings of Ma, L., Sun, B., 2020. Machine learning and AI in marketing – connecting computing
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data power to human insights. Int. J. Res. Mark. 37 (3), 481–504.
Mining. ACM, pp. 785–794. Masello, L., Castignani, G., Sheehan, B., Guillen, M., Murphy, F., 2023. Using contextual
Cummings, J., Hartman, B., 2022. Using machine learning to better model long-term care data to predict risky driving events: a novel methodology from explainable artificial
insurance claims. N. Am. Actuar. J. 26 (3), 470–483. intelligence. Accid. Anal. Prev. 184, 106997.
De Jong, P., Heller, G.Z., et al., 2008. Generalized Linear Models for Insurance Data. Cam- Mitchell, T., Buchanan, B., Dejong, G., Dietterich, T., Rosenbloom, P., Waibel, A., 1990.
bridge University Press. Machine learning. Annu. Rev. Comput. Sci. 4, 417–433.
40
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41
Noll, A., Salzmann, R., Wuthrich, M.V., 2020. Case study: French motor third-party li- Snoek, J., Larochelle, H., Adams, R.P., 2012. Practical Bayesian optimization of machine
ability claims. Available at SSRN: https://ssrn.com/abstract=3164764 or http:// learning algorithms. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (Eds.), Ad-
dx.doi.org/10.2139/ssrn.3164764, 2020. vances in Neural Information Processing Systems, Vol. 25. Curran Associates, Inc.
Okine, A.N.-A., Frees, E.W., Shi, P., 2022. Joint model prediction and application to So, B., 2024. Enhanced gradient boosting for zero-inflated insurance claims and compar-
individual-level loss reserving. ASTIN Bull. 52 (1), 91–116. ative analysis of CatBoost, XGBoost, and LightGBM. Scand. Actuar. J., 1–23.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., So, B., Boucher, J.-P., Valdez, E.A., 2021. Cost-sensitive multi-class adaboost for under-
Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, standing driving behavior based on telematics. ASTIN Bull. 51 (3), 719–751.
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. So, B., Valdez, E.A., 2024. SAMME. C2 algorithm for imbalanced multi-class classification.
PyTorch: an imperative style, high-performance deep learning library. In: Wallach, Soft Comput. 28, 9387–9404.
H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Ad- Somol, P., Pudil, P., Novovičová, J., Paclík, P., 1999. Adaptive floating search methods in
vances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., feature selection. Pattern Recognit. Lett. 20 (11–13), 1157–1163.
pp. 8026–8037. Stekhoven, D.J., Bühlmann, P., 2012. Missforest—non-parametric missing value imputa-
Pedregosa, F., Michel, V., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Vanderplas, tion for mixed-type data. Bioinformatics 28 (1), 112–118.
J., Cournapeau, D., Pedregosa, F., Varoquaux, G., Gramfort, A., Thirion, B., Grisel, O., Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G., 2007. A genetic algorithm-based method for
Dubourg, V., Passos, A., Brucher, M., Perrot, M., Duchesnay, Édouard, 2011. Scikit- feature subset selection. Soft Comput. 12, 111–120.
learn: machine learning in Python. J. Mach. Learn. Res. 12 (85), 2825–2830. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K., 2013. Auto-WEKA: combined se-
Peiris, H., Jeong, H., Kim, J.-K., Lee, H., 2024. Integration of traditional and telematics lection and hyperparameter optimization of classification algorithms. In: Proceedings
data for efficient insurance claims prediction. ASTIN Bull. 54 (2), 263–279. of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data
Peng, H., Long, F., Ding, C., 2005. Feature selection based on mutual information: criteria Mining. Association for Computing Machinery, pp. 847–855.
of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Tomek, I., 1976. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC-6 (11),
Mach. Intell. 27 (8), 1226–1238. 769–772.
Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6 Turcotte, R., Boucher, J.-P., 2024. GAMLSS for longitudinal multivariate claim count mod-
(3), 21–45. els. N. Am. Actuar. J. 28 (2), 337–360.
Qayyum, A., Qadir, J., Bilal, M., Al-Fuqaha, A., 2020. Secure and robust machine learning Wang, Q., Ma, Y., Zhao, K., Tian, Y., 2022. A comprehensive survey of loss functions in
for healthcare: a survey. IEEE Rev. Biomed. Eng. 14, 156–180. machine learning. Ann. Data Sci. 9, 187–212.
Quan, Z., Hu, C., Dong, P., Valdez, E.A., 2024. Improving business insurance loss models Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data.
by leveraging InsurTech innovation. N. Am. Actuar. J., 1–28. IEEE Trans. Syst. Man Cybern. SMC-2 (3), 408–421.
Quan, Z., Valdez, E.A., 2018. Predictive analytics of insurance claims using multivariate Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., Deng, S.-H., 2019. Hyperparameter op-
decision trees. Depend. Model. 6 (1), 377–407. timization for machine learning models based on bayesian optimization. J. Electron.
Quan, Z., Wang, Z., Gan, G., Valdez, E.A., 2023. On hybrid tree-based methods for short- Sci. Technol. 17 (1), 26–40.
term insurance claims. Probab. Eng. Inf. Sci. 37 (2), 597–620. Wüthrich, M.V., 2019. From generalized linear models to neural networks, and back. Tech-
Rapin, J., Teytaud, O., 2018. Nevergrad - a gradient-free optimization platform. https:// nical report. Department of Mathematics, ETH Zurich.
GitHub.com/FacebookResearch/Nevergrad. Yang, L., Shami, A., 2020. On hyperparameter optimization of machine learning algo-
Sagi, O., Rokach, L., 2018. Ensemble learning: a survey. Wiley Interdiscip. Rev. Data Min. rithms: theory and practice. Neurocomputing 415, 295–316.
Knowl. Discov. 8 (4), e1249. Yoon, J., Jordon, J., Schaar, M.V.D., 2018. GAIN: Missing Data Imputation Using Gener-
Salehi, S.S.M., Erdogmus, D., Gholipour, A., 2017. Tversky loss function for image seg- ative Adversarial Nets. Proceedings of the 35th International Conference on Machine
mentation using 3D fully convolutional deep networks. In: Wang, Q., Shi, Y., Suk, Learning, vol. 80. PMLR, pp. 5689–5698.
H.-I., Suzuki, K. (Eds.), Machine Learning in Medical Imaging. Springer International Young, S.R., Rose, D.C., Karnowski, T.P., Lim, S.-H., Patton, R.M., 2015. Optimizing deep
Publishing, pp. 379–387. learning hyper-parameters through an evolutionary algorithm. In: Proceedings of
Servén, D., Brummitt, C., 2018. pyGAM: generalized additive models in python. https:// the Workshop on Machine Learning in High-Performance Computing Environments.
github.com/dswah/pyGAM. MLHPC ’15. Association for Computing Machinery, New York, NY, USA, pp. 1–5.
Shi, P., Zhang, W., Shi, K., 2024. Leveraging weather dynamics in insurance claims triage Zhang, Y., Ji, L., Aivaliotis, G., Taylor, C., 2024. Bayesian CART models for insurance
using deep learning. J. Am. Stat. Assoc. 119 (546), 825–838. claims frequency. Insur. Math. Econ. 114, 108–131.
Si, J., He, H., Zhang, J., Cao, X., 2022. Automobile insurance claim occurrence prediction Zöller, M.-A., Huber, M.F., 2021. Benchmark and survey of automated machine learning
model based on ensemble learning. Appl. Stoch. Models Bus. Ind. 38 (6), 1099–1112. frameworks. J. Artif. Intell. Res. 70, 409–472.
41