0% found this document useful (0 votes)
19 views25 pages

1 s2.0 S0167668724001057 Main

This paper presents an Automated Machine Learning (AutoML) workflow specifically designed for the insurance sector, enabling users without prior domain knowledge to deploy machine learning models efficiently. The AutoML system addresses unique challenges in insurance data, such as imbalanced datasets, by incorporating features like data preprocessing, ensemble pipelines, and customized loss functions. The proposed AutoML is open-source and aims to enhance the accessibility of advanced machine learning techniques for both practitioners and researchers in the insurance industry.

Uploaded by

Vamsi Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views25 pages

1 s2.0 S0167668724001057 Main

This paper presents an Automated Machine Learning (AutoML) workflow specifically designed for the insurance sector, enabling users without prior domain knowledge to deploy machine learning models efficiently. The AutoML system addresses unique challenges in insurance data, such as imbalanced datasets, by incorporating features like data preprocessing, ensemble pipelines, and customized loss functions. The proposed AutoML is open-source and aims to enhance the accessibility of advanced machine learning techniques for both practitioners and researchers in the insurance industry.

Uploaded by

Vamsi Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Insurance Mathematics and Economics 120 (2025) 17–41

Contents lists available at ScienceDirect

Insurance: Mathematics and Economics


journal homepage: www.elsevier.com/locate/ime

Automated machine learning in insurance


Panyi Dong, Zhiyu Quan ∗
Actuarial and Risk Management Sciences, University of Illinois Urbana-Champaign, 1409 W. Green Street (MC-382), Urbana, IL, 61801, USA

A R T I C L E I N F O A B S T R A C T

Keywords: Machine Learning (ML) has gained popularity in actuarial research and insurance industrial applications.
AutoML However, the performance of most ML tasks heavily depends on data preprocessing, model selection, and
Insurance data analytics hyperparameter optimization, which are considered to be intensive in terms of domain knowledge, experience,
Imbalance learning
and manual labor. Automated Machine Learning (AutoML) aims to automatically complete the full life-cycle of ML
AI education
tasks and provides state-of-the-art ML models without human intervention or supervision. This paper introduces
an AutoML workflow that allows users without domain knowledge or prior experience to achieve robust and
effortless ML deployment by writing only a few lines of code. This proposed AutoML is specifically tailored for
the insurance application, with features like the balancing step in data preprocessing, ensemble pipelines, and
customized loss functions. These features are designed to address the unique challenges of the insurance domain,
including the imbalanced nature of common insurance datasets. The full code and documentation are available
on the GitHub repository.1

1. Introduction sector, as summarized by Ma and Sun (2020). Guerra and Castelli (2021)
present the ML innovations in the banking sector, particularly in the
Machine Learning (ML), as described by Mitchell et al. (1990), is analysis of liquidity risks, bank risks, and credit risks. Additionally, there
a multidisciplinary sub-field of Artificial Intelligence (AI) focused on is a growing trend in adopting ML models in the insurance sector and
developing and implementing algorithms and statistical models that en- among actuarial researchers and industry practitioners, as evidenced by
able computer systems to perform data-driven tasks or make predictions the recent literature. Some recent literature advances emerging data-
through “leveraging data” and iterative learning processes. This data- driven research topics in areas such as climate risks, health and long-
driven approach guides the design of ML algorithms, allowing them to term care, and telematics. For instance, by combining dynamic weather
grasp the distributions and structures within datasets and unveil cor- information with deep learning techniques, Shi et al. (2024) develop an
relations that elude traditional mathematical and statistical methods. improved predictive model, leading to improved insurance claim man-
Professionals in data-related fields, such as data scientists and ML en- agement. Hartman et al. (2020) and Cummings and Hartman (2022)
gineers, can engage in autonomous decision-making based on data and explore various ML models on health and long-term care insurance.
benefit from cutting-edge predictions generated by modern ML models. Masello et al. (2023) and Peiris et al. (2024) find that the integration
In recent decades, ML has significantly reshaped various indus- of telematics through ML models can better comprehend risk charac-
tries and gained widespread popularity in academia due to its excep- teristics. Other researchers focus on developing enhanced ML models
tional predictive capabilities. As summarized by Jordan and Mitchell to further improve predictive capabilities or address specific challenges
(2015), ML has made significant contributions in various fields, includ- within the insurance sector. Charpentier et al. (2023) propose a rein-
ing robotics, autonomous driving, language processing, and computer forcement learning technique and explore its application in the financial
vision. The medical and healthcare industry, as suggested by Kononenko sector. Gan and Valdez (2024) explore the use of exponential family
(2001) and Qayyum et al. (2020), is increasingly adopting ML for appli- principal component analysis in analyzing insurance compositional data
cations such as medical image analysis and clinical treatments. Further- to improve predictive accuracy. Turcotte and Boucher (2024) develop a
more, ML models have significantly improved personalization and tar- generalized additive model for location, scale, and shape (GAMLSSs) to
geting, marketing strategy, and customer engagement in the marketing better understand longitudinal data. To address the excessive zeros and

* Corresponding author.
E-mail addresses: panyid2@illinois.edu (P. Dong), zquan@illinois.edu (Z. Quan).
1
https://github.com/PanyiDong/InsurAutoML.

https://doi.org/10.1016/j.insmatheco.2024.10.002
Received 24 August 2024; Received in revised form 6 October 2024; Accepted 29 October 2024
Available online 8 November 2024
0167-6687/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

the heavily right-skewed distribution in claim management, So (2024) the initial model, users can then compare it with the results generated by
compare various ML models and propose a zero-inflated boosted tree our AutoML to gauge if their model surpasses the benchmark set by our
model. By combining classification and regression, Quan et al. (2023) AutoML. This utilization of AutoML as a benchmark aids in standard-
propose a hybrid tree-based ML algorithm to model claim frequency and izing and advancing ML methodologies within the dynamic landscape
severity. In addition to advancements in ML models, researchers, such of the insurance industry, including academic research. Secondly, our
as Frees et al. (2016) and Charpentier (2015), have collected various in- proposed AutoML generates training history as a byproduct, offering in-
surance datasets. For further studies, see Quan and Valdez (2018), Noll sights that users may have overlooked. For example, users can review
et al. (2020), Si et al. (2022) and Quan et al. (2024). the training process and the results recorded by AutoML over time. This
In an ideal scenario, ML models are crafted to automate data-driven feature can provide insights that align with user experiments or intu-
tasks, aiming to minimize manual labor and human intervention. How- ition, or it can sometimes present counterintuitive findings that prompt
ever, due to their dependence on data, there is no one-size-fits-all solu- users to reconsider their approach. Thirdly, our AutoML, designed with
tion, necessitating the training of a specific ML model for each task. This flexibility in the search space and optimization algorithms, can seam-
complexity is compounded by the vast array of ML models available, lessly incorporate future innovations while maintaining its strength in
each with numerous hyperparameters controlling the learning process automation. For experienced users seeking to unlock the full poten-
and impacting model performance. It even leads to the emergence of tial of our AutoML, its flexible design enables them to leverage tuning
a new research field, Hyperparameter Optimization (Yang and Shami, results to gain a deeper understanding of the underlying data struc-
2020), which has become increasingly vital in both academic and indus- ture. Such insights can then serve as a guideline for manually reducing
trial practice. Consequently, finding the optimal hyperparameter setting the search complexity, ultimately facilitating the attainment of opti-
for the ML model becomes a labor-intensive process reliant on extensive mal performance in time and computation. Finally, our research has
experience in ML. Additionally, the proliferation of digitalization has real-world applicability and is available as an open-source tool, mak-
led to explosive growth in data volume and variety (number of features ing it free for users to implement in practical scenarios. For instance,
and observations), coupled with a decline in overall data quality. As in the life-cycle of insurance business operations, our AutoML can of-
a result, real-world datasets, especially in industrial settings, demand fer tool sets to insurers to improve underwriting processes, optimize
meticulous data preprocessing to achieve optimal performance. This pricing strategies, enhance risk management practices, and improve op-
data preprocessing often involves manual optimization, model-specific, erational efficiency, cost reduction, and customer satisfaction. Thus,
and trial-and-error, further complicating the training of the ML model. AutoML presents a promising opportunity for insurance companies to
These factors contribute to the difficulty of building practical ML models leverage advanced ML models and unlock the full potential of their
for individuals lacking prior experience. Even for ML experts, achieving
data. Additionally, we believe that our open-source AutoML can serve
optimal performance in new datasets can be daunting, as evidenced in
as an educational tool for university students and a benchmark-building
Kaggle2 competitions.
resource for academic research.
In this paper, we endeavor to make ML more accessible to inexperi-
The paper is structured in the following sections: Section 2 provides
enced users and develop an inclusive learning tool that assists insurance
an overview of the general AutoML workflow and formulates the pro-
practitioners and actuarial researchers in utilizing state-of-the-art ML
cesses involved in model selection and hyperparameter optimization.
tools in their daily operations. In fact, this research stems from the IRisk
Section 3 focuses specifically on our AutoML design tailored for the in-
Lab3 project, which facilitates the learning of ML for actuarial science
surance domain, emphasizing the integration of sampling techniques
students and establishes a comprehensive pipeline for automating data-
and ensemble learning strategies to address the unique issues in insur-
driven tasks. Automated Machine Learning (AutoML), as summarized in
ance data. To showcase the feasibility of our AutoML in the insurance
Zöller and Huber (2021), offers a solution aimed at diminishing repet-
domain, Section 4 presents experiment results demonstrating the per-
itive manual tuning efforts, thus expediting ML training. Its objective
formance of our AutoML on various insurance datasets and compares
is to encourage the adoption of ML across various domains, particu-
it with existing research. Notations utilized throughout the paper can
larly among inexperienced users, by facilitating full ML life cycles and
be found in the Appendix A, Table 11. The experiments demonstrate
enhancing comprehension of domain datasets. Successful implementa-
that our AutoML carries the potential to achieve superior performance
tions of AutoML span across academic open-source libraries, LeDell and
without extensive human intervention, thereby freeing practitioners and
Poirier (2020), Feurer et al. (2022), and industry-commercialized prod-
ucts, including startups like DataRobot4 and cloud services such as AWS researchers from tedious manual tuning work. Section 5 provides con-
SageMaker.5 However, existing open-source AutoML implementations cluding remarks, summarizing the key findings and insights presented
may not be suitable for the insurance domain due to specific challenges, in the paper.
such as imbalanced datasets, a high prevalence of missing values, and
scalability issues. 2. The concept of AutoML
Our proposed AutoML pipeline tailored for the insurance domain
encompasses fully functional components for data preprocessing, model Most ML algorithm architectures characterize an ML model 𝑀 by
selection, and hyperparameter optimization. We envision several use a set of parameters 𝜃 and hyperparameters 𝜆. Parameters 𝜃 are essen-
cases for our proposed AutoML. Firstly, it can serve as a performance tial components of the ML model structure, representing values that are
benchmark for evaluating future ML creations among researchers, ac- trainable during model calibration and estimated from data, such as the
tuaries, and data scientists in the insurance sector. For instance, users estimated prediction values of the leaf nodes in tree-based models or the
can input datasets into our AutoML at the onset of a data project. Mean- weights and biases in a neural network’s linear layer. In contrast, hyper-
while, users can manually analyze the datasets, plan their subsequent parameters 𝜆 control the model’s flexibility and learning process, such as
steps, and produce preliminary results. Upon obtaining a prototype of the maximum depth in tree-based models or the learning rate in neural
networks. Unlike parameters 𝜃 , which are determined during the train-
2
ing process, hyperparameters 𝜆 are set before training and are chosen
https://www.kaggle.com/competitions.
3 by users based on their experience and preferences. Extensive empirical
See IRisk Lab currently serves as an academic-industry collaboration hub,
experiments have shown that the careful selection of hyperparameters
facilitates the integration of discovery-based learning experiences for students,
and showcases state-of-the-art research in all areas of Risk Analysis and Ad- can significantly impact model performance. However, there exists no
vanced Analytics. Retrieved from https://asrm.illinois.edu/illinois-risk-lab. universal set of hyperparameters that guarantees optimal performance
4
https://www.datarobot.com/. across all datasets, nor are there established theoretical foundations pro-
5
https://aws.amazon.com/sagemaker/. viding precise guidance for their selection.

18
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

With a set of fixed hyperparameters 𝜆, the optimization of the ML where  and  refer to the same evaluation process and the loss func-
model 𝑀 utilizing dataset  = (X, y) can be expressed as tion defined in the naive model selection process.
As suggested by He et al. (2021), many solutions for Model Selec-
argmin(𝑀𝜆𝜃 (X), y) tion and HPO, especially in the field of Neural Networks (NNs) where
𝜃
the Model Selection is transformed into the Neural Architecture Search
where  denotes the loss function, which takes predictions and true (NAS) problem, rely on a two-stage framework that separates the model
values as inputs and returns a numerical value indicating the goodness- architecture search and hyperparameter optimization. The two-stage op-
of-fit of the model. Thus, for any given loss functions , the model 𝑀 timization can be expressed as:
and hyperparameter 𝜆 are critical components that control the task per-
formance. 𝑀 ∗ = argmin 𝔼∼(𝑡𝑟𝑎𝑖𝑛 ,𝑣𝑎𝑙𝑖𝑑 ) (, 𝑀𝜆𝜃 , )
0
In a broader sense, selecting the loss function  can also be consid- 𝑀∈

ered as a hyperparameter, depending on the task’s objective. Empirical 𝜆 𝑀∗


= argmin 𝔼∼(𝑡𝑟𝑎𝑖𝑛 ,𝑣𝑎𝑙𝑖𝑑 ) (, 𝑀 ∗ 𝜃𝝀𝑀 , )
experiments have demonstrated that ML models optimized for differ- 𝜆𝑀 ∈𝚲𝑀
ent loss functions often exhibit varying performance when evaluated In the first stage, the model 𝑀 ∗ is found using default hyperpa-
with other loss functions. The loss function  inherently embeds the rameter settings 𝜆0 . In the second stage, the HPO is performed with
intuitions of business decision-making, making its selection crucial for the restricted model 𝑀 ∗ provided by the first stage to find the optimal
effective ML deployments. For example, Mean Square Error (MSE) or hyperparameter set 𝜆𝑀∗ . The two-stage optimization generates an op-
Mean Absolute Error (MAE) might be more suitable for tasks such as timized ML model 𝑀 ∗𝑀∗ . While this strategy is efficient from a time
𝝀
personalized insurance loss modeling or individual fair pricing, while and computation perspective, it may lead to local optima in both stages
metrics like Percentage Error (PE) may better reflect portfolio-level op- rather than the global optimum.
timization. Therefore,  can be considered as a hyperparameter based Considering that the insurance industry prioritizes accuracy over
on the specific objective of ML applications. In our AutoML implemen- computational efficiency, we adopt a similar approach to Auto-WEKA,
tation, we offer users the option to either choose from existing loss proposed by Thornton et al. (2013). This approach merges the problems
functions or define their own based on the task requirements. We discuss of Model Selection and HPO into a combined algorithm selection and hy-
this in more detail in Section 3.4. perparameter optimization (CASH) problem. The CASH framework treats
Model Selection and Hyperparameter Optimization (HPO) are two the models as tunable hyperparameters, generating a hierarchical nested
of the most crucial components of AutoML, both fundamentally char- hyperparameter space. Therefore, the optimization of the CASH prob-
acterized as optimization problems. The objectives are to identify the lem can be formulated as
best-performing model architecture and the optimal hyperparameter
settings. We divide the dataset as  = (𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙𝑖𝑑 ), where 𝑡𝑟𝑎𝑖𝑛 = 𝑀𝜆∗∗ = argmin 𝔼∼(𝑡𝑟𝑎𝑖𝑛 ,𝑣𝑎𝑙𝑖𝑑 ) (, 𝑀𝜆𝜃𝑀 , )
(X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛 ) is the training set and 𝑣𝑎𝑙𝑖𝑑 = (X𝑣𝑎𝑙𝑖𝑑 , y𝑣𝑎𝑙𝑖𝑑 ) is the vali- 𝑀∈,𝜆𝑀 ∈𝚲𝑀
dation set. The model 𝑀 belongs to the model space , which is the = argmin 𝔼∼(𝑡𝑟𝑎𝑖𝑛 ,𝑣𝑎𝑙𝑖𝑑 ) (, 𝑀𝜆𝜃𝑀 , )
collection of all appropriate models, i.e., 𝑀 ∈ . In addition, we denote (𝑀,𝜆𝑀 )∈ 𝑀
the default hyperparameter settings (usually provided by the built-in
where  𝑀 =  × 𝚲𝑀 denotes a conjunction space of model space
code) of the model architecture as 𝜆0 . The naive model selection prob-
 and hyperparameter space 𝚲𝑀 , and (𝑀, 𝜆𝑀 ) is a sample in the
lem can then be formulated as
conjunction space. By combining Model Selection and HPO, the CASH
framework is able to find the global optimum, albeit at the expense of
𝑀 ∗ = argmin 𝔼∼(𝑡𝑟𝑎𝑖𝑛 ,𝑣𝑎𝑙𝑖𝑑 ) (, 𝑀𝜆𝜃 , )
𝑀∈ 0 increased computational resources.
where  denotes an evaluation process that takes a loss function , a
3. Insurance domain-specific AutoML
initialized model 𝑀𝜆𝜃 and a dataset . The evaluation process  first
0
optimizes the model on the training set in the form of
When deploying ML models in real-world scenarios, practitioners
and researchers recognize that the unique and complex structure of the
𝜃 ∗
= argmin(𝑀𝜆𝜃 (X𝑡𝑟𝑎𝑖𝑛 ), y𝑡𝑟𝑎𝑖𝑛 )
𝜃 0 insurance industry can impede ML innovations. One of the most press-
ing challenges for ML in the insurance sector is data quality, which is
and returns the evaluation loss through 𝐿𝑒𝑣𝑎𝑙 = (𝑀𝜆𝜃 (X𝑣𝑎𝑙𝑖𝑑 ), y𝑣𝑎𝑙𝑖𝑑 )

0 affected by several factors, including legacy systems, operation-oriented
with the optimally trained parameterized model. Thus, the process of data collection practices, and inadequate database management. Firstly,
naive model selection determines the best model architecture 𝑀 ∗ in many insurance companies rely on outdated legacy systems that lack
the model space  that achieves the minimal evaluation loss 𝐿𝑒𝑣𝑎𝑙 . It the capability to store and process modern data formats, resulting in in-
should be noted that naive model selection does not involve HPO which consistent, incomplete, or inaccurate data. Integrating data from these
requires finding the optimal hyperparameters for all candidate models systems with newer technologies can be complex and error-prone. Sec-
in the model space . In line with the established convention of mini- ondly, traditional data collection practices in the insurance industry
mizing loss in optimization problems, our approach to model selection are operation-oriented rather than analysis-oriented. Data is often gath-
also follows the direction of minimization. Thus, for loss functions (per- ered to support specific operational processes without considering its
formance metrics), such as accuracy score, 𝑅2 score, and Area Under potential use for analytical or predictive modeling. This leads to frag-
the Curve (AUC) score that aim to maximize values, we use negative mented, siloed, or insufficiently detailed data, detrimental to effective
values to maintain coherence in the optimization direction. ML applications. Thirdly, insufficient database management practices
For a fixed model 𝑀 , if the hyperparameter set 𝜆 consists of further degrade data quality. Issues such as duplicate records, missing
𝑁 tunable hyperparameter, an 𝑁 dimensional hyperparameter space values, and outdated information are common, hindering the perfor-
𝚲𝑀 = Λ𝑀 1
× Λ𝑀2
× … × Λ𝑀 𝑁
can be configured, where Λ𝑀 𝑗 , 𝑗 = 1, … , 𝑁 mance and reliability of ML models. The performance of ML models
represents the available range of hyperparameter 𝑗 . A specific hyperpa- is heavily dependent on data quality. High-quality data facilitates eas-
rameter set 𝜆𝑀 ∈ 𝚲𝑴 is a 𝑁 dimension vector within the hyperparam- ier and more efficient model building and results in higher inference
eter space 𝚲𝑀 . HPO aims to find the optimal hyperparameter set 𝜆𝑀∗ accuracy. However, data quality is primarily determined by collection
such that procedures, which are beyond the control of ML models. In response to
these challenges, our AutoML incorporates comprehensive data prepro-
𝜆𝑀∗ = argmin 𝔼∼(𝑡𝑟𝑎𝑖𝑛 ,𝑣𝑎𝑙𝑖𝑑 ) (, 𝑀 𝜃𝑀 , ) cessing techniques to extract as much valuable information as possible
𝝀
𝜆𝑀 ∈𝚲𝑀

19
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

through automated trial and error, which is otherwise infeasible with- ble computational run-time required for optimization, and sub-optimal
out extensive manual work. Our AutoML aims to build models that are performance metrics during the evaluation phase. For example, miss-
robust, accurate, and capable of delivering reliable predictions despite ing values can hinder the application of linear models without proper
the inherent challenges associated with data quality. This systematic imputation methods, and the inclusion of wide-ranging features can
approach allows us to maximize the value extracted from the available substantially slow down the convergence rate of gradient-based opti-
data, ultimately leading to better insights and decision-making in the mization methods. To address these challenges, a category of techniques
insurance sector. known as Data Preprocessing, is employed to prepare the raw dataset be-
Another unique challenge in the insurance sector is the problem fore model optimization.
of imbalanced data. When modeling future claims, for example, claim Data preprocessing techniques, such as imputation and feature se-
events are relatively rare, with most policyholders not experiencing any lection, lack dedicated evaluation metrics aside from the final model
claims. Typically, the claim events constitute less than 10% of all poli- prediction metrics. Moreover, there is no universal preprocessing tech-
cies, and in some cases, such as policies covering catastrophic events, the nique that consistently demonstrates superior performance across all
proportion can be as low as 0.1%. This phenomenon is referred to as an datasets. As a result, these data preprocessing techniques are often data-
imbalanced data problem in ML, where a single class comprises the ma- dependent, requiring extensive manual adjustments and domain exper-
jority of all observations. Specifically, in the field of imbalance learning, tise. This reliance adds another layer of complexity to the construction
the class taking the dominant proportion is typically referred to as major- of the optimal pipeline for any ML task, in addition to model selection
ity class while others denote minority class. In certain domains, observa- and hyperparameter optimization.
tions within the minority class or rare observations are referred to as out- In our AutoML, we incorporate five classes of data preprocessing
liers (Hodge and Austin, 2004). These outliers are often treated as noise techniques commonly employed in ML, aiming to cover a comprehen-
that can be removed or merged into other classes. However, in the insur- sive range of data preprocessing needs. According to García et al. (2016),
ance industry, these minority class or rare observations corresponding to these data preprocessing techniques are both essential and beneficial
the claim events contribute as a crucial estimation of financial liabilities for ML models. In the order of pipeline fitting, we include Data Encod-
or pure premiums in the actuarial term. Consequently, accurately estimat- ing, Data Imputation, Data Balancing, Data Scaling, and Feature Selection,
ing the minority class or rare observations is crucial for the insurance which we implement in our AutoML based on the reference outlined as
domain. However, the majority of ML models, by their designs, assume follows (Due to space constraints, please refer to Appendix C for detailed
equal contributions from each observation, regardless of whether they lists of the components used in our AutoML pipelines):
belong to the majority or minority classes. As a result, many ML mod- Data Encoding: Data Encoding converts string-based categorical
els, without appropriate modifications, underperform on imbalanced features into numerical ones, either in the format of ordinal or binary
datasets. The problem of imbalanced data has caught the attention of one-hot encoding. Since most ML models do not support string variables,
researchers across diverse fields, leading to the development of various this unified numerical representation of features ensures consistency
solutions summarized by He and Garcia (2009) and Guo et al. (2017). across the dataset and prevents issues like unseen categorical variables
These imbalance learning techniques, in general, can be categorized as: and type inconsistencies.
(1) Sampling methods; (2) Cost-sensitive methods; (3) Ensemble meth- Data Imputation: Missing values, whether due to intrinsic infor-
ods; and (4) Kernel-based methods and Active Learning methods. Our mation loss or database mismanagement, pose significant challenges
AutoML incorporates a series of sampling methods as a critical data for ML. Various imputation techniques have been proposed to address
preprocessing component to balance majority and minority classes. In the missing value issue, such as statistical methods like multiple im-
addition, advancements in imbalance learning within actuarial science, putation (Azur et al., 2011), non-parametric approach (Stekhoven and
such as cost-sensitive loss functions (Hu et al., 2022; So et al., 2021) Bühlmann, 2012), and generative adversarial networks (Yoon et al.,
can be integrated into our loss function. Furthermore, as summarized by 2018). These imputation solutions determine the optimal estimates for
Sagi and Rokach (2018), ensemble learning not only achieves state-of- the missing cells and populate them accordingly. Unlike the common
the-art performance, but also has the potential to address the imbalance practice of discarding observations containing missing values, i.e., com-
problem effectively by combining multiple ML models into a predictive plete data analysis, which maintains the accuracy of the remaining data,
ensemble model. The multiple evaluation pipelines fitted during our Au- imputation techniques leverage all available information, potentially en-
toML training can naturally serve as candidates to form an ensemble. hancing the understanding of the dataset while also carrying the risk of
Our AutoML focuses on structured data, such as tabular data from introducing misleading imputed values. To address the uncertainty in
databases or CSV files, and is particularly designed for supervised learn- imputation, we offer users several highly cited methods to help find the
ing problems, specifically regression and classification. While many best imputation solution.
other types of AutoML handle unstructured data, like audio, text, video, Data Balancing: Data balancing addresses class imbalance using
and images, and incorporate unsupervised learning or active learning, sampling methods, which are especially beneficial in fields like insur-
these may lie beyond the scope of our AutoML. Instead, we aim to tailor ance, where infrequent but significant events, such as frauds and claims,
AutoML solutions to the insurance domain to enhance the performance must be accurately modeled. As summarized by Batista et al. (2004),
and adaptability of ML solutions. In the following, we introduce our Au- sampling methods allow ML models to learn from these rare events bet-
toML from a microscopic to a macroscopic perspective. Subsection 3.1 ter by adjusting class distributions and improving decision boundaries.
outlines each component of the model pipeline, with a specific focus on Summarized by Chawla et al. (2002), sampling methods are gen-
sampling techniques. Subsection 3.2 details how these components are erally classified into over-sampling and down-sampling. Over-sampling
interconnected and optimized within our AutoML workflow. Subsection increases the number of minority class instances by generating synthetic
3.3 summarizes the integration of ensemble model as an alternative so- data, while under-sampling reduces the number of majority class obser-
lution to imbalance learning and as the highest layer of our AutoML vations.
production. Lastly, Subsection 3.4 discusses the selection of loss func- For the dataset  = (X, y), an imbalance problem can be described by
tions and their strategic implications in business contexts. splitting it into a majority subset 𝑚𝑎𝑗𝑜𝑟 = (X𝑚𝑎𝑗𝑜𝑟 , y𝑚𝑎𝑗𝑜𝑟 ) and a minority
subset 𝑚𝑖𝑛𝑜𝑟 = (X𝑚𝑖𝑛𝑜𝑟 , y𝑚𝑖𝑛𝑜𝑟 ), where 𝑦 = 𝐴 for every 𝑦 ∈ y𝑚𝑎𝑗𝑜𝑟 and 𝑦 ≠
[ ] [ ] [ ]
3.1. Model pipeline 𝑚𝑎𝑗𝑜𝑟 X y
𝐴 for every 𝑦 ∈ y𝑚𝑖𝑛𝑜𝑟 , such that  = = ( 𝑚𝑎𝑗𝑜𝑟 , 𝑚𝑎𝑗𝑜𝑟 )
𝑚𝑖𝑛𝑜𝑟 X𝑚𝑖𝑛𝑜𝑟 y𝑚𝑖𝑛𝑜𝑟
Applying the model 𝑀 directly to the raw dataset , as formulated and an imbalance ratio
in Section 2, is often impractical or even impossible for several reasons. |𝑚𝑎𝑗𝑜𝑟 |
These include the diverse and complex structures of datasets, the infeasi- ≫1
|𝑚𝑖𝑛𝑜𝑟 |

20
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Here, 𝐴 refers to the response variable defining the majority class, and wrapper methods) or model-free (i.e., filter methods), and the effective-
|∗ | denotes the cardinality of dataset ∗ , indicating the number of ness of the selection heavily relies on the datasets.
observations. In insurance, for instance, non-fraudulent observations Preprocessing techniques, combined with ML models, comprise a
usually form the majority class, suggesting 𝐴 = 0. Given the imbalance pipeline that represents a real-world workflow of modeling tasks. To dif-
between 𝑚𝑎𝑗𝑜𝑟 and 𝑚𝑖𝑛𝑜𝑟 , conventional ML models are usually more ferentiate the types of hyperparameters, we extend the notations from
influenced by observations in 𝑚𝑎𝑗𝑜𝑟 than those in 𝑚𝑖𝑛𝑜𝑟 , resulting in Section 2 as follows: encoding algorithm 𝐸 with hyperparameter 𝜆𝐸 ; im-
inferior ML performance. putation algorithm 𝐼 with hyperparameter 𝜆𝐼 ; balancing algorithm 𝐵
Over-sampling retains the structure of 𝑚𝑎𝑗𝑜𝑟 and construct a sam- with hyperparameter 𝜆𝐵 ; scaling algorithm 𝑆 with hyperparameter 𝜆𝑆 ;
pled minority subset 𝑌𝑚𝑖𝑛𝑜𝑟 such that 𝑚𝑖𝑛𝑜𝑟 ⊂ 𝑌𝑚𝑖𝑛𝑜𝑟 and feature selection algorithm 𝐹 with hyperparameter 𝜆𝐹 ; and ML model
𝑀 with hyperparameter 𝜆𝑀 . In our AutoML, the pipeline  can be ini-
|𝑚𝑎𝑗𝑜𝑟 | tialized by the input of algorithm-hyperparameter pairs, denoted as
=𝑅
|𝑌𝑚𝑖𝑛𝑜𝑟 |
0 = 𝑀𝜆𝑀 ◦𝐹𝜆𝐹 ◦𝑆𝜆𝑆 ◦𝐵𝜆𝐵 ◦𝐼𝜆𝐼 ◦𝐸𝜆𝐸
where 𝑅 is a pre-defined threshold or can be a hyperparameter, measur-
ing the imbalance ratio, typically 𝑅 ≈ 1. In our AutoML, datasets with where the initialized pipeline 0 is controlled by trainable parameters
an imbalance ratio greater than 𝑅 are considered imbalanced, and sam- 𝜃 . The pipeline can then be trained to optimize 𝜃 as follows:
pling methods are applied to adjust the imbalance ratio. The additional
observations in 𝑌𝑚𝑖𝑛𝑜𝑟 ⧵ 𝑚𝑖𝑛𝑜𝑟 are synthetically generated samples that 𝜃 ∗ = argmin(0 (X), y)
𝜃
simulate the statistical properties of the observations in 𝑚𝑖𝑛𝑜𝑟 . Com-
mon generation methods include duplication, as described by Batista where  = (X, y) is a dataset and  refers to the loss function. The pre-
et al. (2004) in the Simple Random Over-Sampling method, and linear processing techniques are applied to the datasets sequentially in the
interpolation in the feature space, as used in Synthetic Minority Over- specified order, as demonstrated in the parameterization of the pipeline.
Sampling Techniques (SMOTE) by Chawla et al. (2002). These synthetic This ordering demonstrated in our AutoML is a widely accepted mod-
minority samples increase the size of the minority subset, making the eling process that embeds some of the typical incentives in the domain
majority and minority classes comparable in size and mitigating the of data science. For example, the encoding and imputation processes
imbalance problem caused by the disparity between the majority and solve fundamental incompatibility issues that are not resolvable in sub-
minority classes. sequent operations. Further, it can be evident from the pipeline fitting
Under-sampling, on the contrary, maintains 𝑚𝑖𝑛𝑜𝑟 and construct a that, while the pipeline encompasses the entire life-cycle of modeling
reduced majority subset 𝑌𝑚𝑎𝑗𝑜𝑟 such that 𝑌𝑚𝑎𝑗𝑜𝑟 ⊂ 𝑚𝑎𝑗𝑜𝑟 and tasks, the selection of algorithms and their corresponding hyperparam-
eters still relies on manual decisions or automated optimization. The
|𝑌𝑚𝑎𝑗𝑜𝑟 | automated optimization of such algorithm-hyperparameter pairs under-
=𝑅 scores the term “Auto” in AutoML.
|𝑚𝑖𝑛𝑜𝑟 |
The subset 𝑌𝑚𝑎𝑗𝑜𝑟 is constructed by removing observations from 3.2. Automated optimization
𝑚𝑎𝑗𝑜𝑟 while preserving the statistical similarities. Most under-sampling
techniques use k Nearest Neighbors (kNN) as the base learner. Tomek To achieve automated optimization, we adopt the framework of
Link, proposed by Tomek (1976), utilizes kNN to find the adjoining CASH, as formulated in Section 2, and extend it to the preprocessing-
majority-minority pairs for removal. Edited Nearest Neighbors (ENN) modeling space. Given the preprocessing-modeling pipeline summarized
(Wilson, 1972) employs the predictions of kNN as majority votes to in Subsection 3.1, we construct the conjunction space for Data Encod-
determine which majority class observations should be removed. Con- ing, Data Imputation, Data Balancing, Data Scaling, Feature Selection,
densed Nearest Neighbors (CNN) by Hart (1968) uses kNN as the bench- denoted as  𝐸 ,  𝐼 ,  𝐵 ,  𝑆 ,  𝐹 , respectively. In addition, the optimal
mark to determine the necessary majority class observations required algorithm-hyperparameter pairs can be identified by exploring the doc-
to generate the subset. Assuming that the majority and minority classes umentation from implemented libraries or packages. This allows us to
can be viewed as a binary classification problem and that distinctions incorporate the pipeline fitting into the objective function  , extending
between them can be discerned through their spatial distributions in it as follows:
feature space, the kNN leverages nearest neighbors as a criterion for
statistical significance to identify sample importance. By removing ob- (, 𝑀𝜆𝑀 ◦𝐹𝜆𝐹 ◦𝑆𝜆𝑆 ◦𝐵𝜆𝐵 ◦𝐼𝜆𝐼 ◦𝐸𝜆𝐸 , )
servations that disagree with kNN predictions, majority observations
In the following, we demonstrate the optimization strategy adopted
that distort the smoothness of decision boundaries can be precisely elim-
in our AutoML, focusing on supervised learning tasks on tabular
inated, leading to smooth decision boundaries and a proper balance
datasets. Fig. 1 illustrates the automated optimization workflow, where
between majority and minority classes.
a space of preprocessing techniques and ML models, along with the cor-
Data Scaling: Data scaling might not always significantly impact
responding hyperparameter space for each method, is constructed and
model performance, but it usually accelerates convergence, especially
stored prior to optimization. This space is usually denoted as the Search
in the case of gradient-based optimizations on skewed features. Consid-
Space and can be written as
ering the intensive computation involved in AutoML optimization, time
efficiency is crucial, making the scaling of features as important as other
 = 𝐸 × 𝐼 × 𝐵 × 𝑆 × 𝐹 × 𝑀
data preprocessing components. Furthermore, some of the scaling tech-
niques help remove outliers in features, which are beneficial to the ML It is worth noting that, while our AutoML default search space includes
models in certain scenarios. In our AutoML, we incorporate a series of all possible methods, representing the largest possible space, it is flexible
common scaling techniques, such as standardization and normalization, and can be modified (add/remove) according to user needs.
are incorporated. Although the optimization theoretically guarantees finding the
Feature Selection: In real-world applications, unprocessed features global optimum, in practice, it is nearly impossible due to limited com-
can suffer from redundancy or ambiguity, negatively affecting run-time puting resources, especially given the multi-dimensional search space
and potentially undermining the model performance. Feature selection we have designed in our AutoML. Thus, we introduce two of the most
addresses this issue by reducing dimensions and effectively identifying a apparent and natural constraints, time and number of trials as the com-
subset of valuable features. As summarized by Chandrashekar and Sahin puting budget. These two constraints are referred to as time budget and
(2014), feature selection techniques can be either model-dependent (i.e., evaluation budget, respectively, and both can be modified according to

21
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Fig. 1. An illustration of AutoML workflow.

user demands. The time budget denotes the maximum allowed run- Algorithm 1: The AutoML optimization.
time for experiments, while the evaluation budget limits the number of Input: Dataset  = (𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙𝑖𝑑 ); Search space  ; Time budget 𝑇 ;
evaluations or trials executed during the experiments. Both budgets are Evaluation budget 𝐺 ; Search algorithm 𝑆𝑎𝑚𝑝
audited before each round of evaluations to determine whether a new Output: Optimal pipline with hyperparameter settings  ∗
trial should be generated. A new trial is initiated only if both budgets 1 𝑘=0 ; /* Round of evaluation */
are not depleted. It is worth noting that the time and evaluation bud- 2 𝑡𝑟𝑒 = 𝑇 ; /* Remaining time budget */
gets are highly correlated: a larger number of trials typically requires a 3 𝑔 𝑟𝑒 = 𝐺 ; /* Remaining evaluation budget */
longer runtime, while allowing for longer experiments generally enables 4 while 𝑡𝑟𝑒 > 0 and 𝑔 𝑟𝑒 > 0 do
5 𝑡𝑠𝑡𝑎𝑟𝑡 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒;
deeper exploration of the search space. Furthermore, since the interme-
diate optimal loss during the search is non-increasing with respect to 6 (𝐸 (𝑘) , 𝜆(𝑘)
𝐸
), (𝐼 (𝑘) , 𝜆(𝑘)
𝐼
), (𝐵 (𝑘) , 𝜆(𝑘)
𝐵
), (𝑆 (𝑘) , 𝜆(𝑘)
𝑆
), (𝐹 (𝑘) , 𝜆(𝑘)
𝐹
), (𝑀 (𝑘) , 𝜆(𝑘)
𝑀
)=
𝑆𝑎𝑚𝑝 ( );
(𝑘)
the number of trials for fixed sampling procedures, the optimization
7 𝑘 = 𝑀 (𝑘) ◦𝐹 (𝑘) ◦𝑆 (𝑘) ◦𝐵 (𝑘) ◦𝐼 (𝑘) ◦𝐸 (𝑘) ;
guarantees at least non-degenerating performance as more time is spent (𝑘)
𝜆𝑀 (𝑘)
𝜆𝐹 𝜆𝑆(𝑘)
𝜆𝐵(𝑘)
𝜆𝐼 (𝑘)
𝜆𝐸 (𝑘)

searching or more sets of hyperparameters are explored. Consequently, 8 𝐿𝑒𝑣𝑎𝑙,(𝑘) = (, 𝑘 , );
increasing either the time or evaluation budget generally improves the 9 𝑡𝑒𝑛𝑑 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒;
performance of the optimization in practice. 10 𝑘 = 𝑘 + 1;
11 𝑡𝑟𝑒 = 𝑡𝑟𝑒 − (𝑡𝑒𝑛𝑑 − 𝑡𝑠𝑡𝑎𝑟𝑡 );
For each round of evaluation, a specific set of hyperparameters is
12 𝑔 𝑟𝑒 = 𝑔 𝑟𝑒 − 1;
sampled from the search space  using a predefined sampling method,
/* Recording hyperparameters, preprocessed data,
commonly referred to as the Search Algorithm. The sampled hyperparam- trained pipeline */
eter sets can be either independent of each other (e.g., Random Search, 13 end
Grid Search) or conditional on previous evaluations (e.g., Bayesian 14 𝑘∗ = argmin𝐿𝑒𝑣𝑎𝑙,(𝑘) ; /* Find optimal pipeline order */
Search, Snoek et al. (2012), Wu et al. (2019)). Each sampled hyper- 𝑘

parameter set consists of methods of encoding, imputation, balancing, 15  ∗ = 𝑘∗ ;


16 return  ∗ ;
scaling, feature selection, and an ML model, along with the correspond-
ing hyperparameter settings. The initialized pipeline, comprising all
method objects, is then fed into the extended objective function  as in-
puts to get the evaluation loss at the end of each round. The evaluation the tuning process. Additionally, we harness two fundamental capabili-
losses, which serve as indicators of the pipeline’s performance, guide the ties of Ray Tune: compatibility with mainstream ML model frameworks
construction of the optimal pipeline. This optimal pipeline represents and seamless integration with a diverse collection of search algorithms.
the final product of the optimization and modeling task. The optimiza- The compatibility of Ray Tune with major ML model frameworks like
tion is formulated as Algorithm 1. Along with the optimal pipeline, all Scikit-Learn (Pedregosa et al., 2011), PyTorch (Paszke et al., 2019),
training records, and preprocessed train/test sets are stored for exami- LightGBM (Ke et al., 2017), XGBoost (Chen and Guestrin, 2016), pyGAM
nation if needed. (Servén and Brummitt, 2018) allows us to utilize models ranging from
To enable efficient optimization, we employ Ray Tune created by simple linear models to tree-based ensemble models and deep neural
Liaw et al. (2018), a sub-system of the Ray library. Ray is a flexi- networks. The broad range of available ML models ensures the versa-
ble and scalable distributed computing framework designed for high- tility and applicability of our AutoML to a wide array of ML tasks and
performance and parallel computing tasks, specialized in ML workloads. datasets. The search algorithms iteratively sample hyperparameter sets
It simplifies the development of distributed applications, allowing users from the defined search space. The selection of an appropriate search
to parallelize and scale their workloads effortlessly across clusters. Com- algorithm significantly influences the attainment of the optimal solu-
plementing Ray, Ray Tune is a specialized library for hyperparameter tion and efficiency. Therefore, search algorithms are crucial to both the
tuning. It provides a comprehensive suite of functionalities for effi- efficiency and effectiveness of optimization problems. As summarized
cient experimentation and optimization of ML models. Ray Tune enables by Yang and Shami (2020), various search algorithms have been pro-
users to systematically search through hyperparameter spaces, lever- posed, ranging from fixed algorithms such as grid search to completely
aging state-of-the-art optimization algorithms to fine-tune models for random algorithms like random search and algorithms conditioned on
optimal performance. In our AutoML, Ray Tune coordinates the op- previous experiments like evolutionary algorithms (Young et al., 2015),
timization experiments by recording key information and facilitating gradient-based optimization (Bakhteev and Strijov, 2020). The seamless

22
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

integration or conversion of frequently-used search algorithms like Hy- to a real value, loss functions provide a quantifiable measure of model
perOpt (Bergstra et al., 2013), Nevergrad (Rapin and Teytaud, 2018), performance. As summarized by Wang et al. (2022), a variety of loss
and Optuna (Akiba et al., 2019), combined with the parallel hyperpa- functions have been developed to address the specific demands of differ-
rameter tuning architecture supported by Ray Tune, enables us to create ent ML tasks. These loss functions are meticulously designed to optimize
a more flexible environment for experiment settings while ensuring ef- the model performance across diverse applications, ensuring alignment
ficiency and effectiveness. with the objectives and constraints of each particular task.
The choice of loss functions can significantly influence model cre-
3.3. Ensemble model ation and, ultimately, the success of ML models. In financial modeling,
Bams et al. (2009) demonstrate that the choice of loss functions has a
As discussed previously, ensemble learning offers alternative solu- substantial impact on option valuation, underscoring their critical role
tions to the imbalance problem. In addition to the sampling techniques in this domain. Similarly, in deep learning, the design of appropriate loss
summarized in Subsection 3.1, we integrate ensemble learning into our functions is crucial for tasks such as Object Detection (Lin et al., 2017)
AutoML to further address the imbalance problem. Ensemble models and Image Segmentation (Salehi et al., 2017). In the insurance domain,
combine multiple trained models into a single model by aggregating one of the major challenges is the imbalanced distribution of response
their predictions, potentially addressing the imbalance problem that in- variables, which complicates the direct application of ML models. To
dividual models cannot effectively handle. Moreover, constructing an address this issue, researchers have proposed carefully calibrated im-
ensemble model is recognized as an effective technique for achieving balance learning algorithms and adjusted cost-sensitive loss functions,
state-of-the-art performance in practical applications. In our AutoML, as highlighted by Hu et al. (2022), Zhang et al. (2024), and So and
pipeline-level ensembles are integrated within the pipeline optimization Valdez (2024).
framework. Consequently, the pipelines {𝑘 }, 𝑘 = 1, 2, ..., 𝐺 , generated Choosing the appropriate loss function is essential for aligning ML
during each evaluation round in Algorithm 1 can naturally serve as can- models with insurance business objectives. It ensures that models not
didates to form the ensemble model. To enhance the flexibility of the only perform well technically, but also deliver outcomes that are mean-
ensemble models, we implement three major ensemble structures: Stack- ingful and beneficial to the insurance business. Loss functions guide the
ing, Bagging, and Boosting, as summarized in Dong et al. (2020). model during training by quantifying errors, ensuring that the model
Considering a total 𝐺 trained pipelines constrained by the time/e- learns to minimize the types of error that matter the most to the in-
valuation budget following specific training protocols, 𝐻 pipelines are surance context. Our AutoML framework provides a variety of common
selected to construct the ensemble model. The final predictions of the loss functions suitable for both imbalanced and balanced learning sce-
ensemble model can be computed as the aggregation of individual pre- narios. Additionally, it offers flexibility for users to employ customized
dictions through certain Voting Mechanisms. Consequently, the ensemble loss functions based on their specific needs. This customization allows
model Σ, given 𝐺 pipelines {𝑔 }, 𝑔 = 1, 2, … , 𝐺 , can be expressed as users to define loss functions that better capture their unique objec-
tives, leading to more relevant and actionable insights. For instance, in
Σ𝐻 = Σ𝐻 (1 , 2 , … , 𝐺 )
sensitive areas like insurance pricing, ensuring fairness across different
= Σ𝐻 ((1) , (2) , … , (𝐻) ) demographic groups is crucial. Custom loss functions can be designed to
enforce fairness constraints and comply with regulatory requirements,
where 𝐻 denotes the pre-fixed hyperparameter for the size of the en-
helping to ensure ethical and equitable outcomes.
semble model (number of selected pipelines), and (ℎ) refers to the ℎ-th
pipeline ranked by the evaluation loss 𝐿𝑒𝑣𝑎𝑙 computed from the objec-
4. AutoML in action
tive function  . The predictions, given the ensemble model Σ𝐻 and a
input matrix X, can be expressed as
To demonstrate the feasibility and efficacy of our AutoML, we con-
ŷ = Σ𝐻 (X) ducted experiments using several datasets studied by actuarial science
researchers. We evaluated both the performance and run-time of these
= 𝛾((1) (X), (2) (X), … , (𝐻) (X)) experiments. Our AutoML framework is user-friendly and requires only a
where 𝛾 denotes the voting mechanism attached to the ensemble model few lines of code to deploy, making it accessible to inexperienced users.
Σ𝐻 . Please refer to Appendix D for the code required to run the experiments
It is important to note that the three ensemble structures only func- and the corresponding descriptions, along with the optimal pipeline hy-
tion as the protocol of the training diagram, predictions through the perparameter settings observed in these experiments.
voting mechanism, and, in our work, validation for the completion of
the task. They do not involve any additional training procedures. The 4.1. French motor third-part liability
optimization of each individual pipeline given the input training/vali-
dation datasets thus remains unaffected by the deployment of ensemble In this experiment, we use the French Motor Third-Part Liability
models. Specifically, the naming of these ensemble models distinguishes datasets, freMTPL2freq, from package CASDatasets (Charpentier, 2015),
them by their training diagrams. The stacking ensemble models utilize a which comprises 677,991 motor liability policies collected in France
fully parallel optimization approach across entire datasets, where mul- with the response variable ClaimNb, indicating number of claims during
tiple base pipelines are trained independently and their predictions are the exposure period and 10 categorical/numerical explanatory variables
aggregated through the voting mechanism. In contrast, bagging ensem- (excluding the policy ID, IDpol). Refer to Table 1 for the full list of fea-
ble models are optimized by training multiple base pipelines on random tures and response variables and their descriptions. The task is to predict
subsets of the data, aiming to reduce variance and enhance generaliza- future claim frequency, which is framed as a regression problem. We
tion. Boosting ensemble models iteratively improve model predictions follow the same random seed and train/test split percentage suggested
by focusing on the residuals from previous pipelines, sequentially re- by Noll et al. (2020) to replicate the train/test sets, without applying
fining predictions to minimize overall error. Reference Appendix B for any preprocessing techniques before feeding the data into the AutoML
details of the ensemble strategies. pipeline. The performance of the experiment is evaluated by mean Pois-
son deviance, as the same metrics utilized in Noll et al. (2020), which
3.4. Loss functions can be expressed as
𝑍
Loss functions play a pivotal role in both model creation and eval- 2 ∑( 𝑦 )
𝑃 𝑜𝑖 (y, ŷ ) = 𝑦̂ − 𝑦𝑧 + 𝑦𝑧 log( 𝑧 )
uation. By mapping pairs of true observations and model predictions 𝑍 𝑧=1 𝑧 𝑦̂𝑧

23
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Table 1
Features & Response variables of freMTPL2freq dataset.

Category Name Type Description

Area Categorical Density values of the community


BonusMalus Numerical Bonus/Malus of the driver
Density Numerical Density of inhabitants in the neighborhood
DrivAge Numerical Age of the driver
Exposure Numerical Exposure length of enforced policies
Features
Region Categorical Region of the policies by categories
VehAge Numerical Age of the vehicle
VehBrand Categorical Brand of the vehicle
VehGas Categorical Gasoline type of the vehicle
VehPower Numerical Power of the vehicle

Response ClaimNb Numerical Number of claims reported

Table 2 search do not provide enough valid and high-performing pipelines. This
AutoML performance on freMTPL2freq dataset. suggests that users should consider increasing the evaluation and time
G T/s runtime/s Train Deviance Test Deviance budget for better results. On the other hand, the increasing test deviance
observed from 𝐺 = 256 to 𝐺 = 1, 024 does not necessarily indicate over-
8 900 807.62 0.3622 0.3689
16 1,800 1,082.21 0.3826 0.3890
fitting but rather reflects the limitations of the search concerning the
32 3,600 2,092.15 0.3156 0.3250 vast hyperparameter space designed for the AutoML pipeline. This sug-
64 7,200 4,417.51 0.3022 0.3122 gests that users should consider stopping the AutoML process and inves-
128 14,400 8,052.91 0.2925 0.3034 tigating the fitted pipelines.
256 28,800 12,624.60 0.2779 0.3009
512 57,600 34,036.03 0.2762 0.3020
1024 115,200 63,401.81 0.2539 0.3114 4.2. Wisconsin local government property insurance fund

The Wisconsin Local Government Property Insurance Fund dataset,


LGPIF, introduced in Frees et al. (2016), is another classical real-world
dataset for actuarial science researchers. The local government prop-
erty insurance covers six different groups: building and content (BC),
contractor’s equipment (IM), comprehensive new (PN), comprehensive
old (PO), collision new (CN), and collision old (CO). The dataset consists
of 5,677 policies (with 1,684 observed claims) collected from the years
2006 to 2010 as a train set and 1,098 unique policies (with 328 observed
claims) of the year 2011 as a test set. In our AutoML experiments, fol-
lowing Quan and Valdez (2018), we select only one line of business, and
variable yAvgBC, the average claim sizes of coverage Building and Con-
tent, as the response, making it a supervised regression task. The dataset
includes 8 continuous variables describing coverage and deductible in-
formation and 13 categorical indicators of insured entities and history
claims are utilized as features. For details on the variables and their
descriptions, refer to Table 3. Specifically, as instructed in Quan and
Valdez (2018), we apply the logarithmic transformation on response
Fig. 2. Train/Test deviance and runtime on freMTPL2freq dataset. variable, yAvgBC, and 6 coverage variables, which can be expressed as

𝑥̂ = log(𝑥 + 1)
for a total of 𝑍 true response values 𝑦𝑧 and predictions 𝑦̂𝑧 (𝑧 =
1, 2, ..., 𝑍 ). The evaluation of mean Poisson deviance is common in ac- where 𝑥 denotes the original variable and 𝑥̂ refers to the transformed
tuarial practice for claim frequency modeling. variable. In training our AutoML, we optimize based on the loss function
To illustrate the performance of our AutoML in terms of evaluation using the Coefficient of Determination (𝑅2 ). For a total of 𝑍 true values
and time budget, we train the AutoML across a range of increasing evalu- y and their corresponding predictions ŷ generated by fitted AutoML, can
ation budget 𝐺 and time budget 𝑇 . The runtime and train/test deviance be written as
for each ensemble of fitted pipelines are shown in Table 2 with a visu- 𝑅𝑆𝑆
𝑅2 (y, ŷ ) =
alization provided in Fig. 2 on a log scale. In Fig. 2, it is evident that as 𝑇 𝑆𝑆
𝐺 and 𝑇 increase synchronously, the runtime increases approximately ∑𝑍 ∑𝑍
where 𝑅𝑆𝑆 = 𝑧=1 (𝑦̂𝑧 − 𝑦𝑧 )2 and 𝑇 𝑆𝑆 = 𝑧=1 (𝑦𝑧 − ȳ )2 denote the
linearly with the evaluation budget. Concurrently, train/test deviance
regression sum of squares and the total sum of squares respectively, and
decreases, indicating the performance gain from a more extensive hy- 1 ∑𝑍
perparameter search. The superior performance at 𝐺 = 8 compared to ȳ = 𝑦 is the mean of all true values.
𝑍 𝑧=1 𝑧
𝐺 = 16 can be explained by the insufficient number of fitted pipelines The experiment follows the same structure as demonstrated in Sub-
at early training stages. At the starting point 𝐺 = 8, only two valid section 4.1, where we record the runtime and train/test errors as the
pipelines could be selected to form the ensemble model out of five evaluation and time budget increase. Table 4 and Fig. 3 summarize
trained pipelines, due to the metric of mean Poisson deviance only al- the results and the visualization correspondingly. Both Table 4 and
lowing non-zero predictions. In contrast, at 𝐺 = 16, although five valid Fig. 3 suggest that scaling evaluation and time budget on the LGPIF
pipelines could be selected under the current evaluation and time bud- dataset has a less significant impact on the performance compared to
get, the ensemble construction is underperforming compared to the one freMTPL2freq dataset. One obvious reason is that the LGPIF dataset is
from the 𝐺 = 8 stage. This is because the early stages of hyperparameter significantly smaller than the freMTPL2freq dataset, requiring fewer tri-

24
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Table 3
Features & Response variables of LGPIF dataset.

Category Name Type Description

Type Categorical Binary indicator of property type (City, County, Misc, School, Town, Village)
IsRC Categorical Binary indicator of replacement cost
Feature Coverage Numerical Coverage of the property (BC, IM, PN, PO, CN, CO)
lnDeduct Numerical Logarithm of deductible (BC, IM)
NoClaimCredit Categorical Binary indicator of prior claim reports (BC, IM, PN, PO, CN, CO)

Response yAvgBC Numerical Average claim sizes

Table 4 Table 5
AutoML performance on LGPIF dataset. AutoML performance on ausprivauto claim occurrence.

G T/s runtime/s Train 𝑅2 Test 𝑅2 G T/s runtime/s Train AUC Test AUC

8 900 1202.55 0.2267 0.2151 8 900 565.15 0.8681 0.5407


16 1,800 1,533.87 0.2700 0.2268 16 1,800 629.83 0.8489 0.6253
32 3,600 1,513.57 0.2373 0.2235 32 3,600 1,127.55 0.6578 0.6754
64 7,200 2,891.36 0.2674 0.2255 64 7,200 2,506.17 0.6560 0.6739
128 14,400 3,367.43 0.3409 0.2361 128 14,400 3,749.68 0.6576 0.6754
256 28,800 8,413.55 0.3330 0.2360 256 28,800 4,598.22 0.6600 0.6770
512 57,600 10,313.03 0.3424 0.2372 512 57,600 9,709.30 0.6602 0.6774
1024 115,200 15,282.33 0.3197 0.2377 1024 115,200 18,938.53 0.6609 0.6815
2048 230,400 40,856.35 0.3260 0.2360 2048 230,400 30,951.82 0.6626 0.6831
4096 460,800 79,438.58 0.6627 0.6831

Fig. 3. Train/Test error and runtime on LGPIF dataset.


Fig. 4. Train/Test error and runtime on ausprivauto occurrence dataset.
als to achieve suitable results from our AutoML. In addition, we also
notice that, throughout the experiments, the best-performing individ- 128 rounds of evaluation and 2,047.21 seconds of fitting. Compared
ual pipelines constantly demonstrate improvement on the train set. This to the training AUC score of 0.662 achieved by the exposure-adjusted
outcome validates the feasibility of substituting manual tuning with au- model (De Jong et al., 2008), our AutoML fitted training dataset is bet-
tomatic optimization, showcasing the efficacy of our approach. ter than the Generalized Linear Model (GLM) family. However, the high
AUC score may also indicate over-fitting, given that no cross-validation
is applied. To ensure robust performance, subsequent experiments are
4.3. Australian automobile insurance
conducted with the same feature sets and same loss function but with a
90/10 train/test split ratio and 4-fold cross-validation, following the ex-
The Automobile claim datasets in Australia, ausprivauto, have mul-
perimental design suggested by Si et al. (2022). The loss function, AUC,
tiple response variables related to insurance claims introduced by De
can be written as
Jong et al. (2008) and also collected in the package CASDatasets (Char- ∑ ∑
pentier, 2015). For our experiments, we use subset autoprivauto0405, 𝜇∶𝑦𝜇 =0 𝜈∶𝑦𝜈 =1 1𝑦̂𝑝𝑟𝑜𝑏,𝜇 <𝑦̂𝑝𝑟𝑜𝑏,𝜈
which consists of 67,856 one-year auto insurance policies enforced in 𝐴𝑈 𝐶 (y, ŷ 𝑝𝑟𝑜𝑏 ) = ∑ ∑𝑍
( 𝑍𝜇=1 1𝑦𝜇 =0 )( 𝜈=1 1𝑦𝜈 =1 )
the year 2004 or 2005. We select the variable ClaimOcc, a binary in-
dicator of the occurrence of claim events, to test the performance of for a binary classification problem, where y and ŷ 𝑝𝑟𝑜𝑏 are the true classes
AutoML on classification tasks. To differentiate this experiment from and predicted probabilities of the positive class.
other experiments on the ausprivauto dataset, we denote the binary claim The numerical runtime and performance and its corresponding visu-
occurrence estimation task as ausprivauto_occ. The features comprise 4 alization can be found in Table 5 and Fig. 4 respectively. As observed,
categorical variables, describing the vehicle age groups (VechAge), ve- the runtime exhibits a similar scaling pattern as seen in the behaviors of
hicle body groups (VehBody), drivers’ gender (Gender), and drivers’ age the freMTPL2freq and LGPIF datasets. Experiments with an evaluation
groups (DrivAge), and two continuous features of vehicles’ value (Ve- budget 𝐺 ≤ 16 show a notable discrepancy between the train and test
hValue) and exposure (Exposure). Following the example of applying metrics compared to the other entries in the table. For experiments with
Logistic Regression in De Jong et al. (2008), our AutoML can achieve 𝐺 ≤ 16, the best-performing model architecture is the kNN. Obviously,
the training AUC score of 0.9401 by training on the entire dataset, with the kNN model tends to overfit in this dataset and leads to a dispar-

25
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Table 6
AutoML performance on ausprivauto claim frequency.

G T/s runtime/s Train Deviance Test Deviance

8 900 968.38 0.3799 0.3737


16 1,800 1,333.62 0.3748 0.3674
32 3,600 1,489.29 0.3748 0.3674
64 7,200 3,187.05 0.3748 0.3672
128 14,400 3,574,53 0.3749 0.3670
256 28,800 4,551.89 0.3742 0.3668
512 57,600 10,933.49 0.3748 0.3672
1024 115,200 13,849.67 0.3748 0.3671

Table 7
AutoML performance on ausprivauto claim amount.

G T/s runtime/s Train MSE Test MSE

8 900 952.30 993,285.74 1,220,835.99


16 1,800 1,290.22 1,101,871.13 1,192,769.13 Fig. 6. Train/Test deviance and runtime on ausprivauto claims.
32 3,600 2,181.53 1,099,855.88 1,191,872.59
64 7,200 2,350.45 1,099,573.20 1,191,859.57 4.4. AutoML as a benchmark
128 14,400 4,274,43 1,099,296.76 1,191,913.62
256 28,800 8,797.69 1,098,746.18 1,191,618.57 To demonstrate the potential of our AutoML pipeline as a bench-
512 57,600 11,533.74 1,098,873.21 1,192,162.05
mark, we compare the model performance derived from our AutoML
1024 115,200 23,347.84 1,098,817.89 1,191,450,96
with that of the widely used GLM in insurance pricing, and other actu-
arial literature studying the same dataset. It is important to note that
although GLMs are popular in practical applications, they serve as a
relatively weak benchmark and are associated with several limitations.
Research by Frees et al. (2016), Okine et al. (2022) and Jeong (2024)
indicates that more advanced expert models can often outperform GLM
models significantly. Therefore, the comparison of our AutoML results
with GLM models should be regarded as a preliminary guideline rather
than a definitive measure of performance.
To ensure a fair comparison, we build GLM using the same datasets
mentioned in previous experiments. For claim frequency estimation on
the freMTPL2freq dataset and the ausprivauto dataset (i.e., experiment
ausprivauto_fre), we employ Poisson regression with the logarithm of
exposure as the offset. For claim occurrence on ausprivauto (i.e., exper-
iment ausprivauto_occ), we utilize Logistic Regression with a Bernoulli
distribution. Lastly, for estimation of aggregated claim amounts on
LGPIF dataset and ausprivauto dataset (i.e., experiment ausprivauto), our
model is based on GLM with a Tweedie family.
As shown in Table 8, our AutoML consistently demonstrates superior
Fig. 5. Train/Test deviance and runtime on ausprivauto frequency dataset.
performance compared to the GLM family under reasonably constrained
ity between training and testing metrics. This disparity is significantly computational hardware and time budgets. This comparison highlights
mitigated for experiments with 𝐺 ≥ 32, as the instability of the kNN ar- the effectiveness of the AutoML approach in delivering superior results
chitecture during k-fold cross-validation leads to the selection of other across various insurance pricing models, highlighting its adaptability
model architectures. Both train and test performance metrics steadily and advantages over traditional GLM frameworks.
improve with the increasing evaluation budget and time budget. Furthermore, to assess how our AutoML performs relative to cutting-
Additionally, we can construct two regression models using the re- edge actuarial research, we gather the performance metrics from Au-
sponse variables for the number of claims reported, ClaimNb, and for the toML experiments conducted on freMTPL2freq, LGPIF and claim occur-
aggregated claim amounts, ClaimAmount, with the same feature sets. In rence in ausprivauto_occ datasets, applying the consistent data-splitting
the following discussion, we refer to the experiments conducted on claim strategies as those discussed in actuarial studies.
frequency and claim amounts as ausprivauto_fre and ausprivauto, respec- As summarized in Tables 9 and 2, AutoML achieves a test Poisson de-
tively. The AutoML optimization in these two regression experiments viance of less than 0.3122, with the best performance reaching 0.3009,
tries to minimize the mean Poisson deviance 𝑃 𝑜𝑖 and Mean Squared when the evaluation budget 𝐺 greater than 64. This result surpasses the
test Poisson deviance 0.3149 reported by Wüthrich (2019). Notably, the
Error (MSE) between the true values y and predicted values ŷ respec-
results provided by researchers are superior to those from a naive Pois-
tively. The metric of MSE is defined as:
son GLM, which has a test Poisson deviance of 0.3595, reflecting the
𝑍
1 ∑ benefits of careful data preprocessing and model selection. However,
𝑀𝑆𝐸 (y, ŷ ) = (𝑦 − 𝑦̂𝑧 )2 AutoML consistently outperforms the claim frequency modeling results
𝑍 𝑧=1 𝑧
provided by humans for the freMTPL2freq dataset. This suggests the po-
The evaluation performance and runtime for ausprivauto_fre and auspri- tential of AutoML to eliminate the need for manual data preprocessing
vauto datasets are summarized in Table 6 and Table 7, and the corre- and model selection. At the very least, the training results from the Au-
sponding visualization plots are illustrated as Fig. 5 and Fig. 6. As the toML pipelines can inspire humans to discover better solutions.
tables and plots illustrate, our AutoML exhibits a diminishing trend in Additionally, as evident from the LGPIF section of Table 9, the opti-
the improvement of performance metrics over time and evaluation bud- mal predictions generated by our AutoML outperform all metrics except
get. MAE when compared to the results from Quan and Valdez (2018) on

26
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Table 8
Comparison of AutoML with GLM models.

Data GLM Results AutoML


Family Metric Test loss G Test loss

freMTPL2freq Poisson Poisson Deviance 0.3595 256 0.3009

𝑅2 0.2062 0.2377
Gini 0.4089 0.4187
LGPIF Tweedie ME 0.1609 1024 0.0476
MSE 14.0533 13.4956
MAE 2.8749 2.8955

ausprivauto_occ Bernoulli AUC 0.6792 2048 0.6831

ausprivauto_fre Poisson Poisson Deviance 0.4437 256 0.3668

ausprivauto Tweedie RMSE 1,091.67 1024 1,091.54

Table 9
Comparison of AutoML with other actuarial literature.

Data Actuarial literature AutoML


Source Metric Test loss G Test loss

freMTPL2freq Wüthrich (2019) Poisson Deviance 0.3149 256 0.3009

𝑅2 0.229 0.2377
Gini 0.414 0.4187
LGPIF Quan and Valdez (2018) ME 0.048 1024 0.0476
MSE 13.651 13.4956
MAE 2.883 2.8955

ausprivauto_occ Si et al. (2022) AUC 0.660 2048 0.6831

the claim severity prediction. While Quan and Valdez (2018) utilized Table 10
various tree-based models, our AutoML ensemble model leverages mul- Comparison of AutoML on general ML tasks.
tiple state-of-the-art algorithms, resulting in superior prediction perfor- Task Model Evaluation Metric Score
mance. A similar trend is observed for the claim occurrence prediction
InsurAutoML 0.86476
on the ausprivauto_occ dataset, where our AutoML outperforms the stack- autogluon medium 0.86171
ing ensemble model approach suggested by Si et al. (2022). The com- Flood Prediction autogluon best 𝑅2 score 0.86262
parison of the experiment LGPIF in Table 8 and Table 9 suggests that h2o 0.86073
our AutoML not only outperforms the GLM but also has the potential to auto-sklearn 0.86233

surpass state-of-the-art ML models developed by human experts. InsurAutoML 0.14726


As demonstrated in the comparative analysis presented in Table 8 autogluon medium 0.17143
with the GLM model framework and Table 9 alongside selected actuar- Abalone autogluon best RMSLE 0.16840
h2o 0.14751
ial literature, our AutoML consistently outperforms in various insurance
auto-sklearn 0.14826a
tasks, spanning both classification and regression. In the context of the
a
insurance sector, our approach encompasses tasks related to predicting The auto-sklearn experiment on the Abalone dataset encountered an issue
claim occurrence, claim frequency, and claim amounts. This perfor- with categorical preprocessing when using the default settings. Specifically, the
default model predicted a single unique value for all observations, which led
mance’s superiority across diverse tasks underscores the practicality and
to an RMSLE of 0.28599. To address this, the RMSLE reported in this table is
effectiveness of our automated approach in the insurance domain.
based on a preprocessed dataset where all string-based features were converted
to numerical ones, allowing proper optimization.
4.5. AutoML for general-purposed ML tasks

In addition to insurance-specific tasks, our AutoML can be employed those used in the corresponding competitions. Specifically, the 𝑅2 score
in general ML tasks. In the following, we demonstrate the capability of is used for the Flood Prediction task, while the Root Mean Squared Log-
our AutoML comparing other popular AutoML frameworks. arithmic Error (RMSLE) is utilized for the Abalone task. The RMSLE can
Specifically, we test our AutoML on two Kaggle competition datasets, be mathematically expressed as:
Regression with a Flood Prediction Dataset6 and Regression with an

Abalone Dataset.7 We compare our AutoML with three popular Au- √
√1 ∑ 𝑍
( )2
toML architecture, autogluon8 (Erickson et al., 2020), h2o9 (LeDell and 𝑅𝑀𝑆𝐿𝐸 (y, ŷ ) = √ log(1 + 𝑦̂𝑧 ) − log(1 + 𝑦𝑧 )
Poirier, 2020) and auto-sklearn10 (Feurer et al., 2015). 𝑍 𝑧=1
The private scores submitted by each AutoML architecture are pre-
For autogluon, we ran experiments using medium preset and best preset,
sented in Table 10. The evaluation metrics employed are consistent with
denoting autogluon medium and autogluon best respectively. For our Au-
toML, as well as for the h2o and auto-sklearn frameworks, we allocated a
6 runtime of 6-10 hours for each experiment. The results indicate that our
https://www.kaggle.com/competitions/playground-series-s4e5.
7
https://www.kaggle.com/competitions/playground-series-s4e4. AutoML is competitive in performance compared to these widely used
8
https://github.com/autogluon/autogluon. and well-established AutoML frameworks on general ML tasks. More-
9
https://github.com/h2oai/h2o-3. over, our AutoML pipeline demonstrates versatility, being applicable to
10
https://github.com/automl/auto-sklearn. both general ML tasks and insurance-specific ML applications.

27
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

5. Conclusion Declaration of competing interest

The authors declare that they have no known competing financial


In conclusion, there is a noticeable gap between insurance practi-
interests or personal relationships that could have appeared to influence
tioners and the data science community when it comes to effectively
the work reported in this paper.
optimizing and applying ML solutions for both industrial and research
purposes.
Appendix A. Notations
Our AutoML effectively bridges this gap by automating the ML
pipeline for insurance-related tasks. This framework not only stream-
lines the modeling process but also demonstrates adaptability and robust Table 11
performance across various insurance applications. Our comprehensive Notations used in the formulation of AutoML.
experimentation and comparison with traditional modeling frameworks, Notation Description
such as GLM, and existing actuarial literature confirm the feasibility and
 a conjunction space of model space and hyperparameter space
effectiveness of our AutoML approach.  a dataset
The runtime scalability of our AutoML, as observed in the pre-  a loss function
vious experiments, generally meets expectations, showing a roughly  the model space for all ML models
 a ML workflow pipeline
linear increase with the evaluation budget. However, performance im-  an objective function
provements may plateau over time or fail to scale significantly, poten-  the search space
tially leading to wasted computation and increased energy consumption. 𝐴 the majority class in the context of an imbalanced problem
Therefore, to attain optimal performance, strategies should include not 𝐵 a data balancing algorithm
𝐸 a data encoding algorithm
only increasing the computational budget—through enhanced hardware 𝐹 a feature selection algorithm
or extended evaluation/time limits, but also incorporating an early stop- 𝐺 evaluation budget
ping mechanism. Additionally, human intervention can refine the opti- 𝐻 the number of pipelines to construct the ensemble model
mization process by examining the results and limiting the search space, 𝐼 an imputation algorithm
𝑀 an ML model
which are functions supported by our AutoML framework. Combining 𝑁 the number of tunable hyperparameters
computational resources with human-guided intervention provides a 𝑂 unique categories of a classification task
path to achieving the best performance in ML tasks Q a feature selection rule
Our AutoML is designed to be user-friendly, allowing users with lit- 𝑅 the imbalance threshold
𝑆 a data scaling algorithm
tle to no prior knowledge to get started effortlessly using just 4-5 lines of 𝑇 time budget
code. It also offers flexibility for customization to cater to personalized 𝑊 the number of features/columns of a feature matrix X
use scenarios, accommodating both novices and advanced users with (X, y) a pair of feature matrix X and response vector y in a dataset
varying needs. Furthermore, beyond its immediate application in ML 𝑍 the number of observations/rows of a feature matrix
𝛾 a voting mechanism
solution development, our AutoML can serve as a benchmark for evalu- 𝜃 a parameter set of the ML model
ating future innovations in ML models within the insurance domain. By 𝜆 a hyperparameter set of ML model
providing a consistent and comprehensive evaluation platform, it has Λ𝑗 range of hyperparameter 𝑗
the potential to serve as a valuable reference point for assessing the per- 𝚲 the space defined by the ranges of all hyperparameters
Σ a pipeline ensemble model
formance and advancements of emerging algorithms. Ω the number of features selected by a feature selection rule
While AutoML frameworks excel at automating the selection of al-
gorithms, hyperparameters, and preprocessing steps to optimize model
performance, they often fall short in offering transparency and explain- Appendix B. Three ensemble strategies
ability. Current developments in Explainable Artificial Intelligence (XAI)
tend to focus on post-hoc explanations of predictions or model-agnostic The following paragraphs elaborate on how the training of the three
insights, leaving the role of preprocessing steps in the overall model ensemble structures and the voting mechanism are adopted in our work:
process under-explored. This gap is particularly relevant in actuarial Stacking Ensemble: As illustrated in Fig. 7, the stacking ensem-
science, where XAI is still an ongoing research topic, and there is no ble represents the most intuitive ensemble structure, where individual
clear path to achieving domain-specific interpretability. These chal- pipelines are trained independently on the original full datasets. The
lenges in AutoML systems amplify the “black-box” perception, making only difference distinguishing different individual pipelines is the hyper-
it difficult to understand not only how specific features influence pre- parameter set sampled by the search algorithm, which results in varying
dictions, but also why certain hyperparameters or preprocessing steps evaluation losses. Consequently, all pipelines can be trained in parallel,
are selected. The use of ensemble methods adds further complexity, es- leveraging the multi-core, multi-threading capabilities of modern com-
pecially in high-stakes fields such as healthcare, finance, and insurance, puting systems.
where transparency and trust in model outcomes are essential. As the Once the time or evaluation budget is exhausted, a total of 𝐺
XAI field progresses, expanding our AutoML framework to include post- pipelines, denoted as {𝑔 }, 𝑔 = 1, 2, ..., 𝐺 , have been trained. These
hoc inference tools after the modeling steps could significantly enhance pipelines are then ranked based on their evaluation losses, producing
interpretability, helping to bridge this gap. a sorted list {(ℎ) }, ℎ = 1, 2, .... Given our AutoML’s optimization di-
rection, the sorted pipelines {(ℎ) } are arranged in ascending order of
ℎ, indicating that (1) has the lowest evaluation loss and performs the
CRediT authorship contribution statement best. The ensemble model, denoted as Σ𝐻 , is constructed utilizing the
best-performing 𝐻 pipelines, {(1) , (2) , ..., (𝐻) }. The assembly of the
Panyi Dong: Writing – review & editing, Writing – original draft, ensemble model Σ𝐻 involves no additional computation but merely the
Visualization, Validation, Software, Resources, Methodology, Investi- stacking the selected pipelines, which can be expressed as
gation, Formal analysis, Data curation. Zhiyu Quan: Writing – review
Σ𝐻 = Σ𝐻 ((1) , (2) , … , (𝐻) )
& editing, Writing – original draft, Validation, Supervision, Resources,
Project administration, Methodology, Investigation, Funding acquisi- Algorithm 2 summarizes the stacking ensemble process, highlighting
tion, Data curation, Conceptualization. that the key differences from Algorithm 1 are the post-ranking step and

28
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Fig. 7. An illustration of stacking ensemble training diagram.

the stacking procedure for the ensemble model, specifically line 14 in is the best-performing pipeline for subset dataset (ℎ) . The ensemble
the algorithm description. model, targeting optimal performance, is then constructed as:

Σ𝐻 = Σ𝐻 (1,(1) , 2,(1) , … , 𝐻,(1) )


Algorithm 2: The Stacking Ensemble.
by selecting the best-performing pipelines from each subset. The con-
Input: Dataset  = (𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙𝑖𝑑 ); Search space  ; Time budget 𝑇 ;
Evaluation budget 𝐺 ; Search algorithm 𝑆𝑎𝑚𝑝; Size of the struction of the ensemble model becomes straightforward by stacking
ensemble model 𝐻 the best-performing pipelines from each subset dataset, once the opti-
Output: Ensemble Σ𝐻 mization and ranking on subsets are completed. The bagging algorithm
1 𝑘=0 ; /* Round of evaluation */ is summarized in Algorithm 3. Unlike Algorithm 2, additional subset
2 𝑡𝑟𝑒 = 𝑇 ; /* Remaining time budget */ matrices {𝝆(ℎ) } (ℎ = 1, 2, ..., 𝐻 ) are required to construct the subset
3 𝑔 𝑟𝑒 = 𝐺 ; /* Remaining evaluation budget */ datasets, and the optimization loops are performed 𝐻 times to obtain
4 while 𝑡𝑟𝑒 > 0 and 𝑔 𝑟𝑒 > 0 do
the pipelines {ℎ,(1) }. Bagging is typically used to prevent over-fitting,
5 𝑡𝑠𝑡𝑎𝑟𝑡 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒;
as optimizing on data subsets is sub-optimal compared to utilizing the
6 (𝐸 (𝑘) , 𝜆(𝑘)
𝐸
), (𝐼 (𝑘) , 𝜆(𝑘)
𝐼
), (𝐵 (𝑘) , 𝜆(𝑘)
𝐵
), (𝑆 (𝑘) , 𝜆(𝑘)
𝑆
), (𝐹 (𝑘) , 𝜆(𝑘)
𝐹
), (𝑀 (𝑘) , 𝜆(𝑘)
𝑀
)= full datasets. However, as demonstrated in Galar et al. (2012), bagging
𝑆𝑎𝑚𝑝(𝑘) ( );
algorithms are also beneficial for imbalance learning.
7 𝑘 = 𝑀 (𝑘) (𝑘)
◦𝐹 (𝑘)
(𝑘)
◦𝑆 (𝑘)
(𝑘)
◦𝐵 (𝑘)
(𝑘)
◦𝐼 (𝑘)
(𝑘)
◦𝐸 (𝑘)
(𝑘)
;
𝜆𝑀 𝜆𝐹 𝜆𝑆 𝜆𝐵 𝜆𝐼 𝜆𝐸
8 𝐿𝑒𝑣𝑎𝑙,(𝑘) = (, 𝑘 , );
9 𝑡𝑒𝑛𝑑 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒; Algorithm 3: The Bagging Ensemble.
10 𝑘 = 𝑘 + 1;
Input: Dataset  = (𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙𝑖𝑑 ); Search space  ; Time budget 𝑇 ;
11 𝑡𝑟𝑒 = 𝑡𝑟𝑒 − (𝑡𝑒𝑛𝑑 − 𝑡𝑠𝑡𝑎𝑟𝑡 );
Evaluation budget 𝐺 ; Search algorithm 𝑆𝑎𝑚𝑝; Size of the
12 𝑔 𝑟𝑒 = 𝑔 𝑟𝑒 − 1;
ensemble 𝐻 ; Subset Matrices {𝝆(ℎ) }ℎ=1,2,…,𝐻
13 end
Output: Ensemble Σ𝐻
14 {(𝑘) } = 𝑠𝑜𝑟𝑡({𝑘 }); 1 for ℎ ← 1 to 𝐻 do
15 Σ𝐻 = Σ𝐻 ((1) , (2) , … , (𝐻) ); 2 𝑘=0 ; /* Round of evaluation */
16 return Σ𝐻 ; 3 𝑡𝑟𝑒 = 𝑇 ∕∕𝐻 ; /* Remaining time budget */
4 𝑔 𝑟𝑒 = 𝐺∕∕𝐻 ; /* Remaining evaluation budget */
5 (ℎ) = ((X𝑡𝑟𝑎𝑖𝑛 𝝆(ℎ) , y𝑡𝑟𝑎𝑖𝑛 ), (X𝑣𝑎𝑙𝑖𝑑 𝝆(ℎ) , y𝑣𝑎𝑙𝑖𝑑 ));
Bagging Ensemble: The bagging ensemble follows a similar training 6 while 𝑡𝑟𝑒 > 0 and 𝑔 𝑟𝑒 > 0 do
diagram as the stacking ensemble, but differs by optimizing the individ- 7 𝑡𝑠𝑡𝑎𝑟𝑡 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒;
ual pipelines on the subsets of the features, rather than the full feature 8 (𝐸 (𝑘) , 𝜆(𝑘)
𝐸
), (𝐼 (𝑘) , 𝜆(𝑘)
𝐼
), (𝐵 (𝑘) , 𝜆(𝑘)
𝐵
), (𝑆 (𝑘) , 𝜆(𝑘)
𝑆
), (𝐹 (𝑘) , 𝜆(𝑘)
𝐹
), (𝑀 (𝑘) , 𝜆(𝑘)
𝑀
)=
set. As demonstrated in Fig. 8, for a pre-defined number of 𝐻 , denot- 𝑆𝑎𝑚𝑝(𝑘) ( );
(𝑘) (𝑘) (𝑘) (𝑘) (𝑘) (𝑘)
ing the number of pipelines in the ensemble model Σ𝐻 , 𝐻 subsets of 9 ℎ,𝑘 = 𝑀 (𝑘) ◦𝐹 (𝑘) ◦𝑆 (𝑘) ◦𝐵 (𝑘) ◦𝐼 (𝑘) ◦𝐸 (𝑘) ;
𝜆𝑀 𝜆𝐹 𝜆𝑆 𝜆𝐵 𝜆𝐼 𝜆𝐸
datasets are constructed by randomly selecting only a proportion of the 10 𝐿𝑒𝑣𝑎𝑙,(𝑘) = (, ℎ,𝑘 , (ℎ) );
features. Each pipeline ℎ,𝑙 , ℎ = 1, 2, … , 𝐻 and 𝑙 = 1, 2, … , 𝐺∕∕𝐻 , is 11 𝑡𝑒𝑛𝑑 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒;
optimized on the subset dataset (ℎ) which is constructed from the orig- 12 𝑘 = 𝑘 + 1;
inal dataset  = (X, y) as follows: for every subset, a feature selection 13 𝑡𝑟𝑒 = 𝑡𝑟𝑒 − (𝑡𝑒𝑛𝑑 − 𝑡𝑠𝑡𝑎𝑟𝑡 );
rule Q(ℎ) , in the form of 𝑊 × 1 vector where 𝑊 denotes the number 14 𝑔 𝑟𝑒 = 𝑔 𝑟𝑒 − 1;
of features in the feature matrix X, is defined by our AutoML. The fea- 15 end
16 {ℎ,(𝑘) } = 𝑠𝑜𝑟𝑡({ℎ,𝑘 });
ture selection rules consist of binary values where the 𝑤-th element of
(ℎ) (ℎ) 17 end
rule 𝑞𝑤 = 1 if the 𝑤-th feature is selected by the rule and 𝑞𝑤 = 0 oth-
18 Σ𝐻 = Σ𝐻 (1,(1) , 2,(1) , … , 𝐻,(1) );
erwise. Following the feature selection rule Q , a subset matrix 𝝆(ℎ)
(ℎ)
(ℎ) (ℎ) 19 return Σ𝐻 ;
of shape of 𝑊 × Ω(ℎ) can be defined, where elements 𝜌 ∑𝑤 (ℎ) = 𝑞𝑤
𝑤, 𝛼=1 𝑞𝛼
∑𝑤 (ℎ)
for 𝑤 = 1, 2, ..., 𝑊 and 𝜌𝑤,𝜔 = 0 for all (𝑤, 𝜔) ∉ {(𝑤, 𝛼=1 𝑞𝛼 )}𝑤 (𝑤 =
1, 2, ..., 𝑊 and 𝜔 = 1, 2, ..., Ω(ℎ) ). Ω(ℎ) is the number of features selected Boosting Ensemble: In contrast to the parallel training diagram
∑𝑤 (ℎ)
by rule Q(ℎ) , so Ω(ℎ) = 𝛼=1 𝑞𝛼 . The subset dataset is then defined as utilized in the stacking and bagging ensembles, boosting ensemble, as
 = (X𝝆 , y).
(ℎ) (ℎ) shown in Fig. 9, employs a sequential learning approach in which each
Given the structure of tabular datasets, each pipeline aims to op- subsequent pipeline is optimized based on the residuals/gradients of the
timize the horizontal partition of the original datasets instead of the previous pipelines. The sequential design of boosting ensemble is mo-
full original datasets, as seen in the stacking ensemble. The pipelines tivated by the idea that later models are trained to correct the errors
of each subset, once trained, are ordered by their evaluation loss of earlier models, addressing issues that earlier models may not have
within the groups of subsets, resulting in ℎ,(𝑙) , ℎ = 1, 2, … , 𝐻 and effectively resolved. For a boosting ensemble of size 𝐻 , the pipeline
𝑙 = 1, 2, … , 𝐺∕∕𝐻 . Using the same ascending order, each pipeline ℎ,(1) ℎ , (ℎ = 1, 2, … , 𝐻 ), at step ℎ, does not train on the original training

29
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Fig. 8. An illustration of bagging ensemble training diagram.

Fig. 9. An illustration of boosting ensemble training diagram.

set 𝑡𝑟𝑎𝑖𝑛 = (X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛 ), but instead focuses on optimizing the residual lowing the residual learning framework, the boosting ensemble utilized
dataset. in our work can be summarized as Algorithm 4. The algorithm closely
resembles Algorithm 3, with the distinction of modifying its learning
ℎ = (X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛,ℎ ) data set using residuals on response variables instead of employing
where feature subsets on feature matrices, as seen in the bagging ensemble.
Similar to the bagging ensemble approach, the optimization of the resid-
y𝑡𝑟𝑎𝑖𝑛,ℎ = y𝑡𝑟𝑎𝑖𝑛,ℎ−1 − ℎ−1 (X𝑡𝑟𝑎𝑖𝑛 ) ual dataset ℎ can be fully parallelized. This leads to more efficient
optimization and prevents the occurrence of sub-optimal pipelines on
for ℎ = 1, 2, ..., 𝐻 . The initialization of the residuals and predictions
individual residual datasets.
can be defined as y𝑡𝑟𝑎𝑖𝑛,0 = y𝑡𝑟𝑎𝑖𝑛 and 0 (X𝑡𝑟𝑎𝑖𝑛 ) = 0, indicating the first
Voting Mechanism in Ensemble: The voting mechanism, a criti-
pipeline is trained on the original train set 𝑡𝑟𝑎𝑖𝑛 . By learning from
residuals instead of original response variables, the subsequent pipelines cal component in the ensemble, determines how the predictions from
possess the potential to address errors that preceding pipelines could individual candidate pipelines can be aggregated to produce the final
not, thereby potentially solving the problem of imbalance. In our Au- prediction for the ensemble model.
toML, rather than utilizing a full sequential training diagram, we split For the boosting ensemble, given their unique training diagram of
the training by steps. At step ℎ, (ℎ = 1, 2, ..., 𝐻 ), pipelines ℎ,𝑘 , (𝑘 = residual learning, the voting mechanism is typically the summation of
1, 2, ..., 𝐺∕∕𝐻 ), aim to optimize on datasets ℎ , and the best-performing pipelines:
pipelines ℎ,(1) are selected by the ascending order of evaluation losses.
𝐻

The components of the ensemble model Σ𝐻 can then be defined as
ŷ = Σ𝐻 (X) = ℎ (X)
ℎ = 𝐏ℎ,(1) and utilized to generate dataset ℎ+1 for the next step. Fol-
ℎ=1

30
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

ℎ (X) = 𝑜 = (0, … , 0, 1, 0, … , 0)
Algorithm 4: The Boosting Ensemble.

Input: Dataset  = (𝑡𝑟𝑎𝑖𝑛 , 𝑣𝑎𝑙𝑖𝑑 ); Search space  ; Time budget 𝑇 ;
𝑜
Evaluation budget 𝐺 ; Search algorithm 𝑆𝑎𝑚𝑝; Size of the
ensemble model 𝐻 The aggregation of predictions in hard voting can be considered as the
Output: Ensemble Σ𝑀 majority vote among all 𝐻 pipelines, with the probabilities reduced
1 Initialization: y𝑡𝑟𝑎𝑖𝑛,0 = y𝑡𝑟𝑎𝑖𝑛 ; y𝑣𝑎𝑙𝑖𝑑,0 = y𝑣𝑎𝑙𝑖𝑑 ; y ̂ 𝑡𝑟𝑎𝑖𝑛,0 = 0; ŷ 𝑣𝑎𝑙𝑖𝑑,0 = 0; to indicator functions/unit vector representations. The eventual predic-
2 for ℎ ← 1 to 𝐻 do
tions of the ensemble model can be calculated as
3 𝑘=0 ; /* Round of evaluation */
𝐻
∑ 𝐻
∑ 𝐻

4 𝑡𝑟𝑒 = 𝑇 ∕∕𝐻 ; /* Remaining time budget */
5 𝑔 𝑟𝑒 = 𝐺∕∕𝐻 ; /* Remaining evaluation budget */ ŷ = argmax( 1{ℎ (X)=1} , 1{ℎ (X)=2} , … , 1{ℎ (X)=𝑂} )
6 y𝑡𝑟𝑎𝑖𝑛,ℎ = y𝑡𝑟𝑎𝑖𝑛,ℎ−1 − ŷ 𝑡𝑟𝑎𝑖𝑛,ℎ−1 ; y𝑣𝑎𝑙𝑖𝑑,ℎ = y𝑣𝑎𝑙𝑖𝑑,ℎ−1 − ŷ 𝑣𝑎𝑙𝑖𝑑,ℎ−1 ; ℎ=1 ℎ=1 ℎ=1
7 ℎ = ((X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛,ℎ ), (X𝑣𝑎𝑙𝑖𝑑 , y𝑣𝑎𝑙𝑖𝑑,ℎ )); 𝐻

8 while 𝑡𝑟𝑒 > 0 and 𝑔 𝑟𝑒 > 0 do = argmax ℎ (X)
9 𝑡𝑠𝑡𝑎𝑟𝑡 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒; ℎ=1
10 (𝐸 (𝑘) , 𝜆(𝑘)
𝐸
), (𝐼 (𝑘) , 𝜆(𝑘)
𝐼
), (𝐵 (𝑘) , 𝜆(𝑘)
𝐵
), (𝑆 (𝑘) , 𝜆(𝑘)
𝑆
), (𝐹 (𝑘) , 𝜆(𝑘)
𝐹
), (𝑀 (𝑘) , 𝜆(𝑘)
𝑀
)= where the first row utilizes the numerical representation of ℎ (X) and
𝑆𝑎𝑚𝑝(𝑘) ( ); second row corresponds to the unit vector representation.
(𝑘) (𝑘) (𝑘) (𝑘) (𝑘) (𝑘)
11 ℎ,𝑘 = 𝑀 (𝑘) ◦𝐹 (𝑘) ◦𝑆 (𝑘) ◦𝐵 (𝑘) ◦𝐼 (𝑘) ◦𝐸 (𝑘) ; With the training diagram and the voting mechanism, multiple
𝜆𝑀 𝜆𝐹 𝜆𝑆 𝜆𝐵 𝜆𝐼 𝜆𝐸
12 𝐿𝑒𝑣𝑎𝑙,(𝑘) = (, ℎ,𝑘 , ℎ ); pipelines can be coordinated to address the imbalance problems that
13 𝑡𝑒𝑛𝑑 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡𝑇 𝑖𝑚𝑒; individual pipelines alone might struggle with, as demonstrated in var-
14 𝑘 = 𝑘 + 1; ious empirical studies Polikar (2006), Galar et al. (2012), Dong et al.
15 𝑡𝑟𝑒 = 𝑡𝑟𝑒 − (𝑡𝑒𝑛𝑑 − 𝑡𝑠𝑡𝑎𝑟𝑡 ); (2020).
16 𝑔 𝑟𝑒 = 𝑔 𝑟𝑒 − 1;
17 end Appendix C. Preprocessing methods and ML models
18 {ℎ,(𝑘) } = 𝑠𝑜𝑟𝑡({ℎ,𝑘 });
19 ŷ 𝑡𝑟𝑎𝑖𝑛,ℎ = ℎ,(1) (X𝑡𝑟𝑎𝑖𝑛 ); ŷ 𝑣𝑎𝑙𝑖𝑑,ℎ = ℎ,(1) (X𝑣𝑎𝑙𝑖𝑑 );
We have provided a brief list of the preprocessing methods and ML
20 end
models currently used in our AutoML (see Tables 12–18). As develop-
21 Σ𝐻 = Σ𝐻 (1,(1) , 2,(1) , … , 𝐻,(1) );
22 return Σ𝐻 ; ment progresses, we intend to expand the framework by integrating
more advanced and updated methods. We also welcome contributions
from the broader community to support the ongoing development of our
open-source AutoML.
For stacking and bagging ensembles, the voting mechanisms must
be tailored to the nature of the regression or classification tasks, but are
Table 12
interchangeable between stacking and bagging structures. Data Encoding Methods.
In regression tasks, the voting mechanisms often involve aggregation
Method Description
statistics. Commonly used statistics include mean, median, and maxi-
mum, calculated by performing the corresponding operation on all 𝐻 Data Encoding Transform string-based variables into numeric representations,
and preserve the corresponding conversion table to ensure
predictions by individual pipelines. For example, in the mean voting
consistent application on the test sets. Ordinal and One-hot
mechanism, the corresponding predictions of the ensemble models can encoding available.
be expressed as
𝐻
1 ∑
ŷ = Σ𝐻 (X) =  (X) Table 13
𝐻 ℎ=1 ℎ Data Imputation Methods.
In classification tasks, the voting can be either hard or soft (Po- Method Description
likar, 2006), which differs by aggregating the class-level predictions Simple Imputation Impute missing values by applying relevant feature
or probabilities of each class. The soft voting aggregates the prediction statistics, such as the mean or median.
probabilities generated by individual pipelines. For a 𝑂 -category clas- Joint Imputation Assuming the data follows a multivariate normal
distribution, compute the mean vector and covariance
sification problem, the prediction probability of pipeline ℎ, ℎ , can be
matrix based on the observed values. Then, impute the
represented as a probability vector missing values by sampling from this distribution.
Expectation Iteratively update the imputed values, along with the mean
ℎ𝑝𝑟𝑜𝑏 (X) = (𝑝ℎ,1 , 𝑝ℎ,2 , … , 𝑝ℎ,𝑂 ) Maximization (EM) vector and covariance matrix of the joint distribution, to
maximize the likelihood of the multivariate normal
where 𝑝ℎ,𝑜 ∈ [0, 1], (𝑜 = 1, 2, … , 𝑂 ), denotes the probability of being cat- distribution.
∑𝑂
egorised as class 𝑜 and 𝑜=1 𝑝ℎ,𝑜 = 1. The ensemble model prediction in kNN Imputation Impute missing values using a k-Nearest Neighbors (kNN)
model trained on the observed values. (Stekhoven and
soft voting is the class with the highest aggregated probability: Bühlmann, 2012)
Miss Forest Impute missing values by training a Random Forest model
𝐻
∑ 𝐻
∑ 𝐻
∑ 𝐻
∑ Imputation on the observed values. (Stekhoven and Bühlmann, 2012)
ŷ = argmax ℎ𝑝𝑟𝑜𝑏 (X) = argmax( 𝑝ℎ,1 , 𝑝ℎ,2 , … , 𝑝ℎ,𝑂 ) Multiple Initially impute the missing values through a simple
ℎ=1 ℎ=1 ℎ=1 ℎ=1 Imputation by imputation method. Then, iteratively refine the imputation
The hard voting, however, aggregates the prediction classes instead of Chained Equations by removing the previously imputed values, training a base
(MICE) ML model using only the observed values, and re-imputing
the prediction probabilities. The prediction classes of individual pipeline the missing values column by column. This process is
ℎ can be expressed as repeated multiple times until the imputations converge to
stable estimates. (Azur et al., 2011)
ℎ (X) = argmax(𝑝ℎ,1 , 𝑝ℎ,2 , … , 𝑝ℎ,𝑂 ) Generative Train a Generative Adversarial Network (GAN) model to
Adversarial impute missing values. The model consists of a generator
indicating that the predicted class is the most probable class based on Imputation Nets that produces realistic imputations and a discriminator that
probability. If class 𝑜 is the predicted class, the prediction can further (GAIN) differentiates between actual observed values and the
generated imputations. (Yoon et al., 2018)
be re-written as a unit vector

31
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Table 14
Data Balancing Methods.

Method Description

Simple Random Over-Sampling Randomly replicate minority samples until the desired majority-minority threshold is achieved.
Simple Random Under-Sampling Randomly remove majority samples to achieve the desired majority-minority threshold.
Tomek Link By identifying adjacent majority-minority class pairs and removing majority samples from these pairs, the ratio of majority class can
be reduced. Tomek (1976)
Edited Nearest Neighbor (ENN) Remove majority samples that are misclassified by a kNN model trained on the original data. (Wilson, 1972)
Condensed Nearest Neighbor (CNN) Select a subset of majority class observations that aligns with the predictions of a kNN model trained on the original data. (Hart, 1968)
Synthetic Minority Over-Sampling Generate synthetic minority samples by applying linear interpolation between existing minority samples in the feature space. (Chawla
Techniques (SMOTE) et al., 2002)
One Sided Selection (OSS) A two-step method where the first stage utilizes Tomek Links, followed by CNN.
CNN-TomekLink
Smote-TomekLink Two-step balancing strategy that combines two balancing algorithms.
Smote-ENN

Table 15
Data Scaling Methods.

Method Description
x𝑗 − 𝜇𝑗
Standardization Scale by x𝑠𝑐𝑎𝑙𝑒𝑑
𝑗 = where 𝜇𝑗 and 𝜎𝑗 are mean and standard deviation of feature 𝑗 .
𝜎𝑗
x𝑗 − x𝑗,𝑚𝑖𝑛
Normalization Scale by x𝑠𝑐𝑎𝑙𝑒𝑑
𝑗 = where x𝑗,𝑚𝑖𝑛 and x𝑗,𝑚𝑎𝑥 are the minimum and maximum value of feature 𝑗 .
x𝑗,𝑚𝑎𝑥 − x𝑗,𝑚𝑖𝑛
x𝑗 − x𝑗,𝑚𝑖𝑛
Min-Max Scaling Scale by x𝑠𝑐𝑎𝑙𝑒𝑑 = (𝑓 − 𝑓𝑗,𝑚𝑖𝑛 ) where 𝑓𝑗,𝑚𝑖𝑛 are 𝑓𝑗,𝑚𝑎𝑥 two pre-defined hyperparameters for customized scaling range.
𝑗
x𝑗,𝑚𝑎𝑥 − x𝑗,𝑚𝑖𝑛 𝑗,𝑚𝑎𝑥
x𝑗 − 𝑄1∕2,𝑗
Robust Scaling Scale by x𝑠𝑐𝑎𝑙𝑒𝑑
𝑗 = where 𝑄1∕2,𝑗 , 𝑄𝑙,𝑗 and 𝑄ℎ,𝑗 are 50-th percentile, some pre-defined low and high quantiles of feature 𝑗 .
𝑄ℎ,𝑗 − 𝑄𝑙,𝑗
Power Transformation Scale the feature exponentially.
Quantile Transformation Transform the feature into uniform or normal distribution.
Winsorization Cap the feature by a certain quantile to remove outliers.

Table 16
Feature Selection Methods.

Method Description

Feature Filter Rank features based on their univariate correlation with the response variable, then select top ranking features.
Percentile Feature Selection Select features that fall within the highest percentile/rate by constructing univariate ML models to rank them.
Rates Feature Selection
Stepwise Feature Selection (SFS) Iteratively add or remove one feature at a time to identify the optimal subset of features.
Adaptive Sequential Forward Floating An extension of SFS that allows both the addition and removal of features with dynamic steps to handle variable dependencies. Somol
Search (ASFFS) et al. (1999)
Principle Component Analysis (PCA) Perform eigenvalue decomposition to calculate principal components, reducing data dimensionality while retaining essential
information.
Truncated Singular Value Use SVD to decompose the design matrix and reduce dimensionality.
Decomposition (SVD)
Minimal-Redundancy-Maximal- A variant of SFS that enhances feature selection performance by simultaneously maximizing the relevance of the selected features with
Relevance (mRMR) respect to the response variable while minimizing redundancy among the selected features. (Peng et al., 2005)
Copula-based Feature Selection (CBFS) A variant of SFS that accounts for variable dependencies by embedding information into a copula framework. (Lall et al., 2021)
Genetic Algorithm (GA) Optimize feature selection by employing processes such as selection, crossover, and mutation to maximize model performance. (Tan et
al., 2007)
Extra Tree Feature Selection Select features based on their importance in an Extra Trees model, which uses random splits to determine feature significance.

32
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Table 17
Classification Models.

Method Description

Stochastic Gradient Descent (SGD) Regularized linear models with non-linear loss functions optimized by SGD.
Logistic Regression A linear model that uses a logistic (sigmoid) link function to predict probabilities.
Adaboost Sequentially trains a series of weak learners, adjusting their weights so that each subsequent learner focuses more on the errors made
by its predecessors.
Passive Aggressive (PA) Updates the model only when a prediction is incorrect or uncertain, aggressively adjusting the decision boundary while remaining
passive when predictions are correct.
Linear Discriminant Analysis (LDA) Identifies the Bayes-optimal linear/quadratic boundary in the feature space that can effectively distinguish the distinct classes.
Quadratic Discriminant Analysis (QDA)
Generalized Additive Models (GAMs) Linear additive framework that learns smooth functions for each feature.
k Nearest Neighbor (kNN) Predicts the output by considering the majority or average outcome of the k closest data points in the feature space.
Linear Support Vector Machine (SVM) Linear/Non-linear kernel-based SVM models that finds the optimal linear decision boundary.
Kernel SVM
Gaussian Naive Bayes Gaussian/Bernoulli/Multinomial-based probabilistic classification algorithm based on Bayes Theorem.
Bernoulli Naive Bayes
Multinomial Naive Bayes
Decision Tree Grow a single decision tree that recursively splits the data into branches and nodes based on feature values.
Extra Tree Train a series of completely randomized decision trees on sub-samples and take average for predictions.
Random Forest An ensemble of decision trees trained on random feature subsets, with the final prediction made by averaging or voting across all trees.
Histogram Gradient Boosting Different implementations of gradient boosting decision trees.
LightGBM
XGBoost
Multi-layer perception (MLP) A feedforward neural networks consisting of multiple layers of fully interconnected neurons.

Table 18
Regression Models.

Method Description

Lasso Regression Linear regression with L1-/L2- regularization.


Ridge Regression
ElasticNet
Bayesian Ridge Regression Conditional Bayesian linear regression with L2 regularization.
Automatic Relevance Determination Employs a Bayesian framework to estimate the relevance weights and select relevant features.
(ARD) Regression
Gaussian Process A probabilistic model with Gaussian prior.
Linear Regression Same model architectures as those in classification models but tailored to suit regression tasks
SGD
Adaboost
k Nearest Neighbor (kNN)
Linear SVM
SVM
Decision Tree
Extra Tree
Random Forest
Histogram Gradinet Boosting
LightGBM
XGBoost

Appendix D. Experiment details preparing for the experiment. Specifically, Line 5 specifies the evalua-
tion metric, which is the mean Poisson deviance. Lines 7-10 define key
In the following, we present the code necessary to run the experi- parameters such as the experiment’s random seed, evaluation budget,
ments demonstrated in Section 4 and provide brief descriptions for each the number of candidates used to construct the ensemble model, and the
code block. The scripts for all experiments are available in our GitHub time budget. Lines 15-16 handle the data loading process, while Lines
repository.11 19-20 define the response variable and the set of features. The train/test
split process is executed between Lines 26-34, where the split is deter-
D.1. French motor third-part liability mined based on the train set index generated by Listing 2, following the
procedure outlined by Noll et al. (2020). Once the training is complete,
Listing 1 presents the code used to run the French Motor Third-Part predictions are made on both the train and test sets, with the train/test
Liability experiment. The first 12 lines (Lines 1-12) are dedicated to mean Poisson deviance being reported as indicated in Lines 53-56.
In the training setup, we assign the experiment name based on the
evaluation budget, as indicated in Line 39. For instance, if the evaluation
11
https://github.com/PanyiDong/InsurAutoML/tree/master/experiments. budget is set to 512, the experiment will be named freMTPL2freq_512.

33
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

1 import InsurAutoML
2 from InsurAutoML import load_data, AutoTabularRegressor
3 import numpy as np
4 import pandas as pd
5 from sklearn.metrics import mean_poisson_deviance
6
7 seed = 42
8 n_trials = 64
9 N_ESTIMATORS = 4
10 TIMEOUT = (n_trials / 4) * 450
11
12 InsurAutoML.set_seed(seed)
13
14 # load data
15 database = load_data(data_type = ".csv").load(path = "")
16 database_names = [*database]
17
18 # define response/features
19 response = "ClaimNb"
20 features = np.sort(list(
21 set(database["freMTPL2freq"].columns) - set(["IDpol", "ClaimNb"])
22 ))
23
24 # read train index & get test index
25 # python dataframe index starts from 0, but R starts from 1
26 train_index = np.sort(pd.read_csv("train_index.csv").values.flatten()) - 1
27 test_index = np.sort(
28 list(set(database["freMTPL2freq"].index) - set(train_index))
29 )
30 # train/test split
31 train_X, test_X, train_y, test_y = (
32 database["freMTPL2freq"].loc[train_index, features], database["freMTPL2freq"].loc[test_index, features],
33 database["freMTPL2freq"].loc[train_index, response], database["freMTPL2freq"].loc[test_index, response],
34 )
35
36
37 # fit AutoML model
38 mol = AutoTabularRegressor(
39 model_name = "freMTPL2freq_{}".format(n_trials),
40 n_estimators = N_ESTIMATORS,
41 max_evals = n_trials,
42 timeout = TIMEOUT,
43 validation=False,
44 search_algo="HyperOpt",
45 objective= mean_poisson_deviance,
46 cpu_threads = 12,
47 balancing = ["SimpleRandomOverSampling", "SimpleRandomUnderSampling"],
48 seed = seed,
49 )
50 mol.fit(train_X, train_y)
51
52
53 train_pred = mol.predict(train_X)
54 test_pred = mol.predict(test_X)
55
56 mean_poisson_deviance(train_y, train_pred), mean_poisson_deviance(test_y, test_pred)

Listing 1: French Motor Third-Part Liability Experiment Code.

1 RNGversion("3.5.0")
2 set.seed (100)
3 ll <- sample (c (1: nrow ( freMTPL2freq )) , round (0.9* nrow ( freMTPL2freq )) , replace = FALSE )
4 write.csv(ll, "train_index.csv") # the train_index.csv generated in R is utilized in AutoML train/test split

Listing 2: Generation of train set index.

Upon completion of the training, a folder with the same name is cre- Listing 3 presents the optimal hyperparameters identified through
ated to store all the experiment results, along with a file of the same our AutoML experiments. All four top-performing pipelines employ or-
name containing the final ensemble model. The experiment configura- dinal encoding for data transformation. Given the absence of missing
tion, including the components of the ensemble model, the evaluation values, imputation processes are not necessary. Among the pipelines,
budget, the time budget, the cross-validation methodology, the search two implement normalization, while the other two utilize quantile
algorithm, the evaluation metric, the number of parallel computing transformation. Three of the selected pipelines are based on Histogram
threads, the balancing algorithms, and the random seed, are defined Gradient Boosting regression tree models, whereas one employs Light-
in Lines 40-48. Specifically, we employ the HyperOpt search algorithm GBM as the regression model. Although the selected pipelines employ
as described by Bergstra et al. (2013). Due to computational constraints, similar preprocessing techniques and regression models, the chosen
we limit the balancing algorithms to simple random over-sampling and hyperparameters for each pipeline vary, with some differing signifi-
under-sampling techniques. cantly.

34
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

1 For pipeline 1:
2 Optimal encoding method is: DataEncoding
3 Optimal encoding hyperparameters:{’dummy_coding’: False}
4
5 Optimal imputation method is: no_processing
6 Optimal imputation hyperparameters:{}
7
8 Optimal balancing method is: SimpleRandomOverSampling
9 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9637986293893803}
10
11 Optimal scaling method is: QuantileTransformer
12 Optimal scaling hyperparameters:{}
13
14 Optimal feature selection method is: no_processing
15 Optimal feature selection hyperparameters:{}
16
17 Optimal regression model is: HistGradientBoostingRegressor
18 Optimal regression hyperparameters:{’early_stop’: ’valid’, ’l2_regularization’: 3.5597237113674115e-08, ’learning_rate’:
0.021408454901122625, ’loss’: ’squared_error’, ’max_bins’: 255, ’max_depth’: None, ’max_leaf_nodes’: 771, ’min_samples_leaf’:
2, ’n_iter_no_change’: 10, ’scoring’: ’loss’, ’tol’: 1e-07, ’validation_fraction’: 0.2533770905902288}
19
20 For pipeline 2:
21 Optimal encoding method is: DataEncoding
22 Optimal encoding hyperparameters:{’dummy_coding’: False}
23
24 Optimal imputation method is: no_processing
25 Optimal imputation hyperparameters:{}
26
27 Optimal balancing method is: SimpleRandomUnderSampling
28 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9903581841265658}
29
30 Optimal scaling method is: Normalize
31 Optimal scaling hyperparameters:{}
32
33 Optimal feature selection method is: no_processing
34 Optimal feature selection hyperparameters:{}
35
36 Optimal regression model is: HistGradientBoostingRegressor
37 Optimal regression hyperparameters:{’early_stop’: ’valid’, ’l2_regularization’: 2.0273936664656744e-09, ’learning_rate’:
0.02368640948373309, ’loss’: ’squared_error’, ’max_bins’: 255, ’max_depth’: None, ’max_leaf_nodes’: 165, ’min_samples_leaf’:
3, ’n_iter_no_change’: 19, ’scoring’: ’loss’, ’tol’: 1e-07, ’validation_fraction’: 0.318356865844186}
38
39 For pipeline 3:
40 Optimal encoding method is: DataEncoding
41 Optimal encoding hyperparameters:{’dummy_coding’: False}
42
43 Optimal imputation method is: no_processing
44 Optimal imputation hyperparameters:{}
45
46 Optimal balancing method is: SimpleRandomUnderSampling
47 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9661230677755623}
48
49 Optimal scaling method is: QuantileTransformer
50 Optimal scaling hyperparameters:{}
51
52 Optimal feature selection method is: no_processing
53 Optimal feature selection hyperparameters:{}
54
55 Optimal regression model is: LightGBM_Regressor
56 Optimal regression hyperparameters:{’boosting’: ’dart’, ’learning_rate’: 0.984884665575003, ’max_depth’: -1, ’min_data_in_leaf’: 2,
’n_estimators’: 54, ’num_iterations’: 53, ’num_leaves’: 3, ’objective’: ’poisson’, ’tree_learner’: ’voting’}
57
58 For pipeline 4:
59 Optimal encoding method is: DataEncoding
60 Optimal encoding hyperparameters:{’dummy_coding’: False}
61
62 Optimal imputation method is: no_processing
63 Optimal imputation hyperparameters:{}
64
65 Optimal balancing method is: SimpleRandomUnderSampling
66 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9856650936908325}
67
68 Optimal scaling method is: Normalize
69 Optimal scaling hyperparameters:{}
70
71 Optimal feature selection method is: extra_trees_preproc_for_regression
72 Optimal feature selection hyperparameters:{’bootstrap’: True, ’criterion’: ’friedman_mse’, ’max_depth’: None, ’max_features’:
0.9979378812336601, ’max_leaf_nodes’: None, ’min_samples_leaf’: 2, ’min_samples_split’: 8, ’min_weight_fraction_leaf’: 0.0, ’
n_estimators’: 100}

35
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

73
74 Optimal regression model is: HistGradientBoostingRegressor
75 Optimal regression hyperparameters:{’early_stop’: ’valid’, ’l2_regularization’: 0.5251286015868422, ’learning_rate’:
0.01514498873147682, ’loss’: ’squared_error’, ’max_bins’: 255, ’max_depth’: None, ’max_leaf_nodes’: 3, ’min_samples_leaf’: 1,
’n_iter_no_change’: 3, ’scoring’: ’loss’, ’tol’: 1e-07, ’validation_fraction’: 0.07374400131323675}

Listing 3: Optimal Hyperparameter settings in freMTPL2freq experiments.

1 import InsurAutoML
2 from InsurAutoML import load_data, AutoTabularRegressor
3 import numpy as np
4 from sklearn.metrics import r2_score
5
6 seed = 42
7 n_trials = 64
8 N_ESTIMATORS = 5
9 TIMEOUT = (n_trials / 4) * 450
10
11 InsurAutoML.set_seed(seed)
12
13 # load data
14 database = load_data(data_type = ".rdata").load(path = "")
15 database_names = [*database]
16
17 # define response/features
18 response = ["yAvgBC"]
19 features = [
20 ’TypeCity’, ’TypeCounty’, ’TypeMisc’, ’TypeSchool’, ’TypeTown’, ’TypeVillage’, ’IsRC’, ’CoverageBC’, ’lnDeductBC’,
21 ’NoClaimCreditBC’, ’CoverageIM’, ’lnDeductIM’, ’NoClaimCreditIM’, ’CoveragePN’, ’NoClaimCreditPN’, ’CoveragePO’,
22 ’NoClaimCreditPO’,’CoverageCN’, ’NoClaimCreditCN’, ’CoverageCO’, ’NoClaimCreditCO’
23 ]
24 # log transform of response
25 database["data"][response] = np.log(database["data"][response] + 1)
26 database["dataout"][response] = np.log(database["dataout"][response] + 1)
27 # log transform of coverage feateres
28 database["data"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] = np.log(
29 database["data"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] + 1
30 )
31 database["dataout"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] = np.log(
32 database["dataout"][["CoverageBC", "CoverageIM", "CoveragePN", "CoveragePO", "CoverageCN", "CoverageCO"]] + 1
33 )
34
35 train_X, train_y = database["data"][features], database["data"][response]
36 test_X, test_y = database["dataout"][features], database["dataout"][response]
37
38 # fit AutoML model
39 mol = AutoTabularRegressor(
40 model_name = "LGPIF_{}".format(n_trials),
41 n_estimators = N_ESTIMATORS,
42 max_evals = n_trials,
43 timeout = TIMEOUT,
44 validation="KFold",
45 valid_size=0.2,
46 search_algo="HyperOpt",
47 objective= "R2",
48 cpu_threads = 12,
49 seed = seed,
50 )
51 mol.fit(train_X, train_y)
52
53 train_pred = mol.predict(train_X)
54 test_pred = mol.predict(test_X)
55 r2_score(train_y, train_pred), r2_score(test_y, test_pred)

Listing 4: Wisconsin Local Government Property Insurance Fund Experiment Code.

D.2. Wisconsin local government property insurance fund by Quan and Valdez (2018) by applying logarithmic transformations to
the response variable and selected features.
Listing 4 is the code used to run the Wisconsin Local Government In the training setup, we employ five-fold cross-validation, using val-
Property Insurance Fund experiments. The structure of this experiment idation as KFold with valid_size set to 0.2. Given the smaller data size, the
follows the same setup as previously described in the French Motor restrictions on the balancing algorithms applied in the previous exper-
Third-Party Liability experiment, including environment configuration, iment are not enforced in this experiment. Additionally, the evaluation
data loading, definition of features and response variables, AutoML fit- metric selected for this experiment is the 𝑅2 score.
ting, and prediction generation. Unlike the French Motor Third-Part Listing 5 presents the five optimal pipelines identified in the LGPIF
Liability experiment, the LGPIF dataset includes both in-sample and out- experiments. With a stacking ensemble strategy in place, the top-
of-sample data, eliminating the need for a train/test split process. After performing pipelines share almost identical selection in preprocessing
reading the two Rdata files, we replicate the preprocessing steps outlined techniques and regression models. In particular, the Extra Trees regres-

36
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

1 For pipeline 1:
2 Optimal encoding method is: DataEncoding
3 Optimal encoding hyperparameters:{’dummy_coding’: False}
4
5 Optimal imputation method is: no_processing
6 Optimal imputation hyperparameters:{}
7
8 Optimal balancing method is: no_processing
9 Optimal balancing hyperparameters:{}
10
11 Optimal scaling method is: Winsorization
12 Optimal scaling hyperparameters:{}
13
14 Optimal feature selection method is: truncatedSVD
15 Optimal feature selection hyperparameters:{’target_dim’: 206}
16
17 Optimal regression model is: ExtraTreesRegressor
18 Optimal regression hyperparameters:{’bootstrap’: False, ’criterion’: ’squared_error’, ’max_depth’: None, ’max_features’:
0.6070310845755006, ’max_leaf_nodes’: None, ’min_impurity_decrease’: 0.0, ’min_samples_leaf’: 17, ’min_samples_split’: 7, ’
min_weight_fraction_leaf’: 0.0}
19
20 For pipeline 2:
21 Optimal encoding method is: DataEncoding
22 Optimal encoding hyperparameters:{’dummy_coding’: False}
23
24 Optimal imputation method is: no_processing
25 Optimal imputation hyperparameters:{}
26
27 Optimal balancing method is: no_processing
28 Optimal balancing hyperparameters:{}
29
30 Optimal scaling method is: Winsorization
31 Optimal scaling hyperparameters:{}
32
33 Optimal feature selection method is: truncatedSVD
34 Optimal feature selection hyperparameters:{’target_dim’: 207}
35
36 Optimal regression model is: ExtraTreesRegressor
37 Optimal regression hyperparameters:{’bootstrap’: False, ’criterion’: ’squared_error’, ’max_depth’: None, ’max_features’:
0.6132171875599337, ’max_leaf_nodes’: None, ’min_impurity_decrease’: 0.0, ’min_samples_leaf’: 17, ’min_samples_split’: 7, ’
min_weight_fraction_leaf’: 0.0}
38
39 For pipeline 3:
40 Optimal encoding method is: DataEncoding
41 Optimal encoding hyperparameters:{’dummy_coding’: False}
42
43 Optimal imputation method is: no_processing
44 Optimal imputation hyperparameters:{}
45
46 Optimal balancing method is: Smote_TomekLink
47 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.997815634129791, ’k’: 8}
48
49 Optimal scaling method is: Winsorization
50 Optimal scaling hyperparameters:{}
51
52 Optimal feature selection method is: truncatedSVD
53 Optimal feature selection hyperparameters:{’target_dim’: 226}
54
55 Optimal regression model is: ExtraTreesRegressor
56 Optimal regression hyperparameters:{’bootstrap’: False, ’criterion’: ’squared_error’, ’max_depth’: None, ’max_features’:
0.6204531346722685, ’max_leaf_nodes’: None, ’min_impurity_decrease’: 0.0, ’min_samples_leaf’: 17, ’min_samples_split’: 8, ’
min_weight_fraction_leaf’: 0.0}
57
58 For pipeline 4:
59 Optimal encoding method is: DataEncoding
60 Optimal encoding hyperparameters:{’dummy_coding’: False}
61
62 Optimal imputation method is: no_processing
63 Optimal imputation hyperparameters:{}
64
65 Optimal balancing method is: no_processing
66 Optimal balancing hyperparameters:{}
67
68 Optimal scaling method is: Winsorization
69 Optimal scaling hyperparameters:{}
70
71 Optimal feature selection method is: select_rates_regression
72 Optimal feature selection hyperparameters:{’alpha’: 0.20903223117422198, ’mode’: ’fdr’, ’score_func’: ’f_regression’}
73

37
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

74 Optimal regression model is: HistGradientBoostingRegressor


75 Optimal regression hyperparameters:{’early_stop’: ’valid’, ’l2_regularization’: 5.571785756489699e-09, ’learning_rate’:
0.030169889527745974, ’loss’: ’squared_error’, ’max_bins’: 255, ’max_depth’: None, ’max_leaf_nodes’: 4, ’min_samples_leaf’:
14, ’n_iter_no_change’: 5, ’scoring’: ’loss’, ’tol’: 1e-07, ’validation_fraction’: 0.08383521405637624}
76
77 For pipeline 5:
78 Optimal encoding method is: DataEncoding
79 Optimal encoding hyperparameters:{’dummy_coding’: False}
80
81 Optimal imputation method is: no_processing
82 Optimal imputation hyperparameters:{}
83
84 Optimal balancing method is: no_processing
85 Optimal balancing hyperparameters:{}
86
87 Optimal scaling method is: Winsorization
88 Optimal scaling hyperparameters:{}
89
90 Optimal feature selection method is: truncatedSVD
91 Optimal feature selection hyperparameters:{’target_dim’: 219}
92
93 Optimal regression model is: ExtraTreesRegressor
94 Optimal regression hyperparameters:{’bootstrap’: False, ’criterion’: ’squared_error’, ’max_depth’: None, ’max_features’:
0.6352854656316334, ’max_leaf_nodes’: None, ’min_impurity_decrease’: 0.0, ’min_samples_leaf’: 18, ’min_samples_split’: 13, ’
min_weight_fraction_leaf’: 0.0}

Listing 5: Optimal Hyperparameter settings in LGPIF experiments.

1 import pandas as pd
2 import InsurAutoML
3 from InsurAutoML import load_data, AutoTabular
4 from InsurAutoML.utils import train_test_split
5
6 seed = 42
7 n_trials = 128
8 N_ESTIMATORS = 4
9 TIMEOUT = (n_trials / 4) * 450
10
11 InsurAutoML.set_seed(seed)
12
13 # load data
14 database = load_data(data_type = ".csv").load(path = "")
15 database_names = [*database]
16
17 # define response/features
18 response = "ClaimOcc"
19 features = list(
20 set(database["ausprivauto"].columns) - set(["ClaimOcc", "ClaimNb", "ClaimAmount"])
21 )
22 features.sort()
23
24 # train/test split
25 train_X, test_X, train_y, test_y = train_test_split(
26 database[’ausprivauto’][features], database[’ausprivauto’][[response]], test_perc = 0.1, seed = seed
27 )
28 pd.DataFrame(train_X.index.sort_values()).to_csv("train_index.csv", index=False)
29
30 # fit AutoML model
31 mol = AutoTabular(
32 model_name="ausprivauto_occ_{}".format(n_trials),
33 max_evals=n_trials,
34 n_estimators=N_ESTIMATORS,
35 timeout=TIMEOUT,
36 validation="KFold",
37 valid_size=0.25,
38 search_algo="Optuna",
39 objective="AUC",
40 cpu_threads=12,
41 seed=seed,
42 )
43 mol.fit(train_X, train_y)
44
45 from sklearn.metrics import roc_auc_score
46
47 y_train_pred = mol.predict_proba(train_X)
48 y_test_pred = mol.predict_proba(test_X)
49 roc_auc_score(train_y.values, y_train_pred["class_1"].values), roc_auc_score(test_y.values, y_test_pred["class_1"].values)

Listing 6: Australian Automobile Insurance Experiment Code.

38
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

1 For pipeline 1:
2 Optimal encoding method is: DataEncoding
3 Optimal encoding hyperparameters:{’dummy_coding’: False}
4
5 Optimal imputation method is: no_processing
6 Optimal imputation hyperparameters:{}
7
8 Optimal balancing method is: CNN_TomekLink
9 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9277384882773799}
10
11 Optimal scaling method is: PowerTransformer
12 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
13
14 Optimal feature selection method is: select_rates_classification
15 Optimal feature selection hyperparameters:{’alpha’: 0.29938397321922683, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
16
17 Optimal classification model is: QDA
18 Optimal classification hyperparameters:{’reg_param’: 0.05294078667761132}
19
20 For pipeline 2:
21 Optimal encoding method is: DataEncoding
22 Optimal encoding hyperparameters:{’dummy_coding’: False}
23
24 Optimal imputation method is: no_processing
25 Optimal imputation hyperparameters:{}
26
27 Optimal balancing method is: CNN_TomekLink
28 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9221903451264218}
29
30 Optimal scaling method is: PowerTransformer
31 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
32
33 Optimal feature selection method is: select_rates_classification
34 Optimal feature selection hyperparameters:{’alpha’: 0.29698209545431653, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
35
36 Optimal classification model is: QDA
37 Optimal classification hyperparameters:{’reg_param’: 0.043945988174408424}
38
39 For pipeline 3:
40 Optimal encoding method is: DataEncoding
41 Optimal encoding hyperparameters:{’dummy_coding’: False}
42
43 Optimal imputation method is: no_processing
44 Optimal imputation hyperparameters:{}
45
46 Optimal balancing method is: CNN_TomekLink
47 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.9243338187985631}
48
49 Optimal scaling method is: PowerTransformer
50 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
51
52 Optimal feature selection method is: select_rates_classification
53 Optimal feature selection hyperparameters:{’alpha’: 0.30437601126836655, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
54
55 Optimal classification model is: QDA
56 Optimal classification hyperparameters:{’reg_param’: 0.055735061547359833}
57
58 For pipeline 4:
59 Optimal encoding method is: DataEncoding
60 Optimal encoding hyperparameters:{’dummy_coding’: False}
61
62 Optimal imputation method is: no_processing
63 Optimal imputation hyperparameters:{}
64
65 Optimal balancing method is: CNN_TomekLink
66 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.8982560884166318}
67
68 Optimal scaling method is: PowerTransformer
69 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
70
71 Optimal feature selection method is: select_rates_classification
72 Optimal feature selection hyperparameters:{’alpha’: 0.2982712022144762, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
73
74 Optimal classification model is: QDA
75 Optimal classification hyperparameters:{’reg_param’: 0.05729605531950389}
76
77 For pipeline 5:
78 Optimal encoding method is: DataEncoding
79 Optimal encoding hyperparameters:{’dummy_coding’: False}

39
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

80
81 Optimal imputation method is: no_processing
82 Optimal imputation hyperparameters:{}
83
84 Optimal balancing method is: CNN_TomekLink
85 Optimal balancing hyperparameters:{’imbalance_threshold’: 0.912043608305704}
86
87 Optimal scaling method is: PowerTransformer
88 Optimal scaling hyperparameters:{’method’: ’yeo-johnson’}
89
90 Optimal feature selection method is: select_rates_classification
91 Optimal feature selection hyperparameters:{’alpha’: 0.28567583661991386, ’mode’: ’fwe’, ’score_func’: ’f_classif’}
92
93 Optimal classification model is: QDA
94 Optimal classification hyperparameters:{’reg_param’: 0.05069185745224875}

Listing 7: Optimal Hyperparameter settings in ausprivauto experiments.

sion model is frequently chosen in four of the selected pipelines, paired Dong, X., Yu, Z., Cao, W., Shi, Y., Ma, Q., 2020. A survey on ensemble learning. Front.
with Winsorization to cap extreme feature values. Comput. Sci. 14, 241–258.
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., Smola, A., 2020.
Autogluon-tabular: robust and accurate AutoML for structured data. arXiv preprint
D.3. Australian automobile insurance arXiv:2003.06505.
Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F., 2022. Auto-Sklearn 2.0:
The code for running the Australian Automobile Insurance experi- hands-free AutoML via meta-learning. J. Mach. Learn. Res. 23 (261), 1–61.
ment is detailed in Listing 6. As this is a classification task, we utilize Feurer, M., Klein, A., Jost, K.E., Springenberg, T., Blum, M., Hutter, F., 2015. Efficient and
robust automated machine learning. Adv. Neural Inf. Process. Syst. 28.
AutoTabular with automatic task type selection. Initially, we perform
Frees, E.W., Lee, G., Yang, L., 2016. Multivariate frequency-severity regression models in
a 90/10 train/test split for the first experiment and reuse the gen- insurance. Risks 4 (1), 4.
erated train set index for subsequent experiments. The experimental Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F., 2012. A review on
setup involves four-fold cross-validation, employing the Optuna (Akiba ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based
et al., 2019) search algorithm and using AUC as the evaluation met- approaches. IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 42 (4), 463–484.
Gan, G., Valdez, E.A., 2024. Compositional data regression in insurance with exponential
ric.
family PCA. Variance 17 (1).
For ausprivauto experiments, as illustrated in Listing 7, all five opti- García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F., 2016. Big data pre-
mal pipelines consistently employ the same preprocessing methods and processing: methods and prospects. Big Data Anal. 1, 1–22.
classification models. Guerra, P., Castelli, M., 2021. Machine learning applied to banking supervision a literature
review. Risks 9 (7), 136.
Guo, H., Li, Y., Shang, J., Gu, M., Huang, Y., Gong, B., 2017. Learning from class-
Data availability
imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239.
Hart, P.E., 1968. The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14 (3),
Data will be made available on request. 515–516.
Hartman, B., Owen, R., Gibbs, Z., 2020. Predicting high-cost health insurance members
References through boosted trees and oversampling: an application using the HCCI database. N.
Am. Actuar. J. 25 (1), 53–61.
He, H., Garcia, E.A., 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M., 2019. Optuna: a next-generation
Eng. 21 (9), 1263–1284.
hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD
He, X., Zhao, K., Chu, X., 2021. AutoML: a survey of the state-of-the-art. Knowl.-Based
International Conference on Knowledge Discovery & Data Mining, Proceedings of the
Syst. 212, 106622.
International Conference on Knowledge Discovery & Data Mining. Association for
Hodge, V., Austin, J., 2004. A survey of outlier detection methodologies. Artif. Intell.
Computing Machinery, pp. 2623–2631.
Rev. 22, 85–126.
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J., 2011. Multiple imputation by chained
equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20 (1), Hu, C., Quan, Z., Chong, W.F., 2022. Imbalanced learning for insurance using modified
40–49. loss functions in tree-based models. Insur. Math. Econ. 106, 13–32.
Bakhteev, O.Y., Strijov, V.V., 2020. Comprehensive analysis of gradient-based hyperpa- Jeong, H., 2024. Tweedie multivariate semi-parametric credibility with the exchangeable
rameter optimization algorithms. Ann. Oper. Res. 289 (1), 51–65. correlation. Insur. Math. Econ. 115, 13–21.
Bams, D., Lehnert, T., Wolff, C.C.P., 2009. Loss functions in option valuation: a framework Jordan, M.I., Mitchell, T.M., 2015. Machine learning: trends, perspectives, and prospects.
for selection. Manag. Sci. 55 (5), 853–862. Science 349 (6245), 255–260.
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C., 2004. A study of the behavior of several Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y., 2017. LightGBM:
methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6 a highly efficient gradient boosting decision tree. In: Advances in Neural Information
(1), 20–29. Processing Systems, vol. 30.
Bergstra, J., Yamins, D., Cox, D., 2013. Making a science of model search: hyperparame- Kononenko, I., 2001. Machine learning for medical diagnosis: history, state of the art and
ter optimization in hundreds of dimensions for vision architectures. In: Dasgupta, S., perspective. Artif. Intell. Med. 23 (1), 89–109.
McAllester, D. (Eds.), Proceedings of the 30th International Conference on Machine Lall, S., Sinha, D., Ghosh, A., Sengupta, D., Bandyopadhyay, S., 2021. Stable feature se-
Learning. In: Proceedings of Machine Learning Research, vol. 28. PMLR, pp. 115–123. lection using copula based mutual information. Pattern Recognit. 112, 107697.
Chandrashekar, G., Sahin, F., 2014. A survey on feature selection methods. Comput. Electr. LeDell, E., Poirier, S., 2020. H2O AutoML: scalable automatic machine learning. In: 7th
Eng. 40 (1), 16–28. ICML Workshop on Automated Machine Learning (AutoML).
Charpentier, A., 2015. Computational actuarial science with R. J. R. Stat. Soc., Ser. A, Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I., 2018. Tune: a re-
Stat. Soc. 178 (3), 782–783. search platform for distributed model selection and training. arXiv preprint arXiv:
Charpentier, A., Élie, R., Remlinger, C., 2023. Reinforcement learning in economics and 1807.05118.
finance. Comput. Econ. 62 (1), 425–462. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P., 2017. Focal loss for dense object
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic mi- detection. In: Proceedings of the IEEE International Conference on Computer Vision,
nority over-sampling technique. J. Artif. Intell. Res. 16 (1), 321–357. pp. 2980–2988.
Chen, T., Guestrin, C., 2016. XGBoost: a scalable tree boosting system. In: Proceedings of Ma, L., Sun, B., 2020. Machine learning and AI in marketing – connecting computing
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data power to human insights. Int. J. Res. Mark. 37 (3), 481–504.
Mining. ACM, pp. 785–794. Masello, L., Castignani, G., Sheehan, B., Guillen, M., Murphy, F., 2023. Using contextual
Cummings, J., Hartman, B., 2022. Using machine learning to better model long-term care data to predict risky driving events: a novel methodology from explainable artificial
insurance claims. N. Am. Actuar. J. 26 (3), 470–483. intelligence. Accid. Anal. Prev. 184, 106997.
De Jong, P., Heller, G.Z., et al., 2008. Generalized Linear Models for Insurance Data. Cam- Mitchell, T., Buchanan, B., Dejong, G., Dietterich, T., Rosenbloom, P., Waibel, A., 1990.
bridge University Press. Machine learning. Annu. Rev. Comput. Sci. 4, 417–433.

40
P. Dong and Z. Quan Insurance Mathematics and Economics 120 (2025) 17–41

Noll, A., Salzmann, R., Wuthrich, M.V., 2020. Case study: French motor third-party li- Snoek, J., Larochelle, H., Adams, R.P., 2012. Practical Bayesian optimization of machine
ability claims. Available at SSRN: https://ssrn.com/abstract=3164764 or http:// learning algorithms. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (Eds.), Ad-
dx.doi.org/10.2139/ssrn.3164764, 2020. vances in Neural Information Processing Systems, Vol. 25. Curran Associates, Inc.
Okine, A.N.-A., Frees, E.W., Shi, P., 2022. Joint model prediction and application to So, B., 2024. Enhanced gradient boosting for zero-inflated insurance claims and compar-
individual-level loss reserving. ASTIN Bull. 52 (1), 91–116. ative analysis of CatBoost, XGBoost, and LightGBM. Scand. Actuar. J., 1–23.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., So, B., Boucher, J.-P., Valdez, E.A., 2021. Cost-sensitive multi-class adaboost for under-
Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, standing driving behavior based on telematics. ASTIN Bull. 51 (3), 719–751.
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. So, B., Valdez, E.A., 2024. SAMME. C2 algorithm for imbalanced multi-class classification.
PyTorch: an imperative style, high-performance deep learning library. In: Wallach, Soft Comput. 28, 9387–9404.
H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Ad- Somol, P., Pudil, P., Novovičová, J., Paclík, P., 1999. Adaptive floating search methods in
vances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., feature selection. Pattern Recognit. Lett. 20 (11–13), 1157–1163.
pp. 8026–8037. Stekhoven, D.J., Bühlmann, P., 2012. Missforest—non-parametric missing value imputa-
Pedregosa, F., Michel, V., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Vanderplas, tion for mixed-type data. Bioinformatics 28 (1), 112–118.
J., Cournapeau, D., Pedregosa, F., Varoquaux, G., Gramfort, A., Thirion, B., Grisel, O., Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G., 2007. A genetic algorithm-based method for
Dubourg, V., Passos, A., Brucher, M., Perrot, M., Duchesnay, Édouard, 2011. Scikit- feature subset selection. Soft Comput. 12, 111–120.
learn: machine learning in Python. J. Mach. Learn. Res. 12 (85), 2825–2830. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K., 2013. Auto-WEKA: combined se-
Peiris, H., Jeong, H., Kim, J.-K., Lee, H., 2024. Integration of traditional and telematics lection and hyperparameter optimization of classification algorithms. In: Proceedings
data for efficient insurance claims prediction. ASTIN Bull. 54 (2), 263–279. of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data
Peng, H., Long, F., Ding, C., 2005. Feature selection based on mutual information: criteria Mining. Association for Computing Machinery, pp. 847–855.
of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Tomek, I., 1976. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC-6 (11),
Mach. Intell. 27 (8), 1226–1238. 769–772.
Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6 Turcotte, R., Boucher, J.-P., 2024. GAMLSS for longitudinal multivariate claim count mod-
(3), 21–45. els. N. Am. Actuar. J. 28 (2), 337–360.
Qayyum, A., Qadir, J., Bilal, M., Al-Fuqaha, A., 2020. Secure and robust machine learning Wang, Q., Ma, Y., Zhao, K., Tian, Y., 2022. A comprehensive survey of loss functions in
for healthcare: a survey. IEEE Rev. Biomed. Eng. 14, 156–180. machine learning. Ann. Data Sci. 9, 187–212.
Quan, Z., Hu, C., Dong, P., Valdez, E.A., 2024. Improving business insurance loss models Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data.
by leveraging InsurTech innovation. N. Am. Actuar. J., 1–28. IEEE Trans. Syst. Man Cybern. SMC-2 (3), 408–421.
Quan, Z., Valdez, E.A., 2018. Predictive analytics of insurance claims using multivariate Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., Deng, S.-H., 2019. Hyperparameter op-
decision trees. Depend. Model. 6 (1), 377–407. timization for machine learning models based on bayesian optimization. J. Electron.
Quan, Z., Wang, Z., Gan, G., Valdez, E.A., 2023. On hybrid tree-based methods for short- Sci. Technol. 17 (1), 26–40.
term insurance claims. Probab. Eng. Inf. Sci. 37 (2), 597–620. Wüthrich, M.V., 2019. From generalized linear models to neural networks, and back. Tech-
Rapin, J., Teytaud, O., 2018. Nevergrad - a gradient-free optimization platform. https:// nical report. Department of Mathematics, ETH Zurich.
GitHub.com/FacebookResearch/Nevergrad. Yang, L., Shami, A., 2020. On hyperparameter optimization of machine learning algo-
Sagi, O., Rokach, L., 2018. Ensemble learning: a survey. Wiley Interdiscip. Rev. Data Min. rithms: theory and practice. Neurocomputing 415, 295–316.
Knowl. Discov. 8 (4), e1249. Yoon, J., Jordon, J., Schaar, M.V.D., 2018. GAIN: Missing Data Imputation Using Gener-
Salehi, S.S.M., Erdogmus, D., Gholipour, A., 2017. Tversky loss function for image seg- ative Adversarial Nets. Proceedings of the 35th International Conference on Machine
mentation using 3D fully convolutional deep networks. In: Wang, Q., Shi, Y., Suk, Learning, vol. 80. PMLR, pp. 5689–5698.
H.-I., Suzuki, K. (Eds.), Machine Learning in Medical Imaging. Springer International Young, S.R., Rose, D.C., Karnowski, T.P., Lim, S.-H., Patton, R.M., 2015. Optimizing deep
Publishing, pp. 379–387. learning hyper-parameters through an evolutionary algorithm. In: Proceedings of
Servén, D., Brummitt, C., 2018. pyGAM: generalized additive models in python. https:// the Workshop on Machine Learning in High-Performance Computing Environments.
github.com/dswah/pyGAM. MLHPC ’15. Association for Computing Machinery, New York, NY, USA, pp. 1–5.
Shi, P., Zhang, W., Shi, K., 2024. Leveraging weather dynamics in insurance claims triage Zhang, Y., Ji, L., Aivaliotis, G., Taylor, C., 2024. Bayesian CART models for insurance
using deep learning. J. Am. Stat. Assoc. 119 (546), 825–838. claims frequency. Insur. Math. Econ. 114, 108–131.
Si, J., He, H., Zhang, J., Cao, X., 2022. Automobile insurance claim occurrence prediction Zöller, M.-A., Huber, M.F., 2021. Benchmark and survey of automated machine learning
model based on ensemble learning. Appl. Stoch. Models Bus. Ind. 38 (6), 1099–1112. frameworks. J. Artif. Intell. Res. 70, 409–472.

41

You might also like