Paper 25

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Received 9 March 2024, accepted 18 April 2024, date of publication 24 April 2024, date of current version 13 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3393154

Financial Fraud Detection Using Value-at-Risk


With Machine Learning in Skewed Data
ABDULLAHI UBALE USMAN 1,3 , SUNUSI BALA ABDULLAHI 2, (Member, IEEE),
YU LIPING 1 , BAYAN ALGHOFAILY4 , AHMED S. ALMASOUD 4,

AND AMJAD REHMAN 4 , (Senior Member, IEEE)


1 School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou 310018, China
2 Department of Electronics and Telecommunication Engineering, Faculty of Engineering, King Mongkut’s University of Technology Thonburi, Bang Mod,
Thrung Khru, Bangkok 10140, Thailand
3 Department of Statistics, Kano University of Science and Technology, Wudil 713281, Nigeria
4 College of Computer & Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia

Corresponding author: Yu Liping (yvliping@zjgsu.edu.cn)

ABSTRACT The significant losses that banks and other financial organizations suffered due to new bank
account (NBA) fraud are alarming as the number of online banking service users increases. The inherent
skewness and rarity of NBA fraud instances have been a major challenge to the machine learning (ML)
models and happen when non-fraud instances outweigh the fraud instances, which leads the ML models to
overlook and erroneously consider fraud as non-fraud instances. Such errors can erode the confidence and
trust of customers. Existing studies consider fraud patterns instead of potential losses of NBA fraud risk
features while addressing the skewness of fraud datasets. The detection of NBA fraud is proposed in this
research within the context of value-at-risk as a risk measure that considers fraud instances as a worst-case
scenario. Value-at-risk uses historical simulation to estimate potential losses of risk features and model them
as a skewed tail distribution. The risk-return features obtained from value-at-risk were classified using ML
on the bank account fraud (BAF) Dataset. The value-at-risk handles the fraud skewness using an adjustable
threshold probability range to attach weight to the skewed NBA fraud instances. A novel detection rate (DT)
metric that considers risk fraud features was used to measure the performance of the fraud detection model.
An improved fraud detection model is achieved using a K-nearest neighbor with a true positive (TP) rate of
0.95 and a DT rate of 0.9406. Under an acceptable loss tolerance in the banking sector, value-at-risk presents
an intelligent approach for establishing data-driven criteria for fraud risk management.

INDEX TERMS Detection rate, fraud detection, K-nearest neighbor, skewed instances, value-at-risk.

I. INTRODUCTION and debit card fraud, mortgage fraud, and many more [4.5].
The Association of Certified Fraud Examiners (ACFE) The act of opening an account to commit fraud at banks or
2022 released a financial fraud report stating that 2,110 fraud other financial organizations is known as ‘‘new bank account
cases involving industries in financial sectors in 133 coun- (NBA) fraud’’ [6]. Fraud not only results in immediate finan-
tries resulted in losses of around $3.6 billion [1]. Financial cial losses and erodes public confidence in institutions, but
fraud can be termed as the deliberate employment of unlaw- has broader consequences, affecting customers and finan-
ful procedures or tactics to obtain financial gain [2]. The cial systems through market instability and contributing to
consequences of financial fraud can potentially disrupt larger macroeconomic downturns [7]. Fraud datasets typi-
economies, raise living expenses, and undermine consumer cally exhibit some properties including skewness, evolving
confidence [3]. Forms of financial fraud include insurance patterns, highly dimensional, and restricted access to relevant
fraud, money laundering, new bank account fraud, credit information. Specifically, fraud skewness which represents
the majority fraud class over the non-fraud class has been
The associate editor coordinating the review of this manuscript and a major concern to studies, as it affects the performance
approving it for publication was Chao Tong . of fraud detection model. The Skewed fraud instances can
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 64285
A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

have a bad influence on machine learning algorithms such • This paper used novel detection rate performance met-
as distance-based algorithms [8]. Previous efforts in tackling rics to capture the overall performance in detection of
fraud involve developing rule-based expert systems, statisti- NBA fraud instances that incorporate risk fraud factors.
cal methods, machine learning, and risk-based methods [9], The remainder of the paper is arranged as follows: The
[10]. Due to the cost of maintenance and the inefficiency of study’s review of the literature is presented in Section II. The
rule-based methods [10], decision-makers decide to utilize problem definition is presented in Section III. The materials
statistical methods such as autoregressive models to handle and procedures are presented in Section IV. The experimental
financial fraud [11], [12], [13]. The complex patterns and setup is presented in Section V. The results are presented
high dimensional nature of frauds make the statistical meth- in Section VI. The study’s conclusions and discussions are
ods less effective, as such machine learning models were presented in Section VII.
deployed [10], [14]. However, some of the studies that utilize
machine learning techniques were found to have a high False
II. LITERATURE REVIEW
Positive (FP) rate [15], [16], [17]. Machine learning models
This section presents related studies in financial fraud detec-
can potentially handle high-dimensional data and complex
tion. Different studies exist that utilize both statistical and
patterns of fraud instances.
artificial intelligence-based methods in the context of a risk
To evaluate the effectiveness of machine learning model,
and financial fraud perspective.
Jesus et al. [18] presented the first domain-specific and real-
world bank account fraud (BAF) dataset. The datasets were
generated using generative adversarial networks (GANs) and A. STATISTICAL METHODS OF FRAUD DETECTION
evaluated using light gradient boosting method (LGBM).The Many studies in the literature utilize statistical methods in
study [18], [19] utilizes 25 sets of hyperparameter configura- evaluating financial fraud. Specifically, significant studies
tions to optimize the LGBM model, utility aware reweighing were found to utilize ordinary least squares (OLS) regression
was used to handle the class skewness of BAF dataset. and autoregressive (AR) models for financial fraud evalu-
The study [15] utilizes stacking in ensembled learning with ation. Using the Tehran Stock Exchange dataset, the study
majority voting to evaluate the BAF dataset and address [21] uses a regression model to investigate the association
the changing fraud patterns. The study [20] uses federated between auditor characteristics and fraud detection in emerg-
learning in addressing data privacy issues of BAF dataset ing economies. The authors provide useful information for
and deep neural networks to classify fraud instances. These improving the reliability of the findings. Using pooled OLS
studies achieve good performance in addressing BAF chal- and panel regressions, the study [22] investigates the effect
lenges; However, the studies do not consider the potential of political alignment on corporate fraud convictions, offer-
losses of fraud risk features. To our knowledge, little research ing insights into the connection between politics and fraud.
exists that employs machine learning techniques in NBA The authors use state-level data from 2003 to 2018 on US
fraud detection. The detection of NBA fraud is proposed corporate fraud convictions and party affiliation. The study
in this paper within the context of risk management that [23] utilizes OLS to investigate financial factors of financial
uses value-at-risk to considers skewed fraud instances as a fraud, which is attributed to the fraud triangle. The study [24]
worst-case scenario. To adequately estimate the losses of uses logistic regression to discover that external pressures
fraud risks, value-at-risk was augmented with expected loss and financial stability had a favorable impact on financial
and expected shortfall of frauds which further quantifies reporting fraud. On the other hand, collaboration, arrogance,
the mean and extreme loss effects respectively. These risk changes in directors, incompetent oversight, and hubris have
measures combination will allow the quantification of risks little bearing on false financial reporting. The study [25]
across mean, worst-case, and extreme scenarios. Value-at- provides evidence for the contribution of gender diversity to
risk employs historical simulation to estimate potential losses fraud commission and detection in Chinese listed businesses
of risk features. The risk-return features obtained from value- between 2007 and 2018 using bivariate probit model. The
at-risk are based on assessing their risk exposure to fraud authors opined that female corporate executives are linked to
risk. The risk-return features are sent as input to the NBA a stronger ability to detect fraud, which lowers the likelihood
fraud detection model. Different machine learning models that businesses to commit fraud. From the standpoint of
were trained; However, the K-nearest neighbor outperformed external auditors, the study [26] sheds light on the causes of
other models. The contributions of this paper are: fraud and the function of forensic accounting using regression
analysis to analyze Lebanese data. The study [4] discov-
• This paper used an extreme value theorem to model the ered that while the overall number of employees engaged
tails (potential losses) instead of the fraud pattern. in fraud affects the performance of money banks in Nigeria,
• This paper used value-at-risk to model the skewness of the number of fraud cases and the total amount lost to fraud
fraud instances more efficiently. had a favorable influence. The use of statistical methods by
• This paper utilized historical simulation to estimate the author such as OLS regression, Pearson correlation, and
value-at-risk as it makes no assumptions on any descriptive analysis strengthens the findings by the authors.
distribution. The sales growth index and the depreciation index factors

64286 VOLUME 12, 2024


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

make up the M-score are used in the study [27] to analyze produce better and consistent results. The study [33] uses the
the possibility of profit management using the Athens Stock number of compromised records to determine the cost of a
Exchange Market. It is pertinent to know that a large body data breach; the findings indicate that the total number of
of literature exists that utilizes the AR model. To handle affected records has a Fréchet distribution, random forest is
[12] large-scale non-uniform transactions more quickly, the used for estimating the number of such records. The study
authors employ the AR model, which makes it appropriate [34] uses the estimate of generalized extreme value parame-
for detecting money laundering operations. The study [11] ters to evaluates competency, digital technology abilities, and
uses factor analysis to generate the composite indicator, frac- personality qualities that may improve the ability of external
tional integration (ARFIMA), and fractional cointegration auditors to identify fraud risk, the efficiency of fraud risk
VAR (FCVAR) approaches to evaluate the behavior of the assessment was linked to digital technology abilities through
composite suspicion tax fraud indicator about GDP and tax the application of the partial least-squares structural equation
collection. The study [13] employs the AR model, which is model (PLS-SEM). The study [35] identified a positive cor-
appropriate for studying networks with such topologies and relation between fraud risk assessment and management and
applying it to the detection of financial transaction fraud since the efficient use of forensic accounting using chi-square,
it considers the block-wise structure of networks. The authors fisher test, and correlation, however, there is no relationship
discovered that, in line with reality, there is a risk relationship between fraud risk assessment and management in terms of
between fraudulent groups and ordinary loan applicants. The techniques causing fraud. The study [9] examines fraud using
study [28] outlined specific identification indicators that help ensemble learners for anomaly detection and also handles
with the detection of financial fraud using digital distribu- data skewness, a triage model that receives input from the
tion laws, and the authors demonstrate that the probability ensemble model, and a risk model that estimates the financial
of financial fraud increases significantly as the deviation of losses. The authors successfully provide an effective fraud
financial data distribution from Benford’s law increases. risk-based detection, from machine learning techniques to
In summary, a large body of literature uses statistical meth- risk assessment, but do not to evaluate fraud detection by
ods to analyze the causes and effects that influence financial first considering the risk component before subjecting it to
fraud, but due to the complex nature and scalability of fraud, machine learning detection.
statistical methods are not enough to adequately examine In summary, risk measures are good in the assessment
financial fraud. and management of the features associated with fraud for
effective fraud prevention and control. However, due to the
nonlinearity, high dimension, and complex nature of fraud,
B. RISK-BASED METHODS OF FRAUD DETECTION these risk measures need to be augmented with other tech-
This section presents the financial fraud assessment from the niques such as machine learning techniques that enable
perspective of risk mitigation. The existing studies utilize proper and efficient fraud prevention and detection.
different risk measures such as value-at-risk (VaR), expected
loss, and expected shortfall to assess the level of risk of
fraud. The study [29] offers strategies for breaking down the C. MACHINE LEARNING METHODS IN FRAUD DETECTION
risk of fraud, identifying potential fraudsters, and enabling This section presents studies that utilize machine learning
more targeted anti-fraud measures by tying the motivation techniques for the classification of fraud applications. The
of the fraud triangle to human tendencies that lead to spe- majority of the presented studies consider the detection
cific actions as well as the meta-model of fraud together. while addressing the skewed nature of fraud instances. Sam-
Regression analysis is utilized in the study [30] to look at pling methods, hybrid methods, and other novel methods are
how enterprises manage risk to determine how control envi- majorly used to overcome the skewed nature of fraud datasets.
ronments, risk assessments, control activities, information The study [36] addresses class skewness in credit card fraud
and communication, and monitoring contributed to fraud pre- using quantum machine learning (QML) and support vec-
vention and detection efforts in Indonesian firms. The study tor machines (SVM). The results show that classic machine
[31] defined additional security attributes that might have learning techniques are still useful for non-time series data,
an impact on the cloud system and carried out an anomaly whereas QML applications can be used for time-series-based
detection based on risk assessment named parallel processing and highly skewed data. Quantum neural network (QNN)
(PP) that covers cyber threats and exploitation likelihoods. achieves good performance in fraud detection by the study
The model checker is then employed to determine the risk [37]. The study [38] trained different machine learning mod-
exposure rates associated with the respective attacks. The els, all of which were using default implementations and
study [32] proposes a framework in which doubly-truncated parameters, XGBoost performed more accurately than any
severity distributions are used to estimate the operational risk other models. The effectiveness of telecom fraud is assessed
and offered a framework that includes database construction in the study [39] using a dynamic graph neural network
and risk modeling. By applying value-at-risk and expected (DGNN), the authors effectively present a suggested method
shortfall to identify operational risk sources like external for resolving the issue of telecom fraud detection in extensive
fraud risk and legal risk sections, the authors were able to phone social networks. To assess credit card fraud while

VOLUME 12, 2024 64287


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

considering the skewness of fraud instances, the study [40] TABLE 1. Table of related studies.
makes use of logistics regression (LR), K-nearest neighbor
(KNN), decision tree (DT), random forest (RF), and autoen-
coder (AE) as they can handle skewed data better than other
models, the AE model performs better. KNN, linear dis-
criminant analysis (LDA), and linear regression are used in
the study [41] to investigate credit card fraud, by address-
ing the skewed nature of the credit card fraud data and
using cross-validation techniques, KNN showed higher per-
formance. Using ARIMA model for fraud detection based on
daily transaction counts, the study [14] carried out anomaly
detection, the model is contrasted with four industry-standard
anomaly detection algorithms: the box plot, isolation for-
est(IF), local outlier factor (LOF), and K-means models.
An ensemble classifier (EC) [42] incorporating bagging and
boosting has been used to address the issue of fraud class
skewness, the approach are found to perform better when
compared to the current methods. The study [43] addresses
the issue of skewed datasets by using fuzzy C-means cluster-
ing and the selection of related instances. The authors address
the issues with conventional under-sampling strategies to
enhance the detection performance and accuracy. To iden-
tify fraudulent transactions, the study [44] suggested LSTM
ensemble, SMOTE-ENN was used to address the problem
of fraud skewness. The method outperformed other algo-
rithms in terms of performance, but, SMOTE method may
occasionally produce instances that are not typical instances
of the minority class. A dynamic ensemble technique [45]
for anomaly identification in the Internet of Things sys-
tems is proposed. To address the issue of fraud skewness,
the borderline-synthetic minority over-sampling approach
(Borderline-SMOTE), One-Sided Selection (OSS), and adap-
tive synthetic (ADASYN) were applied in the study [46], OSS
were found to be optimal under-sampling technique and that
adaptive synthetic (ADASYN) performs better when employ-
ing the gradient tree boosting (GTB) classifier. Random forest
ensemble approach [47] performed exceptionally well on
oversampling and under-sampling. Though under-sampling
usually led to the loss of important information while on the
other hand, oversampling brings information that may not be
fully a representative of the training set.
It is widely acknowledged that the skewed distribu-
tion of fraud instances presents a significant challenge for
many machine learning models. The resampling techniques
that have been used in effective fraud skewness mitigation
may not be free from certain shortcomings. The resam-
pled instances usually suffer from non-representative of the
dataset, overfitting, and the loss of important data. Hence,
there is a need to augment the effort of machine learning
algorithms with novel approach in overcoming this challenge.

D. RESEARCH PROBLEM
The problem of NBA fraud keeps increasing daily as the
number of online banking service users keeps increasing [3].

64288 VOLUME 12, 2024


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

TABLE 1. (Continued.) Table of related studies. TABLE 1. (Continued.) Table of related studies.

ML techniques applied in many researches shows a promising


performance in overcoming NBA fraud. However, most ML
struggles when the distribution fraud instances are skewed
as in the case of BAF dataset. The studies [18], [19] utilize
LGBM to address skewed fraud instances using the True
Positive (TP) rate as a performance measure. The study [15]
utilizes stacking in ensembled learning with majority voting
to evaluate the BAF dataset and address the changing fraud
patterns. The study [20] uses federated learning in addressing
data privacy issues and deep neural networks to classify fraud
with TP rate as a metric. These studies achieve good perfor-
mance in addressing BAF challenges. However, the studies
did not consider the potential losses of fraud risk features.
Major problems this paper addresses include:

• Most existing studies do not consider potential losses


of fraud risk features, but fraud instances happen rarely
and cause big losses when they occur.

VOLUME 12, 2024 64289


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

• Fraud instances are inherently skewed compared


to non-fraud instances, producing a highly skewed
distribution.
• Fraud patterns tends to have more irregular and extreme
values, while models like logistics regression or regres-
sion assume normality and predictions may produce an
inaccurate result.

III. PROBLEM DEFINITION


This paper considers Xi = x1 , x2 , . . . , xn as a vector of obser-
vation in a respective raw feature. The Xi is transformed to log
return Xp = x1 , x2 , . . . , xm which is a vector
 of log
 returns.
xi
The log return Xp is computed using log 1 + xi−1 . The log
returns are assessed using value-at-risk to determine the risk
of fraud for each respective feature. The fraud instances are
considered as the worst-case scenario and beyond. The value-
at-risk model is the tail of a distribution i.e., extreme quantiles FIGURE 1. Proposed method.
where fraud occurs. The historical simulation was conducted
 losses distribution ℓ̃p = ℓ Xp =

to estimate potential
−(f t + 1, Zt + xp − f (t, Zt )). The extreme value theorem particularly for the KNN model with hyperparameter k. The
is applied to estimate the tail distribution based on fraud fraud detection model requires the optimization of k to a
instances skewness. The value-at-risk V as a risk measure that lower setting to sufficiently model the fraudulent features in
assesses the risk of the features is the sum of expected loss ℓ the rare cluster. The distance weight of KNN is imperative
and expected shortfall C, as can be seen in (3). The risk-return in inhibiting fraud skewness by assigning a higher weight to
features were obtained as log return passes through the for- near instances which in turn facilitates efficient detection of
mulation comprising ℓ, V, and C as given in (9-12) and the skewed instances.
equations are derived based on tree event of fraud instances. Additionally, this paper put forward a novel approach to
The value-at-risk quantified risks across mean, worst-case, NBA fraud detection through the utilization of value-at-risk
and extreme scenarios. This study aims to detect NBA fraud that appropriately models the fraud skewness. The selection
based on risk-return features using the KNN model. of a 99.5% confidence level highlighted the need to capture
0.5% of extreme fraud risk instances which fit to fall under
IV. MATERIALS AND METHODS the subset of 1% fraud rate (detection effectiveness) as shown
This section discusses the materials and methods adopted in in Fig. 2. The value-at-risk which is finance and risk manage-
this research. ment tools model the tails of a fraud event that are extreme.
Consequently, a novel detection rate performance metrics that
A. PROPOSED METHOD incorporate the risk of skewed fraud instances into the overall
The proposed design of this research is illustrated in Fig. 1 performance measure of detecting rare instances were put
which describes the steps and process involved in NBA forward which will later be seen in (21). The metrics provides
fraud detection. Value-at-risk being an important part of the model with capacity to identify and attach more weight
this research is designed to model the severe and extreme to rare and extreme fraud instances by including the fraud
fraud risk features, it also focuses on rare fraud instances rate and confidence level in detection process. Therefore, this
that are detrimental and very costly when occurred. How- research put forward a single metrics that capture overall
ever, the rare cases that are mostly skewed can distort rate of fraud detection based on risk exposure. Under an
machine learning algorithms [51] especially distance based acceptable loss tolerance in the banking sector, value-at-risk
like KNN. The value-at-risk can handle the fraud skewness presents an intelligent approach for establishing data-driven
through the utilization of adjustable threshold probability criteria for fraud risk management.
ranges (confidence level) unlike the conventional methods
that employ constant fraud probability weight that’s attached B. DATA PREPROCESSING
to the skewed fraud instances. The preprocessed, extracted This paper carries out preprocessing tasks to improve
and engineered features were sent as input to value-at-risk for the quality of features and ensure model accuracy.
simulation. Meanwhile, a distance based KNN is designed The redundant feature device_fraud_count contains zero
for adjustability to detect fraudulent features through iden- instances all of which were manually removed from the
tifying rare clusters with nearest neighbor distance k. The data making the model less complex. The categorical
confidence level chosen considers the rare fraud cases as features device_os, employment_status, payment_type, and
higher risk features that would result in fewer training sets, housing_status were labeled to make them easier to learn

64290 VOLUME 12, 2024


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

D. VALUE-AT-RISK V
Because of the value-at-risk emphasis on statistically extreme
but significant fraud instances, it is ideally more suitable for
the development of efficient fraud detection models. In the
financial sector, value-at-risk is a quantile of loss distribution
that gives a range of potential losses and is one of the most
frequently used measures of risk. V can also be termed as
a statistical measure of the risk of loss over a specific time
at a given confidence level. It also plays a significant part
in the Basel regulatory framework. V has a confidence level
α ∈ (0, 1) [48]. This experiment adopted the Solvency II
framework which uses a one-year horizon with the level of
FIGURE 2. Value-at-risk return curve.
confidence, α equal to 0.995. It can be written as in (1):

TABLE 2. Table of new features. V = µ + σ Z −1 (α) (1)

where µ is the mean of a log loss returns and σ is the standard


deviation of returns, Z represents the standard normal, and
Z −1 (α) represent the ∝ quantile of Z . The value-at-risk [49]
can also be written as in (2):

Value − at − risk = Expected loss + Unexpected loss (2)

In this paper, we consider the unexpected loss to be the


expected shortfall. Therefore, the general relationship will be
written as given in (3):

V =ℓ+C (3)

Generally, expected loss alone does not sufficiently handle


the tail risk of losses, applying V additionally quantifies
the aggregate potential losses of fraud risk features. The
because machine learning cannot process features that are addition of C will further quantify the extreme loss effects.
non-numeric. The features zip_count, keep_alive_session This combination will allow quantification of risk across
and bank_months_count were eliminated to avoid noise and mean, worst-case, and extreme scenarios. The metrics will
collinearity issues. The features foreign_request, has_other aggregate their strengths and overcome their weakness.
_card, and email_is_free were eliminated as they give too
much undefined log returns. 1) EXPECTED LOSS ℓ
Expected loss is an important risk measure for estimating
C. FEATURE EXTRACTION AND ENGINEERING the average or probable loss expected from a specific risk
The development of a NBA fraud detection model was exposure. Intuitively, it indicates loss occurrence on average
based on the selection of relevant features from demographic, in a repeated situation. ℓ is measured usually based on 1 year,
behavioral, risk management and transactional perspec- the higher value of ℓ indicates a high risk of exposure. ℓ does
tive. Demographic features such as income, customer_age, not sufficiently handle tail risks as it is considered more of an
and employment_status were selected. Behavioral features average risk measure, due to this limitation, there is a need
such as bank_branch_count_8w and housing_status were for support by other risk measures like V and C. The expected
selected. Risk-based features that include credit_risk_score losses can be written mathematically in (4):
and proposed_credit_limit were selected. Transactional Pm
features such as days_since_request, total_velocity, and p=t−n+1 xm
ℓ = E Xp+1 = =µ

(4)
payment_type were also selected. Additionally, two or more p
existing features are combined to form a new feature as given
in Table 2. The features engineered are based on location, The Xp is a log return which was computed using the for-
velocity of transactions, default risk, and ability to repay mula (5). The addition of 1 is to avoid having too much
loans to determine the likelihood of fraudulent behaviors. The negative and undefined log returns.
selected features were used along with the other raw features 
xi

for accurate model training. Xp = log 1 + (5)
xi−1

VOLUME 12, 2024 64291


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

2) EXPECTED SHORTFALL C 2. Fraud despite detection: When fraud occurs with (1 - γ )


In other words, C is a conditional value-at-risk given that the detection, there is a reduction in ℓ and V in proportion to γ ,
loss ℓ exceeds the V threshold at the specified confidence while C remains the same as given in (10).
level α. The C as given in (6,7) represents the level for the
Quantile(µ, 99.5%) × (1 − γ ) = µ × (1 − γ )
worst 100(1 − α)% losses in the distribution. It focuses on
the severity of the rare worst-case losses ignored by V. + Average(Xp |Xp > V)
1
Z 1
 (10)
C= V Xp dx (6)
1−α α where ℓ = µ× (1 − γ ) , V = Quantile(µ, 99.5%) × (1γ ),
and C = Average(Xp |Xp > V)
Z 1
1
C= [V = µ + σ Z −1 (α)]dx (7) 3. Fraud detected and stopped: This stops detection before
1−α α
major damage as shown in (11). Fraud is detected by γ and
However, the V can be computationally expensive and dif-
ℓ are restricted to the expenses related to prevention and
ficult to apply to complex financial portfolios. It does not
detection, both V and ℓ are proportional to γ , while C remains
also give information on the severity of loss. To augment
the same.
such weakness of V, expected shortfall C were employed to
estimate the severity of losses in the worst cases. Historical Quantile (µ, 99.5%) × γ = µ × γ + Average(Xp |Xp > V)
simulation is adopted to estimate the losses of fraud risk (11)
features.
where ℓ = µ × γ , V = Quantile (µ, 99.5%) × γ , and C =
3) HISTORICAL SIMULATION Average(Xp |Xp > V)
This is a non-parametric technique that uses past informa- 4. High-risk fraud stopped: This refers to fraud that is
tion to model possible loss in the future [50]. This utilizes avoided because of potential risk exposure as given in (12).
empirical distribution to estimate the loss distribution of pre- The ℓ is the associated cost of prevention and detection of
vious changes in risk features. The advantage of historical high-risk fraud, it is assumed that the associated costs are
simulation over covariance method of loss estimation is its relatively lower than the mean loss. V is the 99.5 percentile
ability to adapt over time and can model dynamic conditions. of loss returns, C quantifies average loss beyond V.
The loss distribution ℓ̃p measure the change in value between
Quantile (γ , 99.5%) = µ(1 − α) + Average(Xp |Xp > V)
the returns f (t, Zt ) at time t and returns f t+1,Zt + xp at


time t + 1 in a specific confidence level, the negative value (12)


is indicating the interest in quantification of loss given in (8),
where ℓ = µ(1 − α), V = Quantile (γ , 99.5%), and C =
Zt denotes condition of returns at time t.
Average(Xp |Xp > V)
ℓ̃p = ℓ Xp = − f t + 1, Zt + xp − f (t, Zt )
  
(8) In each of the four scenarios, these risk measures have
distinct and significant roles played in managing the costs and
4) TREE EVENT OUTCOME risks related to fraud detection and prevention strategies.
Even tree is employed in risk assessment and analysis to pin-
point different event sequences for both fraud and non-fraud E. NBA FRAUD DETECTION MODEL SELECTION
that may result in a particular outcome. An event tree also Machine learning models can utilize high dimensional data
known as an incidence response tree (IRT) [49] contains four to analyze complex fraud patterns that humans or rule-based
possible outcomes as given (9-12) that are based on pre- systems would not. Supervised ML has unique significance
vention, detection, and response. The event tree significantly in employing labeled data to find the patterns, anomalies,
tracks fraudulent activities and estimates the return outcomes and fraudulent activity. Binary logistic regression (BLR) is
of monitoring decisions, hence improving prediction and risk suitable in handling categorical data and is good due its inter-
models. The formulation of this paper is based on Dan Gorton pretability. Naïve bayes (NB) has efficiency and simplicity in
[49]: Fraud without detection, fraud despite detection, fraud terms of cost and time. K-nearest neighbor (KNN) has high
detected and stopped, and high-risk fraud stopped. The detec- effectiveness in fraud detection when managing transactional
tion effectiveness γ is the fraud rate: information, adaptability, and as well as its potential use in
1. Fraud without detection: γ is not regarded as fraud goes hybrid form.
undetected as in (9). The worst-case scenario V of losses
when fraud stays undetected is understood by using the ℓ 1) BINARY LOGISTIC REGRESSION
which is the mean of returns µ. C aids in evaluating the tail BLR is a supervised ML [50] that is very effective in fraud
risk related to undetected fraud. detection capability due to its suitability in handling cate-
gorical data and its interpretability [51]. The solution for
Quantile (µ, 99.5%) = ℓ + Average(Xp |Xp > V) (9)
the fraud detection model is constructed by utilizing the
where ℓ = µ, V = Quantile (µ, 99.5%) , and C = binary fraud class y and features Xi [52]. Xi is a vector of
Average(Xp |Xp > V) features (x1 , x2 , . . . , xn ) capable of influencing the decision

64292 VOLUME 12, 2024


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

of fraud detection model to classify features as either fraud


or non-fraud class y ∈ (0, 1). BLR function uses the sigmoid
function on y ∈ (0, 1). The mathematical expression is given
in (13,14):
X
log (y) = β0 + Xi βi (13)
 
Probability of fraud y
logit = = log
Probability of non − fraud 1−y
(14)

2) NAÏVE BAYES CLASSIFIER


Based on the Bayes theorem, the Naive Bayes is a supervised
machine learning algorithm [53]. The NB is very suitable
in fraud detection for its efficiency in terms of cost, time, FIGURE 3. Fraud class distribution.
and high accuracy [54]. When given the fraud class, the NB
classifier assumes that all fraudulent features are independent
of each other. Assume the target feature tobe Yj ∈ (0,1) and NBA fraud is the scarcity of datasets. The BAF dataset remain
that Xi is a vector of fraudulent features. The P(Yj /Xi ) is the only data in this domain. As such, the evaluation of this
the generic conditional probability Xi given Yj . The Gaussian research paper is forced to rely on the BAF dataset.
function of NB is given in (15), where σ 2 and x are the
variance and mean of probabilities. B. PERFORMANCE METRICS
! The confusion matrix is used to evaluate the performance of
(xi −x )2

Xi

1 −
2σj2
a classification model and contains the values of true positive
P =q e (15) (TP), false positive (FP), true negative (TN), and false nega-
Yj 2πσj2 tive (FN) [52]. The majority of studies in the literature utilize
the true positive (TP) rate, to give room for comparison,
3) K-NEAREST NEIGHBOR evaluation metrics such as accuracy, f-score, TP rate, and FP
KNN is a supervised machine learning algorithm that is use- rate. A novel detection rate was additionally proposed that
ful for problem classification [55] and is good for its better integrate the overall detection performance with associated
detection and lower FP rate. KNN has high effectiveness risk exposure. The Accuracy measures the overall perfor-
in fraud detection when managing transactional information, mance of the model, F-score integrates precision (1-FP rate)
adaptability, and as well as its potential use in hybrid form and recall (TP rate) in a skewed dataset that struggles to
[56]. To determine whether there has been fraudulent behav- balance between minimizing FN and FP. The TP rate measure
ior in fraudulent features Xi , studies employ KNN to classify the proportion of correct detection performance. The FP rate
Xi into fraud class Yj ∈ (0,1). Two estimates are needed for measure the proportion of incorrect detection. The metrics are
the KNN fraud detection technique: The transaction correla- given mathematically in (17-20).
tion and the distance between the transaction’s occurrence of TP + TN
the fraud features. The indicator function is given in (16): Accuracy = (17)
( TP + FP + TN + FN
1, if Yj = Xi TP
E Yj , Xi = TPrate = (18)

(16) TP + FN
0, elsewhere
FP
FPrate = (19)
V. EXPERIMENTAL SETUP FP + TN
The simulation of value-at-risk was conducted in a Microsoft 2 × TP
F−score = (20)
Excel environment, and the development of fraud detection 2 × TP + FP + FN
models was conducted using Python. Experimental proce- The novel detection rate measures the overall rate of detec-
dures that are carried out for developing a fraud detection tion that incorporate the risk of detecting extreme instances.
model. The γ denotes the fraud rate and α denotes the confi-
dence level as given in (21). The component (1+γ (1−α))
(1+γ ) is
A. DATASET proportional to the proportion of extreme fraud instances
A real-world BAF dataset is accessible to the public [18]. exceeding α. It has the advantage of integrating the capac-
It contains 32 features with 1 million instances. The dataset ity to detect rare but extremely significant fraud cases with
contains details about the demographic, behavioral, risk, and the overall detection performance. The detection rate ranges
transactional features. The dataset is highly skewed with a from 0 to 1.
fraud class of 11029 and a non-fraud class of 988971 as (1 + γ (1 − α))TP
Detection rate = (21)
shown in Fig. 3. The primary obstacle to the detection of (1 + γ )(TP + FN )
VOLUME 12, 2024 64293
A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

C. NBA FRAUD DETECTION MODEL DEVELOPMENT


This section discusses the procedure for the development of
NBA fraud detection model using raw features. Raw features
refer to the features in the initial stage that went through pre-
processing, feature extraction, and engineering steps before
transformation to either logarithmic or risk-return features.
The datasets contain different format types and features in
number type were converted to integers for simplicity and
efficiency in processing the fraud detection model. Features
that cause noise and collinearity were removed from the
datasets to increase the performance of the model. Relevant
features that facilitate accuracy were selected from demo-
graphic, behavioral, transactional, and risk perspectives. New
features are engineered based on location, the velocity of
transactions, default risk, and ability to repay loans to deter- FIGURE 4. Performance evaluation of fraud detection model with skewed
features.
mine the risk of fraudulent behaviors. The processed features
and newly engineered features were used to form the set of
raw features and is highly skewed. The raw features were sent
based on the local structure of the data and adjust to various
as input to the machine learning models to classify features
patterns.
as either fraud or non-fraud. BLR, KNN, and NB models are
employed to build a fraud detection model.
VI. RESULTS ANALYSIS AND DISCUSSION
This section presents the general results obtained from exper-
D. EXECUTIONN OF THE PROPOSED APPROACH imental research with skewed fraud instances and risk-return
The NBA fraud detection model presented in section C is features with their validation. The 10-fold cross validation
achieved through the utilization of raw features. The raw was used to evaluate NBA fraud detection models.
features undergo preprocessing and feature selection. The
engineered features along with other features were sent for A. RESULT OF NBA FRAUD DETECTION MODEL WITH
training using different machine learning models. However, SKEWED FRAUD INSTANCES
the results obtained are not very good for NBA fraud detec- This section presents the result of the NBA fraud detec-
tion. Hence, the poor performance of raw features which is tion model with skewed instances using BLR, KNN, and
attributed to skewed data distribution highlighted the need NB. The results are presented in Table 3, the best metric
for model improvement. The raw features were modeled by results among the models were written in bold number. The
value-at-risk for improvement. Initially, raw features were accuracy result of BLR, KNN, and NB are 0.9869, 0.9884,
transformed into a log return, the log returns were then passed and 0.9743 respectively. The TP rate results of BLR, KNN,
through (3) of value-at-risk V. The risk-return features were and NB are 0.0016, 0.0061, and 0.1355 respectively. The
obtained from V, ℓ and C as seen in (9-12). The risk-return FP rate results of BLR, KNN, and NB are 0.002, 0.0007,
features were then subjected to classification by machine and 0.0163 respectively. The f-score results of BLR, KNN,
learning models. Machine learning models such as BLR, and NB are 0.0028, 0.0115, and 0.1042 respectively. The
NB, and KNN were employed to develop the NBA fraud illustrations of the metric results are demonstrated in Fig. 4.
detection model. BLR is essentially a probability prediction It can be observed that the results of accuracy and FP rate
model that needs to be turned into binary values. The max- were good. However, the results of the TP rate and f-score
imum likelihood estimate is used to estimate the weights were not very good. The TP rate is a very important met-
of BLR. A real-valued set of risk-return features is mapped ric especially in fraud detection, robust and accurate fraud
into a binary class of fraud and non-fraud using the sigmoid detection must attain a good TP rate. The poor performance
function. A model that predicts a value very close to 1 is of the fraud detection model, particularly in TP rate and f-
produced by using the best weights. Using the risk-return score, using fraud skewed instances highlighted the need for
features in Naïve bayes, the conditional probability of fraud model improvement. We employ to improve the fraud detec-
feature and the prior probability of fraud class are computed. tion model using value-at-risk augmented features which is
To predict the fraud class based on new features, the posterior presented in section B.
probability of the fraud class is obtained by combining the
learning of probability distributions with Bayes’ rule. The B. RESULT OF AN IMPROVED NBA FRAUD DETECTION
K-NN algorithm detects the K nearest neighbors, using a MODEL USING VALUE-AT-RISK
distance metric, to a given data point. The majority vote of This section presents the results of an improved NBA fraud
the K neighbors is then used to establish the fraud class. detection model using BLR, KNN, and NB. The result indi-
Using this method enables the algorithm to classify outcomes cated good performance by KNN and the results are written

64294 VOLUME 12, 2024


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

TABLE 3. Result of fraud detection model with raw features.

TABLE 4. Result of an improved fraud detection model.

FIGURE 6. Receiver operating curve for the fraud models.

FIGURE 7. Decision boundary for KNN.

The risk-return features were reduced using principal com-


ponent analysis to map the KNN decision boundary. The
FIGURE 5. Performance evaluation of risk-return features.
principal components (PC) were utilized to plot the KNN
decision boundary. Fig. 7 indicates that the KNN model
identifies the fraud risk patterns based on its exhibited linear
in bold number as shown in Table 4. The accuracy results of
boundary which translates to a relatively simple relationship
BLR, KNN, and NB are 0.8, 0.9167, and 0.8667 respectively.
among fraud risk features. Also, the dominance of one class
The TP rate results of BLR, KNN, and NB are 0.75, 0.95, and
in a particular region may signal a distinct fraudulent feature
0.875 respectively. The detection (DT) rate results of BLR,
through smaller k = 3 which successfully reduce the influ-
KNN, and NB are 0.7426, 0.9406, and 0.8580 respectively.
ence of skewed instances which is manifested by a high TP
The f-score results of BLR, KNN, and NB are 0.7333, 0.9333,
rate.
and 0.8333 respectively. The illustrations of the metric results
are demonstrated in Fig. 5. The results show that KNN has
better performance in accuracy, TP rate, DT rate, and f-score. C. RELIABILITY ANALYSIS
Overall, it can be concluded that the KNN model outperforms The reliability analysis of the value-at-risk-based fraud detec-
other models to emerge as the best NBA fraud detection tion model is done using the Kupiec test. Kupiec proposed
model. an additional failure rate-based test in 1995 [57]. The test
The Receiver operating curve (ROC) in Fig. 6 presents the measures the frequency with which a value-at-risk is violated
classification capability, it indicates a high TP rate and low over a specified period. The test null hypothesis is when the
FP rate across different threshold values. The KNN model expected violation rate by the value-at-risk model and the
demonstrates high robustness in fraud detection as compared observed violation rate are equal and is given in (22) as h.
to BLR and NB. The test statistic follows chi-square with 1 degree of freedom

VOLUME 12, 2024 64295


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

TABLE 5. Comparison with the state-of-the-art methods.

TABLE 6. Ablation study using F-score.

FIGURE 8. Reliability analysis.

is given in (23) as the likelihood ratio (LR):


number of violations v
h= = (22)
total numbr of observations t TABLE 7. Parameter analysis involving learning rate lr of BLR.
t−v v !
1 − hexp hexp
LR2 = −2ln ∼ X12 (23)
(1 − hobs ) (hobs )t
t−v

The result for the test at a 5% significance level is given


in Fig. 8. It can be seen that only name_email_similarity
and days_since_request were found not to be consistent the
observed violation rate. The rest of the features were found to threshold probability ranges weight that’s attached to the
be consistent and reliable. Hence, the incorporation of value- skewed fraud instances. The results for comparison are given
at-risk were adequate and reliable. in Table 5. Also, while our paper reached an accuracy of
0.9167, another study [58] employed the BAF in evaluation
D. COMPARISON WITH THE STATE-OF-THE-ART with 0.677 of an accuracy. Our approach was particularly
METHODS IN NBA FRAUD DETECTION better than the methods that are currently in existence.
The results of our experiment are compared with the state-
of-the-art methods for NBA fraud. The study [18] used E. ABLATION STUDY
100 sets of hyperparameters for parameter configuration to This paper conducted an ablation study to determine the
optimize the LGBM model performance and obtained a TP contribution of components that influence the performance
rate result of about 0.6, under-sampling techniques for han- of the NBA fraud detection model. The choice of loga-
dling class skewness were used as part of hyperparameters. rithmic return is among the components
 that
 impacted our
xi
The study [19] utilizes 25 sets of hyperparameter config- result. Given log return Xp = log 1 + xi−1 , 1 is removed
urations to optimize the LGBM model and obtained a TP
 
xi
from log return formulae to become Xp = log xi−1 .
rate result of almost 0.8, utility aware reweighing is used
to handle the class skewness. Additionally, the study [15] The result which can be seen in Table 6 shows F-score
uses ensemble learning techniques in which stacking was of new bank account fraud detection models. The removal
specifically applied to handle class skewness, the strengths of 1 resulted in decreased performance for BLR, KNN,
of the weaker models trained were aggregated using majority and NB.
voting to address evolving patterns and achieved a 0.9 TP
rate result. Another study [20] combined federated learning to F. PARAMETER ANALYSIS
handle data privacy and SHAP value to ensure interpretability This paper examines hyperparameter space to determine the
of feature importance by human experts, the deep neural net- setup that led to optimum model efficiency and performance.
work was used to recognize patterns of fraud, and a TP rate of The experiment utilizes different parameter ranges in BLR,
about 0.75 was achieved, SMOTE were employed to handle KNN, and NB. For BLR, learning rate lr are examined, and
class skewness. Our paper uses KNN with k hyperparameter the accuracy results of different parameter configurations are
to detect fraud with a TP rate of 0.95. We overcome fraud shown in Table 7. For KNN, the number of nearest neighbors
skewness using value-at-risk that considers fraud instances k are evaluated and the accuracy results of parameter settings
as a worst-case scenario through the utilization of adjustable are shown in Table 8. For NB, the different probability

64296 VOLUME 12, 2024


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

TABLE 8. Parameter analysis involving a number of nearest neighbors k financial fraud challenges. The value-at-risk attach confi-
of KNN.
dence probability weight to the rare fraud cases with nearest
neighbor distance k. The distance weight of KNN is impera-
tive in inhibiting class skewness by assigning a higher weight
to near instances which in turn facilitates efficient detection
of skewed instances. The deployment of expected shortfall
and expected loss by value at risk allows quantification of
TABLE 9. Parameter analysis involving a distribution assumption of NB. risk across mean, worst-case, and extreme scenarios enabling
aggregation of their strengths. Therefore, an accurate fraud
detection system assists organizations in making effective
choices and reducing the overall expense of fraud detection
and prevention. This paper does not consider the time win-
dows in the experiment. However, the major challenge is the
lack of data availability in NBA fraud detection.
distributions were evaluated and the accuracy results as given
in Table 9. CONFLICT OF INTEREST
There are no competing interests disclosed by the authors.
VII. DISCUSSIONS AND CONCLUSION
This section presents a discussion of the results and the AVAILABILITY OF DATA AND MATERIALS
conclusion of our findings. The data used in this research is publicly available at
https://github.com/feedzai/bank-account-fraud
A. DISCUSSIONS
This paper explored improving the performance of NBA ACKNOWLEDGMENT
fraud detection model by employing value-at-risk. The per- The authors would like to acknowledge the support of Prince
formance of the fraud detection models was measured based Sultan University for paying the Article Processing Charges
on the removal of redundant features to lower the complexity (APC) of this publication and also would like to thank the
of the model, the selection of an important feature capable School of Statistics & Mathematics, Zhejiang Gongshang
of influencing fraud detection to avoid noise and collinearity, University, China.
and the engineering of features from the contextual perspec-
tive that increase the model performance. The raw features REFERENCES
were sent to BLR, KNN, and NB models for classification. [1] ACFE. Association of Certified Fraud Examiners (ACFE) 2022 Report to
KNN model outperforms other models as shown in Table 3 the Nations. Accessed: 2023. [Online]. Available: https://legacy.acfe.com/
report-to-the-nations/2022/
with an accuracy result of 0.9884, TP rate result of 0.0061, [2] T. Ashfaq, R. Khalid, A. S. Yahaya, S. Aslam, A. T. Azar, S. Alsafari, and
FP rate result of 0.0007, and f-score result of 0.0115. The I. A. Hameed, ‘‘A machine learning and blockchain based efficient fraud
performance of fraud detection is not very good and reliable detection mechanism,’’ Sensors, vol. 22, no. 19, p. 7162, Sep. 2022.
[3] N. S. Alfaiz and S. M. Fati, ‘‘Enhanced credit card fraud detection model
as evidenced by the TP rate and f-score, hence, necessitating using machine learning,’’ Electronics, vol. 11, no. 4, 662, 2022.
the need for the model improvement. Given that, value-at-risk [4] A. Alfaadhel, I. Almomani, and M. Ahmed, ‘‘Risk-based cybersecurity
was employed to improve the model. To improve NBA fraud compliance assessment system (RC2AS),’’ Appl. Sci., vol. 13, no. 10,
p. 6145, May 2023.
detection model, raw features were simulated through value-
[5] D. Sarma, W. Alam, I. Saha, M. N. Alam, M. J. Alam, and S. Hossain,
at-risk. The risk-return features obtained from value-at-risk ‘‘Bank fraud detection using community detection algorithm,’’ in Proc. 2nd
were sent to BLR, KNN, and NB models for classification. Int. Conf. Inventive Res. Comput. Appl. (ICIRCA), Jul. 2020, pp. 642–646.
Among the models, the KNN model performs better as shown [6] A. Pagano, ‘‘Digital account opening fraud on demand deposit accounts:
An assessment of available technology,’’ Ph.D. thesis, Utica College, Utica,
in Table 4 with an f-score result of 0.9333, TP rate result of NY, USA, 2020.
0.95, accuracy result of 0.9167, and DT rate result of 0.9406. [7] Shuftipro. New Account Fraud—A New Breed of Scams. Accessed:
The NBA fraud detection model based on value-at-risk fea- 2023. [Online]. Available: https://shuftipro.com/reports-whitepapers/new-
account-fraud.pdf
tures appears to have good performance. The reliability test [8] R. Sasirekha, B. Kanisha, and S. Kaliraj, ‘‘Study on class imbalance
conducted using the Kupiec test proved to be reliable and problem with modified KNN for classification,’’ in Intelligent Data Com-
consistent as shown in Fig. 8. This indicates that the value-at- munication Technologies and Internet of Things, vol. 101. Singapore:
Springer, 2022, pp. 207–217, doi: https://doi.org/10.1007/978-981-16-
risk engineered features led to the improvement of K-nearest 7610-9_15.
neighbor fraud detection model. [9] P. Vanini, S. Rossi, E. Zvizdic, and T. Domenig, ‘‘Online payment fraud:
From anomaly detection to risk management,’’ Financial Innov., vol. 9,
no. 1, p. 66, Mar. 2023, doi: 10.1186/s40854-023-00470-w.
B. CONCLUSION [10] X. Zhu, X. Ao, Z. Qin, Y. Chang, Y. Liu, Q. He, and J. Li, ‘‘Intelligent
The value-at-risk-based fraud detection model presented in financial fraud detection practices in post-pandemic era,’’ Innovation,
this paper enables the quantification and mitigation of fraud vol. 2, no. 4, Nov. 2021, Art. no. 100176, doi: 10.1016/j.xinn.2021.100176.
[11] M. Monge, C. Poza, and S. Borgia, ‘‘A proposal of a suspicion of tax fraud
risk features and at the same time overcome the influence indicator based on Google Trends to foresee Spanish tax revenues,’’ Int.
of skewed fraud instances which is very crucial in solving Econ., vol. 169, pp. 1–12, May 2022, doi: 10.1016/j.inteco.2021.11.002.

VOLUME 12, 2024 64297


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

[12] S. Kannan and K. Somasundaram, ‘‘Autoregressive-based outlier [34] N. I. Mat Ridzuan, J. Said, F. M. Razali, D. I. Abdul Manan, and
algorithm to detect money laundering activities,’’ J. Money Laundering N. Sulaiman, ‘‘Examining the role of personality traits, digital technology
Control, vol. 20, no. 2, pp. 190–202, May 2017, doi: 10.1108/jmlc-07- skills and competency on the effectiveness of fraud risk assessment among
2016-0031. external auditors,’’ J. Risk Financial Manage., vol. 15, no. 11, p. 536,
[13] B. Xiao, B. Lei, W. Lan, and B. Guo, ‘‘A blockwise network autoregressive Nov. 2022, doi: 10.3390/jrfm15110536.
model with application for fraud detection,’’ Ann. Inst. Stat. Math., vol. 74, [35] O. E. Akinbowale, H. E. Klingelhöfer, and M. F. Zerihun, ‘‘Application of
no. 6, pp. 1043–1065, Dec. 2022, doi: 10.1007/s10463-022-00822-w. forensic accounting techniques in the south African banking industry for
[14] G. Moschini, R. Houssou, J. Bovay, and S. Robert-Nicoud, ‘‘Anomaly and the purpose of fraud risk mitigation,’’ Cogent Econ. Finance, vol. 11, no. 1,
fraud detection in credit card transactions using the ARIMA model,’’ in Dec. 2023, Art. no. 2153412, doi: 10.1080/23322039.2022.2153412.
Proc. 7th Int. Conf. Time Forecasting, Jul. 2021, p. 56, doi: 10.3390/eng- [36] H. Wang, W. Wang, Y. Liu, and B. Alidaee, ‘‘Integrating machine learning
proc2021005056. algorithms with quantum annealing solvers for online fraud detection,’’
[15] A. A. Alhashmi, A. M. Alashjaee, A. A. Darem, A. F. Alanazi, and IEEE Access, vol. 10, pp. 75908–75917, 2022.
R. Effghi, ‘‘An ensemble-based fraud detection model for financial [37] N. Innan, A. Sawaika, A. Dhor, S. Dutta, S. Thota, H. Gokal, N. Patel,
transaction cyber threat classification and countermeasures,’’ Eng., Tech- M. A.-Z. Khan, I. Theodonis, and M. Bennai, ‘‘Financial fraud detection
nol. Appl. Sci. Res., vol. 13, no. 6, pp. 12433–12439, Dec. 2023, doi: using quantum graph neural networks,’’ Quantum Mach. Intell., vol. 6,
10.48084/etasr.6401. no. 1, pp. 1–18, Jun. 2024.
[16] R. M. Aziz, R. Mahto, K. Goel, A. Das, P. Kumar, and A. Saxena, [38] A. Alwadain, R. F. Ali, and A. Muneer, ‘‘Estimating financial fraud through
‘‘Modified genetic algorithm with deep learning for fraud transactions of transaction-level features and machine learning,’’ Mathematics, vol. 11,
ethereum smart contract,’’ Appl. Sci., vol. 13, no. 2, p. 697, Jan. 2023, doi: no. 5, p. 1184, Feb. 2023.
10.3390/app13020697. [39] L. Ren, R. Hu, D. Li, Y. Liu, J. Wu, Y. Zang, and W. Hu, ‘‘Dynamic graph
[17] M. Hegazy, A. Madian, and M. Ragaie, ‘‘Enhanced fraud miner: Credit neural network-based fraud detectors against collaborative fraudsters,’’
card fraud detection using clustering data mining techniques,’’ Egyptian Knowl.-Based Syst., vol. 278, Oct. 2023, Art. no. 110888.
Comput. Sci. J., vol. 40, no. 3, pp. 1–10, 2016. [40] V. Chang, L. M. T. Doan, A. Di Stefano, Z. Sun, and G. Fortino,
[18] S. Jesus, J. Pombal, D. Alves, A. Cruz, P. Saleiro, R. Ribeiro, J. Gama, ‘‘Digital payment fraud detection methods in digital ages and industry
and P. Bizarro, ‘‘Turning the tables: Biased, imbalanced, dynamic tabular 4.0,’’ Comput. Electr. Eng., vol. 100, May 2022, Art. no. 107734, doi:
datasets for ML evaluation,’’ in Proc. Adv. Neural Inf. Process. Syst., 10.1016/j.compeleceng.2022.107734.
vol. 35, 2022, pp. 33563–33575. [41] J. Chung and K. Lee, ‘‘Credit card fraud detection: An improved strategy
[19] J. Pombal, P. Saleiro, M. A. T. Figueiredo, and P. Bizarro, ‘‘Fairness-aware for high recall using KNN, LDA, and linear regression,’’ Sensors, vol. 23,
data valuation for supervised learning,’’ 2023, arXiv:2303.16963. no. 18, p. 7788, Sep. 2023, doi: 10.3390/s23187788.
[20] T. Awosika, R. Mani Shukla, and B. Pranggono, ‘‘Transparency and pri- [42] V. S. S. Karthik, A. Mishra, and U. S. Reddy, ‘‘Credit card fraud detection
vacy: The role of explainable AI and federated learning in financial fraud by modelling behaviour pattern using hybrid ensemble model,’’ Arabian
detection,’’ 2023, arXiv:2312.13334. J. Sci. Eng., vol. 47, no. 2, pp. 1987–1997, Feb. 2022, doi: 10.1007/s13369-
[21] J. Khaksar, M. Salehi, and M. Lari DashtBayaz, ‘‘The relationship between 021-06147-9.
auditor characteristics and fraud detection,’’ J. Facilities Manage., vol. 20, [43] H. Ahmad, B. Kasasbeh, B. Aldabaybah, and E. Rawashdeh, ‘‘Class
no. 1, pp. 79–101, Jan. 2022, doi: 10.1108/jfm-02-2021-0024. balancing framework for credit card fraud detection based on clustering
and similarity-based selection (SBS),’’ Int. J. Inf. Technol., vol. 15, no. 1,
[22] A. Cordis, ‘‘Political alignment and corporate fraud: Evidence from the
pp. 325–333, Jan. 2023, doi: 10.1007/s41870-022-00987-w.
United States of America,’’ J. Appl. Accounting Res., Oct. 2023, doi:
10.1108/jaar-06-2022-0159. [44] E. Esenogho, I. D. Mienye, T. G. Swart, K. Aruleba, and G. Obaido,
‘‘A neural network ensemble with feature engineering for improved credit
[23] M. J. Rahman and X. Jie, ‘‘Fraud detection using fraud triangle theory:
card fraud detection,’’ IEEE Access, vol. 10, pp. 16400–16407, 2022, doi:
Evidence from China,’’ J. Financial Crime, vol. 31, no. 1, pp. 101–118,
10.1109/ACCESS.2022.3148298.
Jan. 2024, doi: 10.1108/jfc-09-2022-0219.
[45] J. Jiang, F. Liu, Y. Liu, Q. Tang, B. Wang, G. Zhong, and W. Wang,
[24] T. Achmad, I. Ghozali, and I. D. Pamungkas, ‘‘Hexagon fraud: Detection
‘‘A dynamic ensemble algorithm for anomaly detection in IoT imbalanced
of fraudulent financial reporting in state-owned enterprises Indonesia,’’
data streams,’’ Comput. Commun., vol. 194, pp. 250–257, Oct. 2022, doi:
Economies, vol. 10, no. 1, p. 13, Jan. 2022.
10.1016/j.comcom.2022.07.034.
[25] Y. Wang, M. Yu, and S. Gao, ‘‘Gender diversity and financial state- [46] D. Sisodia and D. S. Sisodia, ‘‘Data sampling strategies for click fraud
ment fraud,’’ J. Accounting Public Policy, vol. 41, no. 2, Mar. 2022, detection using imbalanced user click data of online advertising: An empir-
Art. no. 106903. ical review,’’ IETE Tech. Rev., vol. 39, no. 4, pp. 789–798, Jul. 2022, doi:
[26] J. Hendieh, M. Schneider, and T. Sakr, ‘‘Fraud detection and prevention,’’ 10.1080/02564602.2021.1915892.
Middle-East J. Sci. Res., vol. 31, no. 1, pp. 44–52, 2023. [47] A. Singh, R. K. Ranjan, and A. Tiwari, ‘‘Credit card fraud detection under
[27] A. Maniatis, ‘‘Detecting the probability of financial fraud due to earn- extreme imbalanced data: A comparative study of data-level algorithms,’’
ings manipulation in companies listed in Athens stock exchange market,’’ J. Exp. Theor. Artif. Intell., vol. 34, no. 4, pp. 571–598, Jul. 2022, doi:
J. Financial Crime, vol. 29, no. 2, pp. 603–619, Mar. 2022. 10.1080/0952813x.2021.1907795.
[28] Y. Gong, J. Li, Z. Xu, and G. Li, ‘‘Detecting financial fraud using two types [48] A. J. McNeil, R. Frey, and P. Embrechts, ‘‘Quantitative risk management:
of Benford factors: Evidence from China,’’ Proc. Comput. Sci., vol. 214, Concepts, techniques and tools, Revised edition,’’ in Princeton Series in
pp. 656–663, Jan. 2022, doi: 10.1016/j.procs.2022.11.225. Finance. Princeton, NJ, USA: Princeton Univ. Press, 2015.
[29] P. Kagias, A. Cheliatsidou, A. Garefalakis, J. Azibi, and N. Sariannidis, [49] D. Gorton, ‘‘Modeling fraud prevention of online services using incident
‘‘The fraud triangle – an alternative approach,’’ J. Financial Crime, vol. 29, response trees and value at risk,’’ in Proc. 10th Int. Conf. Avail-
no. 3, pp. 908–924, May 2022, doi: 10.1108/jfc-07-2021-0159. ability, Rel. Secur., Toulouse, France, Aug. 2015, pp. 149–158, doi:
[30] T. Tarjo, H. V. Vidyantha, A. Anggono, R. Yuliana, and 10.1109/ARES.2015.17.
S. Musyarofah, ‘‘The effect of enterprise risk management on [50] Y. Lyu, F. Qin, R. Ke, Y. Wei, and M. Kong, ‘‘Does mixed frequency vari-
prevention and detection fraud in Indonesia’s local government,’’ ables help to forecast value at risk in the crude oil market?’’ Resour. Policy,
Cogent Econ. Finance, vol. 10, no. 1, Dec. 2022, Art. no. 2101222, doi: vol. 88, Jan. 2024, Art. no. 104426, doi: 10.1016/j.resourpol.2023.104426.
10.1080/23322039.2022.2101222. [51] S. B. Abdullahi and K. Chamnongthai, ‘‘IDF-sign: Addressing inconsistent
[31] B. Stojanović and J. Božić, ‘‘Robust financial fraud alerting system based depth features for dynamic sign word recognition,’’ IEEE Access, vol. 11,
in the cloud environment,’’ Sensors, vol. 22, no. 23, p. 9461, Dec. 2022, pp. 88511–88526, 2023.
doi: 10.3390/s22239461. [52] A. Mahajan, V. S. Baghel, and R. Jayaraman, ‘‘Credit card fraud detection
[32] Y. Yao and J. Li, ‘‘Operational risk assessment of third-party payment using logistic regression with imbalanced dataset,’’ in Proc. 10th Int. Conf.
platforms: A case study of China,’’ Financial Innov., vol. 8, no. 1, p. 19, Comput. Sustain. Global Develop. (INDIACom), Mar. 2023, pp. 339–342.
Dec. 2022, doi: 10.1186/s40854-022-00332-x. [53] F. Aslam, A. I. Hunjra, Z. Ftiti, W. Louhichi, and T. Shams, ‘‘Insurance
[33] J. S. Kamdem and D. Selambi, ‘‘Cyber-risk forecasting using machine fraud detection: Evidence from artificial intelligence and machine learn-
learning models and generalized extreme value distributions,’’ Hal Sci., ing,’’ Res. Int. Bus. Finance, vol. 62, Dec. 2022, Art. no. 101744, doi:
vol. 1, pp. 1–23, Jan. 2022. 10.1016/j.ribaf.2022.101744.

64298 VOLUME 12, 2024


A. U. Usman et al.: Financial Fraud Detection Using Value-at-Risk With Machine Learning in Skewed Data

[54] E. Ileberi, Y. Sun, and Z. Wang, ‘‘A machine learning based credit card BAYAN ALGHOFAILY received the master’s and
fraud detection using the GA algorithm for feature selection,’’ J. Big Data, Ph.D. degrees in computer science from Toronto
vol. 9, no. 1, p. 24, Dec. 2022, doi: 10.1186/s40537-022-00573-8. Metropolitan University, Toronto, Canada. During
[55] P. Atchaya and K. Somasundaram, ‘‘Novel logistic regression over Naive that period, she was a member of the Distributed
Bayes improves accuracy in credit card fraud detection,’’ J. Surv. Fisheries Applications and Broadband Networks Laboratory
Sci., vol. 10, no. 1S, pp. 2172–2181, 2023. (DABNEL). She focused on studying how the per-
[56] R. Bin Sulaiman, V. Schetinin, and P. Sant, ‘‘Review of machine learning formance of machine learning models is affected
approach on credit card fraud detection,’’ Hum.-Centric Intell. Syst., vol. 2, by dataset features. She is currently an Assistant
nos. 1–2, pp. 55–68, Jun. 2022, doi: 10.1007/s44230-022-00004-0.
Professor with the Department of Information Sys-
[57] A. Kannagi, J. Gori Mohammed, S. Sabari Giri Murugan, and
tem, CCIS, Prince Sultan University (PSU). She
M. Varsha, ‘‘Intelligent mechanical systems and its applications on online
fraud detection analysis using pattern recognition K-nearest neighbor is also a member of the Artificial Intelligence and Data Analytics (AIDA)
algorithm for cloud security applications,’’ Mater. Today: Proc., vol. 81, Laboratory, CCIS, PSU. Her research interests include AI, NLP, ML, and
pp. 745–749, 2023, doi: 10.1016/j.matpr.2021.04.228. neural networks. She continues to explore this further in her research.
[58] P. H. Kupiec, ‘‘Techniques for verifying the accuracy of risk measurement
models,’’ in Division of Research and Statistics, Division of Monetary
Affairs, Federal Reserve Board, vol. 95. USA: Journal of Derivatives, 1995.
[59] K. Kireev, M. Andriushchenko, C. Troncoso, and N. Flammarion, ‘‘Trans-
ferable adversarial robustness for categorical data via universal robust
embeddings,’’ 2023, arXiv:2306.04064.

ABDULLAHI UBALE USMAN received the B.Sc.


degree in statistics from Kano University of Sci-
ence and Technology, Wudil, Nigeria, in 2012,
and the M.Sc. degree in statistics from Jodhpur AHMED S. ALMASOUD received the degree
National University, India, in 2016. He is cur- from the University of Technology Sydney. He has
rently pursuing the Ph.D. degree with the School been with Prince Sultan University (PSU), Riyadh,
of Statistics and Mathematics, Zhejiang Gong- Saudi Arabia, since 2014, where he is currently an
shang University, Hangzhou, China. His current Assistant Professor with the College of Computer
research interests include financial fraud detection and Information Sciences. He has published orig-
and machine learning. inal articles in the finest journals in the area of his
studies. His research interests include (but not lim-
ited to) artificial intelligence, machine learning,
SUNUSI BALA ABDULLAHI (Member, IEEE) security architecture, and the Internet of Things.
received the B.Sc. and M.Sc. degrees in electron-
ics from Bayero University Kano (BUK), Nigeria,
and the Ph.D. degree in electrical and computer
engineering from the King Mongkut’s University
of Technology Thonburi, Thailand. His research
interests include computer vision, artificial intel-
ligence, digital image processing, nonlinear opti-
mization and their applications in human motion
analysis, multimodal data interaction analysis, and
social signal processing.
AMJAD REHMAN (Senior Member, IEEE)
YU LIPING received the Ph.D. degree. He is received the Ph.D. degree from the Faculty
currently a Professor with Zhejiang Gongshang of Computing, Universiti Teknologi Malaysia
University. He mainly involved in scientific and (UTM), Malaysia, specializing in information
technological evaluation, technological innova- security using image processing techniques,
tion, and information management. He has six in 2010. He is currently an Associate Profes-
monographs. He is the first author for more than sor with CCIS, Prince Sultan University, Riyadh,
170 articles. He has authored three articles in SCI Saudi Arabia. He is also a PI in several projects and
and SSCI, 40 articles in first-class journals and completed projects funded by MoHE Malaysia,
140 articles in CSSCI. The academic achievements Saudi Arabia. His research interests include bioin-
were collected by Xinhua digest and seven copies formatics, the IoT, information security, and pattern recognition. He received
of the NPC. An article was selected as Leader 5000-Top Academic Articles a Rector Award for the 2010 Best Student from UTM Malaysia.
Platform for China’s Top Sci-Tech Journals (F5000).

VOLUME 12, 2024 64299

You might also like