A Financial Statement Fraud Detection Model Based On Hybrid Data Mining Methods
A Financial Statement Fraud Detection Model Based On Hybrid Data Mining Methods
A Financial Statement Fraud Detection Model Based On Hybrid Data Mining Methods
Abstract—Financial statement fraud has been a difficult variables. In section 4, we describe our experimental setting
problem for both the public and government regulators, so and procedure and section 5 introduces the methods we used
various data mining methods have been used for financial in this research, including classification methods and feature
statement fraud detection to provide decision support for selection. Section 6 gives the results of the prediction.
stakeholders. The purpose of this study is to propose an Finally, we discuss the contribution of this study and propose
optimized financial fraud detection model combining feature future work in section 7.
selection and machine learning classification. The study
indicated that random forest outperformed the other four II. LITERATURE REVIEW
methods. As to two feature selection methods, Xgboost
performed better. And according to our research, 2 or 5 In the past, people used to use expert analysis to find
variables are more acceptable for models in this paper. fraududent financial statements. In this way people may not
fully analyzed the report data for its huge amount and wild
Keywords-machine learning; financial statements fraud range, which caused many shortcomings in judgment. In
detection; feature selection recent years, data mining method has been widely used in
fraud detection to reduce the errors caused by experts’
I. INTRODUCTION judgment, including Internet fraud detection[13-15],
Association of Certified Fraud Examiners (ACFE) telecommunciations fraud detection[16, 17], financial fraud
defines fraud as "A kind of misrepresentation or deception detection[18], and fraud in other areas. These frauds are
that an entity or individual makes knowing that the it could hidden in huge information, and traditional experts’ analysis
result in some unauthorized benefit." According to a study sometimes fails to take into account the whole, and the
conducted by the ACFE, financial statement fraud accounts application of data mining method solves this problem.
for about 10% of white-collar crime. Once fraudulent These data mining methods include SVM[5, 6, 19], RF[20],
accounting practices happened, various actions will be taken DT[4], ANN, LR[21], etc. Most previous stuies use only 1-2
to maintain a sustainable appearance. data mining techniques, without comparisons between each
Considering financial statement fraud bring huge other; And most people overlook the importance of feature
property damage to investors, a large number of researches selection, which is more important than we think to some
have been conducted on the this area using machine learning extent.
methods such as ANN[1-3] ,DT[4] , SVM[5, 6], and text As for those variables, previous studies usually picked
mining[7]. Meanwhile, other fraud problems like credit card them from annual reports. These variables need to be
fraud[8, 9], internal transaction fraud[10] and insurance comprehensive and representative enough to cover all
fraud[11] have also been investigated. Given the different aspects of the company's operations. For example, quick
characteristics of each type of financial fraud, specific ratio and liquidity ratio reflect the solvency; sales growth rate
methods have been developed[12]. This paper puts forward a and EPS represent the profitability; and we can see operating
hybrid detection model for financial statement fraud, and this capacity from inventory turnover and turnover of total capital.
model have the advantages of (1) combining the financial In addition, linguistic variables from MD&A are used for
and non-financial data, (2) using two feature selection emotional analysis[22]. Petr Hajek found that compared with
methods, and (3) easy to explain. non-fraudulent companies, fraudulent companies reported a
During the research, we chose 120 fraudulent financial slightly higher negative sentiment in their annual report. That
statements disclosed by CSRC during the period 2007–2016. is to say fraudulent companies are more likely to use
Then, according to the industry and size of these companies, negative words.
we found the non-fraudulent company for contract. The rest The number of fraudulent companies used in previous
of this paper is organized as follows. In the next section, we studies ranged from tens to thousands, and the patterns of
review the previous literature on financial statement fraud research in each country were slightly different. Most studies
detection. Section 3 introduces financial and non-financial have adopted a matching method to match the non-fraudulent
58
model. ANN is inspired by the central nervous system of
animals, which is a model of information processing of
applied neurons. The distributed structure makes it have the
same robustness as human brain, while ANN also good at
self-learning, self-organizing. Random forest improves the
decision tree by using the voting mechanism of multiple
decision tree, it is an ensemble algorithm in essence. For
example, if there are N samples and M variables
(dimensions), the specific process of random forest is as
follows: (1) Determine a value m, which is used to indicate
how many variables each tree classifier chooses. (2) Collect
k samples from the data set and use them to create k tree
Figure 1. The result of feature selection using PCA. classifiers. In addition, k bags of external data are generated
to be used for testing later.(3) After entering the classified
Xgboost usually known as a trump card in a variety of sample, each tree classifier will classify it, and then all
competitions in the area of data mining. It also has the classifiers will determine the classification result according
function of feature selection. The order of importance of all to the majority rule.
the variables is showing followed in Fig. 2.
VI. EXPERIMENTAL RESULTS
At the beginning, we use machine learning methods to
explore with all the variables, the accuracy of SVM, RF, DT,
ANN, and LR are show in Fig.3. It is obviously that SVM
was better than others in this condition. Random forest has
the lowest accuracy among these methods.
59
In the same way, we test data based on the variables
provided by RF, and results are in Fig. 5.
Random forest still performs good when the number of
selected variables get bigger. LR reach the highest accuracy
when the number of selected variables is 6, but it’s not stable.
Third, in order to test two feature selection functions and
to find which variables are of great importance in machine
learning, we plot the average value of five accuracy (Fig. 6).
We can find that when the number is 2 or 5, we may get a
satisfied result.
Figure 5. Results based on Xgboost.
At last, to compare these five methods more intuitively, random forest has the following advantages: (1) it is good at
we put all variables’ combination into Fig. 7, from which we processing high-dimensional data; (2) it may avoid
find that RF do the best performance. overfitting to some extent; (3) it has good robustness and
stable results.
There are still shortcomings in this article: the data is not
large enough that only Chinese companies contained;
variables can be more various and more innovative. The
practical problems are far more complicated than the ones in
the article, and more factors should be taken into
consideration when designing variables.
REFERENCES
Figure 7. Results of all variables’ combination. [1] Kirkos, E., C. Spathis, and Y. Manolopoulos, Data Mining
techniques for the detection of fraudulent financial statements.
Expert Systems with Applications, 2007. 32(4): p. 995-1003.
VII. DISCUSSION [2] Saha, S.C., C. Lei, and J.C. Patterson, Detecting the financial
statement fraud: The analysis of the differences between data mining
This article has three contributions: First, to calculate and techniques and experts’ judgments. Knowledge-Based Systems,
analyze the factors that affect the fraud behavior. According 2015. 89(9): p. 459-470.
to our research, the effect of x12 and x13 on the model is [3] Wang, L. and C. Wu, A Combination of Models for Financial Crisis
confirmed by both two feature selection methods. Operating Prediction: Integrating Probabilistic Neural Network with Back-
profit margin is the ratio of an enterprise’s operating profit Propagation based on Adaptive Boosting. International Journal of
and operating income. There's a big risk of financial fraud Computational Intelligence Systems, 2017. 10(1): p. 507.
when there's a big loss in the firm and a lower profit margin, [4] Kotsiantis, S., et al., Forecasting fraudulent financial statements
using data mining. Enformatika, 2006. 3(2): p. 104-110.
it would be reflected from indicators related to profit. EPS is
generally used to measure common profit level and [5] Pai, P.F., M.F. Hsu, and M.C. Wang, A support vector machine-
based model for detecting top management fraud. Knowledge-Based
investment risk, and to reflect the operating results of an Systems, 2011. 24(2): p. 314-321.
enterprise. In order to cover losses or exaggerate earnings, [6] Huang, S.Y., Fraud Detection Model by Using Support Vector
the companies have a tendency to increase EPS. Second, Machine Techniques. International Journal of Digital Content
consider the influence of the number of variables on the Technology & Its Applic, 2013.
model. The results indicate that 2 or 5 variables are better [7] Cecchini, M., et al., Making words work: Using financial text as a
than the others. Third, we compared the performance of five predictor of financial events. Decision Support Systems, 2011. 50(1):
machine learning methods and found that among them, p. 164-175.
60
[8] Bhattacharyya, S., et al., Data mining for credit card fraud: A
comparative study. Decision Support Systems, 2011. 50(3): p. 602-
613.
[9] Olszewski, D., Fraud detection using self-organizing map visualizing
the user profiles. Knowledge-Based Systems, 2014. 70(C): p. 324-
334.
[10] Jans, M., et al., A business process mining application for internal
transaction fraud mitigation. Expert Systems with Applications, 2011.
38(10): p. 13351-13359.
[11] Bermúdez, L., et al., A Bayesian dichotomous model with
asymmetric link for fraud in insurance. Insurance Mathematics &
Economics, 2008. 42(2): p. 779-786.
[12] Ngai, E.W.T., et al. The application of data mining techniques in
financial fraud detection:A classification framework and an
academic review of literature. 2013.
[13] Hoque, N., et al., Review: Network attacks: Taxonomy, tools and
systems. Journal of Network & Computer Applications, 2014. 40(1):
p. 307-324.
[14] Kumar, P.A.R. and S. Selvakumar, Detection of distributed denial of
service attacks using an ensemble of adaptive and hybrid neuro-
fuzzy systems. Computer Communications, 2013. 36(3): p. 303-319.
[15] Zhang, B., Y. Zhou, and C. Faloutsos. Toward a Comprehensive
Model in Internet Auction Fraud Detection. in Hawaii International
Conference on System Sciences, Proceedings of the. 2008.
[16] Olszewski, D., A probabilistic approach to fraud detection in
telecommunications. Knowledge-Based Systems, 2012. 26: p. 246-
258.
[17] Olszewski, D., Fraud Detection in Telecommunications Using
Kullback-Leibler Divergence and Latent Dirichlet Allocation.
Intelligent Data Analysis, 2011. 16(3): p. 467-485.
[18] Wang, L. and C. Wu, Business failure prediction based on two-stage
selective ensemble with manifold learning algorithm and kernel-
based fuzzy self-organizing map. Knowledge-Based Systems, 2017.
121: p. 99-110.
[19] Yeh, C.C., et al., A Hybrid Detecting Fraudulent Financial
Statements Model Using Rough Set Theory and Support Vector
Machines. Journal of Cybernetics, 2016. 47(4): p. 261-276.
[20] Liu, C., et al., Financial Fraud Detection Model: Based on Random
Forest. International Journal of Economics & Finance, 2015. 7(7): p.
178-188.
[21] Dechow, P.M., et al., Predicting Material Accounting Misstatements
*. Social Science Electronic Publishing, 2011. 28(1): p. 17–82.
[22] Petr Hajek, R.H., Mining corporate annual reports for intelligent
detection of financial statement fraud –A comparative study of
machine learning methods. Knowledge-Based Systems, 2017. 128: p.
139-152.
[23] Kim, Y.J., B. Baik, and S. Cho, Detecting financial misstatements
with fraud intention using multi-class cost-sensitive learning. Expert
Systems with Applications, 2016. 62: p. 32-43.
[24] Guyon, I. and A. Elisseeff, An Introduction to Feature Extraction.
2006: Springer Berlin Heidelberg. 1-25.
61