Final 1st Phase
Final 1st Phase
Final 1st Phase
CHAPTER 1
INTRODUCTION
Fraud detection concerns a large number of financial institutions and banks as this crime costs them
around $ 67 billion per year. There are different types of fraud: insurance fraud, credit card fraud,
statement fraud, securities fraud etc. Of all of them, credit card fraud is the most common type. It is
defined as an unauthorized use of a credit card account. It occurs when the cardholder and the card
issue rare not aware that the card is being used by a third party. The fraudsters can obtain goods
without paying, or gain illegal access to funds from an account. Credit card fraud is classified into
different types based on the nature of fraudulent activities.
Simple theft (offline fraud): a stolen card is the most straightforward type of credit card
fraud. It is also the fastest to be detected.
Application fraud: when individuals obtain new credit cards using false personal
information.
Bankruptcy fraud: this consists in using a credit card while being insolvent, and purchasing
goods knowing that they are not able to pay. This type can be prevented with credit scoring
techniques.
Internal fraud: when bank employees steal the card details to use it remotely.
At the current state of the world, financial organizations expand the availability of financial facilities
by employing of innovative services such as credit cards, Automated Teller Machines (ATM),internet
and mobile banking services. Besides, along with the rapid advances of e-commerce. The use of
credit card has become a convenience and necessary part of financial life. Credit card is a payment
card supplied to customers as a system of payment. There are lots of advantages in using credit cards
such as:
Ease of purchase: Credit cards can make life easier. They allow customers to purchase on credit in
arbitrary time, location and amount, without carrying the cash provide a convenient payment method
for purchases made on the internet, over the telephone, through ATMs, etc.
Keep customer credit history: Having a good credit history is often important in detecting loyal
customers. This history is valuable not only for credit cards, but also for other financial services like
loans, rental applications, or even some jobs. Lenders and issuers of credit mortgage companies,
credit card companies, retail stores, and utility companies can review customer credit score and
history to see how punctual and responsible customers are in paying back their debts.
Protection of Purchases: Credit cards may also offer customers, additional protection if the
purchased Merchandise becomes lost, damaged, or stolen. Both the buyers credit card statement and
company can confirm that the customer has bought if the original receipt is lost or stolen. In addition,
some credit card companies provide insurance for large purchases.
In spite of all mentioned advantages, the problem of fraud is a serious issue in e-banking services
that threaten credit card transactions especially. Fraud is an intentional deception with the purpose of
obtaining financial gain or causing loss by implicit or explicit trick fraud is a public law violation in
which the fraudster gains an unlawful advantage or causes unlawful damage. The estimation of
amount of damage made by fraud activities indicates that fraud costs a very considerable sum of
money. Credit card fraud is increasing significantly with the development of modern technology
resulting in the loss of billions of dollars worldwide each year statistics from the Internet Crime
Complaint Center show that there has been a significant rising in reported fraud in last decade.
Financial losses caused due to online fraud only in US, was reported $3.4 billion in 2011.
Fraud detection involves identifying scarce fraud activities among numerous legitimate transactions
as quickly as possible. Fraud detection methods are developing rapidly in order to adapt with new
incoming fraudulent strategies across the world. But, development of new fraud detection techniques
becomes more difficult due to the severe limitation of the ideas exchange in fraud detection. On the
other hand, fraud detection is essentially a rare event problem, which has been variously called
outlier analysis, anomaly detection, exception mining, mining rare classes, mining imbalanced data
etc. The number of fraudulent transactions is usually a very low fraction of the total transactions.
Hence the task of detecting fraud transactions in an accurate and efficient manner is fairly difficult
and challengeable. Therefore, development of efficient methods which can distinguish rare fraud
activities from billions of legitimate transaction seems essential.
Although, credit card fraud detection has gained attention and extensive study especially in recent
years and there are lots of surveys about this kind of fraud such as neither classify all credit card
fraud detection techniques with analysis of datasets and attributes. Therefore, we attempt to collect
and integrate a complete set of researches of literature and analyse them from various aspects.
To the best of our knowledge, the absence of complete and detailed credit card fraud detection
survey is an important issue, which is addressed by analysing the state of the art in credit card fraud
detection.
The state of the art fraud detection techniques are described and classified from different aspects of
supervised/unsupervised and numerical/categorical data consistent.
In credit card fraud research each researcher has used its own dataset. There is no standard dataset
or benchmark to evaluate detection methods. We attempt to gather different datasets investigated by
researchers, categorize them into real and synthetized groups and extract the common attributes
affects the quality of detection.
Illegal use of credit card or its information without the knowledge of the owner is referred to as
credit card fraud. Different credit card fraud tricks belong mainly to two groups of application and
behavioural fraud [3]. Application fraud takes place when, fraudsters apply new cards from bank or
issuing companies using false or others information. Multiple applications may be submitted by one
user with one set of user details (called duplication fraud) or different user with identical details
(called identity fraud).
Behavioural fraud, on the other hand, has four principal types: stolen/lost card, mail theft, counterfeit
card and „card holder not present‟ fraud. Stolen/lost card fraud occurs when fraudsters steals credit
card or get access to a lost card. Mail theft fraud occurs when the fraudster get a credit card in mail or
personal information from bank before reaching to actual cardholder. In both counterfeit and „card
holder not present‟ frauds, credit card details are obtained without the knowledge of card holders. In
the former, remote transactions can be conducted using card details through mail, phone, or the
Internet. In the latter, counterfeit cards are made based on card information.
Based on statistical data stated in 2012, the high risk countries facing credit card fraud threat is
illustrated in Fig.1. Ukraine has the most fraud rate with staggering 19%, which is closely followed
by Indonesia at 18.3% fraud rate. After these two, Yugoslavia with the rate of17.8% is the most risky
country. The next highest fraud rate belongs to Malaysia (5.9%), Turkey (9%) and finally United
States. Other countries that are prune to credit card fraud with the rate below than 1% are not
demonstrated.
The first group of techniques deals with supervised classification task in transaction level. In these
methods, transactions are labeled as fraudulent or normal based on previous historical data. This
dataset is then used to create classification models which can predict the state (normal or fraud) of
new records. There are numerous model creation methods for a typical two class classification task
such as rule induction [1], decision trees [2] and neural networks [3].This approach is proven to
reliably detect most fraud tricks which have been observed before [4], it also known as misuse
detection.
The second approach deals with unsupervised methodologies which are based on account behaviour.
In this method a transaction is detected fraudulent if it is in contrast with user’s normal behaviour.
This is because we don’t expect fraudsters behave the same as the account owner or be aware of the
behaviour model of the owner .To this aim, we need to extract the legitimate user behavioural model
(e.. user profile)for each account and then detect fraudulent activities according to it. Comparing new
behaviours with this model, different enough activities are distinguished as frauds. The profiles may
contain the activity information of the account; such as merchant types, amount, location and time of
transactions. This method is also known as anomaly detection.
It is important to highlight the key differences between user behaviour analysis and fraud analysis
approaches. The fraud analysis method can detect known fraud tricks, with a low false positive rate.
These systems extract the signature and model of fraud tricks presented in oracle dataset and can then
easily determine exactly which frauds, the system is currently experiencing. If the test data does not
contain any fraud signatures, no alarm is raised. Thus, the false positive rate can be reduced
extremely. However, since learning of a fraud analysis system (i.e. classifier) is based on limited and
specific fraud records, It cannot detect novel frauds. As a result, the false negatives rate may be
extremely high depending on how ingenious are the fraudsters. User behaviour analysis, on the other
hand, greatly addresses the problem of detecting novel frauds. These methods do not search for
specific fraud patterns, but rather compare incoming activities with the constructed model of
legitimate user behaviour. Any activity that is enough different from the model will be considered as
a possible fraud. Though, user behaviour analysis approaches are powerful in detecting innovative
frauds, they really suffer from high rates of false alarm. Moreover, if a fraud occurs during the
training phase, this fraudulent behaviour will be entered in baseline mode and is assumed to be
normal in further analysis.
In this we will briefly introduce some current fraud detection techniques which are applied to credit
card fraud detection tasks, also main advantage and disadvantage of each approach will be discussed.
CHAPTER-2
LITERATURE SURVEY
Literature survey is the most important step in software development process. Before developing the
tool it is necessary to determine the time factor, economy and company strength. Once these things
are satisfied, then next steps are to determine which operating system and programming language can
be used for developing the tool. Once the programmer starts building the tools the programmer need
lot of external support. This support can be obtained from senior programmer, from books and from
website. Before building the system, the above consideration are taken into account for developing
the proposed system.
Some efforts have been reported in the literature on the Credit card fraud detection:
situations established beforehand and do not take into account the variable nature of fraud; they do
not consider the individual characteristics of cardholders’ behaviour; the control of such rule-based
system is rather complex task for the expert.
CHAPTER 3
PROBLEM STATEMENT
System
System Analysis
System analysis and designs are the application of the system approach to problem solving generally
using computers. To reconstruct a system the analyst must consider its elements output and inputs,
processors, controls feedback and environment.
Analysis
Analysis is a detailed study of the various operations performed by a system and their relationships
within and outside of the system. One aspect of analysis is defining the boundaries of the system and
determining whether or not a candidate system should consider other related systems. During
analysis data are collected on the available files decision points and transactions handled by the
present system. This involves gathering information and using structured tools for analysis.
The problem is summarized as follows: using imbalanced classification approaches, the number of
false alarms generated is higher than the number of frauds that are detected.
In existing System, a research about a case study involving credit card fraud detection, where data
normalization is applied before Cluster Analysis and with results obtained from the use of Cluster
Analysis and Artificial Neural Networks on fraud detection has shown that by clustering attributes
neuronal inputs can be minimized. And promising results can be obtained by using normalized data
and data should be MLP trained.
This research was based on unsupervised learning. Significance of this paper was to find new
methods for fraud detection and to increase the accuracy of results. The data set for this paper is
based on real life transactional data by a large European company and personal details in data is kept
confidential.
Accuracy of an algorithm is around 50%. Significance of this paper was to find an algorithm and to
reduce the cost measure. The result obtained was by 23% and the algorithm they find was Bayes
minimum risk.
Another problematic issue in credit card detection is the scarcity of available data due to
confidentiality issues that give little chance to the community to share real datasets and assess
existing techniques.
Fraud detection systems are prune to several difficulties and challenges enumerated below. An
effective fraud detection technique should have abilities to address these difficulties in order to
achieve best performance.
Fraud detection cost: The system should take into account both the cost of fraudulent
behavior that is detected and the cost of preventing it.
Imbalanced data: The credit card fraud detection data has imbalanced nature. It means that
very small percentages of all credit card transactions are fraudulent. This cause the detection
of fraud transactions very difficult and imprecise.
Nonexistence of standard algorithm: There is not any powerful algorithm known in credit
card fraud literature that outperforms all others. Each technique has its own advantages and
disadvantages. Combining impactful algorithms to support each other’s benefits and cover
their weaknesses would be of great interest.
Nonexistence of suitable metrics: The limitation of good metrics in order to evaluate the
results of fraud detection system is yet an open issue. Nonexistence of such metrics causes
incapability of researchers and practitioners in comparing different approaches and
determining priority of most efficient fraud detection system.
Billions of dollars of loss are caused every year by the fraudulent credit card transactions. Fraud is
old as humanity itself and can take an unlimited variety of different forms. The PwC global economic
crime survey of 2017 suggests that approximately 48% of organizations experienced economic
crime. Therefore, there is definitely an urge to solve the problem of credit card fraud detection.
Moreover, the development of new technologies provides additional ways in which criminals may
commit fraud. The use of credit cards is prevalent in modern day society and credit card fraud has
been kept on growing in recent years. Hugh Financial losses has been fraudulent affects not only
merchants and banks, but also individual person who are using the credits. Fraud may also affect the
reputation and image of a merchant causing non-financial losses that, though difficult to quantify in
the short term, may become visible in the long period. For example, if a cardholder is victim of fraud
with a certain company, he may no longer trust their business and choose a contender.
The Credit Card Fraud Detection Problem includes modelling past credit card transactions with the
knowledge of the ones that turned out to be fraud. This model is then used to identify whether a new
transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while
minimizing the incorrect fraud classifications. Misuse detection uses classification methods to
determine whether an incoming transaction is fraud or not. Usually, such an approach has to know
about the existing types of fraud to make models by learning the various fraud patterns. Anomaly
detection is to build the profile of normal transaction behaviour of a cardholder based on his/her
historical transaction data, and decide a newly transaction as a potential fraud if it deviates from the
normal transaction behaviour. However, an anomaly detection method needs enough successive
sample data to characterize the normal transaction behaviour of a cardholder.
• Several solutions have been proposed in a large body which to the best of our knowledge, are
built on machine learning algorithms.
• As it is a classification paradigm only the classification algorithms that would differentiate
Fraud and Non Fraud transaction is utilized .
• Support Vector Machine, Logistic Regression and Random Forest algorithms are used as they
are the best methods according to the 3 considered performance measures (Accuracy,
Sensitivity and AUPRC).
• The comparative analysis on the desired parameters like confusion matrix , measure,
precision, accuracy, intervention and recall are used to compare these algorithm.
• We will develop a model for the class imbalance problem to find a trade-off between
sensitivity and accuracy.
• The tabulated results would depict the proper differentiation between algorithm to realize
tradeoffs among the algorithms.
In proposed System, we are applying random forest algorithm for classify the credit card dataset.
Random Forest is an algorithm for classification and regression. Summarily, it is a collection of
decision tree classifiers. Random forest has advantage over decision tree as it corrects the habit of
over fitting to their training set. A subset of the training set is sampled randomly so that to train each
individual tree and then a decision tree is built, each node then splits on a feature selected from a
random subset of the full feature set. Even for large data sets with many features and data instances
training is extremely fast in random forest and because each tree is trained independently of the
others. The Random Forest algorithm has been found to provide a good estimate of the generalization
error and to be resistant to over fitting.
Advantage
Random forest ranks the importance of variables in a regression or classification problem in a
natural way can be done by Random Forest.
The ‘amount’ feature is the transaction amount. Feature ‘class’ is the target class for the
binary classification and it takes value 1 for positive case (fraud) and 0 for negative case (non
fraud).
3.3 Objectives
• Different machine learning methods that are utilized include the K means clustering algorithm
in the unsupervised learning algorithm, Random Forest Algorithm in the regression based
algorithm, and Support vector machine(SVM) ( Linear, RBF and sigmoidal kernel ),ANN in
the supervised algorithms.
• The objectives of credit card fraud detection are to reduce losses due to payment fraud for
both merchants and issuing banks an increase revenue opportunities for merchants.
• The aim is to drop in the false alarm and thus also to lead an increase in accuracy.
• Our goal is to detect the issues that must be solved to product a highly efficient solution for
the class imbalance problem.
• Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect
fraud classifications.
• The performance evaluation by developing the confusion matrix , Fmeasure, precision,
accuracy, intervention and recall are used to compare these algorithms. Python and
SKlearn based implementation is carried out.
• The performance evaluation by developing the confusion matrix , Fmeasure, precision,
accuracy, intervention and recall are used to compare these algorithms. Python and SKlearn
based implementation is carried out and the results are tabulated.
CHAPTER 4
SYSTEM REQUIREMENTS
Hardware Requirements
RAM – 4GB
i3 or i5 processor
Software Requirements
Anaconda
CHAPTER 5
SYSTEM ARCHITECTURE
Frequent itemsets are sets of items that occur simultaneously in as many transactions as the user
defined minimum support. The metric support( ) is defined as the fraction of records of database
This means that fraudsters are intruding into customer accounts after learning their genuine behaviour
only. Therefore, instead of finding a common pattern for fraudster behaviour it is more valid to
identify fraud patterns for each customer. Thus, in this research, we have constructed two patterns for
each customer—legal pattern and fraud pattern. When frequent pattern mining is applied to credit
card transaction data of a particular customer, it returns set of attributes showing same values in a
group of transactions specified by the support.
Generally the frequent pattern mining algorithms like that of Apriori return many such groups and
the longest group containing maximum number of attributes is selected as that particular customer’s
legal pattern. The training (pattern recognition) algorithm is given below:
Step 1. Separate each customer’s transactions from the whole transaction database
.
Step 2. From each customer’s transactions separate his/her legal and fraud transactions.
Step 3. Apply Apriori algorithm to the set of legal transactions of each customer. The Apriori
algorithm returns a set of frequent item sets. Take the largest frequent itemset as the legal pattern
corresponding to that customer. Store these legal patterns in legal pattern database.
Step 4. Apply Apriori algorithm to the set of fraud transactions of each customer. The Apriori
algorithm returns a set of frequent item sets. Take the largest frequent itemset as the fraud pattern
corresponding to that customer. Store these fraud patterns in fraud pattern database.
After finding the legal and fraud patterns for each customer, the fraud detection system traverses
these fraud and legal pattern databases in order to detect frauds. These pattern databases are much
smaller in size than original customer transaction databases as they contain only one record
corresponding to a customer. This research proposes a matching algorithm which traverses the
pattern databases for a match with the incoming transaction to detect fraud. If a closer match is found
with legal pattern of the corresponding customer, then the matching algorithm returns “0” giving a
green signal to the bank for allowing the transaction. If a closer match is found with fraud pattern of
the corresponding customer, then the matching algorithm returns “1” giving an alarm to
Department of CSE, VVIT 2019-20 Page 18
Credit Card Fraud Detection
the bank for stopping the transaction. The size of pattern databases is × where
algorithm is explained below.
is the number of
customers and
is the number of attributes. The matching (testing)
䈸 transaction matching with that of the legal pattern of the corresponding customer. Let it be .
Step 2. Count the number of attributes in the incoming transaction matching with that of the fraud
pattern of the corresponding customer. Let it be
䈸.
Step 3. If is legal.
= 0 and is more than the user defined matching percentage, then the incoming
transaction 䈸 䈸
Step 4. If 䈸 = 0 and 䈸 is more than the user defined matching percentage, then the incoming
transaction is fraud.
Step 5. If both
䈸 and
䈸≥
CHAPTER-6
ALGORITHMS
Unsupervised techniques
The unsupervised techniques do not need the previous knowledge of fraudulent and normal records.
These methods raise alarm for those transactions that are most dissimilar from the normal ones.
These techniques are often used in user behaviour approach. ANNs can produce acceptable result for
enough large transaction dataset. They need a long training dataset. Self organizing map (SOM) is
one of the most popular unsupervised neural networks learning which was introduced . SOM
provides a clustering method, which is appropriate for constructing and analysing customer profiles
,in credit card fraud detection, as suggested in SOM operates in two phase: training and mapping. In
the former phase, the map is built and weights of the neurons are updated iteratively, based on input
samples , in latter, test data is classified automatically into normal and fraudulent classes through the
procedure of mapping. As stated in, after training the SOM, new unseen transactions are compared to
normal and fraud clusters, if it is similar to all normal records, it is classified as normal. New fraud
transactions are also detected similarly.
One of the advantages of using unsupervised neural networks over similar techniques is that these
methods can learn from data stream. The more data passed to a SOM model, the more adaptation and
improvement on result is obtained. More specifically, the SOM adapts its model as time passes.
Therefore it can be used and updated online in banks or other financial corporations. As a result, the
fraudulent use of a card can be detected fast and effectively. However, neural networks has some
drawbacks and difficulties which are mainly related to specifying suitable architecture in one hand
and excessive training required for reaching to best performance in other hand.
Hybrid supervised and unsupervised techniques
In addition to supervised and unsupervised learning models of neural networks, some researchers
have applied hybrid models. John ZhongLei et.Al. proposed hybrid supervised (SICLN) and
unsupervised (ICLN)learning network for credit card fraud detection. They improved the reward only
rule of SICLN model to ICLN in order to update weights according to both reward and penalty. This
improvement appeared in terms of increasing stability and reducing the training time. Moreover, the
number of final clusters of the ICLN is independent from the number of initial network neurons. As a
result the inoperable neurons can be omitted from the clusters by applying the penalty rule. The
results indicated that both the ICLN and the SICLN have high performance, but the SICL outperforms
well-known unsupervised clustering algorithms.
input; it gives accuracy comparable to sophisticated neural networks with elaborated features in a
handwriting recognition task. It is also being used for many applications, such as hand writing
analysis, face analysis and so forth, especially for pattern classification and regression based
applications.
The foundations of Support Vector Machines (SVM) have been developed by Vapnik and gained
popularity due to many promising features such as better empirical performance. The formulation
uses the Structural Risk Minimization (SRM) principle, which has been shown to be superior, to
traditional Empirical Risk Minimization (ERM) principle, used by conventional neural networks.
SRM minimizes an upper bound on the expected risk, where as ERM minimizes the error on the
training data. It is this difference which equips SVM with a greater ability to generalize, which is the
goal in statistical learning. SVMs were developed to solve the classification problem, but recently
they have been extended to solve regression problems.
Firstly working with neural networks for supervised and unsupervised learning showed good results
while used for such learning applications. MLP’s uses feed forward and recurrent networks.
Multilayer perceptron (MLP) properties include universal approximation of continuous nonlinear
functions and include learning with input-output patterns and also involve advanced network
architectures.
These are simple visualizations just to have a overview as how neural network looks like.
There can be some issues noticed. Some of them are having many local minima and also finding how
many neurons might be needed for a task is another issue which determines whether optimality of
that NN is reached. Another thing to note is that even if the neural network solutions used tends to
converge, this may not result in a unique solution [11]. Now let us look at another example where we
plot the data and try to classify it and we see that there are many hyper planes which can classify it.
But which one is better? The need for SVM arises. Note the legend is not described as they are
sample plotting to make understand the concepts involved. From above illustration, there are many
linear classifiers (hyper planes) that separate the data. However only one of these achieves maximum
separation. The reason we need it is because if we use a hyper plane to classify, it might end up closer
to one set of datasets compared to others and we do not want this to happen and thus we see that the
concept of maximum margin classifier or hyper plane as an apparent solution. The next illustration
gives the maximum margin classifier example which provides a solution. .Figure 4: Illustration of
Linear SVM. ( Taken from Andrew W. Moore slides 2003) [2]. Note the legend is not described as
they are sample plotting to make understand the concepts involved.
w
i 1 i
The above illustration is the maximum linear classifier with the maximum range. In this context it is
an example of a simple linear SVM classifier. Another interesting question is why maximum margin?
There are some good explanations which include better empirical performance. Another reason is that
even if we’ve made a small error in the location of the boundary this gives us least chance of causing a
misclassification. The other advantage would be avoiding local minima and better classification. Now
we try to express the SVM mathematically and for this tutorial we try to present a linear SVM. The
goals of SVM are separating the data with hyper plane and extend this to non-linear boundaries using
kernel trick. For calculating the SVM we see that the goal is to correctly classify all the data. For
mathematical calculations we have,
[a] If Yi= +1; wxi + b >1
[b] If Yi= -1; wxi + b ≤ 1
[c] For all i; yi (wi + b) ≥ 1
In this equation x is a vector point and w is weight and is also a vector. So to separate the data [a]
should always be greater than zero. Among all possible hyper planes, SVM selects the one where the
distance of hyper plane is as large as possible. If the training data is good and every test vector is
located in radius r from training vector. Now if the chosen hyper plane is located at the farthest
possible from the data. This desired hyper plane which maximizes the margin also bisects the lines
between closest points on convex hull of the two datasets. Thus we have [a], [b] &
[c].
Distance of closest point on hyperplane to origin can be found by maximizing the x as x is on the
hyper plane. Similarly for the other side points we have a similar scenario. Thus solving and
subtracting the two distances we get the summed distance from the separating hyperplane to nearest
points. Maximum Margin = M = 2 / ||w||
Now maximizing the margin is same as minimum. Now we have a quadratic optimization problem
and we need to solve for w and b. To solve this we need to optimize the quadratic function with
linear constraints. The solution involves constructing a dual problem and where a Langlier’s
multiplier αi is associated. We need to find w and b such that Φ (w) =½ |w’||w| is minimized;
And for all {(xi, yi)}: yi (w * xi + b) ≥ 1.
Now solving: we get that w =Σαi * xi; b= yk- w *xk for any xk such that αk 0
Now the classifying function will have the following form: f(x) = Σαi yi xi * x + b
SVM Representation
In this we present the QP formulation for SVM classification. This is a simple representation only.
l
f, i i 1
l ll i yi 0
minαi 1 α α y y K(x , x ) 0iC, for all i; i 1
i j i j i j
α
ii 1 2 i 1 j1
Variables are called slack variables and they measure the error made at point (xi,yi). Training SVM
becomes quite challenging when the number of training points is large. A number of methods for fast
SVM training have been proposed .
Random Forest is a flexible, easy to use machine learning algorithm that produces, even without
hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms,
because it’s simplicity and the fact that it can be used for both classification and regression tasks. In
this post, you are going to learn, how the random forest algorithm works and several other important
things about it.
Table of Contents
How it works
Real Life Analogy
Feature Importance
Difference between Decision Trees and Random Forests
Important Hyperparameters (predictive power, speed)
Advantages and
Disadvantages Use Cases
Summary
How it works
Random Forest is a supervised learning algorithm. Like you can already see from it’s name, it creates
a forest and makes it somehow random. The „forest“ it builds, is an ensemble of Decision Trees,
most of the time trained with the “bagging” method. The general idea of the bagging method is that a
combination of learning models increases the overall result.
One big advantage of random forest is, that it can be used for both classification and regression
problems, which form the majority of current machine learning systems. I will talk about random
forest in classification, since classification is sometimes considered the building block of machine
learning. Below you can see how a random forest would look like with two trees:
Random Forest has nearly the same hyperparameters as a decision tree or a bagging classifier.
Fortunately, you don’t have to combine a decision tree with a bagging classifier and can just easily
use the classifier-class of Random Forest. Like I already said, with Random Forest, you can also deal
with Regression tasks by using the Random Forest regressor.
Random Forest adds additional randomness to the model, while growing the trees. Instead of
searching for the most important feature while splitting a node, it searches for the best feature among
a random subset of features. This results in a wide diversity that generally results in a better model.
Therefore, in Random Forest, only a random subset of the features is taken into consideration by the
algorithm for splitting a node. You can even make trees more random, by additionally using random
thresholds for each feature rather than searching for the best possible thresholds (like a normal
decision tree does).
Real Life Analogy:
Imagine a guy named Andrew, that want’s to decide, to which places he should travel during a one-
year vacation trip. He asks people who know him for advice. First, he goes to a friend, asks Andrew
where he travelled to in the past and if he liked it or not. Based on the answers, he will give Andrew
some advice.
This is a typical decision tree algorithm approach. Andrews friend created rules to guide his decision
about what he should recommend, by using the answers of Andrew. Afterwards, Andrew starts
asking more and more of his friends to advise him and they again ask him different questions, where
they can derive some recommendations from. Then he chooses the places that where recommend the
most to him, which is the typical Random Forest algorithm approach.
Feature Importance:
Another great quality of the random forest algorithm is that it is very easy to measure the relative
importance of each feature on the prediction. Sklearn provides a great tool for this, that measures a
features importance by looking at how much the tree nodes, which use that feature, reduce impurity
across all trees in the forest. It computes this score automatically for each feature after training and
scales the results, so that the sum of all importance is equal to 1
.If you don’t know how a decision tree works and if you don’t know what a leaf or node is, here is a
good description from Wikipedia: In a decision tree each internal node represents a “test” on an
attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the
test, and each leaf node represents a class label (decision taken after computing all attributes). A node
that has no children is a leaf.
Through looking at the feature importance, you can decide which features you may want to drop,
because they don’t contribute enough or nothing to the prediction process. This is important, because
a general rule in machine learning is that the more features you have, the more likely your model will
suffer from overfitting and vice versa.
Like already mentioned, Random Forest is , Random Forest is a collection of Decision Trees, but
there are some differences.
If you input a training dataset with features and labels into a decision tree, it will formulate some set
of rules, which will be used to make the predictions.
For example, if you want to predict whether a person will click on an online advertisement, you could
collect the ad’s the person clicked in the past and some features that describe his decision. If you put the
features and labels into a decision tree, it will generate some rules. Then you can predict whether the
advertisement will be clicked or not. In comparison, the Random Forest algorithm randomly selects
observations and features to build several decision trees and then averages the results.
Another difference is that „deep“ decision trees might suffer from overfitting. Random Forest
prevents overfitting most of the time, by creating random subsets of the features and building smaller
trees using these subsets. Afterwards, it combines the subtrees. Note that this doesn’t work every
time and that it also makes the computation slower, depending on how many trees your random
forest builds.
Important Hyperparameters:
The Hyperparameters in random forest are either used to increase the predictive power of the model
or to make the model faster. I will here talk about the hyperparameters of sklearns built-in random
forest function
Random _state makes the model’s output replicable. The model will always produce the same
results when it has a definite value of random state and if it has been given the same hyperparameters
and the same training data
.Lastly, there is the oob score(also called oob sampling), which is a random forest cross validation
method. In this sampling, about one-third of the data is not used to train the model and can be used to
evaluate its performance. These samples are called the out of bag samples. It is very similar to the
leave-one-out cross-validation method, but almost no additional computational burden goes along
with it.
Like already mentioned, an advantage of random forest is that it can be used for both regression and
classification tasks and that it’s easy to view the relative importance it assigns to the input features.
Random Forest is also considered as a very handy and easy to use algorithm, because it’s default
hyperparameters often produce a good prediction result. The number of hyperparameters is also not
that high and they are straightforward to understand.
One of the big problems in machine learning is overfitting, but most of the time this won’t happen
that easy to a random forest classifier. That’s because if there are enough trees in the forest, the
classifier won’t overfit the model.
The main limitation of Random Forest is that a large number of trees can make the algorithm to slow
and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow
to create predictions once they are trained. A more accurate prediction requires more trees, which
results in a slower model. In most real-world applications the random forest algorithm is fast enough,
but there can certainly be situations where run-time performance is important and other approaches
would be preferred.
And of course Random Forest is a predictive modelling tool and not a descriptive tool. That means, if
you are looking for a description of the relationships in your data, other approaches would be
preferred.
Use Cases:
The random forest algorithm is used in a lot of different fields, like Banking, Stock Market, Medicine
and E-Commerce. In Banking it is used for example to detect customers who will use the bank’s
services more frequently than others and repay their debt in time. In this domain it is also used to
detect fraud customers who want to scam the bank. In finance, it is used to determine a stock’s
behaviour in the future. In the healthcare domain it is used to identify the correct combination of
components in medicine and to analyse a patient’s medical history to identify diseases. And lastly, in
E-commerce random forest is used to determine whether a customer will actually like the product or
not.
Summary:
Random Forest is a great algorithm to train early in the model development process, to see how it
performs and it’s hard to build a “bad” Random Forest, because of its simplicity. This algorithm is
also a great choice, if you need to develop a model in a short period of time. On top of that, it
provides a pretty good indicator of the importance it assigns to your features.
Random Forests are also very hard to beat in terms of performance. Of course you can probably
always find a model that can perform better, like a neural network, but these usually take much more
time in the development. And on top of that, they can handle a lot of different feature types, like
binary, categorical and numerical.
Overall, Random Forest is a (mostly) fast, simple and flexible tool, although it has its limitations.
CHAPTER 7
MODULES
Data Cleansing:
When going through our data cleaning process it’s best to perform all of our cleaning in a coarse-to-
fine style. Start with the biggest glaring issues and work your way down to the nitty gritty details.
Based on this approach, the first thing we’ll do is remove any unrelated or irrelevant features.
Perform a very quick exploration of your dataset to determine which features aren’t highly correlated
with the output you want to predict. You can do this in a few ways:
Perform a correlation analysis of the feature variables.
Check how many rows each feature variable missing. If a variable is missing 90% of its data
points then it’s probably wise to just drop it all together.
Consider the nature of the variable itself. Is it actual useful, from a practical point of view, to
be using this feature variable? Only drop it if you’re quite sure it won’t be helpful.
Handling missing values
We’ve already dropped the feature variables with a high percentage of missing values. Now we want
to handle those feature variables that we do actually need but also have missing values. Again we
have a few options:
Fill in the missing rows with an arbitrary value
Fill in the missing rows with a value computed from the data’s
statistics Ignore missing rows
There first one can be done if you know what a good default value should be. But if you can compute
a value from some kind of statistical analysis that is often highly preferred since it at least has some
support from the data. The last option can be taken if we have a large enough dataset to afford
throwing away some of the rows. However, before you do this, be sure to take a quick look at the
data to be sure that those data points aren’t critically important.
Formatting the data
When datasets are collected, the data will often be entered in by human users as plain text. This can
cause complications with the data format. For example, there are many ways to enter in the name of
the state of California: CA, C.A, California, Cali; these will all need to be standardised into one
uniform format. In addition, there may be cases where the data is continuous and we want to make it
discrete or vice versa.
Standardising data format including acronyms, capitalisation, and
style. Discretising continuous data, or vice versa.
Training :
In general, a learning problem considers a set of n samples of data and then tries to predict properties
of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional
entry (aka multivariate data), it is said to have several attributes or features.
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must
be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters
can be tweaked until the estimator performs optimally. This way, knowledge about the test set can
“leak” into the model and evaluation metrics no longer report on generalization performance. To
solve this problem, yet another part of the dataset can be held out as a so-called “validation set”:
training proceeds on the training set, after which evaluation is done on the validation set, and when
the experiment seems to be successful, final evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number of
samples which can be used for learning the model, and the results can depend on a particular random
choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still
be held out for final evaluation, but the validation set is no longer needed when doing CV. In the
basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are
described below, but generally follow the same principles). The following procedure is followed for
each of the k “folds”:
A model is trained using k−1 of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to
compute a performance measure such as accuracy)
.The performance measure reported by k-fold cross-validation is then the average of the values
computed in the loop. This approach can be computationally expensive, but does not waste too much
data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems
such as inverse inference where the number of samples is very small.
It returns a dictionary containing fit-times, score-times (and optionally training scores as well as
fitted estimators) in addition to the test score.
Step 1 — Importing Scikit-learn
Let’s begin by installing the Python module Scikit-learn, one of the best and most documented
machine learning libraries for Python.
To begin our coding project, let’s activate our Python 3 programming environment.
BIBLIOGRAPHY
[1] P. Richhariya and P. K. Singh, ``Evaluating and emerging payment card fraud challenges and
resolution,'' Int. J. Comput. Appl., vol. 107, no. 14,pp. 5_10, Jan. 2014.
[2] S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland, ``Datamining for credit card fraud:
A comparative study,'' Decis. Support Syst.,vol. 50, no. 3, pp. 602_613, 2011.
[3] A. Dal Pozzolo, O. Caelen, Y.-A. L. Borgne, S. Waterschoot, andG. Bontempi, ``Learned lessons
in credit card fraud detection from a practitioner perspective,'' Expert Syst. Appl., vol. 41, no. 10, pp.
4915_4928,2014
.[4] C. Phua, D. Alahakoon, and V. Lee, ``Minority report in fraud detection: Classification of
skewed data,'' ACM SIGKDD Explorations Newslett.,vol. 6, no. 1, pp. 50_59, 2004.
[5] Z.-H. Zhou and X.-Y. Liu, ``Training cost-sensitive neural networks with methods addressing the
class imbalance problem,'' IEEE Trans. Knowl.Data Eng., vol. 18, no. 1, pp. 63_77, Jan. 2006.
[6] S. Ertekin, J. Huang, and C. L. Giles, ``Active learning for class imbalance eproblem,'' in Proc.
30th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf.Retr., 2007, pp. 823_824
[7] M. Wasikowski and X. Chen, ``Combating the small sample class imbalance problem using
feature selection,'' IEEE Trans. Knowl. Data Eng.,vol. 22, no. 10, pp. 1388_1400, Oct. 2010.
[8] S. Wang and X. Yao, ``Multiclass imbalance problems: Analysis and potential solutions,'' IEEE
Trans. Syst., Man, Cybern. B, Cybern., vol. 42,no. 4, pp. 1119_1130, Aug. 2012
[9] R. J. Bolton and D. J. Hand, ``Statistical fraud detection: A review,'' Stat.Sci., vol. 17, no. 3, pp.
235_249, Aug. 2002.
[10] D. J. Weston, D. J. Hand, N. M. Adams, C. Whitrow, and P. Juszczak,``Plastic card fraud
detection using peer group analysis,'' Adv. Data Anal.Classi_cation, vol. 2, no. 1, pp. 45_62, 2008.
[11] E. Duman and M. H. Ozcelik, ``Detecting credit card fraud by genetic algorithm and scatter
search,'' Expert Syst. Appl., vol. 38, no. 10, Sep,2011.