Irjet V6i3710

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072

CREDIT CARD FRAUD DETECTION USING RANDOM FOREST


Devi Meenakshi. B1, Janani. B2, Gayathri. S3, Mrs. Indira. N4
1,2,3Student, Dept. of Computer Science and Engineering, Panimalar Engineering College, Tamil Nadu, India
4Associate Professor, Dept. of Computer Science and Engineering, Panimalar Engineering College,
Tamil Nadu, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract – The project is mainly focussed on credit card analyst. So, in regards to the card fraud, if the use of cards to
fraud detection in real world. A phenomenal growth in the commit fraud is proven to be high, the fraud weighting of a
number of credit card transactions, has recently led to a transaction that uses a credit card will be equally so.
considerable rise in fraudulent activities. The purpose is to However, if this were to shrink, the contribution level would
obtain goods without paying, or to obtain unauthorized funds parallel. Simply make, these models self-learn without
from an account. Implementation of efficient fraud detection explicit programming such as with manual review. Credit
systems has become imperative for all credit card issuing card fraud detection using Machine learning is done by
banks to minimize their losses. One of the most crucial deploying the classification and regression algorithms. We
challenges in making the business is that neither the card nor use supervised learning algorithm such as Random forest
the cardholder needs to be present when the purchase is being algorithm to classify the fraud card transaction in online or
made. This makes it impossible for the merchant to verify by offline. Random forest is advanced version of Decision
whether the customer making a purchase is the authentic tree. Random forest has better efficiency and accuracy than
cardholder or not. With the proposed scheme, using random the other machine learning algorithms. Random forest aims
forest algorithm the accuracy of detecting the fraud can be to reduce the previously mentioned correlation issue by
improved can be improved. Classification process of random picking only a subsample of the feature space at each
forest algorithm to analyse data set and user current dataset. split. Essentially, it aims to make the trees de-correlated
Finally optimize the accuracy of the result data. The and prune the trees by fixing a stopping criteria for node
performance of the techniques is evaluated based on accuracy, splits, which I will be cover in more detail later.
sensitivity, and specificity, and precision. Then processing of
some of the attributes provided identifies the fraud detection 1.1 PROBLEM DEFINITION
and provides the graphical model visualization. The
performance of the techniques is evaluated based on accuracy, Billions of dollars of loss are caused every year by the
sensitivity, and specificity, and precision. fraudulent credit card transactions. Fraud is old as humanity
itself and can take an unlimited variety of different forms.
Keywords: Credit Card, Fraud Detection, Random Forest. The PwC global economic crime survey of 2017 suggests that
approximately 48% of organizations experienced economic
1. INTRODUCTION crime. Therefore, there is definitely an urge to solve the
problem of credit card fraud detection. Moreover, the
There are various fraudulent activities detection techniques development of new technologies provides additional ways
has implemented in credit card transactions have been kept in which criminals may commit fraud. The use of credit cards
in researcher minds to methods to develop models based on is prevalent in modern day society and credit card fraud has
artificial intelligence , data mining, fuzzy logic and machine been kept on growing in recent years. Hugh Financial losses
learning. Credit card fraud detection is significantly difficult, has been fraudulent affects not only merchants and banks,
but also popular problem to solve. In our proposed system but also individual person who are using the credits. Fraud
we built the credit card fraud detection using Machine may also affect the reputation and image of a merchant
learning. With the advancement of machine learning causing non-financial losses that, though difficult to quantify
techniques. Machine learning has been identified as a in the short term, may become visible in the long period. For
successful measure for fraud detection. A large amount of example, if a cardholder is victim of fraud with a certain
data is transferred during online transaction processes, company, he may no longer trust their business and choose a
resulting in a binary result: genuine or fraudulent. Within the contender.
sample fraudulent datasets, features are constructed. These
are data points namely the age and value of the customer 1.1 SCOPE OF THE PROJECT
account, as well as the origin of the credit card. There are
hundreds of features and each contributes, to varying In this proposed project we designed a protocol or a model
extents, towards the fraud probability. Note, the level in to detect the fraud activity in credit card transactions.
which each feature contributes to the fraud score is This system is capable of providing most of the essential
generated by the artificial intelligence of the machine which features required to detect fraudulent and legitimate
is driven by the training set, but is not determined by a fraud transactions.

© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6662
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072

As technology changes, it becomes difficult to track the [4] Fraudulent Detection in Credit Card System Using
behaviour and pattern of fraudulent transactions. SVM & Decision Tree. “Vijayshree B. Nipane, Poonam S.
Kalinge, Dipali Vidhate, Kunal War, Bhagyashree P.
With the upsurge of machine learning, artificial intelligence Deshpande”.
and other relevant fields of information technology, it
becomes feasible to automate the process and to save some With growing advancement in the electronic commerce field,
of the effective amount of labor that is put into detecting fraud is spreading all over the world, causing major financial
credit card fraudulent activities. losses. In current scenario, Major cause of financial losses is
credit card fraud; it not only affects trades person but also
2. RELATED WORK individual clients. Decision tree, Genetic algorithm, Meta
learning strategy, neural network, HMM are the presented
[1] The Use of Predictive Analytics Technology to Detect methods used to detect credit card frauds. In contemplate
Credit Card Fraud in Canada. “Kosemani Temitayo Hafiz, system for fraudulent detection, artificial intelligence
Dr. Shaun Aghili, Dr. Pavol Zavarsky.” concept of Support Vector Machine (SVM) & decision tree is
being used to solve the problem. Thus by implementation of
This research paper focuses on the creation of a scorecard this hybrid approach, financial losses can be reduced to
from relevant evaluation criteria, features, and capabilities of greater extend.
predictive analytics vendor solutions currently being used to
detect credit card fraud. The scorecard provides a side-by- 5] Supervised Machine (SVM) Learning for Credit Card
side comparison of five credit card predictive analytics Fraud Detection. “Sitaram patel, Sunita Gond”.
vendor solutions adopted in Canada. From the ensuing
research findings, a list of credit card fraud PAT vendor This thesis propose the SVM (Support Vector Machine)
solution challenges, risks, and limitations was outlined. based method with multiple kernel involvement which also
includes several fields of user profile instead of only
[2] BLAST-SSAHA Hybridization for Credit Card Fraud spending profile. The simulation result shows improvement
Detection. “Amlan Kundu, Suvasini Panigrahi, Shamik in TP (true positive), TN (true negative) rate, & also
Sural, Senior Member, IEEE, and Arun K. Majumdar” decreases the FP (false positive) & FN (false negative) rate.

This paper propose to use two-stage sequence alignment in [6] Detecting Credit Card Fraud by Decision Trees and
which a profile Analyser (PA) first determines the similarity Support Vector Machines. “Y. Sahin and E. Duman”
of an incoming sequence of transactions on a given credit In this study, classification models based on decision trees
card with the genuine cardholder’s past spending sequences. and support vector machines (SVM) are developed and
The unusual transactions traced by the profile analyser are applied on credit card fraud detection problem. This study is
next passed on to a deviation analyser (DA) for possible one of the firsts to compare the performance of SVM and
alignment with past fraudulent behaviour. The final decision decision tree methods in credit card fraud detection with a
about the nature of a transaction is taken on the basis of the real data set.
observations by these two analysers. In order to achieve
online response time for both PA and DA, we suggest a new 3. SYSTEM ANALYSIS
approach for combining two sequence alignment algorithms
3.1 EXISTING SYSTEM
BLAST and SSAHA.
In existing System, a research about a case study involving
[3] Research on Credit Card Fraud Detection Model
credit card fraud detection, where data normalization is
Based on Distance Sum.
applied before Cluster Analysis and with results obtained
“Wen-Fang YU, Na Wang”. from the use of Cluster Analysis and Artificial Neural
Networks on fraud detection has shown that by clustering
attributes neuronal inputs can be minimized. And promising
Along with increasing credit cards and growing trade volume results can be obtained by using normalized data and data
in China, credit card fraud rises sharply. How to enhance the should be MLP trained. This research was based on
detection and prevention of credit card fraud becomes the unsupervised learning. Significance of this paper was to find
focus of risk control of banks. It proposes a credit card fraud new methods for fraud detection and to increase the
detection model using outlier detection based on distance accuracy of results. The data set for this paper is based on
sum according to the infrequency and unconventionality of real life transactional data by a large European company and
fraud in credit card transaction data, applying outlier mining personal details in data is kept confidential. Accuracy of an
into credit card fraud detection. Experiments show that this algorithm is around 50%. Significance of this paper was to
model is feasible and accurate in detecting credit card fraud. find an algorithm and to reduce the cost measure. The result
obtained was by 23% and the algorithm they find was Bayes
minimum risk.

© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6663
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072

3.1.1 Disadvantages 4.2SOFTWARE REQUIREMENTS

1. In this paper a new collative comparison measure that Python


reasonably represents the gains and losses due to fraud Anaconda
detection is proposed. OS - Windows 7, 8 and 10 (32 and 64 bit)
2. A cost sensitive method which is based on Bayes minimum
risk is presented using the proposed cost measure. 5. FEASIBILITY STUDY

5.1TECHNICAL FEASIBILITY
3.2 PROPOSED SCHEME
It is evident that necessary hardware and software are
available for development and implementation of proposed
In proposed System, we are applying random forest
system
algorithm for classification of the credit card dataset.
It uses Anaconda.
Random Forest is an algorithm for classification and
regression. Summarily, it is a collection of decision tree
5.2 ECONOMICAL FEASIBILITY
classifiers. Random forest has advantage over decision tree
The cost for the proposed system is comparatively less to
as it corrects the habit of over fitting to their training set. A
other existing software’s.
subset of the training set is sampled randomly so that to
train each individual tree and then a decision tree is built,
5.3 OPERATIONAL FEASIBILITY
each node then splits on a feature selected from a random
In this project it requires to configure the necessary software
subset of the full feature set. Even for large data sets with
to work on the software.
many features and data instances training is extremely fast
in random forest and because each tree is trained
6. SYSTEM ARCHITECTURE
independently of the others. The Random Forest algorithm
has been found to provide a good estimate of the
First the credit card dataset is taken from the source and
generalization error and to be resistant to over fitting.
cleaning and validation is performed on the dataset which
3.3 ADVANTAGES OF PROPOSED SYSTEM includes removal of redundancy, filling empty
spaces in columns, converting necessary variable into factors
 Random forest ranks the or classes then data is divided into 2 part, one is training
importance of variables in a regression or dataset and another one is test data set. Now the original
classification problem in a natural way can be done sample is randomly partitioned into teat and train dataset.
by Random Forest.
 The 'amount' feature is
the transaction amount. Feature 'class' is the target
class for the binary classification and it takes value
1 for positive case (fraud) and 0 for negative case
(not fraud).

4. REQUIREMENT SPECIFICATIONS

The requirements specification is a technical specification of


requirements for the software products. It is the first step in
the requirements analysis process it lists the requirements
of a particular software system including functional,
performance and security requirements. The purpose of Figure 6.1- ARCHITECTURE OF THE PROPOSED SYSTEM
software requirements specification is to provide a detailed
overview of the software project, its parameters and goals. 7. SYSTEM MODULES

4.1HARDWARE REQUIREMENTS 7.1 MODULE DESCRIPTION

7.1.1 MODULE 1: DATA COLLECTION


 Processor - Intel
 RAM - 4 Gb Data used in this paper is a set of product reviews collected
 Hard Disk - 260 GB from credit card transactions records. This step is concerned
 Key Board - Standard Windows Keyboard with selecting the subset of all available data that you will be
 Mouse - Two or Three Button Mouse working with. ML problems start with data preferably, lots of
data (examples or observations) for which you already know

© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6664
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072

the target answer. Data for which you already know the evaluate model performance. Performance of each
target answer is called labelled data. classification model is estimated base on its averaged. The
result will be in the visualized form. Representation of
classified data in the form of graphs. Accuracy is defined as
7.1.2 MODULE 2: DATA PRE-PROCESSING the percentage of correct predictions for the test data. It can
be calculated easily by dividing the number of correct
Organize your selected data by formatting, cleaning and predictions by the number of total predictions.
sampling from it.
8. Algorithm Utilized
Three common data pre-processing steps are:
8.1 Random Forest
Formatting: The data you have selected may not be in a
format that is suitable for you to work with. The data may be Random forest is a type of supervised machine learning
in a relational database and you would like it in a flat file, or algorithm based on ensemble learning. Ensemble learning is
the data may be in a proprietary file format and you would a type of learning where you join different types of
like it in a relational database or a text file. algorithms or same algorithm multiple times to form a more
Cleaning: Cleaning data is the removal or fixing of missing powerful prediction model. The random forest algorithm
data. There may be data instances that are incomplete and combines multiple algorithm of the same type i.e. multiple
do not carry the data you believe you need to address the decision trees, resulting in a forest of trees, hence the name
problem. These instances may need to be removed. "Random Forest". The random forest algorithm can be used
Additionally, there may be sensitive information in some of for both regression and classification tasks.
the attributes and these attributes may need to be removed
from the data entirely. 8.2 WORKING OF RANDOM FOREST
Sampling: There may be far more selected data available
than you need to work with. More data can result in much The following are the basic steps involved in
longer running times for algorithms and larger performing the random forest algorithm
computational and memory requirements. You can take a
smaller representative sample of the selected data that may 1. Pick N random records from the dataset.
be much faster for exploring and prototyping solutions 2. Build a decision tree based on these N records.
before considering the whole dataset. 3. Choose the number of trees you want in your
algorithm and repeat steps 1 and 2.
7.1.3 MODULE 3: FEATURE EXTRATION 4. For classification problem, each tree in the forest
predicts the category to which the new record belongs.
Next thing is to do Feature extraction is an attribute Finally, the new record is assigned to the category that
reduction process. Unlike feature selection, which ranks the wins the majority vote.
existing attributes according to their predictive significance,
feature extraction actually transforms the attributes. The
transformed attributes, or features, are linear combinations 8.3 ADVANTAGES OF USING RANDOM FOREST
of the original attributes. Finally, our models are trained
Pros of using random forest for classification and
using Classifier algorithm. We use classify module on Natural
regression.
Language Toolkit library on Python. We use the labelled
dataset gathered. The rest of our labelled data will be used to
evaluate the models. Some machine learning algorithms 1. The random forest algorithm is not biased, since,
were used to classify pre-processed data. The chosen there are multiple trees and each tree is trained on a
classifiers were Random forest. These algorithms are very subset of data. Basically, the random forest algorithm
popular in text classification tasks. relies on the power of "the crowd"; therefore, the
overall biasedness of the algorithm is reduced.
2. This algorithm is very stable. Even if a new data
7.1.4 MODULE 4: Evaluation Model
point is introduced in the dataset the overall algorithm
Model Evaluation is an integral part of the model is not affected much since new data may impact one
development process. It helps to find the best model that tree, but it is very hard for it to impact all the trees.
represents our data and how well the chosen model will 3. The random forest algorithm works well when you
work in the future. Evaluating model performance with the have both categorical and numerical features.
data used for training is not acceptable in data science
because it can easily generate overoptimistic and over fitted The random forest algorithm also works well when data has
models. There are two methods of evaluating models in data missing values or it has not been scaled well.
science, Hold-Out and Cross-Validation. To avoid over fitting,
both methods use a test set (not seen by the model) to

© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6665
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072

9. APPENDICES [6] Sitaram patel, Sunita Gond , “Supervised Machine (SVM)


Learning for Credit Card Fraud Detection, International of
9.1 SAMPLE SCREENSHOTS FROM THE PROJECT engineering trends and technology, vol. 8, no. -3, pp. 137-
140, 2014.

[7] Snehal Patil, Harshada Somavanshi, Jyoti Gaikwad,


Amruta Deshmane, Rinku Badgujar," Credit Card Fraud
Detection Using Decision Tree Induction Algorithm,
International Journal of Computer Science and Mobile
Computing, Vol.4 Issue.4, April- 2015, pg. 92-95

[8] Dahee Choi and Kyungho Lee, “Machine Learning based


Approach to Financial Fraud Detection Process in Mobile
Payment System", vol. 5, no. - 4, December 2017, pp. 12-24.

Fig- 8.1: Exact figures of fake and original credit card

10. CONCLUSION

The Random forest algorithm will perform better with a


larger number of training data, but speed during testing and
application will suffer. Application of more pre-processing
techniques would also help. The SVM algorithm still suffers
from the imbalanced dataset problem and requires more
preprocessing to give better results at the results shown by
SVM is great but it could have been better if more
preprocessing have been done on the data.

11. REFERENCES

[1] Sudhamathy G: Credit Risk Analysis and Prediction


Modelling of Bank Loans Using R, vol. 8, no-5, pp. 1954-1966.

[2] LI Changjian, HU Peng: Credit Risk Assessment for ural


Credit Cooperatives based on Improved Neural Network,
International Conference on Smart Grid and Electrical
Automation vol. 60, no. - 3, pp 227-230, 2017.

[3] Wei Sun, Chen-Guang Yang, Jian-Xun Qi: Credit Risk


Assessment in Commercial Banks Based On Support Vector
Machines, vol.6, pp 2430-2433, 2006.

[4] Amlan Kundu, Suvasini Panigrahi, Shamik Sural, Senior


Member, IEEE, “BLAST-SSAHA Hybridization for Credit Card
Fraud Detection”, vol. 6, no. 4 pp. 309-315, 2009.

[5] Y. Sahin and E. Duman, “Detecting Credit Card Fraud by


Decision Trees and Support Vector Machines, Proceedings of
International Multi Conference of Engineers and Computer
Scientists, vol. I, 2011.

© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6666

You might also like