Irjet V6i3710
Irjet V6i3710
Irjet V6i3710
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6662
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
As technology changes, it becomes difficult to track the [4] Fraudulent Detection in Credit Card System Using
behaviour and pattern of fraudulent transactions. SVM & Decision Tree. “Vijayshree B. Nipane, Poonam S.
Kalinge, Dipali Vidhate, Kunal War, Bhagyashree P.
With the upsurge of machine learning, artificial intelligence Deshpande”.
and other relevant fields of information technology, it
becomes feasible to automate the process and to save some With growing advancement in the electronic commerce field,
of the effective amount of labor that is put into detecting fraud is spreading all over the world, causing major financial
credit card fraudulent activities. losses. In current scenario, Major cause of financial losses is
credit card fraud; it not only affects trades person but also
2. RELATED WORK individual clients. Decision tree, Genetic algorithm, Meta
learning strategy, neural network, HMM are the presented
[1] The Use of Predictive Analytics Technology to Detect methods used to detect credit card frauds. In contemplate
Credit Card Fraud in Canada. “Kosemani Temitayo Hafiz, system for fraudulent detection, artificial intelligence
Dr. Shaun Aghili, Dr. Pavol Zavarsky.” concept of Support Vector Machine (SVM) & decision tree is
being used to solve the problem. Thus by implementation of
This research paper focuses on the creation of a scorecard this hybrid approach, financial losses can be reduced to
from relevant evaluation criteria, features, and capabilities of greater extend.
predictive analytics vendor solutions currently being used to
detect credit card fraud. The scorecard provides a side-by- 5] Supervised Machine (SVM) Learning for Credit Card
side comparison of five credit card predictive analytics Fraud Detection. “Sitaram patel, Sunita Gond”.
vendor solutions adopted in Canada. From the ensuing
research findings, a list of credit card fraud PAT vendor This thesis propose the SVM (Support Vector Machine)
solution challenges, risks, and limitations was outlined. based method with multiple kernel involvement which also
includes several fields of user profile instead of only
[2] BLAST-SSAHA Hybridization for Credit Card Fraud spending profile. The simulation result shows improvement
Detection. “Amlan Kundu, Suvasini Panigrahi, Shamik in TP (true positive), TN (true negative) rate, & also
Sural, Senior Member, IEEE, and Arun K. Majumdar” decreases the FP (false positive) & FN (false negative) rate.
This paper propose to use two-stage sequence alignment in [6] Detecting Credit Card Fraud by Decision Trees and
which a profile Analyser (PA) first determines the similarity Support Vector Machines. “Y. Sahin and E. Duman”
of an incoming sequence of transactions on a given credit In this study, classification models based on decision trees
card with the genuine cardholder’s past spending sequences. and support vector machines (SVM) are developed and
The unusual transactions traced by the profile analyser are applied on credit card fraud detection problem. This study is
next passed on to a deviation analyser (DA) for possible one of the firsts to compare the performance of SVM and
alignment with past fraudulent behaviour. The final decision decision tree methods in credit card fraud detection with a
about the nature of a transaction is taken on the basis of the real data set.
observations by these two analysers. In order to achieve
online response time for both PA and DA, we suggest a new 3. SYSTEM ANALYSIS
approach for combining two sequence alignment algorithms
3.1 EXISTING SYSTEM
BLAST and SSAHA.
In existing System, a research about a case study involving
[3] Research on Credit Card Fraud Detection Model
credit card fraud detection, where data normalization is
Based on Distance Sum.
applied before Cluster Analysis and with results obtained
“Wen-Fang YU, Na Wang”. from the use of Cluster Analysis and Artificial Neural
Networks on fraud detection has shown that by clustering
attributes neuronal inputs can be minimized. And promising
Along with increasing credit cards and growing trade volume results can be obtained by using normalized data and data
in China, credit card fraud rises sharply. How to enhance the should be MLP trained. This research was based on
detection and prevention of credit card fraud becomes the unsupervised learning. Significance of this paper was to find
focus of risk control of banks. It proposes a credit card fraud new methods for fraud detection and to increase the
detection model using outlier detection based on distance accuracy of results. The data set for this paper is based on
sum according to the infrequency and unconventionality of real life transactional data by a large European company and
fraud in credit card transaction data, applying outlier mining personal details in data is kept confidential. Accuracy of an
into credit card fraud detection. Experiments show that this algorithm is around 50%. Significance of this paper was to
model is feasible and accurate in detecting credit card fraud. find an algorithm and to reduce the cost measure. The result
obtained was by 23% and the algorithm they find was Bayes
minimum risk.
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6663
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
5.1TECHNICAL FEASIBILITY
3.2 PROPOSED SCHEME
It is evident that necessary hardware and software are
available for development and implementation of proposed
In proposed System, we are applying random forest
system
algorithm for classification of the credit card dataset.
It uses Anaconda.
Random Forest is an algorithm for classification and
regression. Summarily, it is a collection of decision tree
5.2 ECONOMICAL FEASIBILITY
classifiers. Random forest has advantage over decision tree
The cost for the proposed system is comparatively less to
as it corrects the habit of over fitting to their training set. A
other existing software’s.
subset of the training set is sampled randomly so that to
train each individual tree and then a decision tree is built,
5.3 OPERATIONAL FEASIBILITY
each node then splits on a feature selected from a random
In this project it requires to configure the necessary software
subset of the full feature set. Even for large data sets with
to work on the software.
many features and data instances training is extremely fast
in random forest and because each tree is trained
6. SYSTEM ARCHITECTURE
independently of the others. The Random Forest algorithm
has been found to provide a good estimate of the
First the credit card dataset is taken from the source and
generalization error and to be resistant to over fitting.
cleaning and validation is performed on the dataset which
3.3 ADVANTAGES OF PROPOSED SYSTEM includes removal of redundancy, filling empty
spaces in columns, converting necessary variable into factors
Random forest ranks the or classes then data is divided into 2 part, one is training
importance of variables in a regression or dataset and another one is test data set. Now the original
classification problem in a natural way can be done sample is randomly partitioned into teat and train dataset.
by Random Forest.
The 'amount' feature is
the transaction amount. Feature 'class' is the target
class for the binary classification and it takes value
1 for positive case (fraud) and 0 for negative case
(not fraud).
4. REQUIREMENT SPECIFICATIONS
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6664
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
the target answer. Data for which you already know the evaluate model performance. Performance of each
target answer is called labelled data. classification model is estimated base on its averaged. The
result will be in the visualized form. Representation of
classified data in the form of graphs. Accuracy is defined as
7.1.2 MODULE 2: DATA PRE-PROCESSING the percentage of correct predictions for the test data. It can
be calculated easily by dividing the number of correct
Organize your selected data by formatting, cleaning and predictions by the number of total predictions.
sampling from it.
8. Algorithm Utilized
Three common data pre-processing steps are:
8.1 Random Forest
Formatting: The data you have selected may not be in a
format that is suitable for you to work with. The data may be Random forest is a type of supervised machine learning
in a relational database and you would like it in a flat file, or algorithm based on ensemble learning. Ensemble learning is
the data may be in a proprietary file format and you would a type of learning where you join different types of
like it in a relational database or a text file. algorithms or same algorithm multiple times to form a more
Cleaning: Cleaning data is the removal or fixing of missing powerful prediction model. The random forest algorithm
data. There may be data instances that are incomplete and combines multiple algorithm of the same type i.e. multiple
do not carry the data you believe you need to address the decision trees, resulting in a forest of trees, hence the name
problem. These instances may need to be removed. "Random Forest". The random forest algorithm can be used
Additionally, there may be sensitive information in some of for both regression and classification tasks.
the attributes and these attributes may need to be removed
from the data entirely. 8.2 WORKING OF RANDOM FOREST
Sampling: There may be far more selected data available
than you need to work with. More data can result in much The following are the basic steps involved in
longer running times for algorithms and larger performing the random forest algorithm
computational and memory requirements. You can take a
smaller representative sample of the selected data that may 1. Pick N random records from the dataset.
be much faster for exploring and prototyping solutions 2. Build a decision tree based on these N records.
before considering the whole dataset. 3. Choose the number of trees you want in your
algorithm and repeat steps 1 and 2.
7.1.3 MODULE 3: FEATURE EXTRATION 4. For classification problem, each tree in the forest
predicts the category to which the new record belongs.
Next thing is to do Feature extraction is an attribute Finally, the new record is assigned to the category that
reduction process. Unlike feature selection, which ranks the wins the majority vote.
existing attributes according to their predictive significance,
feature extraction actually transforms the attributes. The
transformed attributes, or features, are linear combinations 8.3 ADVANTAGES OF USING RANDOM FOREST
of the original attributes. Finally, our models are trained
Pros of using random forest for classification and
using Classifier algorithm. We use classify module on Natural
regression.
Language Toolkit library on Python. We use the labelled
dataset gathered. The rest of our labelled data will be used to
evaluate the models. Some machine learning algorithms 1. The random forest algorithm is not biased, since,
were used to classify pre-processed data. The chosen there are multiple trees and each tree is trained on a
classifiers were Random forest. These algorithms are very subset of data. Basically, the random forest algorithm
popular in text classification tasks. relies on the power of "the crowd"; therefore, the
overall biasedness of the algorithm is reduced.
2. This algorithm is very stable. Even if a new data
7.1.4 MODULE 4: Evaluation Model
point is introduced in the dataset the overall algorithm
Model Evaluation is an integral part of the model is not affected much since new data may impact one
development process. It helps to find the best model that tree, but it is very hard for it to impact all the trees.
represents our data and how well the chosen model will 3. The random forest algorithm works well when you
work in the future. Evaluating model performance with the have both categorical and numerical features.
data used for training is not acceptable in data science
because it can easily generate overoptimistic and over fitted The random forest algorithm also works well when data has
models. There are two methods of evaluating models in data missing values or it has not been scaled well.
science, Hold-Out and Cross-Validation. To avoid over fitting,
both methods use a test set (not seen by the model) to
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6665
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
10. CONCLUSION
11. REFERENCES
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6666