Credit Card Fraud Detection

Govt.Boys Senior Sec.


How can a credit card Fraud happen?
Some of the most common ways it may happen are:

Firstly and most ostensibly when your card details are overseen by some other
When your card is lost or stolen and the person possessing it knows how to get
things done.
Fake phone call convincing you to share the details.
And lastly and most improbably, a high-level hacking of the bank account details.
Main challenges involved in credit card fraud detection are:
Enormous Data is processed every day and the model build must be fast enough to
respond to the scam in time.
Imbalanced Data i.e most of the transactions(99.8%) are not fraudulent which
makes it really hard for detecting the fraudulent ones
Data availability as the data is mostly private.
Misclassified Data can be another major issue, as not every fraudulent transaction
is caught and reported.
And last but not the least, Adaptive techniques used against the model by the
How to tackle these challenges?
The model used must be simple and fast enough to detect the anomaly and
classify it as a fraudulent transaction as quickly as possible.
Imbalance can be dealt with by properly using some methods which we will talk
about in the next paragraph
For protecting the privacy of the user the dimensionality of the data can be
A more trustworthy source must be taken which double-check the data, at least
for training the model.
We can make the model simple and interpretable so that when the scammer
adapts to it with just some tweaks we can have a new model up and running to
Dealing with Imbalance
We will see in the later parts of the article that the data we received is highly
imbalanced i.e only 0.17% of the total Credit Card transaction is fraudulent. Well, a
class imbalance is a very common problem in real life and needs to be handled
before applying any algorithm to it.

There are three common ways to deal with the imbalance of Data

Undersampling– One-sided sampling by Kubat and Matwin(ICML 1997)

Oversampling–SMOTE(Synthetic Minority Oversampling Technique)
Combining the above two.
The imbalance is not within the scope of this article. Here is another article guiding
you to deal with this problem specifically.

For those of you who are wondering if the fraudulent transaction is so rare why
even bother, well here is another fact. The amount of money involved in the
fraudulent transaction reaches Billions of USD and by increasing the specificity to
0.1% we can save Millions of USD. Whereas higher Sensitivity means fewer people

The importance of Machine Learning and Data Science cannot

be overstated. If you are interested in studying past trends and
training machines to learn with time how to define scenarios,
identify and label events, or predict a value in the present or
future, data science is of the essence. It is essential to study the
underlying data and model it by selecting an appropriate
algorithm to approach any such use case. The various control
parameters of the algorithm need to be tweaked to fit the data
set. As a result, the developed application improves and
becomes more efficient in solving the problem.
In this blog, we have attempted to illustrate the modeling of a
data set using a machine learning paradigm classification, with
Credit Card Fraud Detection being the base. Classification is a
machine learning paradigm that involves deriving a function that
will separate data into categories, or classes, characterized by a
training set of data con1taining observations (instances) whose
category membership is known. This function is then used in
identifying in which of the categories a new observation belongs.
Problem Statement:
The Credit Card Fraud Detection Problem includes modeling
past credit card transactions with the knowledge of the ones that
turned out to be fraud. This model is then used to identify
whether a new transaction is fraudulent or not. Our aim here is
to detect 100% of the fraudulent transactions while minimizing
the incorrect fraud classifications.
Data Set Analysis:
This problem has been picked from Kaggle.
1. The data set is highly skewed, consisting of 492 frauds in a
total of 284,807 observations. This resulted in only 0.172%
fraud cases. This skewed set is justified by the low number
of fraudulent transactions.
2. The dataset consists of numerical values from the 28
‘Principal Component Analysis (PCA)’ transformed features,
namely V1 to V28. Furthermore, there is no metadata about
the original features provided, so pre-analysis or feature
study could not be done.
3. The ‘Time’ and ‘Amount’ features are not transformed data.
4. There is no missing value in the dataset.
Inferences drawn:
1. Owing to such imbalance in data, an algorithm that does not
do any feature analysis and predicts all the transactions as
non-frauds will also achieve an accuracy of 99.828%.
Therefore, accuracy is not a correct measure of efficiency in
our case. We need some other standard of correctness
while classifying transactions as fraud or non-fraud.
2. The ‘Time’ feature does not indicate the actual time of the
transaction and is more of a list of the data in chronological
order. So we assume that the ‘Time’ feature has little or no
significance in classifying a fraud transaction. Therefore, we
eliminate this column from further analysis.
Credit Card Fraud Detection is a typical example of
classification. In this process, we have focused more on
analyzing the feature modeling and possible business use cases
of the algorithm’s output than on the algorithm itself. We used
the implementation of Binomial Logistic Regression Algorithm in
the ‘ROCR’ package on the PCA transformed Credit Card Fraud
Some Definitions:
The following are essential definitions – in the current problem’s
context – needed to understand the approaches mentioned
 True Positive: The fraud cases that the model predicted as
 False Positive: The non-fraud cases that the model
predicted as ‘fraud.’
 True Negative: The non-fraud cases that the model
predicted as ‘non-fraud.’
 False Negative: The fraud cases that the model predicted
as ‘non-fraud.’
 Threshold Cutoff Probability: Probability at which the true
positive ratio and true negatives ratio are both highest. It
can be noted that this probability is minimal, which is
reasonable as the probability of frauds is low.
 Accuracy: The measure of correct predictions made by the
model – that is, the ratio of fraud transactions classified as
fraud and non-fraud classified as non-fraud to the total
transactions in the test data.
 Sensitivity: Sensitivity, or True Positive Rate, or Recall, is
the ratio of correctly identified fraud cases to total fraud
 Specificity: Specificity, or True Negative Rate, is the ratio of
correctly identified non-fraud cases to total non-fraud cases.
 Precision: Precision is the ratio of correctly predicted fraud
cases to total predicted fraud cases.
Hello coders, in case you jumped directly to this part, here is what
you need to know. Credit Card fraud is bad and we have to find a
way to identify fraud using some of the features given to us in the
data on which you can completely rely on for now. So without
further adieu, let’s get started.

First chose a platform, I prefer Google Colab but Kaggle is amazing

too. You can compare these two from this article in terms of GPU
configuration as the price is not a factor(both of them are free to

If you want me to make an article on How to use Google Colab or

Kaggle platform or your local machine to build your classifier, then
please let me know in the comments below 😉.

Here is the GitHub link to the repository of the Notebook. You can
fork it and even push to suggest some changes in the repository.
Feel free to try it out.

Importing dependencies

Here is the code to import all the dependencies needed

# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import gridspec

In case you have not installed all these dependencies, I would

like to recommend installing Anaconda distribution which
includes most of the packages you will come across. You can
also watch this video for windows guide to install Anaconda
or read this article by Analytical Vidya for Mac or Linux.

Loading the Data

You have to first download the data from the Kaggle website.
Click the download button next to the new Notebook button
in the middle of the screen.

Now you can use this code to load the dataset to the ipython
notebook you are working on.

Note: The path in the parenthesis must be the path where

you stored the dataset in your machine. If you are using
Colab then you can mount your drive to the notebook and
provide it to your Google Drive’s directory path for the
# Load the dataset from the csv file using pandas
data = pd.read_csv(‘/content/drive/My Drive/creditcard.csv’)
Understanding the Data
Grab a peak at the data

Due to some confidentiality issues, the original features are replaced

with V1, V2, … V28 columns which are the result of PCA
transformation applied to the original ones. The only features which
have not been transformed with PCA are ‘Time’ and ‘Amount’.
Feature ‘Class’ is the response variable and it takes value 1 in case of
fraud and 0 otherwise.


Number of seconds elapsed between this transaction and the first

transaction in the dataset.


Transaction amount


1 for fraudulent transactions, 0 otherwise

Know the numbers

You can chose to uncomment the second line if you want to work on a
smaller dataset first and then when everything is working fine, comment it
once again and run all the cells.
# Print the shape of the data
# data = data.sample(frac=0.1, random_state = 48)

Fig.2 Describing the data

Let’s separate the Fraudulent cases from the authentic ones and
compare their occurrences in the dataset.
# Determine number of fraud cases in datasetFraud = data[data[‘Class’] == 1]
Valid = data[data[‘Class’] == 0]outlier_fraction =
print(outlier_fraction)print(‘Fraud Cases: {}’.format(len(data[data[‘Class’]
== 1])))
print(‘Valid Transactions: {}’.format(len(data[data[‘Class’] == 0])))

fraud There is only 0.17% fraudulent transaction out all the

transactions. The data is highly Unbalanced. Lets first apply our
models without balancing it and if we don’t get a good accuracy then
we can find a way to balance this dataset.
Fig.5 percentage of fraudulent Cases

print(“Amount details of fraudulent transaction”)


Fig.6 Amount details of fraudulent transaction

print(“details of valid transaction”)


Fig.7 Amount details of a valid transaction

As we can clearly notice from this, the average Money transaction
for the fraudulent ones are more. This makes this problem crucial to
deal with.

Correlation matrix graphically gives us an idea of how features

correlate with each other and can help us predict what are the
features that are most relevant for the prediction.
# Correlation matrix
corrmat = data.corr()
fig = plt.figure(figsize = (12, 9))sns.heatmap(corrmat, vmax = .8, square =

Fig.8 Correlation Matix

In the HeatMap we can clearly see that most of the features do not
correlate to other features but there are some features that either
has a positive or a negative correlation with each other. For example
“V2” and “V5” are highly negatively correlated with the feature
called “Amount”. We also see some correlation with “V20” and
“Amount”. This gives us a deeper understanding of the Data
available to us.

With that out of the way let’s proceed with dividing the data values
into Features and Target.
#dividing the X and the Y from the dataset
X=data.drop([‘Class’], axis=1)
#getting just the values for the sake of processing (its a numpy array with
no columns)

Using Skicit learn to split the data into Training and Testing.
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size
= 0.2, random_state = 42)

Building the Isolation Forest Model

Isolation forest is generally used for Anomaly detection. Feel free to
have a look at this video if you want to learn more about this
#Building another model/classifier ISOLATION FOREST
from sklearn.ensemble import IsolationForest
scores_pred = ifc.decision_function(X_train)
y_pred = ifc.predict(X_test)

