0% found this document useful (0 votes)
708 views28 pages

Credit Card Default Prediction: Final Project Report

The document discusses building a predictive model to identify customers likely to default on credit card payments using machine learning algorithms. It describes the credit card dataset features, the data processing approach using Spark Streaming, and implementing Spark Structured Streaming in PySpark. The methodology includes exploratory data analysis, building a baseline model, and optimizing the model through feature selection, hyperparameter tuning, and handling class imbalance to improve results.

Uploaded by

aditghanekar10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
708 views28 pages

Credit Card Default Prediction: Final Project Report

The document discusses building a predictive model to identify customers likely to default on credit card payments using machine learning algorithms. It describes the credit card dataset features, the data processing approach using Spark Streaming, and implementing Spark Structured Streaming in PySpark. The methodology includes exploratory data analysis, building a baseline model, and optimizing the model through feature selection, hyperparameter tuning, and handling class imbalance to improve results.

Uploaded by

aditghanekar10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Credit Card Default Prediction

Big Data Technology (BIA 678-C)

Prof. Seyed Mohammad Nikouei

Final Project Report

Team – 2

Aditya More

Omkar Kurhade

Viraj Jadhao

Adit Ghanekar

1
CONTENTS

ABSTRACT……………………………...……………………………………………………… 3

INTRODUCTION………………………………………………………….…………………… 4

DATASET DESCRIPTION……….……………………...………………….………………… 5

DATA PROCESSING APPROACH……………….…………………….…………………… 6

PYSPARK IMPLEMENTATION……………….……………………………….…………… 8

METHODOLOGY…………………….…….…………………………………….…………… 9

EXPLORATORY DATA ANALYSIS………………………………………………...……… 11

CREDIT LIMIT PROBABILITY……………………...……………………………….……..12

PAYMENT CORRELATION PLOT……….…………………………………………………14

MARITAL STATUS – BOX PLOT……………………………………………………………16

CREDIT LIMIT BY SEX………………………………………………………………………18

BASELINE MODEL…………………………………………………………………...………20

OPTIMIZATION………………………………………………………………………….……23

FEATURE SELECTION………………………………….………………………………...…24

HYPER-PARAMETER TUNING…………………….………………………………………25

CLASS IMBALANCE……………………………………………………………………….…26

RESULTS…………………………………….……………………………………………...… 27

CONCLUSION………………………………….……………………………………………...28

REFERENCES………………………………………………………………………………….28

2
ABSTRACT

The use of machine learning algorithms in predicting credit card defaults has become increasingly

popular in recent years. Our report discusses the implementation of such algorithms to build a

predictive model that can accurately identify customers who are likely to default on their credit

card payments. The project explains how the data used to train the model was collected from a

financial institution's credit card database. The dataset contained information about the customers'

demographics, payment history, credit utilization, and other factors that may influence their

creditworthiness. Then goes on to describe the steps taken to prepare the data for analysis,

including feature engineering, data cleaning, and data normalization. Once the data was prepared,

several machine learning algorithms were tested to find the best-performing model. The article

concludes with a discussion of the results obtained from the implementation of the predictive

model. The models were able to accurately predict credit card defaults with an accuracy of over

80%, demonstrating the effectiveness of machine learning in credit risk assessment. Overall, the

article provides an informative overview of the application of machine learning in predicting credit

card defaults, highlighting the importance of data preparation and the benefits of using advanced

algorithms to identify high-risk customers.

3
INTRODUCTION

According to data from the Federal Reserve, credit card delinquency rates have been on the rise

since 2016, and while there was a sharp decrease in Q1 2020, it was largely due to COVID relief

measures. When credit card holders fall behind on their payments, banks must charge off these

accounts and assume the financial loss. However, with the help of machine learning classification

techniques, it may be possible to predict which customers are most likely to default on their credit

card payments, allowing banks to take preventative measures to minimize losses.

By analyzing data on credit card holders, such as their demographic information, credit card usage

history, and payment patterns, machine learning algorithms can identify patterns and risk factors

associated with credit card defaults. By leveraging these insights, banks can provide customers

with alternative options to help them avoid falling behind on their payments, such as forbearance

or debt consolidation.

With the ability to predict credit card defaults, banks can potentially save significant amounts of

money by avoiding charge-offs and reducing the number of delinquent accounts. Additionally, this

technology could help customers by providing them with more personalized and proactive support,

ultimately leading to better financial outcomes for everyone involved.

4
DATASET DESCRIPTION
Features:

• The credit card dataset consists of various features that provide information about the credit

card holder. One of the key features is the credit limit, which is the maximum amount of

credit that the individual and their family can use.

• Other features include sex, education, marital status, and age. Sex is represented as 1 for

male and 2 for female. Education is categorized as 1 for graduate school, 2 for university,

3 for high school, and 4 for others. Marital status is represented as 1 for married, 2 for

single, and 3 for others.

• The dataset also includes the history of past payments, which is an important factor in

determining creditworthiness. This feature is measured on a scale where -1 indicates that

the payment was made duly, and 1-9 indicates the number of months of delay in payment.

• The amount of the bill statement and previous payments for the past 6 months are also

included in the dataset. This information provides insight into the spending and repayment

habits of the credit card holder.

These features can be used to build predictive models to determine the likelihood of default on a

credit card. By analyzing this information, credit card companies can make informed decisions

about granting credit and setting credit limits. The dataset is a valuable resource for researchers

and data scientists interested in exploring credit card defaults and developing new models to

predict default rates.

5
Snapshot of Dataset

DATA PROCESSING APPROACH

The vast amount of data generated every second from various sources requires efficient processing

and analysis to derive meaningful insights. Real-time data processing is critical for many

applications, including Google search results. Spark Streaming is an extension of the basic Spark

API that enables scalable and fault-tolerant processing of large-scale live data streams.

Spark Streaming offers a high-level abstraction called D-Stream, which represents a continuous

stream of data. D-Streams are processed in batches and support complex algorithms like machine

6
learning and graph processing. To ingest data from sources such as Kafka, Kinesis, or TCP sockets,

Spark Streaming uses a receiver-based approach. The data is processed in micro-batches,

providing near-real-time processing of live streams.

The processed data can then be stored in file systems or databases for further analysis or presented

on live dashboards for monitoring. Spark Streaming collects incoming data streams and separates

them into batches, which are then processed by the Spark engine to produce the final batch of

results. This batch processing approach enables scalable and efficient processing of live data

streams, making it an essential tool for data analysis and decision-making.

Furthermore, Spark Streaming is fault-tolerant, meaning that it can recover from any failures

during processing. This is achieved through RDD lineage, where every RDD (Resilient Distributed

Dataset) in a Spark Streaming application is associated with the RDDs that created it. Therefore,

if a node fails, Spark Streaming can recover lost data from RDD lineage.

Overall, Spark Streaming is an essential tool for processing live data streams in a scalable,

efficient, and fault-tolerant manner. Its support for complex algorithms and various data sources

makes it a versatile platform for real-time data analysis and decision-making.

7
PYSPARK IMPLEMENTATION

Spark Structured Streaming is a powerful stream processing engine that is built on top of Spark

SQL. It is designed to handle large-scale streaming data and update the final results incrementally

as more data arrives.

To use Spark Structured Streaming, a Spark session is created and the data source is streamed. The

data stream is then registered as a temporary table, which can be queried using Spark SQL

functions. For example, you can use grouping functions to count the number of values for specific

columns in the data stream.

Once the Spark SQL function has been applied to the data stream, the result is returned as a Spark

DataFrame. This DataFrame can be converted to a pandas DataFrame using the ‘toPandas()’

function. Pandas is a popular data analysis library in Python that provides high-performance data

manipulation tools.

By converting the Spark DataFrame to a pandas DataFrame, you can easily manipulate and analyze

the data. You can also visualize the data using various plotting libraries available in Python. This

enables you to gain insights into the streaming data and make informed decisions based on the

results.

8
METHODOLOGY

Exploratory Data Analysis:

The first step in any machine learning project is to explore and understand the data. In this case,

the credit card default dataset should be thoroughly examined to identify any missing or erroneous

data and to gain an understanding of the variables and their relationships. This analysis can include

statistical summaries, visualization of distributions and correlations, and identification of any

outliers or anomalies in the data.

Baseline Model:

The baseline model is a simple, initial model that serves as a benchmark for the performance of

more complex models. In this case, a baseline model could be a basic logistic regression model

that uses only a few variables. This model can be used to assess the difficulty of the credit card

default prediction problem and provide a baseline level of performance against which to compare

more advanced models.

Optimization:

Once a baseline model has been established and performance metrics have been identified, the

next step is to optimize the model. This can involve a variety of techniques, such as feature

selection, feature engineering, and model selection. Feature selection involves selecting the most

relevant variables for predicting credit card defaults, while feature engineering involves creating

new variables based on the existing data. Model selection involves choosing the most appropriate

machine learning algorithm to fit the data.

Feature Importance:

After optimizing the model, it is important to understand which features are the most important for

predicting credit card defaults. This can be done using feature importance techniques, such as

9
permutation importance or SHAP values. These techniques help identify the most influential

features in the model and can provide insights into the underlying patterns in the data.

Hyperparameter Tuning:

Hyperparameter tuning involves adjusting the settings of the machine learning algorithm to

optimize its performance. This step can be time-consuming but is essential for achieving the best

possible results. Techniques such as grid search or randomized search can be used to find the

optimal hyperparameters for the selected model.

Class Imbalance:

Credit card default datasets often suffer from class imbalance, meaning that there are many more

non-defaulters than defaulters in the data. This can cause issues with model performance and

accuracy, as the model may be biased towards the majority class. Techniques such as oversampling

or undersampling can be used to address this imbalance and improve model performance.

Analyze Results:

Finally, the results of the machine learning analysis should be carefully analyzed and interpreted.

This involves looking at the model's performance metrics, identifying any areas for improvement,

and interpreting the feature importance results. These insights can be used to refine the model or

to identify potential actions that can be taken to prevent credit card defaults.

10
EXPLORATORY DATA ANALYSIS

Target distribution

• The adjacent graph displays the distribution of default and non-default payments in a

dataset. In this dataset, 0 represents "No Default" and 1 represents "Default." The graph

reveals that out of approximately 30 million records, there are roughly 24 million records

with no default payments and only 6 million records with default payments. This suggests

that the target classes are highly imbalanced, as non-defaults far outnumber defaults in this

dataset.

• Imbalanced datasets are common in credit card payment datasets since most people pay

their credit cards on time. This is particularly true in the absence of an economic crisis. As

11
a result, most of the data will be non-default cases, while the number of default cases will

be relatively small.

• Imbalanced datasets pose a challenge for machine learning models since they can skew the

model's predictions. In this case, a model that is trained on this dataset may be biased

toward predicting non-default cases since they are the majority class. Consequently, the

model may not perform well in predicting default cases, which can result in severe

consequences, particularly in credit card payment datasets.

• To address this challenge, several techniques can be employed to balance the dataset, such

as under-sampling, oversampling, and Synthetic Minority Over-sampling Technique

(SMOTE). These techniques aim to either reduce the number of non-default cases or

increase the number of default cases to create a more balanced dataset.

• In summary, the adjacent graph demonstrates the imbalanced distribution of default and

non-default cases in a credit card payment dataset. The prevalence of non-default cases is

a common occurrence in these datasets, posing a challenge for machine learning models.

Employing balancing techniques can help mitigate this challenge and improve model

performance.

Credit Limit Amount - Probability Density

• Credit limit is the maximum amount of money that a credit card company allows a

cardholder to borrow. It is an important factor in determining an individual's

creditworthiness and financial stability. Understanding the distribution of credit limit

amounts can provide valuable insights into the credit card usage patterns of the

population.

12
• In this case, the distribution of credit limit amounts is revealed through the observation

that the three largest credit limit amount groups are $50k, $20k, and $30k. This suggests

that a significant portion of the population holds credit cards with these specific credit

limit amounts. The frequency of these three groups can help credit card companies tailor

their marketing strategies and promotions to target these groups.

• Furthermore, the statement that most of the population has a credit limit balance of

approximately 100k implies that there is a wide range of credit limit amounts across the

population. While the three largest credit limit amount groups indicate some clustering,

there is still significant variability in the credit limit amounts held by individuals.

• Knowing the distribution of credit limit amounts is also essential for assessing the risk of

default or delinquency. Higher credit limits generally indicate higher creditworthiness

and a more robust financial standing. However, it also means that the potential loss in the

event of a default is greater. On the other hand, lower credit limits may indicate a lower

risk of default but may limit the cardholder's purchasing power.

13
• In conclusion, the observation of the three largest credit limit amount groups and the

majority of the population holding a credit limit balance of approximately 100k provides

valuable insights into the credit card usage patterns of the population. This information

can be useful for credit card companies in tailoring their marketing strategies and

assessing the risk of default or delinquency.

Payment Correlation Plot

14
• Correlation is a statistical measure that describes the degree to which two variables are

related to each other. Correlation strength can vary depending on the time frame in which

the variables are measured. It is generally assumed that the closer the time frame, the

stronger the correlation. This is because variables tend to have a greater impact on each

other when measured over a shorter period.

• For example, if a person has a late payment on their credit card in August, it is likely that

they will also have a late payment in September. This is because the impact of the late

payment in August will carry over to the following month. As a result, the correlation

between the two months is likely to be strong.

• On the other hand, it is less clear if the same assumption can be made for April and

September. The time gap between the two months is greater, and there are likely to be more

intervening factors that could affect credit card payments during that time. For instance, a

person may have received a bonus in May that allowed them to pay their credit card balance

in full in June, thus having no impact on their credit card payments in September.

• Furthermore, there may be seasonal or cyclical factors that could impact credit card

payments at different times of the year. For example, a person may have more expenses

during the summer months due to vacations, while their expenses in the winter months may

be more focused on holiday shopping. These factors could impact credit card payments

differently across the months, making it harder to assume a strong correlation between

April and September.

• In summary, correlation strength tends to increase the closer the time frame of the variables

being measured is. While it may be reasonable to assume a strong correlation between

credit card payments in August and September, it may be less clear to make the same

15
assumption for April and September due to intervening factors and seasonal or cyclical

factors that could impact credit card payments differently across the months.

Marital Status - Box Plot

• The graph that depicts the percentile of people who own a credit card according to their

marital status and color coded according to their sex is an informative visual representation

of a dataset that provides insights into the credit card usage patterns of different groups of

people. The graph helps to understand how credit card usage is correlated with marital

status and sex.

16
• It is important to note that the dataset mainly consists of two groups of people: couples in

their mid-30s to mid-40s and single people in their mid-20s to early-30s. This observation

is significant because it highlights the demographic group that makes up the majority of

the dataset. The age range of the dataset is important as it shows that individuals in their

mid-30s to mid-40s are likely to be more financially stable and have a higher

creditworthiness than younger people in their mid-20s to early-30s. This is because

individuals in their mid-30s to mid-40s are likely to have more work experience, higher

salaries, and more financial stability, which makes them more likely to be approved for a

credit card.

• The graph further illustrates that married individuals are more likely to own a credit card

than single individuals. This may be since married couples are more financially stable and

have more purchasing power than single individuals. Additionally, married couples are

more likely to have a joint bank account, which makes it easier for them to apply for and

use credit cards together.

• The color-coding of the graph according to sex also provides insights into how credit card

usage patterns differ between men and women. The graph shows that men are more likely

to own a credit card than women, regardless of their marital status. This may be due to

societal factors such as the gender pay gap, which could affect women's financial stability

and creditworthiness.

• Overall, the graph depicting the percentile of people who own a credit card according to

their marital status and color-coded according to their sex provides valuable insights into

the credit card usage patterns of different groups of people. The dataset mainly consists of

couples in their mid-30s to mid-40s and single people in their mid-20s to early-30s.

17
Additionally, the graph illustrates that married individuals are more likely to own a credit

card than single individuals, and men are more likely to own a credit card than women.

These insights can be useful for credit card companies in tailoring their marketing strategies

and promotions to target specific demographic groups.

Credit Limit by Sex

• The graph that depicts the credit limit balance across males and females is a visual

representation of a dataset that provides insights into the credit limits of different genders.

The graph helps to understand how credit limit balances are distributed across males and

females.

• It is important to note that the dataset is evenly distributed among males and females. This

observation is significant because it highlights that there is no gender bias in the dataset.

This means that the dataset is representative of both males and females, which is essential

for providing accurate insights into credit limit balances.

18
• The graph further illustrates that there is a difference in credit limit balances between males

and females. The credit limit balances for males are slightly higher than those for females.

This difference in credit limit balances could be due to several factors. For instance, it could

be due to differences in income levels between males and females. Men may earn more on

average than women, which would allow them to qualify for higher credit limits.

• Another factor that could explain the difference in credit limit balances is the difference in

credit scores between males and females. Credit scores are an essential factor in

determining credit limits, and men may have higher credit scores on average than women.

This could be due to a variety of reasons, including differences in credit utilization rates,

payment histories, and length of credit history.

• It is important to note that while there is a difference in credit limit balances between males

and females, the difference is relatively small. This means that the difference is not

statistically significant, and there is no evidence to suggest that there is any gender bias in

credit limit allocations.

• Overall, the graph depicting the credit limit balance across males and females provides

valuable insights into how credit limit balances are distributed across different genders.

The dataset is evenly distributed among males and females, which is essential for providing

accurate insights into credit limit balances. The graph illustrates that there is a small

difference in credit limit balances between males and females, which could be due to

differences in income levels or credit scores. However, there is no evidence to suggest that

there is any gender bias in credit limit allocations.

19
BASELINE MODEL

Prepare features and target:

Scale the data so that the model can easily digest information.

20
Score several models and choose one to improve upon.

21
Calculating Validation Score:

Models used:

• K-nearest neighbors

• Logistic Regression

• Random Forest

• XG Boost

• SVM

22
Optimization

In this scenario, optimization refers to the process of selecting the best model parameters and

tuning the model to improve its performance. The aim is to find the optimal combination of model

parameters that maximizes the recall score, which is a key performance metric in this case. The

focus is on optimizing the recall performance metric, which is a way to measure how well a

predictive model can correctly identify actual positive cases. For instance, in the context of

predicting credit card defaults, recall measures the proportion of credit card holders who are

actually likely to default, that are correctly identified as such by the model.

Recall score = TP / (TP + FN)

To calculate recall, the number of True Positive (TP) and False Negative (FN) predictions are

considered. True Positive represents the number of correctly predicted default cases, while False

Negative represents the number of actual default cases that are predicted as non-default.

The goal is to minimize the number of False Negatives as much as possible, as any defaults that

are not correctly predicted can lead to significant financial losses for credit card companies.

Therefore, the optimal model should have a high recall score to ensure that as many default cases

as possible are correctly identified.

By focusing on optimizing the recall performance metric, credit card companies can develop

effective strategies to prevent credit card defaults and minimize their financial losses. This

23
approach can also lead to the development of more accurate predictive models, enabling credit

card companies to make informed decisions about granting credit, setting credit limits, and

managing their risk effectively.

Feature Selection

When trying to identify which features are the most beneficial, there are a number of different

feature selection scores that one might apply. In this particular instance, we shall be making use of

Feature Importance.

In a nutshell, the process of giving scores to each feature in order to assess the utility of that feature

in predicting the target variable is what we mean when we talk about feature importance.

24
The graphic that may be seen above outlines the most important aspects. It's intriguing that 'age' is

the second most essential attribute to consider. Let's maintain all of them except for the final six,

which are the category variables. It's possible that this will help us to better anticipate our target

variable if we take out the characteristics that are the least essential and maintain the features that

are the most significant.

Hyperparameter Tuning
Every dataset and model call for its own unique collection of hyperparameters, which may be

thought of as a special form of variable. The only method to discover them is to carry out several

separate experiments, each of which consists of selecting a group of hyperparameters and putting

them through your model. This process is referred to as hyper-parameter tuning.

25
Class Imbalance
During the Exploratory Data Analysis phase, we found that the target variable is highly

imbalanced. Imbalanced data sets create difficulties for predictive modeling as most machine

learning algorithms are built around the assumption that there is an equal number of examples for

each class. To address this issue, we can use various techniques, such as under-sampling or over-

sampling, to balance the target variable. In this analysis, we will use random under-sampling and

oversampling techniques to address this imbalance. Random under-sampling involves removing

data from the negative class to balance the target distribution, while random over-sampling

duplicates data from the positive class to balance the target distribution. By using both these

techniques, we can compare their effectiveness and choose the one that provides the best results.

26
Results
The model that was subjected to the fewest amount of changes produced the greatest recall score

of 0.95. Following the selection of features and optimization of hyperparameters, the recall

dropped to 0.79. Accuracy of the model is 82%.

The amount of the dataset that falls into each category may be seen by examining the area under

the receiver operating characteristic (ROC) curve. It displays true positive rate against false

positive rate

We are calculating the area that is between the dotted line in pink and the curved line in blue

right now. This region is represented by a number that falls between 0 and 1, with 0 indicating

that the model made an inaccurate prediction of all of the data and 1 indicating that the model

made an accurate prediction of all of the data.

27
Conclusion

The project demonstrates the potential of machine learning algorithms in predicting credit card

defaults. The project highlights the importance of feature selection and engineering in building

accurate prediction models. The variables with the highest predictive power were found to be

age, credit limit, and payment history. The results also show that gradient boosting, a popular

ensemble learning technique, outperformed other models such as logistic regression, decision

trees, and random forests. This is not surprising as gradient boosting has been shown to be

effective in handling high-dimensional data and reducing bias in model predictions. The use of

machine learning algorithms can help financial institutions and credit card companies better

assess credit risk and make more informed decisions.

REFERENCES

1) https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+client

2) https://www.hindawi.com/journals/complexity/2021/6618841/

28

You might also like