Credit Card Default Prediction: Final Project Report
Credit Card Default Prediction: Final Project Report
Team – 2
Aditya More
Omkar Kurhade
Viraj Jadhao
Adit Ghanekar
1
CONTENTS
ABSTRACT……………………………...……………………………………………………… 3
INTRODUCTION………………………………………………………….…………………… 4
DATASET DESCRIPTION……….……………………...………………….………………… 5
PYSPARK IMPLEMENTATION……………….……………………………….…………… 8
METHODOLOGY…………………….…….…………………………………….…………… 9
BASELINE MODEL…………………………………………………………………...………20
OPTIMIZATION………………………………………………………………………….……23
FEATURE SELECTION………………………………….………………………………...…24
HYPER-PARAMETER TUNING…………………….………………………………………25
CLASS IMBALANCE……………………………………………………………………….…26
RESULTS…………………………………….……………………………………………...… 27
CONCLUSION………………………………….……………………………………………...28
REFERENCES………………………………………………………………………………….28
2
ABSTRACT
The use of machine learning algorithms in predicting credit card defaults has become increasingly
popular in recent years. Our report discusses the implementation of such algorithms to build a
predictive model that can accurately identify customers who are likely to default on their credit
card payments. The project explains how the data used to train the model was collected from a
financial institution's credit card database. The dataset contained information about the customers'
demographics, payment history, credit utilization, and other factors that may influence their
creditworthiness. Then goes on to describe the steps taken to prepare the data for analysis,
including feature engineering, data cleaning, and data normalization. Once the data was prepared,
several machine learning algorithms were tested to find the best-performing model. The article
concludes with a discussion of the results obtained from the implementation of the predictive
model. The models were able to accurately predict credit card defaults with an accuracy of over
80%, demonstrating the effectiveness of machine learning in credit risk assessment. Overall, the
article provides an informative overview of the application of machine learning in predicting credit
card defaults, highlighting the importance of data preparation and the benefits of using advanced
3
INTRODUCTION
According to data from the Federal Reserve, credit card delinquency rates have been on the rise
since 2016, and while there was a sharp decrease in Q1 2020, it was largely due to COVID relief
measures. When credit card holders fall behind on their payments, banks must charge off these
accounts and assume the financial loss. However, with the help of machine learning classification
techniques, it may be possible to predict which customers are most likely to default on their credit
By analyzing data on credit card holders, such as their demographic information, credit card usage
history, and payment patterns, machine learning algorithms can identify patterns and risk factors
associated with credit card defaults. By leveraging these insights, banks can provide customers
with alternative options to help them avoid falling behind on their payments, such as forbearance
or debt consolidation.
With the ability to predict credit card defaults, banks can potentially save significant amounts of
money by avoiding charge-offs and reducing the number of delinquent accounts. Additionally, this
technology could help customers by providing them with more personalized and proactive support,
4
DATASET DESCRIPTION
Features:
• The credit card dataset consists of various features that provide information about the credit
card holder. One of the key features is the credit limit, which is the maximum amount of
• Other features include sex, education, marital status, and age. Sex is represented as 1 for
male and 2 for female. Education is categorized as 1 for graduate school, 2 for university,
3 for high school, and 4 for others. Marital status is represented as 1 for married, 2 for
• The dataset also includes the history of past payments, which is an important factor in
the payment was made duly, and 1-9 indicates the number of months of delay in payment.
• The amount of the bill statement and previous payments for the past 6 months are also
included in the dataset. This information provides insight into the spending and repayment
These features can be used to build predictive models to determine the likelihood of default on a
credit card. By analyzing this information, credit card companies can make informed decisions
about granting credit and setting credit limits. The dataset is a valuable resource for researchers
and data scientists interested in exploring credit card defaults and developing new models to
5
Snapshot of Dataset
The vast amount of data generated every second from various sources requires efficient processing
and analysis to derive meaningful insights. Real-time data processing is critical for many
applications, including Google search results. Spark Streaming is an extension of the basic Spark
API that enables scalable and fault-tolerant processing of large-scale live data streams.
Spark Streaming offers a high-level abstraction called D-Stream, which represents a continuous
stream of data. D-Streams are processed in batches and support complex algorithms like machine
6
learning and graph processing. To ingest data from sources such as Kafka, Kinesis, or TCP sockets,
The processed data can then be stored in file systems or databases for further analysis or presented
on live dashboards for monitoring. Spark Streaming collects incoming data streams and separates
them into batches, which are then processed by the Spark engine to produce the final batch of
results. This batch processing approach enables scalable and efficient processing of live data
Furthermore, Spark Streaming is fault-tolerant, meaning that it can recover from any failures
during processing. This is achieved through RDD lineage, where every RDD (Resilient Distributed
Dataset) in a Spark Streaming application is associated with the RDDs that created it. Therefore,
if a node fails, Spark Streaming can recover lost data from RDD lineage.
Overall, Spark Streaming is an essential tool for processing live data streams in a scalable,
efficient, and fault-tolerant manner. Its support for complex algorithms and various data sources
7
PYSPARK IMPLEMENTATION
Spark Structured Streaming is a powerful stream processing engine that is built on top of Spark
SQL. It is designed to handle large-scale streaming data and update the final results incrementally
To use Spark Structured Streaming, a Spark session is created and the data source is streamed. The
data stream is then registered as a temporary table, which can be queried using Spark SQL
functions. For example, you can use grouping functions to count the number of values for specific
Once the Spark SQL function has been applied to the data stream, the result is returned as a Spark
DataFrame. This DataFrame can be converted to a pandas DataFrame using the ‘toPandas()’
function. Pandas is a popular data analysis library in Python that provides high-performance data
manipulation tools.
By converting the Spark DataFrame to a pandas DataFrame, you can easily manipulate and analyze
the data. You can also visualize the data using various plotting libraries available in Python. This
enables you to gain insights into the streaming data and make informed decisions based on the
results.
8
METHODOLOGY
The first step in any machine learning project is to explore and understand the data. In this case,
the credit card default dataset should be thoroughly examined to identify any missing or erroneous
data and to gain an understanding of the variables and their relationships. This analysis can include
Baseline Model:
The baseline model is a simple, initial model that serves as a benchmark for the performance of
more complex models. In this case, a baseline model could be a basic logistic regression model
that uses only a few variables. This model can be used to assess the difficulty of the credit card
default prediction problem and provide a baseline level of performance against which to compare
Optimization:
Once a baseline model has been established and performance metrics have been identified, the
next step is to optimize the model. This can involve a variety of techniques, such as feature
selection, feature engineering, and model selection. Feature selection involves selecting the most
relevant variables for predicting credit card defaults, while feature engineering involves creating
new variables based on the existing data. Model selection involves choosing the most appropriate
Feature Importance:
After optimizing the model, it is important to understand which features are the most important for
predicting credit card defaults. This can be done using feature importance techniques, such as
9
permutation importance or SHAP values. These techniques help identify the most influential
features in the model and can provide insights into the underlying patterns in the data.
Hyperparameter Tuning:
Hyperparameter tuning involves adjusting the settings of the machine learning algorithm to
optimize its performance. This step can be time-consuming but is essential for achieving the best
possible results. Techniques such as grid search or randomized search can be used to find the
Class Imbalance:
Credit card default datasets often suffer from class imbalance, meaning that there are many more
non-defaulters than defaulters in the data. This can cause issues with model performance and
accuracy, as the model may be biased towards the majority class. Techniques such as oversampling
or undersampling can be used to address this imbalance and improve model performance.
Analyze Results:
Finally, the results of the machine learning analysis should be carefully analyzed and interpreted.
This involves looking at the model's performance metrics, identifying any areas for improvement,
and interpreting the feature importance results. These insights can be used to refine the model or
to identify potential actions that can be taken to prevent credit card defaults.
10
EXPLORATORY DATA ANALYSIS
Target distribution
• The adjacent graph displays the distribution of default and non-default payments in a
dataset. In this dataset, 0 represents "No Default" and 1 represents "Default." The graph
reveals that out of approximately 30 million records, there are roughly 24 million records
with no default payments and only 6 million records with default payments. This suggests
that the target classes are highly imbalanced, as non-defaults far outnumber defaults in this
dataset.
• Imbalanced datasets are common in credit card payment datasets since most people pay
their credit cards on time. This is particularly true in the absence of an economic crisis. As
11
a result, most of the data will be non-default cases, while the number of default cases will
be relatively small.
• Imbalanced datasets pose a challenge for machine learning models since they can skew the
model's predictions. In this case, a model that is trained on this dataset may be biased
toward predicting non-default cases since they are the majority class. Consequently, the
model may not perform well in predicting default cases, which can result in severe
• To address this challenge, several techniques can be employed to balance the dataset, such
(SMOTE). These techniques aim to either reduce the number of non-default cases or
• In summary, the adjacent graph demonstrates the imbalanced distribution of default and
non-default cases in a credit card payment dataset. The prevalence of non-default cases is
a common occurrence in these datasets, posing a challenge for machine learning models.
Employing balancing techniques can help mitigate this challenge and improve model
performance.
• Credit limit is the maximum amount of money that a credit card company allows a
amounts can provide valuable insights into the credit card usage patterns of the
population.
12
• In this case, the distribution of credit limit amounts is revealed through the observation
that the three largest credit limit amount groups are $50k, $20k, and $30k. This suggests
that a significant portion of the population holds credit cards with these specific credit
limit amounts. The frequency of these three groups can help credit card companies tailor
• Furthermore, the statement that most of the population has a credit limit balance of
approximately 100k implies that there is a wide range of credit limit amounts across the
population. While the three largest credit limit amount groups indicate some clustering,
there is still significant variability in the credit limit amounts held by individuals.
• Knowing the distribution of credit limit amounts is also essential for assessing the risk of
and a more robust financial standing. However, it also means that the potential loss in the
event of a default is greater. On the other hand, lower credit limits may indicate a lower
13
• In conclusion, the observation of the three largest credit limit amount groups and the
majority of the population holding a credit limit balance of approximately 100k provides
valuable insights into the credit card usage patterns of the population. This information
can be useful for credit card companies in tailoring their marketing strategies and
14
• Correlation is a statistical measure that describes the degree to which two variables are
related to each other. Correlation strength can vary depending on the time frame in which
the variables are measured. It is generally assumed that the closer the time frame, the
stronger the correlation. This is because variables tend to have a greater impact on each
• For example, if a person has a late payment on their credit card in August, it is likely that
they will also have a late payment in September. This is because the impact of the late
payment in August will carry over to the following month. As a result, the correlation
• On the other hand, it is less clear if the same assumption can be made for April and
September. The time gap between the two months is greater, and there are likely to be more
intervening factors that could affect credit card payments during that time. For instance, a
person may have received a bonus in May that allowed them to pay their credit card balance
in full in June, thus having no impact on their credit card payments in September.
• Furthermore, there may be seasonal or cyclical factors that could impact credit card
payments at different times of the year. For example, a person may have more expenses
during the summer months due to vacations, while their expenses in the winter months may
be more focused on holiday shopping. These factors could impact credit card payments
differently across the months, making it harder to assume a strong correlation between
• In summary, correlation strength tends to increase the closer the time frame of the variables
being measured is. While it may be reasonable to assume a strong correlation between
credit card payments in August and September, it may be less clear to make the same
15
assumption for April and September due to intervening factors and seasonal or cyclical
factors that could impact credit card payments differently across the months.
• The graph that depicts the percentile of people who own a credit card according to their
marital status and color coded according to their sex is an informative visual representation
of a dataset that provides insights into the credit card usage patterns of different groups of
people. The graph helps to understand how credit card usage is correlated with marital
16
• It is important to note that the dataset mainly consists of two groups of people: couples in
their mid-30s to mid-40s and single people in their mid-20s to early-30s. This observation
is significant because it highlights the demographic group that makes up the majority of
the dataset. The age range of the dataset is important as it shows that individuals in their
mid-30s to mid-40s are likely to be more financially stable and have a higher
individuals in their mid-30s to mid-40s are likely to have more work experience, higher
salaries, and more financial stability, which makes them more likely to be approved for a
credit card.
• The graph further illustrates that married individuals are more likely to own a credit card
than single individuals. This may be since married couples are more financially stable and
have more purchasing power than single individuals. Additionally, married couples are
more likely to have a joint bank account, which makes it easier for them to apply for and
• The color-coding of the graph according to sex also provides insights into how credit card
usage patterns differ between men and women. The graph shows that men are more likely
to own a credit card than women, regardless of their marital status. This may be due to
societal factors such as the gender pay gap, which could affect women's financial stability
and creditworthiness.
• Overall, the graph depicting the percentile of people who own a credit card according to
their marital status and color-coded according to their sex provides valuable insights into
the credit card usage patterns of different groups of people. The dataset mainly consists of
couples in their mid-30s to mid-40s and single people in their mid-20s to early-30s.
17
Additionally, the graph illustrates that married individuals are more likely to own a credit
card than single individuals, and men are more likely to own a credit card than women.
These insights can be useful for credit card companies in tailoring their marketing strategies
• The graph that depicts the credit limit balance across males and females is a visual
representation of a dataset that provides insights into the credit limits of different genders.
The graph helps to understand how credit limit balances are distributed across males and
females.
• It is important to note that the dataset is evenly distributed among males and females. This
observation is significant because it highlights that there is no gender bias in the dataset.
This means that the dataset is representative of both males and females, which is essential
18
• The graph further illustrates that there is a difference in credit limit balances between males
and females. The credit limit balances for males are slightly higher than those for females.
This difference in credit limit balances could be due to several factors. For instance, it could
be due to differences in income levels between males and females. Men may earn more on
average than women, which would allow them to qualify for higher credit limits.
• Another factor that could explain the difference in credit limit balances is the difference in
credit scores between males and females. Credit scores are an essential factor in
determining credit limits, and men may have higher credit scores on average than women.
This could be due to a variety of reasons, including differences in credit utilization rates,
• It is important to note that while there is a difference in credit limit balances between males
and females, the difference is relatively small. This means that the difference is not
statistically significant, and there is no evidence to suggest that there is any gender bias in
• Overall, the graph depicting the credit limit balance across males and females provides
valuable insights into how credit limit balances are distributed across different genders.
The dataset is evenly distributed among males and females, which is essential for providing
accurate insights into credit limit balances. The graph illustrates that there is a small
difference in credit limit balances between males and females, which could be due to
differences in income levels or credit scores. However, there is no evidence to suggest that
19
BASELINE MODEL
Scale the data so that the model can easily digest information.
20
Score several models and choose one to improve upon.
21
Calculating Validation Score:
Models used:
• K-nearest neighbors
• Logistic Regression
• Random Forest
• XG Boost
• SVM
22
Optimization
In this scenario, optimization refers to the process of selecting the best model parameters and
tuning the model to improve its performance. The aim is to find the optimal combination of model
parameters that maximizes the recall score, which is a key performance metric in this case. The
focus is on optimizing the recall performance metric, which is a way to measure how well a
predictive model can correctly identify actual positive cases. For instance, in the context of
predicting credit card defaults, recall measures the proportion of credit card holders who are
actually likely to default, that are correctly identified as such by the model.
To calculate recall, the number of True Positive (TP) and False Negative (FN) predictions are
considered. True Positive represents the number of correctly predicted default cases, while False
Negative represents the number of actual default cases that are predicted as non-default.
The goal is to minimize the number of False Negatives as much as possible, as any defaults that
are not correctly predicted can lead to significant financial losses for credit card companies.
Therefore, the optimal model should have a high recall score to ensure that as many default cases
By focusing on optimizing the recall performance metric, credit card companies can develop
effective strategies to prevent credit card defaults and minimize their financial losses. This
23
approach can also lead to the development of more accurate predictive models, enabling credit
card companies to make informed decisions about granting credit, setting credit limits, and
Feature Selection
When trying to identify which features are the most beneficial, there are a number of different
feature selection scores that one might apply. In this particular instance, we shall be making use of
Feature Importance.
In a nutshell, the process of giving scores to each feature in order to assess the utility of that feature
in predicting the target variable is what we mean when we talk about feature importance.
24
The graphic that may be seen above outlines the most important aspects. It's intriguing that 'age' is
the second most essential attribute to consider. Let's maintain all of them except for the final six,
which are the category variables. It's possible that this will help us to better anticipate our target
variable if we take out the characteristics that are the least essential and maintain the features that
Hyperparameter Tuning
Every dataset and model call for its own unique collection of hyperparameters, which may be
thought of as a special form of variable. The only method to discover them is to carry out several
separate experiments, each of which consists of selecting a group of hyperparameters and putting
25
Class Imbalance
During the Exploratory Data Analysis phase, we found that the target variable is highly
imbalanced. Imbalanced data sets create difficulties for predictive modeling as most machine
learning algorithms are built around the assumption that there is an equal number of examples for
each class. To address this issue, we can use various techniques, such as under-sampling or over-
sampling, to balance the target variable. In this analysis, we will use random under-sampling and
data from the negative class to balance the target distribution, while random over-sampling
duplicates data from the positive class to balance the target distribution. By using both these
techniques, we can compare their effectiveness and choose the one that provides the best results.
26
Results
The model that was subjected to the fewest amount of changes produced the greatest recall score
of 0.95. Following the selection of features and optimization of hyperparameters, the recall
The amount of the dataset that falls into each category may be seen by examining the area under
the receiver operating characteristic (ROC) curve. It displays true positive rate against false
positive rate
We are calculating the area that is between the dotted line in pink and the curved line in blue
right now. This region is represented by a number that falls between 0 and 1, with 0 indicating
that the model made an inaccurate prediction of all of the data and 1 indicating that the model
27
Conclusion
The project demonstrates the potential of machine learning algorithms in predicting credit card
defaults. The project highlights the importance of feature selection and engineering in building
accurate prediction models. The variables with the highest predictive power were found to be
age, credit limit, and payment history. The results also show that gradient boosting, a popular
ensemble learning technique, outperformed other models such as logistic regression, decision
trees, and random forests. This is not surprising as gradient boosting has been shown to be
effective in handling high-dimensional data and reducing bias in model predictions. The use of
machine learning algorithms can help financial institutions and credit card companies better
REFERENCES
1) https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+client
2) https://www.hindawi.com/journals/complexity/2021/6618841/
28