Capstone Project - Credit Risk Analysis

Cred X
Customer Identification &

Financial Benefits
ABSTRACT
Introduction : “CredX” is a leading credit card provider that gets thousands of credit card
applicants every year. But in the past few years, it has experienced an increase in credit loss. The
CEO believes that the best strategy to mitigate credit risk is to ‘acquire the right customers’.
Problem statement :
• Identifying the right Customers using Predictive Modelling.
• Utilizing past bank data to identify Risky Customers and avoid Credit Risk.
• Create Strategies to mitigate the Acquisition Risks
• Assess the Final Financial benefits of the model.
Business Understanding :
• Analyzing and Understanding the Demographic and Credit Bureau Data Set.
• Performance Tag is an indicator of customer gone 90Days part due.
• The missing/NA values and identified and imputed using WOE/Information Value Analysis.
• Building a Predictive Model using both Demographic in order to understand the predictive power.
34% • Evaluating the model and validating the likelihood of default for rejected candidates.
• Building an Application Scorecard with Good to Bad odds of 10 to 1.
• Accessing the financial benefits to the organization basis the pr
Business Understanding
Variables Type
Application ID <int>
Age <int>
Gender <chr>
Demographic Dataset: Marital Status (at the time of application) <chr>
• Consists of Categorical data related to customer like Age, Marital No of dependents <chr>
Income <dbl>
Status, Gender, No of Dependents, Income, education, Profession, No Education <chr>
of Months in Current Residence and Current Job, Performance Tag Profession
Type of residence
<chr>
<chr>
• Performance Tag represents if the customer has gone beyond 90 No of months in current residence <int>
No of months in current company <int>
days. Performance Tag <int>
Credit Bureau Dataset:

• Consists of Transaction related data of customers like No of times DPD
in 6mnts (30,60,90), No of times DPD in 12mnts (30,60,90), Avg CC
Utilization, No of Trades opened (6, 12 mnts), No of PL Trades opened
(6,12 mnts), No of Inquiries (6,12 mnts), Presence of Open House
Loan, Outstanding Balance, Open Auto Loan, Performance Tag
EDA Analysis
Cleaning and Preparing the Shared data ready for Analysis
RAW Step 4 Step 5 Step 6

Step 1 Step 2 Step 3
Data
Capping the
Age and Imputing
Removing Binning the Income missing
Merging the variables
Duplicates Categorical Variable
Identifying the Demographic and using
and variables such basis the
missing/NA Credit Bureau Data Information
Cleaning the as Age to Analysis.
Variables and basis Application ID Value and
Dataset. replacing them analyse the
Getting the basic same in details identifying
with Outlier Important
Business and Data
treatment . Variables
Understanding
Data Cleaning
Duplicate ID Identification and Cleansing:
• Both Demographic and Credit Bureau Data set had Duplicate Application Id. (ID :
653287861,671989187,765011468) and has been removed form bot the Data Sets.
NA and Null Value Treatment:

• NA value and Null value has been identified in both the data sets and NULL’s have been converted to
NA for further Treatment.
• These could be blank values, NA values, NULL values or junk values (like negative values in age
variable)
• These can be found be doing multiple types of checks –
 Checking type and structure of data, maximum, minimum values in case of continuous variables
 Objective will be to convert all types of missing values to NA for easy analysis
 Further missing values treatment cab ne done using techniques like removing rows or columns
/ multiple imputations / WOE depending upon the type which gives the best result
Outlier Treatment:
• Identifying outlier values through various techniques
 Plotting Boxplots
 Calculating data percentiles
 Calculating cook’s distance
An Example show ing an approach to visualize missing values
Missing Value Identified in Predictor variables with Education having

the maximum missing values, A total of 0.2% values are missing in the
entire demographic data set
Missing Value Treatment
• Missing Value identified In - Gender, Marital Status, No of Dependents, Education, Profession,

Residence Type, AVg_CC_util_12, No._trades_6, Open_homeloan, Outstanding Bal.
• Mice Package used to Impute the missing Variables.
An Example show ing an approach to visualize outlier values
Outliers have been Identified in Age Column.

The Lower Age values are entered
incorrectly. Which is being Capped at 18Yrs.
Data with Age<16 is only 0.05%
Outliers have been Identified in Income Column.

The Lower Income values are entered incorrectly
and so the same has been capped at 5K.
Outliers have been Identified in Months in Current

Company too and has been capped at 75 as there
was one Outlier of 133 month.
EDA Analysis
Uni-variate and Bivariate Analysis
• Check whether the data is balanced or sparse

• In this case the data is sparse – 4.2% default cases
• 1425 rows have Performance Tag as NA which means these customers were not issued credit card
• Creating bins of data using WOE technique for continuous variables
• Plotting Performance Tag against the predictor variables bins to understand the importance of
variables
• Validating the same through Information Values of the predictor variables
• Replacing the original values of predictor variables by the respective predictor variable bin’s WOE
values
Age Wise Distribution – Defaulters
• No significant pattern was noticed in Default.

• Equally split across age grp from 28-65.
Gender Wise Distribution – Defaulters
• Female Default Is 0.043% and Male Default 0.041%

• Chi-Square test conducted to understand the
significance.
• P Value is coming to 0.347 – Non significant.
Marital Status Wise Distribution – Defaulters

• Married Default – 0.042% and Single Default –
0.043%
significance.
• P Value is coming to 0.621 – Non significant.
No of Dependents Distribution – Defaulters

• Basis the no of dependents the mean is coming to
2.85 for both Default and Non-Default
Income Distribution – Defaulters
• Major default is noticed in Lower income levels.

• Low P Value, hence Income is a Significant Variable
Education Distribution – Defaulters

• Chi-Squared Test done to understand the
Significance level.
• P Value (0.62) – Not Significant
Profession Distribution – Defaulters

Significance level.
• P Value (0.039) – Significant Variable
Residence Type Distribution – Defaulters

Significance level.
• P Value (0.68) – Not Significant
Month Curr. Residence Distribution – Defaulters
• Default observed mainly in Higher income levels.

• P Value shows the Variable in Significant.
Month Curr. Company Distribution – Defaulters
• Default is equally split across.

• P Value shows the Variable in Significant.
Times 90_DPD_6 Distribution – Defaulters
• Default observed in Higher Levels

• P Value Checked to understand the Significance
level.
• Significant Variable

level.

level.

level.

level.

level.
Avg CC_Utilisation Distribution – Defaulters

level.
No of Trades_6 Distribution – Defaulters

level.

level.

level.

level.
No of Inq_6 Distribution – Defaulters

level.
No of Inq_12 Distribution – Defaulters

level.
Open Home Loans Distribution – Defaulters
• No pattern was noticed in Default.

significance.
• P Value is coming to 0.00 – Significant.
Outstanding Balance Distribution – Defaulters

level.
• Non - Significant Variable
Total No of TRade Distribution – Defaulters

level.
Open Auto Loans Distribution – Defaulters
• No pattern was noticed in Default.

significance.
• P Value is coming to 0.03 – Non Significant.
Model Building
Building a Model from the Cleaned Data after EDA Analysis
Step 1 Step 2 Step 5 Step 6 Step 7

Step 3 Step 4
Checking the
Accuracy, Creating a
Running a WOE/IV for all Sensitivity and Score Card,
Logistic Predictor Building a Random Creating Bins
Specificity on
Regression Cut-off Variables then Forest Model to and
Normal,
Model, Using P identification building a understand the understanding
Imputed and
Value and VIF and Creating a model on Accuracy. the Cut-off
WOE Data
for checking Lift and Gain Imputed and Score.
Outlier Removal
multi Chart. WOE Data
Using Cooks
Distance collinearity
Outlier Identification- Cooks Distance
• The points identified above the Red mark are

Outliers and can be removed.
• The Red line shows 60 times the Mean
Distance.
• Total of 14 outliers had been identified.
• All the Outliers are defaulters.
Demographic - Model Building
Steps to build First Model
• To start with build model with only demographic data

• Split data into train and test model such both the sets have nearly equal proportions of the good to bad ratio of Performance Tag
• Run first logistic egression model using predictor variables which have Information Value greater than or equal to 0.02
• Check model summary, VIF values and then run stepAIC to get optimum model
• Again check model summary and VIF
• Remove variables which are not significant and high VIF values and run model again
• Arrive at the final model by repeating above step till variables are significant and VIF <=2.0
• Use the final model to predict probabilities on the test data
• Once the probabilities for test data is calculated, next task is to choose cut-off value
• Calculate specificity, sensitivity and accuracy and select the probability where all these three coincide on the test data.
• Basis the cutoff probability, classify the test data into good and bad
• Calculate model performance on the test data. Calculate AUC (area under the ROC curve to asses the model performance)
Optimal Probability Cut Off & ROC Curve
• The Optimal Probability cutoff has been

identified and 0.042
• Capping the lower income grp at 74.
• The final model details are (Accuracy – 0.55,
Sensitivity – 0.53, Specificity – 0.55)
Lift & Gain Chart
Bucket Total Totalresp Cumresp Gain Cumlift

• Cumulative Lift and gain Chart. 1 1747 115 115 15.68895 1.568895
• In the fifth decile we are getting a gain of 59% 2 1746 85 200 27.28513 1.364256
• Gain – 59% 3 1746 85 285 38.88131 1.296044

4 1746 73 358 48.84038 1.22101
5 1746 76 434 59.20873 1.184175
6 1747 61 495 67.5307 1.125512
7 1746 62 557 75.98909 1.085558
8 1746 55 612 83.4925 1.043656
9 1746 62 674 91.95089 1.021677
10 1746 59 733 100 1
Model Building – Demographic Data
Build First Model – Model Performance
Logistic Regression Model on Demographic data

model_dem3 <- glm(Perf_Tag ~ Income + ProfessionSE +Months_curr_Co,
family = "binomial", data = train1)
In the fifth decile we are getting a gain of 59%

This was a basic model using only demographic data giving
following performance values -
 Accuracy - 0.5588
 Sensitivity - 0.5347
 Specificity - 0.5599
 Gain - 59%
Full Data - Model Building
Data Preparation
• Merge both the demographic and credit bureau data
• Some quality checks while merging the data – use setdiff function to check if the App_ID matches in both the
data (which is the primary key)
• After merging the data, total number of rows in merged data is same as the demographic / credit bureau data.
• Performance tag values are same in both the data sets
• Calculate Information value and WOE
• Replace original predictor values with WOE values
• Select variables which have IV >= 0.02
Model Building – Full Data
Model Performance
Logistic Regression Model on full data

model_full12 <- glm(formula = Perf_Tag ~ Income + ProfessionSE +
Months_curr.Res + Months_curr_Co + Times_30DPD_6 +
AVg_CC_util_12 + No._PLtrades_12,

 Gain - 76%
 Number of defaults are in decreasing order as we move from 1st bucket to 10th bucket, reflecting the
stability of model
Build First Model
• Take the training data set
• Select variables which have IV >= 0.02
• Build first logistic regression model on WOE data
• Check model summary and VIF values
• Run stepAIC to get the optimum model
• Check model summary and AIC values
• Remove variables which are not significant and high VIF values and run model again
• Arrive at the final model by repeating above step till variables are significant and VIF <=2.0
• Use the final model to predict probabilities on the test data
• Once the probabilities for test data is calculated, next task is to choose cut-off value
• Calculate specificity, sensitivity and accuracy and select the probability where all these three
coincide on the test data.
• Basis the cutoff probability, classify the test data into good and bad
Model Building – Full Data on WOE Analysis
Model Performance
Logistic Regression Model on Full data Woe values
model_woe5 <- glm(formula = Perf_Tag ~ AVg_CC_util_12_woe + No._trades_12_woe +
No._inq_12_woe + Times_30DPD_6_woe +
Income_woe, family = "binomial",
data = train3)
 Gain - 74%
Model Performance
Random Forest Model on Full data
The model doesn’t perform better than logistic regression model
 Sensitivity – 0.5970
 Gain - 72%
Building Credit Scorecard– Full Data
Build Credit Score card
Model selected for building scorecard

Logistic Regression Model on full data
model_full12 <- glm(formula = Perf_Tag ~ Income + ProfessionSE +
Months_curr.Res + Months_curr_Co + Times_30DPD_6 +
AVg_CC_util_12 + No._PLtrades_12,
Base Score computed - 429

Scorecard for Rejected Applicants
Using the selected model Credit score computed for rejected applicants
• Data above shows percentile of rejected applicants against credit score

• This shows that 429 can be taken as cut-off score for accepting / rejecting applicants
• Recommendation – Reject Applicants below Credit Score of 429
Business Benefits
Assessing Financial Impact
• Basis the cut-off score of 419

• The Credit score model will additionally help to identify 467
probable defaulters.
• So, let’s assume loss with each default is 15 while profit from
each non-default is 1
• Loss due to 476 additional default = 476*15= 7140
• Opportunity lost in profit from additionally rejecting = 5228
• Net Business Benefit from this model = 7140-5228 = 1912 units

Capstone Project - Credit Risk Analysis

Uploaded by

Copyright:

Available Formats

Capstone Project - Credit Risk Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Project - Credit Risk Analysis

Uploaded by

Copyright:

Available Formats

Cred X

Customer Identification &

Credit Bureau Dataset:

RAW Step 4 Step 5 Step 6

NA and Null Value Treatment:

Missing Value Identified in Predictor variables with Education having

• Missing Value identified In - Gender, Marital Status, No of Dependents, Education, Profession,

Outliers have been Identified in Age Column.

Outliers have been Identified in Income Column.

Outliers have been Identified in Months in Current

• Check whether the data is balanced or sparse

• No significant pattern was noticed in Default.

• Female Default Is 0.043% and Male Default 0.041%

• No significant pattern was noticed in Default.

• No significant pattern was noticed in Default.

• Major default is noticed in Lower income levels.

• No significant pattern was noticed in Default.

• No significant pattern was noticed in Default.

• No significant pattern was noticed in Default.

• Default observed mainly in Higher income levels.

• Default is equally split across.

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• Default observed in Higher Levels

• No pattern was noticed in Default.

• Default observed in Higher Levels

• Default observed in Higher Levels

• No pattern was noticed in Default.

Step 1 Step 2 Step 5 Step 6 Step 7

• The points identified above the Red mark are

• To start with build model with only demographic data

• The Optimal Probability cutoff has been

Bucket Total Totalresp Cumresp Gain Cumlift

• Gain – 59% 3 1746 85 285 38.88131 1.296044

Logistic Regression Model on Demographic data

In the fifth decile we are getting a gain of 59%

Logistic Regression Model on full data

In the fifth decile we are getting a gain of 76%

Model selected for building scorecard

Base Score computed - 429

• Data above shows percentile of rejected applicants against credit score

• Basis the cut-off score of 419

You might also like