0% found this document useful (0 votes)
19 views49 pages

mlproj

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views49 pages

mlproj

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

DEVISHREE B

PERUMAL K
JANESH PURUSHOTTAMAN
VISHWAA R

1
CONTENTS
1. Problem statement 3

2. Data Summary 4

3. Data Preprocessing 7

4. Exploratory Data Analysis 8

5. Univariate Analysis 8

6. Bivariate Analysis 18

7. Multivariate Analysis 42

8. Treatment of Outliers 43

9. Building of Model 44

10. Random forest model 46

11. Adaboost Model 47

12. Gradient Boosting Model 47

13. Compare the performance of the Models 48

14. Business recommendation 49

2
Problem Statement

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good
source of income for banks because of different kinds of fees charged by the banks like annual fees, balance
transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are
charged to every user irrespective of usage, while others are charged under specified circumstances. Customers’
leaving credit cards services would lead bank to loss, so the bank wants to analyse the data of customers and
identify the customers who will leave their credit card services and reason for same – so that bank could improve
upon those areas. You as a Data scientist at Thera bank need to come up with a classification model that will help
the bank improve its services so that customers do not renounce their credit cards. You need to identify best
possible model that will give the required performance

DATA SUMMARY:

The dataset “BankChurners” contains information on 10,127 customers, with 21 attributes such as age, gender,
education level, marital status, income category, and credit card details.

3
The dataset comprises various types of variables:

1. Numerical Variables:
a. CLIENTNUM: Unique identifier (int64).
b. Customer_Age: Age of the customer (int64).
c. Dependent_count: Number of dependents (int64).
d. Months_on_book: Number of months as a customer (int64).
e. Total_Relationship_Count: Total number of products held by the customer (int64).
f. Months_Inactive_12_mon: Number of inactive months in the last 12 months (int64).
g. Contacts_Count_12_mon: Number of contacts in the last 12 months (int64).
h. Credit_Limit: Credit limit on the credit card (float64).
i. Total_Revolving_Bal: Total revolving balance on the credit card (int64).
j. Avg_Open_To_Buy: Average open to buy on the credit card (float64).
k. Total_Amt_Chng_Q4_Q1: Change in transaction amount from Q4 to Q1 (float64).
l. Total_Trans_Amt: Total transaction amount (int64).
m. Total_Trans_Ct: Total transaction count (int64).
n. Total_Ct_Chng_Q4_Q1: Change in transaction count from Q4 to Q1 (float64).
o. Avg_Utilization_Ratio: Average utilization ratio of the credit card (float64).

2. Categorical Variables:
a. Attrition_Flag: Indicates whether the customer has churned or not (object).
b. Gender: Gender of the customer (object).
c. Education_Level: Education level of the customer (object).
d. Marital_Status: Marital status of the customer (object).
4
e. Income_Category: Income category of the customer (object).
f. Card_Category: Type of credit card the customer holds (object).

3. Missing Values:
a. Education_Level: 1,519 missing values.
b. Marital_Status: 749 missing values.

Null Data:

Descriptive Statistics of the Data of Ordinal/Numerical Variable:

5
Descriptive Statistics of the Data of Categorical Variable:

6
Data Preprocessing

Dropping CLIENTNUM : Excluding the CLIENTNUM column from the dataset, as its presence does not contribute
to the information or analysis goals. This column likely serves as a unique identifier for each customer but does
not contain any data pertinent to the analysis of customer attributes or behaviors.

Converting Variables to Category Type:

 Categorical data typically have a finite number of unique values, so storing them as 'Category' type can
reduce memory usage compared to storing them as 'Object' type, which can be particularly advantageous
for large datasets.
 By categorizing numerical variables such as 'Total_Relationship_Count', 'Months_Inactive_12_mon', and
'Contacts_Count_12_mon', we aim to segment our data into meaningful groups or levels, facilitating more
insightful analysis and interpretation.

7
Income Category: We discovered that within the Income category, there exists a variable label "abc," accounting
for approximately 10% of the dataset. This proportion is significant, and imputing the data using the mode or
imputer could potentially impact the data analysis and decision-making process. Therefore, during data
preprocessing, we will assign "abc" values to "Unknown."

Marital Status: Given that approximately 7% of the dataset lacks information regarding marital status, which
represents a relatively small portion of the dataset, we opt to impute the missing values using the mode.

Education Level: With around 15% of the Educational Level data missing, it represents a substantial proportion
that could substantially influence decision-making processes. Replacing or imputing these values may
substantially alter the dataset and consequently affect the insights derived from it.

Exploratory Data Analysis:

Univariate Analyis:

Customer_Age

8
 The distribution of age is normal.
 The boxplot shows that there are outliers in the age data and it is skewed towards the right indicating wider
dispersion towards the older age group
 The median and mean age of our customers hover around 46 years, while the upper age bracket extends
to approximately 73 years.

Months_on_book

9
 Month on book has maximum distrubution around ~35-36.Most customer have credit card for this long. It
has many outliers on lower and higher end.

Credit_Limit

10
 The distribution of the credit Limit is right-skewed with a peak at 35000 as seen in the above plot this is
the maximum limit and seems to be some kind of default value. There are lot of outliers on higher end.
 The normal distribution curve shows that the curve is highly skewed towards the right with outliers

Total_Revolving_Bal

11
 The histogram shows that the Total Revolving bal seems to have different disturbution with many
customers with ~0 revolving balance and then it follows almost normal distrubution and then sudden peak
at 2500.

Avg_Open_To_Buy

12
 Average open to buy has the same normal distribution curve as Credit Card limit showing a highly skewed
nature towards the right tail.

Total_Amt_Chng_Q4_Q1

13
 Total Amt change has lot of outliers on lower and upper end. There are some 3.5 ratio of total amount
change from Q4 to Q1.
 The figures are normally distributed with skweness on both sides of the tail.

Total_Trans_Amt

14
 Total transaction amount also has very different distrubution with data between 0 -2500 , then 2500-5000,
and then 750-10000 and then 12500-17500. It has lot of outliers on right tail.

Total_Trans_Ct

15
 Majority of the customers has around ~65 transactions in the last 12 months.
 We can see that there some outliers in the far right end.

Total_Ct_Chng_Q4_Q1

16
 The Distribution of Total Tranaction count shows similiar distribution as Total Transaction Amount with
extreme skeweness on the right side
 The mean and median of the distribution is around 0.7 comparing the transaction count of Q4 with Q1

Avg_Utilization_Ratio

17
 Avg_utlization ration is measure of how much of the given credit limit is the customer actually using. it
ranges from 0.0 to 1
Bivariate Analysis
Correlation Matrix :

18
 There exists a very high correlation between credit limit and average open-to-buy.
 Total transaction count shows a high correlation with total transaction amount.
 The duration a customer has been with the bank (months on books) exhibits a positive correlation with
customer age; as age increases, the duration of the customer's relationship with the bank also tends to
increase.
 The average utilization ratio is positively correlated with total revolving balance and negatively correlated
with credit limit and average open-to-buy.

BAR PLOT ANALYSIS


 For all Categorical columns, bar plot representation is made:
 A bar graph represents categorical data, which is data that can be grouped into categories. Each category
is represented by a separate bar, and the height or length of each bar corresponds to the frequency or
value of the category it represents. Bar graphs are useful for comparing the sizes or frequencies of different
categories.

ATTRITION FLAG

19
GENDER

20
EDUCATION LEVEL

21
CARD CATEGORY

22
MARITAL STATUS

23
CONTACTS COUNT 12 MONTH

24
Attrition_Flag Vs Customer_Age

25
Attrition_Flag Vs Months_on_book

26
Attrition_Flag Vs Credit_Limit

27
Attrition_Flag Vs Total_Revolving_Bal

Attrition_Flag Vs Avg_Open_To_Buy

28
Attrition_Flag Vs Total_Amt_Chng_Q4_Q1

29
Attrition_Flag Vs Total_Trans_Amt

30
Attrition_Flag Vs Total_Trans_Ct

31
Attrition_Flag Vs Total_Ct_Chng_Q4_Q1

32
Attrition_Flag Vs Avg_Utilization_Ratio

33
STACKED BAR PLOT REPRESENTATION:

 The stacked bar chart (aka stacked bar graph) extends the standard bar chart from looking at numeric
values across one categorical variable to two. Each bar in a standard bar chart is divided into a number of
sub-bars stacked end to end, each one corresponding to a level of the second categorical variable.

Attrition_Flag Vs Gender

 The stacked barplot clearly illustrates that female customers are more inclined towards credit card attrition
compared to male customers

Attrition vs Age

34
Attrition_Flag Vs Dependent_Count

35
 Customers with three and two dependents exhibit a higher attrition rate compared to those with zero, one,
four, and five dependents

Attrition_Flag Vs Marital_Status

36
 Married customers are more susceptible to attrition compared to single individuals, with divorced
customers exhibiting the lowest attrition rates.

Attrition_Flag Vs Income_Category

37
 Customer with less than $40K have higher chance of attrition compared to the other customers and
customers with more than $120K has very low chance of attrition

Attrition_Flag Vs Card_Category

38
 Since the majority of cardholders fall under the Blue category, attrition rates are consequently higher,
followed by Silver and Gold categories

Attrition_Flag Vs Total_Relationship_Count

39
 Customers with a product/relation count of three are more prone to attrition, while those with two products
exhibit a similar pattern of susceptibility.

Attrition_Flag Vs Months_Inactive_12_mon

40
 Customers who were inactive for three months are more prone to attrition, followed by those inactive for
two months, and then four months of inactivity.

Attrition_Flag Vs Contacts_Count_12_mon

41
 Similar to inactive customers, those who were contacted three times are more prone to attrition, followed
by those contacted two times and then four times.

Multivariate Analysis :

Attrition_Flag Vs Gender Vs Marital Status

42
 The catplot above reveals that married female customers have a higher likelihood of attrition compared to
their male counterparts

Treatment of Outliers

 There are many outliers in the data however we are not going to treat it
 We will refrain from addressing outliers as doing so could potentially hinder model performance and
impede our ability to discern customers prone to attrition. Outliers play a vital role in identifying both robust

43
and weak customers. Our main objective is to encompass the complete spectrum of customer behaviors
and attributes, including outliers, while maintaining optimal model performance

Building of Model:
Encoding of Categorical Variable:
We have encoded the categorical variable to improve interpretability of the model

Splitting of Data into Test and Train

 The data has been partitioned to allocate 70% for training the model and reserve the remaining 30% for
testing its performance.
 7088 rows and 49 columns are goes to training the model while the remaining 3039 rows are used for
testing the performance of the model.

Imbalanced Data

Imbalanced data refers to a situation in a dataset where the distribution of classes or categories is highly skewed,
meaning that one class is significantly more prevalent than the others.

Imbalanced data can pose challenges for machine learning models because they tend to be biased towards the
majority class, leading to poor performance in predicting minority classes.

In this case we can observe that we have imbalance in the data with target variable constates of only ~16% for
the attrited customer and the remaining ~84% are existimg customer.

Up sampling the data using SMOTE:

In this instance, we employed upsampling on the data by generating synthetic entries for class 1 through the
SMOTE technique, which involves creating new instances within the minority class by interpolating between
existing samples. Following the application of SMOTE, the resulting dataset consists of the following number of
rows

44
Random Forest Model

The Random Forest model displays excellent performance, achieving perfect accuracy (100%) on the training
data and approximately 94.8% accuracy on the test data. It excels in correctly identifying non-attrited customers
(class 0) with a precision of 96% and recall of 98%. However, it shows slightly lower precision (88%) and recall
(79%) for identifying attrited customers (class 1). The weighted average F1-score remains high at approximately
95%, indicating good overall performance. Nonetheless, there's a noticeable class imbalance, with non-attrited
customers dominating the dataset. In summary, the Random Forest model demonstrates strong accuracy and
effectiveness, particularly in identifying non-attrited customers, but it may benefit from further optimization to
improve performance on identifying attrited customers.

AdaBoost model

 The AdaBoost model demonstrates strong performance, achieving a train accuracy of approximately
96.8% and a test accuracy of around 94.7%. It excels in correctly identifying non-attrited customers (class
45
0), with a precision of 98% and recall of 96%. For attrited customers (class 1), the precision is slightly lower
at 81%, but the recall is higher at 88%, indicating better detection of true positives. The weighted average
F1-score remains high, indicating good overall performance. Class imbalance is present, with non-attrited
customers outnumbering attrited ones. In summary, the AdaBoost model performs well, particularly in
identifying attrited customers accurately, but may still benefit from further optimization to improve precision
for class 1 without sacrificing performance on class 0.

Gradient Boosting model

The Gradient Boosting model demonstrates exceptional performance, boasting a train accuracy of approximately
98.1% and a test accuracy of around 96.3%. It excels in correctly identifying both non-attrited (class 0) and attrited
(class 1) customers, with high precision and recall scores for both classes. Specifically, for class 0, it achieves a
precision of 98% and recall of 97%, while for class 1, it achieves a precision of 86% and recall of 91%. The
weighted average F1-score remains high, indicating excellent overall performance. Additionally, class imbalance
is present, with non-attrited customers outnumbering attrited ones. In summary, the Gradient Boosting model
performs exceptionally well, demonstrating superior accuracy and effectiveness in identifying both classes
accurately.

46
COMPARING THE PERFORMANCE OF THE MODELS

Random Forest Model:

Achieves perfect accuracy on the training data and approximately 94.8% accuracy on the test data. It has a
precision of 96% and recall of 98% for class 0, and a precision of 88% and recall of 79% for class 1.

AdaBoost Model:

Achieves a train accuracy of approximately 96.8% and a test accuracy of around 94.7%. It has a precision of 98%
and recall of 96% for class 0, and a precision of 81% and recall of 88% for class 1.

Gradient Boosting Model:

Achieves a train accuracy of approximately 98.1% and a test accuracy of around 96.3%. It has a precision of 98%
and recall of 97% for class 0, and a precision of 86% and recall of 91% for class 1.

Considering accuracy, precision, recall, and overall performance, the Gradient Boosting model appears
to be the best among the three. It achieves high accuracy on both training and test data and demonstrates
balanced precision and recall for both classes.

47
Business Recomendation

Marketing Campaigns:

The strategy of tailoring and personalizing online advertising according to data acquired from some intended
audience

Develop targeted marketing campaigns tailored towards married customers, particularly focusing on those with
incomes less than $40K to mitigate attrition rates. Additionally, prioritize efforts towards female customers,
especially those aged between 40 and 50, who exhibit a higher likelihood of attrition.

Diversification of product offerings:

Consider diversifying product offerings to attract customers with different demographic profiles. For example,
introduce credit card packages specifically designed for married individuals, offering benefits or rewards that cater
to their needs and preferences.

Personalized Engagement:

Implement personalized engagement strategies to re-engage inactive customers. Provide incentives or special
offers to encourage them to resume card usage. Furthermore, optimize contact frequency based on customer
behaviour to prevent over-contacting and potential annoyance.

Risk prediction and managing them:

Develop risk evaluation models that consider demographic factors such as marital status, income level, and
product count to identify customers at higher risk of attrition. Implement proactive measures, such as targeted
retention efforts or loyalty programs, to retain these high-risk customers.

Need to educate customers and earn customer loyalty:

Provide educational resources and support services to assist customers in better understanding credit card
benefits and usage. Enhance communication channels to address customer concerns promptly and proactively
resolve issues to improve overall satisfaction and loyalty.

48
49

You might also like