Data Mining Business Report-Clustering & CART

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57
At a glance
Powered by AI
The document discusses using clustering and decision tree models to analyze bank marketing and insurance data to develop customer segments and predict insurance claims. Various exploratory data analyses and modeling approaches are discussed.

The bank wants to develop customer segmentation to provide promotional offers to its customers. They collected data summarizing customer activities over past months and want to identify segments based on credit card usage.

Exploratory data analysis was performed including univariate analysis to understand each variable and multivariate analysis to check for multicollinearity between variables. Various plots and statistics were used to analyze the variables.

DATA MINING PROJECT – BUSINESS REPORT

Data Mining_ Clustering & CART –


BUSINESS REPORT

July-2021

Sangeeta M Chandel.

CONTENTS: -

 PROBLEM 01- Bank_marketing_part1_Data--------Page 02 to 24

 PROBLEM 02- Insurance_part2_data ----Page 25 to 57

pg. 1
DATA MINING PROJECT – BUSINESS REPORT

PROBLEM-01

Problem 1A:

Problem Statement: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers to its
customers.They collected a sample that summarizes the activities of users during the past few
months.

You are given the task to identify the segments based on credit card usage.

Data Dictionary for Market Segmentation:

 spending: Amount spent by the customer per month (in 1000s)


 advance_payments: Amount paid by the customer in advance by cash (in 100s)
 probability_of_full_payment: Probability of payment done in full by the customer to the bank
 current_balance: Balance amount left in the account to make purchases (in 1000s)
 credit_limit: Limit of the amount in credit card (10000s)
 min_payment_amt : minimum paid by the customer while making payments for purchases
made monthly (in 100s)
 max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)

Question:-
1.1 Read the data and do exploratory data analysis. Describe the data briefly

 Answer:-
HEAD of DATA

TAIL of the DATA

pg. 2
DATA MINING PROJECT – BUSINESS REPORT

Observation
Data looks good based on intial records seen in top 5 and bottom 5.

Info of the Data:

Observation:
7 variables and 210 records. No missing record based on intial analysis. All the variables numeric
type.

Univariate analysis

Observation
 Based on summary descriptive, the data looks good.
 We see for most of the variable, mean/medium are nearly equal
 Include a 90% to see variations and it looks distributely evenly
 Std Deviation is high for spending variable

pg. 3
DATA MINING PROJECT – BUSINESS REPORT

Spending variable

pg. 4
DATA MINING PROJECT – BUSINESS REPORT

max_spent_in_single_shopping variable

pg. 5
DATA MINING PROJECT – BUSINESS REPORT

pg. 6
DATA MINING PROJECT – BUSINESS REPORT

pg. 7
DATA MINING PROJECT – BUSINESS REPORT

Observations
 Credit limit average is around $3.258(10000s)
 Distrubtion is skewed to right tail for all the variable execpt probability_of_full_payment
variable, which has left tail

pg. 8
DATA MINING PROJECT – BUSINESS REPORT

Multivariate analysis
Check for multicollinearity

Observation- Strong positive correlation between


 - spending & advance_payments,

 - advance_payments & current_balance,

 - credit_limit & spending

 - spending & current_balance

 - credit_limit & advance_payments

- max_spent_in_single_shopping current_balance
pg. 9
DATA MINING PROJECT – BUSINESS REPORT

#correlation matrix

pg. 10
DATA MINING PROJECT – BUSINESS REPORT

Strategy to remove outliers: We choose to replace attribute outlier values by their


respective medians , instead of dropping them, as we will lose other column info and
also there outlier are present only in two variables and within 5 records

let us remove the outliers


for column in clean_dataset.columns.tolist(): Q1 = clean_dataset[column].quantile(.25) # 1st quartile Q3 =
clean_dataset[column].quantile(.75) # 3rd quartile IQR = Q3-Q1 # get inter quartile range # Replace elements of
columns that fall below Q1-1.5IQR and above Q3+1.5IQR
clean_dataset[column].replace(clean_dataset.loc[(clean_dataset[column] > Q3+1.5IQR)|(clean_dataset[column] <
Q1-1.5IQR), column], clean_dataset[column].median())

Observation
Most of the outlier has been treated and now we are good to go.

pg. 11
DATA MINING PROJECT – BUSINESS REPORT

Observation
Though we did treated the outlier, we still see one as per the boxplot, it is okay, as it is no extrme
and on lower band.

Question:-
1.2 Do you think scaling is necessary for clustering in this case? Justify
Answer:-

1 Scaling needs to be done as the values of the variables are different.

2 spending, advance_payments are in different values and this may get more weightage.

3 Also have shown below the plot of the data prior and after scaling.

4 Scaling will have all the values in the relative same range.

5 I have used zscore to standarised the data to relative same scale -3 to +3.

prior to scaling

pg. 12
DATA MINING PROJECT – BUSINESS REPORT

# Scaling the attributes.

#after scaling

Question:-
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them

Answer:-

Creating the Dendrogram

Importing dendrogram and linkage module

pg. 13
DATA MINING PROJECT – BUSINESS REPORT

Importing fcluster module to create clusters

pg. 14
DATA MINING PROJECT – BUSINESS REPORT

Cluster 3

Observation
Both the method are almost similer means , minor variation, which we know it occurs.

We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and
based on the dataset had gone for 3 group cluster solution based on the hierarchical clustering

Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.

And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment made).

pg. 15
DATA MINING PROJECT – BUSINESS REPORT

Question:-
1.3 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and silhouette score.

Answer:-

Cluster 01:

Cluster 02:

Cluster 03:

Cluster 04:

WSS:

pg. 16
DATA MINING PROJECT – BUSINESS REPORT

pg. 17
DATA MINING PROJECT – BUSINESS REPORT

silhouette_score

#Insights

From SC Score, the number of optimal clusters could be 3 or 4

pg. 18
DATA MINING PROJECT – BUSINESS REPORT

silhouette_samples

3 Cluster Solution

K-Means Clustering & Cluster Information

pg. 19
DATA MINING PROJECT – BUSINESS REPORT

transposing the cluster

Note
I am going with 3 clusters via kmeans, but am showing the analysis of 4 and 5 kmeans cluster, I see
we based on current dataset given, 3 cluster solution makes sense based on the spending pattern
(High, Medium, Low)

4-Cluster Solution

pg. 20
DATA MINING PROJECT – BUSINESS REPORT

cluster_4_T

pg. 21
DATA MINING PROJECT – BUSINESS REPORT

5 cluster

cluster_5_T

pg. 22
DATA MINING PROJECT – BUSINESS REPORT

Question:-
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.

Answer:-

3 group cluster via Kmeans

3 group cluster via hierarchical clustering

pg. 23
DATA MINING PROJECT – BUSINESS REPORT

Cluster Group Profiles

Group 1 : High Spending

Group 3 : Medium Spending

Group 2 : Low Spending

Promotional strategies for each cluster

Group 1 : High Spending Group


 Giving any reward points might increase their purchases.
 maximum max_spent_in_single_shopping is high for this group, so can be offered
discount/offer on next transactions upon full payment
 Increase there credit limit and
 Increase spending habits
 Give loan against the credit card, as they are customers with good repayment record.
 Tie up with luxary brands, which will drive more one_time_maximun spending

Group 3 : Medium Spending Group


 They are potential target customers who are paying bills and doing purchases and
maintaining comparatively good credit score. So we can increase credit limit or can lower
down interest rate.
 Promote premium cards/loyalty cars to increase transactions.
 Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourage them to spend more

Group 2 : Low Spending Group


 customers should be given remainders for payments. Offers can be provided on early
payments to improve their payment rate.
 Increase there spending habits by tying up with grocery stores, utilities (electricity, phone,
gas, others)

pg. 24
DATA MINING PROJECT – BUSINESS REPORT

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model which
predicts the claim status and provide recommendations to management. Use CART, RF & ANN and
compare the models' performances in train and test sets.

Attribute Information:
Target: Claim Status (Claimed) Code of tour firm (Agency_Code) Type of tour insurance firms (Type)
Distribution channel of tour insurance agencies (Channel) Name of the tour insurance products
(Product) Duration of the tour (Duration) Destination of the tour (Destination) Amount of sales of
tour insurance policies (Sales) The commission received for tour insurance firm (Commission) Age of
insured (Age)

Question:-
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it.

Answer:-

pg. 25
DATA MINING PROJECT – BUSINESS REPORT

Observation
 10 variables
 Age, Commision, Duration, Sales are numeric variable
 rest are categorial variables
 3000 records, no missing one
 9 independant variable and one target variable - Clamied

Check for missing value in any column

Observation
No missing value
pg. 26
DATA MINING PROJECT – BUSINESS REPORT

Descriptive Statistics Summary

Observation
 Duration has negative valu, it is not possible. Wrong entry.
 Commission & Sales- mean and median varies significantly

pg. 27
DATA MINING PROJECT – BUSINESS REPORT

Observation
Categorical code variable maximum unique count is 5

Observation
 Data looks good at first glance

pg. 28
DATA MINING PROJECT – BUSINESS REPORT

Observation
 Data looks good at first glance

Getting unique counts of all Nominal Variables

pg. 29
DATA MINING PROJECT – BUSINESS REPORT

Check for duplicate data

Removing Duplicates - Not removing them - no unique


identifier, can be different customer
Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, so I am not dropping them off.

Univariate Analysis

Age variable

pg. 30
DATA MINING PROJECT – BUSINESS REPORT

pg. 31
DATA MINING PROJECT – BUSINESS REPORT

Commission variable

pg. 32
DATA MINING PROJECT – BUSINESS REPORT

Duration variable

pg. 33
DATA MINING PROJECT – BUSINESS REPORT

Sales variable

pg. 34
DATA MINING PROJECT – BUSINESS REPORT

Observation
There are outliers in all the variables, but the sales and commission can be a genius business value.
Random Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will
keep the data as it is.

I will treat the outliers for the ANN model to compare the same after the all the steps just for
comparison.

pg. 35
DATA MINING PROJECT – BUSINESS REPORT

Categorical Variables

Agency Code
Count Plot:

Box Plot:

pg. 36
DATA MINING PROJECT – BUSINESS REPORT

Swarmp plot:

Combine Violin plot and Swarmp plot:

Type

pg. 37
DATA MINING PROJECT – BUSINESS REPORT

Channel

pg. 38
DATA MINING PROJECT – BUSINESS REPORT

pg. 39
DATA MINING PROJECT – BUSINESS REPORT

Product Name

pg. 40
DATA MINING PROJECT – BUSINESS REPORT

Destination:

pg. 41
DATA MINING PROJECT – BUSINESS REPORT

Checking pairwise distribution of the continuous variables

pg. 42
DATA MINING PROJECT – BUSINESS REPORT

Checking for Correlations

pg. 43
DATA MINING PROJECT – BUSINESS REPORT

Converting all objects to categorical codes

pg. 44
DATA MINING PROJECT – BUSINESS REPORT

Proportion of 1s and 0s

Question:-

2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network

Answer:-

Prior to Scaling:

pg. 45
DATA MINING PROJECT – BUSINESS REPORT

After Scaling:

Splitting data into training and test set

Checking the dimensions of the training and test data

Building a Decision Tree Classifier

pg. 46
DATA MINING PROJECT – BUSINESS REPORT

Generating Tree

Variable Importance - DTCL

Building a Random Forest Classifier


param_grid_rfcl = { 'max_depth': [5,10,15],#20,30,40 'max_features': [4,5,6,7],## 7,8,9 'min_samples_leaf':
[10,50,70],## 50,100 'min_samples_split': [30,50,70], ## 60,70 'n_estimators': [200, 250,300] ## 100,200 }

rfcl = RandomForestClassifier(random_state=1)

grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv = 5)

grid_search_rfcl.fit(X_train, train_labels) print(grid_search_rfcl.bestparams) best_grid_rfcl =


grid_search_rfcl.bestestimator best_grid_rfcl

{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50,


'min_samples_split': 450}

pg. 47
DATA MINING PROJECT – BUSINESS REPORT

Predicting the Training and testing data

Getting the Predicted Classes and Probs

Building a Neural Network Classifier

Getting the Predicted Classes and Probs

pg. 48
DATA MINING PROJECT – BUSINESS REPORT

Question:-

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model

Answer:-

CART - AUC and ROC for the training data

CART -AUC and ROC for the test data

CART Confusion Matrix and Classification Report for the


training data

pg. 49
DATA MINING PROJECT – BUSINESS REPORT

CART Confusion Matrix and Classification Report for the


testing data

Cart Conclusion

Train Data:
 AUC: 82%
 Accuracy: 79%
 Precision: 70%
 f1-Score: 60%

pg. 50
DATA MINING PROJECT – BUSINESS REPORT

Test Data:
 AUC: 80%
 Accuracy: 77%
 Precision: 80%
 f1-Score: 84%

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

Change is the most important variable for predicting diabetes

RF Model Performance Evaluation on Training data

pg. 51
DATA MINING PROJECT – BUSINESS REPORT

RF Model Performance Evaluation on Test data

Random Forest Conclusion

Train Data:
 AUC: 86%
 Accuracy: 80%
 Precision: 72%
 f1-Score: 66%

pg. 52
DATA MINING PROJECT – BUSINESS REPORT

Test Data:
 AUC: 82%
 Accuracy: 78%
 Precision: 68%
 f1-Score: 62

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

Change is again the most important variable for predicting diabetes

NN Model Performance Evaluation on Training data

pg. 53
DATA MINING PROJECT – BUSINESS REPORT

NN Model Performance Evaluation on Test data

Neural Network Conclusion

Train Data:
 AUC: 82%
 Accuracy: 78%
 Precision: 68%
 f1-Score: 59

pg. 54
DATA MINING PROJECT – BUSINESS REPORT

Test Data:
 AUC: 80%
 Accuracy: 77%
 Precision: 67%
 f1-Score: 57%

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

Question:-

2.3 Final Model: Compare all the model and write an inference which model is best/optimized.

Answer:-

Comparison of the performance metrics from the 3 models

ROC Curve for the 3 models on the Training data

pg. 55
DATA MINING PROJECT – BUSINESS REPORT

ROC Curve for the 3 models on the Test data

CONCLUSION :
I am selecting the RF model, as it has better accuracy, precision, recall, f1 score better than other
two CART & NN.

pg. 56
DATA MINING PROJECT – BUSINESS REPORT

Question:-
2.3 Inference: Basis on these predictions, what are the business insights and recommendations

Answer:-

I strongly recommended we collect more real time unstructured data and past data if possible.

This is understood by looking at the insurance data by drawing relations between different variables
such as day of the incident, time, age group, and associating it with other external information such
as location, behavior patterns, weather information, airline/vehicle types, etc.

• Streamlining online experiences benefitted customers, leading to an increase in conversions, which


subsequently raised profits. • As per the data 90% of insurance is done by online channel. • Other
interesting fact, is almost all the offline business has a claimed associated, need to find why? • Need
to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional
marketing campaign or evaluate if we need to tie up with alternate agency • Also based on the model
we are getting 80%accuracy, so we need customer books airline tickets or plans, cross sell the
insurance based on the claim data pattern. • Other interesting fact is more sales happen via Agency
than Airlines and the trend shows the claim are processed more at Airline. So we may need to deep
dive into the process to understand the workflow and why?

Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time •
Increase customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling
costs Insights gained from data and AI-powered analytics could expand the boundaries of
insurability, extend existing products, and give rise to new risk transfer solutions in areas like a non-
damage business interruption and reputational damage.

END OF PROJECT

pg. 57

You might also like