Data Mining Business Report-Clustering & CART

DATA MINING PROJECT – BUSINESS REPORT
Data Mining_ Clustering & CART –

BUSINESS REPORT
July-2021
Sangeeta M Chandel.
CONTENTS: -
 PROBLEM 01- Bank_marketing_part1_Data--------Page 02 to 24
 PROBLEM 02- Insurance_part2_data ----Page 25 to 57
pg. 1
PROBLEM-01
Problem 1A:
Problem Statement: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers.They collected a sample that summarizes the activities of users during the past few
months.
You are given the task to identify the segments based on credit card usage.
Data Dictionary for Market Segmentation:
 spending: Amount spent by the customer per month (in 1000s)

 advance_payments: Amount paid by the customer in advance by cash (in 100s)
 probability_of_full_payment: Probability of payment done in full by the customer to the bank
 current_balance: Balance amount left in the account to make purchases (in 1000s)
 credit_limit: Limit of the amount in credit card (10000s)
 min_payment_amt : minimum paid by the customer while making payments for purchases
made monthly (in 100s)
 max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)
Question:-
1.1 Read the data and do exploratory data analysis. Describe the data briefly
 Answer:-
HEAD of DATA
TAIL of the DATA
pg. 2
Observation
Data looks good based on intial records seen in top 5 and bottom 5.
Info of the Data:
Observation:
7 variables and 210 records. No missing record based on intial analysis. All the variables numeric
type.
Univariate analysis
Observation
 Based on summary descriptive, the data looks good.
 We see for most of the variable, mean/medium are nearly equal
 Include a 90% to see variations and it looks distributely evenly
 Std Deviation is high for spending variable
pg. 3
Spending variable
pg. 4
max_spent_in_single_shopping variable
pg. 5
pg. 6
pg. 7
Observations
 Credit limit average is around $3.258(10000s)
 Distrubtion is skewed to right tail for all the variable execpt probability_of_full_payment
variable, which has left tail
pg. 8
Multivariate analysis
Check for multicollinearity
Observation- Strong positive correlation between

 - spending & advance_payments,
 - advance_payments & current_balance,
 - credit_limit & spending
 - spending & current_balance
 - credit_limit & advance_payments
- max_spent_in_single_shopping current_balance
pg. 9
#correlation matrix
pg. 10
Strategy to remove outliers: We choose to replace attribute outlier values by their

respective medians , instead of dropping them, as we will lose other column info and
also there outlier are present only in two variables and within 5 records
let us remove the outliers

for column in clean_dataset.columns.tolist(): Q1 = clean_dataset[column].quantile(.25) # 1st quartile Q3 =
clean_dataset[column].quantile(.75) # 3rd quartile IQR = Q3-Q1 # get inter quartile range # Replace elements of
columns that fall below Q1-1.5IQR and above Q3+1.5IQR
clean_dataset[column].replace(clean_dataset.loc[(clean_dataset[column] > Q3+1.5IQR)|(clean_dataset[column] <
Q1-1.5IQR), column], clean_dataset[column].median())
Observation
Most of the outlier has been treated and now we are good to go.
pg. 11
Observation
Though we did treated the outlier, we still see one as per the boxplot, it is okay, as it is no extrme
and on lower band.
Question:-
1.2 Do you think scaling is necessary for clustering in this case? Justify
Answer:-
1 Scaling needs to be done as the values of the variables are different.
2 spending, advance_payments are in different values and this may get more weightage.
3 Also have shown below the plot of the data prior and after scaling.
4 Scaling will have all the values in the relative same range.
5 I have used zscore to standarised the data to relative same scale -3 to +3.
prior to scaling
pg. 12
# Scaling the attributes.
#after scaling
Question:-
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
Answer:-
Creating the Dendrogram
Importing dendrogram and linkage module
pg. 13
Importing fcluster module to create clusters
pg. 14
Cluster 3
Observation
Both the method are almost similer means , minor variation, which we know it occurs.
We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and
based on the dataset had gone for 3 group cluster solution based on the hierarchical clustering
Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment made).
pg. 15
Question:-
1.3 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and silhouette score.
Answer:-
Cluster 01:
Cluster 02:
Cluster 03:
Cluster 04:
WSS:
pg. 16
pg. 17
silhouette_score
#Insights
From SC Score, the number of optimal clusters could be 3 or 4
pg. 18
silhouette_samples
3 Cluster Solution
K-Means Clustering & Cluster Information
pg. 19
transposing the cluster
Note
I am going with 3 clusters via kmeans, but am showing the analysis of 4 and 5 kmeans cluster, I see
we based on current dataset given, 3 cluster solution makes sense based on the spending pattern
(High, Medium, Low)
4-Cluster Solution
pg. 20
cluster_4_T
pg. 21
5 cluster
cluster_5_T
pg. 22
Question:-
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
Answer:-
3 group cluster via Kmeans
3 group cluster via hierarchical clustering
pg. 23
Cluster Group Profiles
Group 1 : High Spending
Group 3 : Medium Spending
Group 2 : Low Spending
Promotional strategies for each cluster
Group 1 : High Spending Group

 Giving any reward points might increase their purchases.
 maximum max_spent_in_single_shopping is high for this group, so can be offered
discount/offer on next transactions upon full payment
 Increase there credit limit and
 Increase spending habits
 Give loan against the credit card, as they are customers with good repayment record.
 Tie up with luxary brands, which will drive more one_time_maximun spending
Group 3 : Medium Spending Group

 They are potential target customers who are paying bills and doing purchases and
maintaining comparatively good credit score. So we can increase credit limit or can lower
down interest rate.
 Promote premium cards/loyalty cars to increase transactions.
 Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourage them to spend more
Group 2 : Low Spending Group

 customers should be given remainders for payments. Offers can be provided on early
payments to improve their payment rate.
 Increase there spending habits by tying up with grocery stores, utilities (electricity, phone,
gas, others)
pg. 24
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model which
predicts the claim status and provide recommendations to management. Use CART, RF & ANN and
compare the models' performances in train and test sets.
Attribute Information:
Target: Claim Status (Claimed) Code of tour firm (Agency_Code) Type of tour insurance firms (Type)
Distribution channel of tour insurance agencies (Channel) Name of the tour insurance products
(Product) Duration of the tour (Duration) Destination of the tour (Destination) Amount of sales of
tour insurance policies (Sales) The commission received for tour insurance firm (Commission) Age of
insured (Age)
Question:-
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it.
Answer:-
pg. 25
Observation
 10 variables
 Age, Commision, Duration, Sales are numeric variable
 rest are categorial variables
 3000 records, no missing one
 9 independant variable and one target variable - Clamied
Check for missing value in any column
Observation
No missing value
pg. 26
Descriptive Statistics Summary
Observation
 Duration has negative valu, it is not possible. Wrong entry.
 Commission & Sales- mean and median varies significantly
pg. 27
Observation
Categorical code variable maximum unique count is 5
Observation
 Data looks good at first glance
pg. 28
Observation
 Data looks good at first glance
Getting unique counts of all Nominal Variables
pg. 29
Check for duplicate data
Removing Duplicates - Not removing them - no unique

identifier, can be different customer
Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, so I am not dropping them off.
Univariate Analysis
Age variable
pg. 30
pg. 31
Commission variable
pg. 32
Duration variable
pg. 33
Sales variable
pg. 34
Observation
There are outliers in all the variables, but the sales and commission can be a genius business value.
Random Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will
keep the data as it is.
I will treat the outliers for the ANN model to compare the same after the all the steps just for
comparison.
pg. 35
Categorical Variables
Agency Code
Count Plot:
Box Plot:
pg. 36
Swarmp plot:
Combine Violin plot and Swarmp plot:
Type
pg. 37
Channel
pg. 38
pg. 39
Product Name
pg. 40
Destination:
pg. 41
Checking pairwise distribution of the continuous variables
pg. 42
Checking for Correlations
pg. 43
Converting all objects to categorical codes
pg. 44
Proportion of 1s and 0s
Question:-
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
Answer:-
Prior to Scaling:
pg. 45
After Scaling:
Splitting data into training and test set
Checking the dimensions of the training and test data
Building a Decision Tree Classifier
pg. 46
Generating Tree
Variable Importance - DTCL
Building a Random Forest Classifier

param_grid_rfcl = { 'max_depth': [5,10,15],#20,30,40 'max_features': [4,5,6,7],## 7,8,9 'min_samples_leaf':
[10,50,70],## 50,100 'min_samples_split': [30,50,70], ## 60,70 'n_estimators': [200, 250,300] ## 100,200 }
rfcl = RandomForestClassifier(random_state=1)
grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv = 5)
grid_search_rfcl.fit(X_train, train_labels) print(grid_search_rfcl.bestparams) best_grid_rfcl =

grid_search_rfcl.bestestimator best_grid_rfcl
{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50,

'min_samples_split': 450}
pg. 47
Predicting the Training and testing data
Getting the Predicted Classes and Probs
Building a Neural Network Classifier
Getting the Predicted Classes and Probs
pg. 48
Question:-
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Answer:-
CART - AUC and ROC for the training data
CART -AUC and ROC for the test data
CART Confusion Matrix and Classification Report for the

training data
pg. 49
CART Confusion Matrix and Classification Report for the

testing data
Cart Conclusion
Train Data:
 AUC: 82%
 Accuracy: 79%
 Precision: 70%
 f1-Score: 60%
pg. 50
Test Data:
 AUC: 80%
 Accuracy: 77%
 Precision: 80%
 f1-Score: 84%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
Change is the most important variable for predicting diabetes
RF Model Performance Evaluation on Training data
pg. 51
RF Model Performance Evaluation on Test data
Random Forest Conclusion
Train Data:
 AUC: 86%
 Accuracy: 80%
 Precision: 72%
 f1-Score: 66%
pg. 52
Test Data:
 AUC: 82%
 Accuracy: 78%
 Precision: 68%
 f1-Score: 62
Change is again the most important variable for predicting diabetes
NN Model Performance Evaluation on Training data
pg. 53
NN Model Performance Evaluation on Test data
Neural Network Conclusion
Train Data:
 AUC: 82%
 Accuracy: 78%
 Precision: 68%
 f1-Score: 59
pg. 54
Test Data:
 AUC: 80%
 Accuracy: 77%
 Precision: 67%
 f1-Score: 57%
Question:-
2.3 Final Model: Compare all the model and write an inference which model is best/optimized.
Answer:-
Comparison of the performance metrics from the 3 models
ROC Curve for the 3 models on the Training data
pg. 55
ROC Curve for the 3 models on the Test data
CONCLUSION :
I am selecting the RF model, as it has better accuracy, precision, recall, f1 score better than other
two CART & NN.
pg. 56
Question:-
2.3 Inference: Basis on these predictions, what are the business insights and recommendations
Answer:-
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables
such as day of the incident, time, age group, and associating it with other external information such
as location, behavior patterns, weather information, airline/vehicle types, etc.
• Streamlining online experiences benefitted customers, leading to an increase in conversions, which

subsequently raised profits. • As per the data 90% of insurance is done by online channel. • Other
interesting fact, is almost all the offline business has a claimed associated, need to find why? • Need
to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional
marketing campaign or evaluate if we need to tie up with alternate agency • Also based on the model
we are getting 80%accuracy, so we need customer books airline tickets or plans, cross sell the
insurance based on the claim data pattern. • Other interesting fact is more sales happen via Agency
than Airlines and the trend shows the claim are processed more at Airline. So we may need to deep
dive into the process to understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time •
Increase customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling
costs Insights gained from data and AI-powered analytics could expand the boundaries of
insurability, extend existing products, and give rise to new risk transfer solutions in areas like a non-
damage business interruption and reputational damage.
END OF PROJECT
pg. 57

Data Mining Business Report-Clustering & CART

Uploaded by

Copyright:

Available Formats

Data Mining Business Report-Clustering & CART

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Business Report-Clustering & CART

Uploaded by

Copyright:

Available Formats

What problem is the bank trying to solve by developing customer segmentation?

What problem is the bank trying to solve by developing customer segmentation?

What type of data analysis was performed on the bank marketing data?

What type of data analysis was performed on the bank marketing data?

DATA MINING PROJECT – BUSINESS REPORT

Data Mining_ Clustering & CART –

 PROBLEM 01- Bank_marketing_part1_Data--------Page 02 to 24

 PROBLEM 02- Insurance_part2_data ----Page 25 to 57

Problem Statement: Clustering

Data Dictionary for Market Segmentation:

 spending: Amount spent by the customer per month (in 1000s)

TAIL of the DATA

Info of the Data:

Observation- Strong positive correlation between

 - advance_payments & current_balance,

 - credit_limit & spending

 - spending & current_balance

 - credit_limit & advance_payments

Strategy to remove outliers: We choose to replace attribute outlier values by their

let us remove the outliers

1 Scaling needs to be done as the values of the variables are different.

# Scaling the attributes.

Creating the Dendrogram

Importing dendrogram and linkage module

Importing fcluster module to create clusters

From SC Score, the number of optimal clusters could be 3 or 4

K-Means Clustering & Cluster Information

transposing the cluster

3 group cluster via Kmeans

3 group cluster via hierarchical clustering

Cluster Group Profiles

Group 1 : High Spending

Group 3 : Medium Spending

Group 2 : Low Spending

Promotional strategies for each cluster

Group 1 : High Spending Group

Group 3 : Medium Spending Group

Group 2 : Low Spending Group

Check for missing value in any column

Descriptive Statistics Summary

Getting unique counts of all Nominal Variables

Check for duplicate data

Removing Duplicates - Not removing them - no unique

Combine Violin plot and Swarmp plot:

Checking pairwise distribution of the continuous variables

Checking for Correlations

Converting all objects to categorical codes

Splitting data into training and test set

Checking the dimensions of the training and test data

Building a Decision Tree Classifier

Variable Importance - DTCL

Building a Random Forest Classifier

grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv = 5)

grid_search_rfcl.fit(X_train, train_labels) print(grid_search_rfcl.bestparams) best_grid_rfcl =

{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50,

Predicting the Training and testing data

Getting the Predicted Classes and Probs

Building a Neural Network Classifier

Getting the Predicted Classes and Probs

CART - AUC and ROC for the training data