Data Mining Business Report-Clustering & CART
Data Mining Business Report-Clustering & CART
Data Mining Business Report-Clustering & CART
July-2021
Sangeeta M Chandel.
CONTENTS: -
pg. 1
DATA MINING PROJECT – BUSINESS REPORT
PROBLEM-01
Problem 1A:
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers.They collected a sample that summarizes the activities of users during the past few
months.
You are given the task to identify the segments based on credit card usage.
Question:-
1.1 Read the data and do exploratory data analysis. Describe the data briefly
Answer:-
HEAD of DATA
pg. 2
DATA MINING PROJECT – BUSINESS REPORT
Observation
Data looks good based on intial records seen in top 5 and bottom 5.
Observation:
7 variables and 210 records. No missing record based on intial analysis. All the variables numeric
type.
Univariate analysis
Observation
Based on summary descriptive, the data looks good.
We see for most of the variable, mean/medium are nearly equal
Include a 90% to see variations and it looks distributely evenly
Std Deviation is high for spending variable
pg. 3
DATA MINING PROJECT – BUSINESS REPORT
Spending variable
pg. 4
DATA MINING PROJECT – BUSINESS REPORT
max_spent_in_single_shopping variable
pg. 5
DATA MINING PROJECT – BUSINESS REPORT
pg. 6
DATA MINING PROJECT – BUSINESS REPORT
pg. 7
DATA MINING PROJECT – BUSINESS REPORT
Observations
Credit limit average is around $3.258(10000s)
Distrubtion is skewed to right tail for all the variable execpt probability_of_full_payment
variable, which has left tail
pg. 8
DATA MINING PROJECT – BUSINESS REPORT
Multivariate analysis
Check for multicollinearity
- max_spent_in_single_shopping current_balance
pg. 9
DATA MINING PROJECT – BUSINESS REPORT
#correlation matrix
pg. 10
DATA MINING PROJECT – BUSINESS REPORT
Observation
Most of the outlier has been treated and now we are good to go.
pg. 11
DATA MINING PROJECT – BUSINESS REPORT
Observation
Though we did treated the outlier, we still see one as per the boxplot, it is okay, as it is no extrme
and on lower band.
Question:-
1.2 Do you think scaling is necessary for clustering in this case? Justify
Answer:-
2 spending, advance_payments are in different values and this may get more weightage.
3 Also have shown below the plot of the data prior and after scaling.
4 Scaling will have all the values in the relative same range.
5 I have used zscore to standarised the data to relative same scale -3 to +3.
prior to scaling
pg. 12
DATA MINING PROJECT – BUSINESS REPORT
#after scaling
Question:-
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
Answer:-
pg. 13
DATA MINING PROJECT – BUSINESS REPORT
pg. 14
DATA MINING PROJECT – BUSINESS REPORT
Cluster 3
Observation
Both the method are almost similer means , minor variation, which we know it occurs.
We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and
based on the dataset had gone for 3 group cluster solution based on the hierarchical clustering
Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment made).
pg. 15
DATA MINING PROJECT – BUSINESS REPORT
Question:-
1.3 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and silhouette score.
Answer:-
Cluster 01:
Cluster 02:
Cluster 03:
Cluster 04:
WSS:
pg. 16
DATA MINING PROJECT – BUSINESS REPORT
pg. 17
DATA MINING PROJECT – BUSINESS REPORT
silhouette_score
#Insights
pg. 18
DATA MINING PROJECT – BUSINESS REPORT
silhouette_samples
3 Cluster Solution
pg. 19
DATA MINING PROJECT – BUSINESS REPORT
Note
I am going with 3 clusters via kmeans, but am showing the analysis of 4 and 5 kmeans cluster, I see
we based on current dataset given, 3 cluster solution makes sense based on the spending pattern
(High, Medium, Low)
4-Cluster Solution
pg. 20
DATA MINING PROJECT – BUSINESS REPORT
cluster_4_T
pg. 21
DATA MINING PROJECT – BUSINESS REPORT
5 cluster
cluster_5_T
pg. 22
DATA MINING PROJECT – BUSINESS REPORT
Question:-
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
Answer:-
pg. 23
DATA MINING PROJECT – BUSINESS REPORT
pg. 24
DATA MINING PROJECT – BUSINESS REPORT
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model which
predicts the claim status and provide recommendations to management. Use CART, RF & ANN and
compare the models' performances in train and test sets.
Attribute Information:
Target: Claim Status (Claimed) Code of tour firm (Agency_Code) Type of tour insurance firms (Type)
Distribution channel of tour insurance agencies (Channel) Name of the tour insurance products
(Product) Duration of the tour (Duration) Destination of the tour (Destination) Amount of sales of
tour insurance policies (Sales) The commission received for tour insurance firm (Commission) Age of
insured (Age)
Question:-
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it.
Answer:-
pg. 25
DATA MINING PROJECT – BUSINESS REPORT
Observation
10 variables
Age, Commision, Duration, Sales are numeric variable
rest are categorial variables
3000 records, no missing one
9 independant variable and one target variable - Clamied
Observation
No missing value
pg. 26
DATA MINING PROJECT – BUSINESS REPORT
Observation
Duration has negative valu, it is not possible. Wrong entry.
Commission & Sales- mean and median varies significantly
pg. 27
DATA MINING PROJECT – BUSINESS REPORT
Observation
Categorical code variable maximum unique count is 5
Observation
Data looks good at first glance
pg. 28
DATA MINING PROJECT – BUSINESS REPORT
Observation
Data looks good at first glance
pg. 29
DATA MINING PROJECT – BUSINESS REPORT
Univariate Analysis
Age variable
pg. 30
DATA MINING PROJECT – BUSINESS REPORT
pg. 31
DATA MINING PROJECT – BUSINESS REPORT
Commission variable
pg. 32
DATA MINING PROJECT – BUSINESS REPORT
Duration variable
pg. 33
DATA MINING PROJECT – BUSINESS REPORT
Sales variable
pg. 34
DATA MINING PROJECT – BUSINESS REPORT
Observation
There are outliers in all the variables, but the sales and commission can be a genius business value.
Random Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will
keep the data as it is.
I will treat the outliers for the ANN model to compare the same after the all the steps just for
comparison.
pg. 35
DATA MINING PROJECT – BUSINESS REPORT
Categorical Variables
Agency Code
Count Plot:
Box Plot:
pg. 36
DATA MINING PROJECT – BUSINESS REPORT
Swarmp plot:
Type
pg. 37
DATA MINING PROJECT – BUSINESS REPORT
Channel
pg. 38
DATA MINING PROJECT – BUSINESS REPORT
pg. 39
DATA MINING PROJECT – BUSINESS REPORT
Product Name
pg. 40
DATA MINING PROJECT – BUSINESS REPORT
Destination:
pg. 41
DATA MINING PROJECT – BUSINESS REPORT
pg. 42
DATA MINING PROJECT – BUSINESS REPORT
pg. 43
DATA MINING PROJECT – BUSINESS REPORT
pg. 44
DATA MINING PROJECT – BUSINESS REPORT
Proportion of 1s and 0s
Question:-
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
Answer:-
Prior to Scaling:
pg. 45
DATA MINING PROJECT – BUSINESS REPORT
After Scaling:
pg. 46
DATA MINING PROJECT – BUSINESS REPORT
Generating Tree
rfcl = RandomForestClassifier(random_state=1)
pg. 47
DATA MINING PROJECT – BUSINESS REPORT
pg. 48
DATA MINING PROJECT – BUSINESS REPORT
Question:-
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Answer:-
pg. 49
DATA MINING PROJECT – BUSINESS REPORT
Cart Conclusion
Train Data:
AUC: 82%
Accuracy: 79%
Precision: 70%
f1-Score: 60%
pg. 50
DATA MINING PROJECT – BUSINESS REPORT
Test Data:
AUC: 80%
Accuracy: 77%
Precision: 80%
f1-Score: 84%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
pg. 51
DATA MINING PROJECT – BUSINESS REPORT
Train Data:
AUC: 86%
Accuracy: 80%
Precision: 72%
f1-Score: 66%
pg. 52
DATA MINING PROJECT – BUSINESS REPORT
Test Data:
AUC: 82%
Accuracy: 78%
Precision: 68%
f1-Score: 62
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
pg. 53
DATA MINING PROJECT – BUSINESS REPORT
Train Data:
AUC: 82%
Accuracy: 78%
Precision: 68%
f1-Score: 59
pg. 54
DATA MINING PROJECT – BUSINESS REPORT
Test Data:
AUC: 80%
Accuracy: 77%
Precision: 67%
f1-Score: 57%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
Question:-
2.3 Final Model: Compare all the model and write an inference which model is best/optimized.
Answer:-
pg. 55
DATA MINING PROJECT – BUSINESS REPORT
CONCLUSION :
I am selecting the RF model, as it has better accuracy, precision, recall, f1 score better than other
two CART & NN.
pg. 56
DATA MINING PROJECT – BUSINESS REPORT
Question:-
2.3 Inference: Basis on these predictions, what are the business insights and recommendations
Answer:-
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables
such as day of the incident, time, age group, and associating it with other external information such
as location, behavior patterns, weather information, airline/vehicle types, etc.
Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time •
Increase customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling
costs Insights gained from data and AI-powered analytics could expand the boundaries of
insurability, extend existing products, and give rise to new risk transfer solutions in areas like a non-
damage business interruption and reputational damage.
END OF PROJECT
pg. 57