Customer Churn Prediction in Telecom Industry: Sourav Sarkar (Group 7)
Customer Churn Prediction in Telecom Industry: Sourav Sarkar (Group 7)
Customer Churn Prediction in Telecom Industry: Sourav Sarkar (Group 7)
Presented By
Sourav Sarkar (Group 7)
pg. 1
CUSTOMER CHURN PREDICTION
Contents
Contents
1 Project Objective..............................................................................................................................3
2. Exploratory Data Analysis – Step by step approach.................................................................3
3.1 Environment Set up and Data Import....................................................................................3
3.1.1 Install necessary Packages and Invoke Libraries.........................................................3
3.1.2 Set up working Directory..................................................................................................4
3.1.3 Import and Read the Dataset..........................................................................................4
3.2. Data Cleaning & arranging.........................................................................................................4
3.3. EDA...............................................................................................................................................4
3.3.1. Univariate and bivariate analysis - Basic data summary, Univariate, Bivariate
analysis, graphs..............................................................................................................................4
3.3.2. outliers and missing values................................................................................................8
3.3.3. Multicollinearity...................................................................................................................10
3.3.4. Summary from EDA...........................................................................................................11
3.4. Building Models.........................................................................................................................12
3.4.1. LR.........................................................................................................................................12
3.4.2. kNN......................................................................................................................................13
3.4.3. Naïve Bayes.......................................................................................................................15
3.4.4. Model Comparison.............................................................................................................16
3.5. Actionable Insights (5 marks).......................................................................................................17
pg. 2
CUSTOMER CHURN PREDICTION
1 Project Objective
Build a model that will help the telecom identify the potential customers who have a
higher probability of churn the connection
Customer Churn is a burning problem for Telecom companies. In this project, we simulate
one such case of customer churn where we work on a data of postpaid customers with a
contract. The data has information about the customer usage behaviour, contract details and
the payment details. The data also indicates which were the customers who cancelled their
service. Based on this past data, we need to build a model which can predict whether a
customer will cancel their service in the future or not.
The department wants to build a model that will help them identify the potential customers
who have a higher probability of churn the connection. This will increase the success ratio to
retain customers while at the same time reduce the loss of the company
The project is performed on a much easier and accessible RSTUDIO CLOUD version, with
data import via inbuilt functions
3.1.1 Install necessary Packages and Invoke Libraries
library(plyr)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(ggthemes)
library(caret)
library(MASS)
library(randomForest)
library(party)
library(tibble)
library(ggthemes)
library(dplyr)
3.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project. Please refer Appendix A for Source
Code.
pg. 3
CUSTOMER CHURN PREDICTION
Since the minimum tenure is 0 month and maximum tenure is 56 months, we can
group them into five tenure groups: “0–12 Month”, “12–24 ##Month”, “24–48 Months”,
“> 48 Month
Change the values in column “Churn” from 0 or 1 to “No” or “Yes”.
Change the values in column “ContractRenewal” from 0 or 1 to “No” or “Yes”.
Change the values in column “DataPlan” from 0 or 1 to “No” or “Yes”.
Remove the columns we do not need for the analysis.
3.3. EDA
The raw data contains 3334 rows (customers) and 12 columns (features). The “Churn”
column is our target.
pg. 4
CUSTOMER CHURN PREDICTION
This depicts the churn values from table formed by predicting the values of J48
decision tree on churn parameter.
Represents the graph of customer service calls with respect to churn factor that has
just two values True and false
pg. 5
CUSTOMER CHURN PREDICTION
Represents the graph of churn factor with respect to Roaming calls mins
pg. 6
CUSTOMER CHURN PREDICTION
pg. 7
CUSTOMER CHURN PREDICTION
Missing Values:
We use sapply to check the number if missing values in each column. We found that there
are no missing values in the data set
> sapply(churn, function(x) sum(is.na(x)))
Churn tenure ContractRenewal DataPlan DataUsage CustServCalls
DayMins DayCalls
0 0 0 0 0 0 0 0
MonthlyCharge OverageFee RoamMins tenure_group
0 0 0 0
Outliers
Most of the data in tenure is between 0 and 48 months. There are very few data in
>48 months segment
Most of the data in CustServ is between 0 and 3 times. There are very few data in >3
times segment
pg. 8
CUSTOMER CHURN PREDICTION
Most of the data in Day Calls is between 48 and 149 times. There are very few data
where day call are less than 25 times and greater than 150 times
Most of the data in Day Calls minutes in a month is between 48 and 335 minutes.
There are very few data where day call are less than 48 minutes and greater than
335 minutes
Most of the data usage in a month is between 0 and 1.7 GB. There are very few data
usage above 4.5 GB a month and no data usage at all
pg. 9
CUSTOMER CHURN PREDICTION
3.3.3. Multicollinearity
Churn and cotract renewal is negatively corellated, as a customer who churn is not
going to renew contact, hence it is not a deciding factor
Churn amd CustServCalls are positively corellated as customers decides to churn
based on the number of calls made to customer care, which directly connects to
issues not resolved for customers
Churn is also positively corellated to Monthly Charge, Overcharge, Roaming Mins,
and are decididing factors for a customer to leave the connection
pg. 10
CUSTOMER CHURN PREDICTION
The graph shows us relativity between customer service calls and Day calls with
respect to the churn factor that is represented by two colors. By the color in the graph
we see that churners are more in high number of customer service calls. Blue color
represents the customers who churned.
The graph shows the relativity in number of customer service calls and day calls on
the subset of Data. The third parameter Churn factor represented by the color. The
smooth lines in the graph show clearly that Churns are more in case of high
customer service calls. Whereas, they don‘t vary much with the day calls.
Churn and cotract renewal is negatively corellated, as a customer who churn is not
going to renew contact, hence it is not a deciding factor
Churn amd CustServCalls are positively corellated as customers decides to churn
based on the number of calls made to customer care, which directly connects to
issues not resolved for customers
Churn is also positively corellated to Monthly Charge, Overcharge, Roaming Mins,
and are decididing factors for a customer to leave the connection
All of the categorical variables seem to have a reasonably broad distribution,
therefore, all of them will be kept for the further analysis
pg. 11
CUSTOMER CHURN PREDICTION
So, the customers mostly churn the connection on making high calls to customer
care, high monthly charges, High average daytime calls
I will create a trainControl object so that all of the models use the same 10-fold cross
validation on 70% of the data as a training set. I will then use the remaining 30% of the data
to test the model accuracy.
I will be using the both area under the ROC curve (AUC) and Accuracy percentage as
metrics for assessing model accuracy
The data set will be split in 70:30 ratio. Observations 1 to 2334 will make up the training set
and the remaining 1000 observations will be the test set.
3.4.1. LR
The Logistic Regression model on the training data gives a ROC value of 0.81. 97.5%
of customers that the model identified as leaving their service did so.
Generalized Linear Model
3333 samples
11 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 3000, 3000, 2999, 2999, 3000, 3000, ...
Resampling results:
Reference
Prediction No Yes
No 846 115
Yes 17 22
Accuracy : 0.868
95% CI : (0.8454, 0.8884)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.3429
pg. 12
CUSTOMER CHURN PREDICTION
Kappa : 0.2015
Sensitivity : 0.9803
Specificity : 0.1606
Pos Pred Value : 0.8803
Neg Pred Value : 0.5641
Prevalence : 0.8630
Detection Rate : 0.8460
Detection Prevalence : 0.9610
Balanced Accuracy : 0.5704
'Positive' Class : No
3.4.2. kNN
3333 samples
11 predictor
2 classes: 'No', 'Yes'
pg. 13
CUSTOMER CHURN PREDICTION
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 17
Reference
Prediction No Yes
No 862 93
Yes 1 44
Accuracy : 0.906
95% CI : (0.8862, 0.9234)
No Information Rate : 0.863
P-Value [Acc > NIR] : 2.11e-05
Kappa : 0.446
Sensitivity : 0.9988
pg. 14
CUSTOMER CHURN PREDICTION
Specificity : 0.3212
Pos Pred Value : 0.9026
Neg Pred Value : 0.9778
Prevalence : 0.8630
Detection Rate : 0.8620
Detection Prevalence : 0.9550
Balanced Accuracy : 0.6600
'Positive' Class : No
Yes, Naïve Bayes model is applicable in this data. It performs well in case of
categorical input variables compared to numerical variable(s), and we have
categorical variables in our dataset.
The NB model has predictive accuracy of 87.7% and correctly classifies 87% of
customers who went on to leave.
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 863 123
Yes 0 14
Accuracy : 0.877
95% CI : (0.855, 0.8967)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.106
Kappa : 0.1642
Sensitivity : 1.0000
Specificity : 0.1022
Pos Pred Value : 0.8753
Neg Pred Value : 1.0000
Prevalence : 0.8630
Detection Rate : 0.8630
Detection Prevalence : 0.9860
Balanced Accuracy : 0.5511
'Positive' Class : No
pg. 15
CUSTOMER CHURN PREDICTION
The kNN and Naïve Bayes showed the greatest overall predictive accuracy but the Logistic
Regression model showed the highest sensitivity, correctly identifying 97.5% of customers
who did go on to churn the connection. Additional decision will be required to declare which
pg. 16
CUSTOMER CHURN PREDICTION
model is absolutely perfect fit. But for this data set, Logistic Regression model is the
perfect fit
ROC Curve
In this case all models show significantly greater predictive accuracy than the null model
that predicts ‘No’ for every customer with accuracy of 72.6%. The Logistic and kNN
methods are almost identical in terms of results.
Tenure, Data Plan, Data Usage, are the most significant predictors. Customers with
these attributes are the most likely to not churn.
Being signed up to a one year or two-year contract is a strong indicator that a
customer will not leave the service, so those on month-to-month contracts are the most
likely to churn
Provide better data packs on a lower monthly rental. This will also reduce any
overcharge fee for extra data usage
Resolve customer queries faster, which encourage customers to extend the
contract or keep the connection
Provide better roaming benefits, which will reduce the monthly charge. As
customers who are on roaming, have more minutes spent on calls, resulting in
overcharge of fees.
3.6. Code
> library(plyr)
> library(corrplot)
> library(ggplot2)
> library(gridExtra)
pg. 17
CUSTOMER CHURN PREDICTION
> library(ggthemes)
> library(caret)
> library(MASS)
> library(randomForest)
> library(party)
> ##Library
>
> library(plyr)
> library(corrplot)
> library(ggplot2)
> library(gridExtra)
> library(ggthemes)
> library(caret)
> library(MASS)
> library(randomForest)
> library(party)
>
>
> churn <- read.csv('Cellphone.csv')
> str(churn)
'data.frame': 3333 obs. of 12 variables:
$ Churn : int 0 0 0 0 0 0 0 0 0 0 ...
$ AccountWeeks : int 128 107 137 84 75 118 121 147 117 141 ...
$ tenure : int 29 25 32 19 17 27 28 34 27 32 ...
$ ContractRenewal: int 1 1 1 0 0 0 1 0 1 0 ...
$ DataPlan : int 1 1 0 0 0 0 1 0 0 1 ...
$ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : int 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num 265 162 243 299 167 ...
$ DayCalls : int 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
> head(churn)
Churn AccountWeeks tenure ContractRenewal DataPlan DataUsage CustServCalls
DayMins DayCalls MonthlyCharge OverageFee
1 0 128 29 1 1 2.7 1
265.1 110 89 9.87
2 0 107 25 1 1 3.7 1
161.6 123 82 9.78
3 0 137 32 1 0 0.0 0
243.4 114 52 6.06
4 0 84 19 0 0 0.0 2
299.4 71 57 3.10
5 0 75 17 0 0 0.0 3
166.7 113 41 7.42
6 0 118 27 0 0 0.0 0
223.4 98 57 11.03
RoamMins
1 10.0
2 13.7
3 12.2
4 6.6
5 10.1
6 6.3
> View(churn)
pg. 18
CUSTOMER CHURN PREDICTION
>
> ##Data Cleaning & arranging
>
>
>
##############################################################################
###############################
> ##The raw data contains 7043 rows (customers) and 21 columns (features). The
“Churn” column is our target.
> ##We use sapply to check the number if missing values in each columns. We
found that there are 11 missing values in “TotalCharges” columns. ##So, let’s
remove all rows with missing values.
>
> sapply(churn, function(x) sum(is.na(x)))
Churn AccountWeeks tenure ContractRenewal
DataPlan DataUsage CustServCalls
0 0 0 0
0 0 0
DayMins DayCalls MonthlyCharge OverageFee
RoamMins
0 0 0 0
0
>
> churn <- churn[complete.cases(churn), ]
>
> ##3. Since the minimum tenure is 0 month and maximum tenure is 56 months, we
can group them into five tenure groups: “0–12 Month”, “12–24 ##Month”, “24–48
Months”, “> 48 Month”
> min(churn$tenure); max(churn$tenure)
[1] 0
[1] 56
>
> group_tenure <- function(tenure){
+ if (tenure >= 0 & tenure <= 12){
+ return('0-12 Month')
+ }else if(tenure > 12 & tenure <= 24){
+ return('12-24 Month')
+ }else if (tenure > 24 & tenure <= 48){
+ return('24-48 Month')
+ }else if (tenure > 48){
+ return('> 48 Month')
+ }
+ }
> churn$tenure_group <- sapply(churn$tenure,group_tenure)
> churn$tenure_group <- as.factor(churn$tenure_group)
>
>
> ##4. Change the values in column “Churn” from 0 or 1 to “No” or “Yes”.
> churn$Churn <- as.factor(mapvalues(churn$Churn,
+ from=c("0","1"),
+ to=c("No", "Yes")))
>
> ##5. Change the values in column “ContractRenewal” from 0 or 1 to “No” or
“Yes”.
> churn$ContractRenewal <- as.factor(mapvalues(churn$ContractRenewal,
+ from=c("0","1"),
pg. 19
CUSTOMER CHURN PREDICTION
+ to=c("No", "Yes")))
>
> ##6. Change the values in column “DataPlan” from 0 or 1 to “No” or “Yes”.
> churn$DataPlan <- as.factor(mapvalues(churn$DataPlan,
+ from=c("0","1"),
+ to=c("No", "Yes")))
>
>
>
> ##Remove the columns we do not need for the analysis.
> churn$AccountWeeks <- NULL
>
> #########################################
>
> ##Missing Values
>
> sum(is.na(churn))
[1] 0
>
>
##############################################################################
########################
>
> ##EDA
>
> ##Exploratory data analysis and feature selection
> ##Correlation between numeric variables
>
> numeric.var <- sapply(churn, is.numeric)
> corr.matrix <- cor(churn[,numeric.var])
> corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables",
method="number")
>
> ##Exploring data
> ##How many churns in this dataset ?
>
> ggplot(churn, aes(x = Churn))+
+ geom_histogram(stat = "count", fill = c("sky blue", "orange"))
Warning: Ignoring unknown parameters: binwidth, bins, pad
>
> ##When do customer churns ?
>
> churn %>% filter(churn$tenure == "Yes") %>%
+ ggplot( aes(x= tenure))+
+ geom_bar(fill = "orange" )
>
>
> ##Bar plots of categorical variables
>
> p1 <- ggplot(churn, aes(x=ContractRenewal)) + ggtitle("Contract Renewal") +
xlab("Contract") +
+ geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
+ ylab("Percentage") + coord_flip() + theme_minimal()
> plot(p1)
>
> library(RWeka)
pg. 20
CUSTOMER CHURN PREDICTION
>
> ##Decision Tree for Churn (using J48)
> ##Fig. 1 depicts the churn values from table formed by predicting the
> ##values of J48 decision tree on churn parameter
> m2 <- J48(Churn ~ ., data = churn)
> m2
J48 pruned tree
------------------
pg. 21
CUSTOMER CHURN PREDICTION
Number of Leaves : 48
No Yes
No 2840 10
Yes 146 337
pg. 22
CUSTOMER CHURN PREDICTION
> plot(m3)
>
> ##Classification Tree for all the Calls (using rpart )
>
>
> library(rpart)
> f<-rpart(Churn ~CustServCalls+MonthlyCharge+DayCalls+tenure_group
+ +OverageFee,method="class", data=churn)
> plot(f, uniform=TRUE,main="Classification Tree for Churn")
> text(f, use.n=TRUE, all=TRUE, cex=.7)
>
>
> plot(Churn ~., data = churn, type = "c")
Hit <Return> to see next plot: lines(Churn ~ CustServCalls,type="l")
Hit <Return> to see next plot:
Hit <Return> to see next plot:
Hit <Return> to see next plot: qplot(DayCalls, CustServCalls, data =
Hit <Return> to see next plot: churn,colour=Churn)
Hit <Return> to see next plot:
Hit <Return> to see next plot: qplot(DayCalls,CustServCalls, data = churn,geom
=
Hit <Return> to see next plot: c("point", "smooth"),color=Churn)
Hit <Return> to see next plot:
Hit <Return> to see next plot: dsc<- churn[sample(nrow(churn),100), ]
Hit <Return> to see next plot: qplot(DayCalls,CustServCalls, data = dsc, geom
= c("point",
There were 11 warnings (use warnings() to see them)
> "smooth"),color=Churn)
Error: unexpected ')' in "
"smooth")"
>
> qplot(tenure_group, CustServCalls, data =
+ churn,colour=Churn)
>
> qplot(ContractRenewal, CustServCalls, data =
+ churn,colour=Churn)
>
> ggplot(churn) +
+ geom_bar(aes(x = DataPlan, fill = Churn), position = "dodge")
>
> ggplot(churn) +
+ geom_bar(aes(x = CustServCalls, fill = Churn), position = "dodge")
>
> scatterplot(churn)
Error in scatterplot.default(churn) :
argument "y" is missing, with no default
>
> library("car")
>
> #################################################################
>
> ##Identifying outlier
>
> boxplot(churn$tenure, main="Tenure Outliers",
+ ylab="Tenure in Months")
> boxplot(churn$CustServCalls, main="Cust Serv Outliers",
pg. 23
CUSTOMER CHURN PREDICTION
+ ylab="Calls Made")
> boxplot(churn$DayCalls, main="Day Calls Outliers",
+ ylab="Duration of Calls")
> boxplot(churn$DayMins, main="Day Mins Outliers",
+ ylab="Duration of Calls")
>
> boxplot(churn$DataUsage, main="Data Usage Outliers",
+ ylab="Usage in GB")
>
>
>
> ##################################################################
>
> ##Logistic Regression Model
>
> glm_model <- train(Churn ~ ., data = churn,
+ method="glm",
+ trControl = control
+ )
Warning message:
In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
>
> glm_model
Generalized Linear Model
3333 samples
11 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 3000, 2999, 3000, 3000, 2999, 3000, ...
Resampling results:
>
> ##The Logistic Regression model on the training data gives an ROC value of
0.81. 97.5% of customers that the model identified as leaving their service
did so.
>
> ##Predictive Capability of Logistic Regression Model
>
> glm_pred <- predict(glm_model, newdata = test)
> glmcm <- confusionMatrix(glm_pred, test[["Churn"]])
> glmaccuracy <- glmcm$overall[c(1,3,4)]
> glmcm
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 846 115
Yes 17 22
pg. 24
CUSTOMER CHURN PREDICTION
Accuracy : 0.868
95% CI : (0.8454, 0.8884)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.3429
Kappa : 0.2015
Sensitivity : 0.9803
Specificity : 0.1606
Pos Pred Value : 0.8803
Neg Pred Value : 0.5641
Prevalence : 0.8630
Detection Rate : 0.8460
Detection Prevalence : 0.9610
Balanced Accuracy : 0.5704
'Positive' Class : No
>
>
> ##When the model is applied to the test data it yields accuracy of 87%, with
95% of customers that the model identified as leaving their service doing so
and 59.46% of customers that the model identified as staying with the service
staying with the company
>
>
> ##K-Nearest Neighbours
> ##Cross validation on the training set is used to find the optimal value for
K
>
> knn_model <- train(Churn ~ ., data = churn,
+ method = "knn", trControl = control,
+ preProcess = c("center","scale"), tuneLength = 50)
Warning message:
In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
>
> knn_model
k-Nearest Neighbors
3333 samples
11 predictor
2 classes: 'No', 'Yes'
pg. 25
CUSTOMER CHURN PREDICTION
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 17.
>
> ##Predictive Capability
> knn_pred <- predict(knn_model, newdata = test)
> knncm <- confusionMatrix(knn_pred, test[["Churn"]])
> knnaccuracy <- knncm$overall[c(1,3,4)]
> knncm
Confusion Matrix and Statistics
pg. 26
CUSTOMER CHURN PREDICTION
Reference
Prediction No Yes
No 862 93
Yes 1 44
Accuracy : 0.906
95% CI : (0.8862, 0.9234)
No Information Rate : 0.863
P-Value [Acc > NIR] : 2.11e-05
Kappa : 0.446
Sensitivity : 0.9988
Specificity : 0.3212
Pos Pred Value : 0.9026
Neg Pred Value : 0.9778
Prevalence : 0.8630
Detection Rate : 0.8620
Detection Prevalence : 0.9550
Balanced Accuracy : 0.6600
'Positive' Class : No
>
> ##The KNN model has predictive accuracy of 90.6% and correctly classifies
87% of customers who went on to leave.
>
>
> ##Naive Bayes
>
> nb_model <- train(Churn ~ ., data = churn,
+ method = "nb", trControl = control,
+ preProcess = c("center","scale"), tuneLength = 50)
There were 50 or more warnings (use warnings() to see the first 50)
>
>
>
> ##Predictive Capability
> nb_pred <- predict(nb_model, newdata = test)
Warning messages:
1: In FUN(X[[i]], ...) :
Numerical 0 probability for all classes with observation 189
2: In FUN(X[[i]], ...) :
Numerical 0 probability for all classes with observation 258
> nbcm <- confusionMatrix(nb_pred, test[["Churn"]])
> nbaccuracy <- nbcm$overall[c(1,3,4)]
> nbcm
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 863 123
Yes 0 14
pg. 27
CUSTOMER CHURN PREDICTION
Accuracy : 0.877
95% CI : (0.855, 0.8967)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.106
Kappa : 0.1642
Sensitivity : 1.0000
Specificity : 0.1022
Pos Pred Value : 0.8753
Neg Pred Value : 1.0000
Prevalence : 0.8630
Detection Rate : 0.8630
Detection Prevalence : 0.9860
Balanced Accuracy : 0.5511
'Positive' Class : No
>
> ##The NB model has predictive accuracy of 87.7% and correctly classifies 87%
of customers who went on to leave.
>
> model_list <- list("Logistic Regression" = glm_model, "kNN" = knn_model,
"NB" = nb_model)
> resamples <- resamples(model_list)
>
> dotplot(resamples, metric="ROC", main = "Area Under Curve with 95% CI")
>
> models <- c("Logistic Regression", "KNN", "NB")
>
> accuracysummary <- bind_rows(Logistic = glmaccuracy, kNN = knnaccuracy, NB =
nbaccuracy)
>
> accuracysummary2 <- add_column(accuracysummary, "Model" = models, before =
"Accuracy")
>
> accuracysummary2
# A tibble: 3 x 5
Accuracy AccuracyLower AccuracyUpper Model before
<dbl> <dbl> <dbl> <chr> <chr>
1 0.868 0.845 0.888 Logistic Regression Accuracy
2 0.906 0.886 0.923 KNN Accuracy
3 0.877 0.855 0.897 NB Accuracy
>
>
> ggplot(accuracysummary2, aes(x = Model, y = Accuracy)) + geom_bar(stat =
"identity") +
+ geom_errorbar(width = 0.5, aes(ymin = AccuracyLower, ymax =
AccuracyUpper), color = "black") +
+ coord_cartesian(ylim = c(0.9, 0.85)) +
+ labs(y = "Accuracy %", x = "Model", title = "Model Prediction Accuracy
with 95% CI") +
+ theme_minimal()
pg. 28
CUSTOMER CHURN PREDICTION
>
> ##################Conclusion
>
> ##Identifying attributes of customers likely to churn
> summary(glm_model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9916 -0.5103 -0.3490 -0.2080 3.0143
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.315887 1.142700 -3.777 0.000159 ***
tenure -0.005033 0.013059 -0.385 0.699920
ContractRenewalYes -1.987817 0.143711 -13.832 < 2e-16 ***
DataPlanYes -1.170428 0.536632 -2.181 0.029179 *
DataUsage 0.434520 1.926339 0.226 0.821537
CustServCalls 0.509554 0.038988 13.070 < 2e-16 ***
DayMins 0.018782 0.032536 0.577 0.563752
DayCalls 0.003456 0.002755 1.255 0.209658
MonthlyCharge -0.035050 0.191203 -0.183 0.854551
OverageFee 0.199282 0.326232 0.611 0.541293
RoamMins 0.079964 0.022106 3.617 0.000298 ***
`tenure_group0-12 Month` -1.561340 0.961034 -1.625 0.104238
`tenure_group12-24 Month` -1.490958 0.873711 -1.706 0.087921 .
`tenure_group24-48 Month` -1.394545 0.810196 -1.721 0.085206 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(knn_model)
Length Class Mode
learn 2 -none- list
k 1 -none- numeric
theDots 0 -none- list
xNames 13 -none- character
problemType 1 -none- character
tuneValue 1 data.frame list
obsLevels 2 -none- character
param 0 -none- list
> summary(nb_model)
Length Class Mode
apriori 2 table numeric
tables 13 -none- list
levels 2 -none- character
call 6 -none- call
pg. 29
CUSTOMER CHURN PREDICTION
x 13 data.frame list
usekernel 1 -none- logical
varnames 13 -none- character
xNames 13 -none- character
problemType 1 -none- character
tuneValue 3 data.frame list
obsLevels 2 -none- character
param 0 -none- list
pg. 30