Customer Churn Prediction in Telecom Industry: Sourav Sarkar (Group 7)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 30


Customer Churn Prediction in

Telecom Industry

Presented By
Sourav Sarkar (Group 7)

pg. 1


1 Project Objective..............................................................................................................................3
2. Exploratory Data Analysis – Step by step approach.................................................................3
3.1 Environment Set up and Data Import....................................................................................3
3.1.1 Install necessary Packages and Invoke Libraries.........................................................3
3.1.2 Set up working Directory..................................................................................................4
3.1.3 Import and Read the Dataset..........................................................................................4
3.2. Data Cleaning & arranging.........................................................................................................4
3.3. EDA...............................................................................................................................................4
3.3.1. Univariate and bivariate analysis - Basic data summary, Univariate, Bivariate
analysis, graphs..............................................................................................................................4
3.3.2. outliers and missing values................................................................................................8
3.3.3. Multicollinearity...................................................................................................................10
3.3.4. Summary from EDA...........................................................................................................11
3.4. Building Models.........................................................................................................................12
3.4.1. LR.........................................................................................................................................12
3.4.2. kNN......................................................................................................................................13
3.4.3. Naïve Bayes.......................................................................................................................15
3.4.4. Model Comparison.............................................................................................................16
3.5. Actionable Insights (5 marks).......................................................................................................17

pg. 2

1 Project Objective

Build a model that will help the telecom identify the potential customers who have a
higher probability of churn the connection
Customer Churn is a burning problem for Telecom companies. In this project, we simulate
one such case of customer churn where we work on a data of postpaid customers with a
contract. The data has information about the customer usage behaviour, contract details and
the payment details. The data also indicates which were the customers who cancelled their
service. Based on this past data, we need to build a model which can predict whether a
customer will cancel their service in the future or not.

The department wants to build a model that will help them identify the potential customers
who have a higher probability of churn the connection. This will increase the success ratio to
retain customers while at the same time reduce the loss of the company

2. Exploratory Data Analysis – Step by step approach

1. Environment Set up and Data Import
2. Variable Identification
3. Different overaction with graphs
4. Univariate Analysis
5. Bivariate Analysis
6. Distribution of Numeric Data
7. Correlation Scatterplot
8. Find missing values
3.1 Environment Set up and Data Import

The project is performed on a much easier and accessible RSTUDIO CLOUD version, with
data import via inbuilt functions
3.1.1 Install necessary Packages and Invoke Libraries
 library(plyr)
 library(corrplot)
 library(ggplot2)
 library(gridExtra)
 library(ggthemes)
 library(caret)
 library(MASS)
 library(randomForest)
 library(party)
 library(tibble)
 library(ggthemes)
 library(dplyr)
3.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project. Please refer Appendix A for Source

pg. 3

3.1.3 Import and Read the Dataset

The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the
file. Please refer Appendix A for Source Code. However, in RSTUDIO CLOUD, data can
also be imported directly through inbuilt function

3.2. Data Cleaning & arranging

 Since the minimum tenure is 0 month and maximum tenure is 56 months, we can
group them into five tenure groups: “0–12 Month”, “12–24 ##Month”, “24–48 Months”,
“> 48 Month
 Change the values in column “Churn” from 0 or 1 to “No” or “Yes”.
 Change the values in column “ContractRenewal” from 0 or 1 to “No” or “Yes”.
 Change the values in column “DataPlan” from 0 or 1 to “No” or “Yes”.
 Remove the columns we do not need for the analysis.

3.3. EDA

The EDA are as follows

3.3.1. Univariate and bivariate analysis - Basic data summary, Univariate, Bivariate analysis,

The raw data contains 3334 rows (customers) and 12 columns (features). The “Churn”
column is our target.

 Around 483 customers decided to cancel the connection

 2850 customers did not cancell the connection

pg. 4

 This depicts the churn values from table formed by predicting the values of J48
decision tree on churn parameter.

 Represents the graph of customer service calls with respect to churn factor that has
just two values True and false

 Represents the graph of churn factor with respect to data plan

pg. 5

 Represents the graph of churn factor with respect to Day Mins

 Represents the graph of churn factor with respect to Monthly Charge

 Represents the graph of churn factor with respect to Roaming calls mins

pg. 6

 Represents the graph of churn factor with respect to tenure in months

 Contract renewal percentage

pg. 7

3.3.2. outliers and missing values 

Missing Values:
We use sapply to check the number if missing values in each column. We found that there
are no missing values in the data set
> sapply(churn, function(x) sum(
Churn tenure ContractRenewal DataPlan DataUsage CustServCalls
DayMins DayCalls
0 0 0 0 0 0 0 0
MonthlyCharge OverageFee RoamMins tenure_group
0 0 0 0


 Most of the data in tenure is between 0 and 48 months. There are very few data in
>48 months segment

 Most of the data in CustServ is between 0 and 3 times. There are very few data in >3
times segment

pg. 8

 Most of the data in Day Calls is between 48 and 149 times. There are very few data
where day call are less than 25 times and greater than 150 times

 Most of the data in Day Calls minutes in a month is between 48 and 335 minutes.
There are very few data where day call are less than 48 minutes and greater than
335 minutes

 Most of the data usage in a month is between 0 and 1.7 GB. There are very few data
usage above 4.5 GB a month and no data usage at all

pg. 9

3.3.3. Multicollinearity 

 Churn and cotract renewal is negatively corellated, as a customer who churn is not
going to renew contact, hence it is not a deciding factor
 Churn amd CustServCalls are positively corellated as customers decides to churn
based on the number of calls made to customer care, which directly connects to
issues not resolved for customers
 Churn is also positively corellated to Monthly Charge, Overcharge, Roaming Mins,
and are decididing factors for a customer to leave the connection

pg. 10

3.3.4. Summary from EDA

 The graph shows us relativity between customer service calls and Day calls with
respect to the churn factor that is represented by two colors. By the color in the graph
we see that churners are more in high number of customer service calls. Blue color
represents the customers who churned.

 The graph shows the relativity in number of customer service calls and day calls on
the subset of Data. The third parameter Churn factor represented by the color. The
smooth lines in the graph show clearly that Churns are more in case of high
customer service calls. Whereas, they don‘t vary much with the day calls.
 Churn and cotract renewal is negatively corellated, as a customer who churn is not
going to renew contact, hence it is not a deciding factor
 Churn amd CustServCalls are positively corellated as customers decides to churn
based on the number of calls made to customer care, which directly connects to
issues not resolved for customers
 Churn is also positively corellated to Monthly Charge, Overcharge, Roaming Mins,
and are decididing factors for a customer to leave the connection
 All of the categorical variables seem to have a reasonably broad distribution,
therefore, all of them will be kept for the further analysis

pg. 11

 So, the customers mostly churn the connection on making high calls to customer
care, high monthly charges, High average daytime calls

3.4. Building Models

I will create a trainControl object so that all of the models use the same 10-fold cross
validation on 70% of the data as a training set. I will then use the remaining 30% of the data
to test the model accuracy.
I will be using the both area under the ROC curve (AUC) and Accuracy percentage as
metrics for assessing model accuracy
The data set will be split in 70:30 ratio. Observations 1 to 2334 will make up the training set
and the remaining 1000 observations will be the test set.

3.4.1. LR

The Logistic Regression model on the training data gives a ROC value of 0.81. 97.5%
of customers that the model identified as leaving their service did so.
Generalized Linear Model

3333 samples
11 predictor
2 classes: 'No', 'Yes'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 3000, 3000, 2999, 2999, 3000, 3000, ...
Resampling results:

ROC Sens Spec

0.8114457 0.9747368 0.182398

Predictive Capability of Logistic Regression Model

Confusion Matrix and Statistics

Prediction No Yes
No 846 115
Yes 17 22

Accuracy : 0.868
95% CI : (0.8454, 0.8884)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.3429

pg. 12

Kappa : 0.2015

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.9803
Specificity : 0.1606
Pos Pred Value : 0.8803
Neg Pred Value : 0.5641
Prevalence : 0.8630
Detection Rate : 0.8460
Detection Prevalence : 0.9610
Balanced Accuracy : 0.5704

'Positive' Class : No

3.4.2. kNN

Yes, KNN model is applicable in this data

I have done a Cross validation on the training set is used to find the optimal value for
K. The KNN model has predictive accuracy of 90.6% and correctly classifies 87% of
customers who went on to leave.
k-Nearest Neighbors

3333 samples
11 predictor
2 classes: 'No', 'Yes'

Pre-processing: centered (13), scaled (13)

Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 3000, 2999, 3000, 3000, 3000, 2999, ...
Resampling results across tuning parameters:

k ROC Sens Spec

5 0.8406125 0.9796491 0.399447279
7 0.8638593 0.9831579 0.374957483
9 0.8713614 0.9870175 0.341751701
11 0.8739489 0.9891228 0.316794218
13 0.8773208 0.9898246 0.283673469
15 0.8792518 0.9915789 0.252636054
17 0.8815504 0.9929825 0.244345238
19 0.8789064 0.9919298 0.231930272
21 0.8790745 0.9947368 0.221683673
23 0.8798011 0.9954386 0.194727891
25 0.8806830 0.9975439 0.186522109
27 0.8805258 0.9975439 0.165731293
29 0.8802212 0.9975439 0.155484694
31 0.8811277 0.9975439 0.147151361
33 0.8806776 0.9978947 0.132610544
35 0.8802820 0.9978947 0.113903061
37 0.8794389 0.9978947 0.115943878

pg. 13

39 0.8783408 0.9975439 0.111819728

41 0.8788874 0.9989474 0.105527211
43 0.8789838 0.9982456 0.088988095
45 0.8772669 0.9978947 0.084906463
47 0.8770335 0.9978947 0.076615646
49 0.8755518 0.9985965 0.070408163
51 0.8748780 0.9982456 0.059991497
53 0.8753791 0.9985965 0.057950680
55 0.8746037 0.9985965 0.049659864
57 0.8754845 0.9989474 0.047619048
59 0.8742601 0.9989474 0.045535714
61 0.8743958 0.9989474 0.043452381
63 0.8733002 0.9989474 0.043452381
65 0.8732624 0.9989474 0.035119048
67 0.8726458 0.9989474 0.033035714
69 0.8729846 0.9989474 0.026828231
71 0.8726033 0.9989474 0.018579932
73 0.8713635 0.9989474 0.016496599
75 0.8708805 0.9992982 0.016539116
77 0.8709801 0.9992982 0.008290816
79 0.8706274 0.9996491 0.006207483
81 0.8705588 0.9996491 0.006207483
83 0.8705911 0.9996491 0.008290816
85 0.8702447 0.9996491 0.010374150
87 0.8701569 0.9996491 0.008290816
89 0.8698441 0.9996491 0.008290816
91 0.8697138 0.9996491 0.008290816
93 0.8695739 0.9996491 0.006207483
95 0.8677119 0.9996491 0.006207483
97 0.8677432 0.9996491 0.004124150
99 0.8678821 0.9996491 0.004124150
101 0.8668879 0.9996491 0.006207483
103 0.8665417 0.9996491 0.006207483

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 17

Confusion Matrix and Statistics

Prediction No Yes
No 862 93
Yes 1 44

Accuracy : 0.906
95% CI : (0.8862, 0.9234)
No Information Rate : 0.863
P-Value [Acc > NIR] : 2.11e-05

Kappa : 0.446

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.9988

pg. 14

Specificity : 0.3212
Pos Pred Value : 0.9026
Neg Pred Value : 0.9778
Prevalence : 0.8630
Detection Rate : 0.8620
Detection Prevalence : 0.9550
Balanced Accuracy : 0.6600

'Positive' Class : No

3.4.3. Naïve Bayes

Yes, Naïve Bayes model is applicable in this data. It performs well in case of
categorical input variables compared to numerical variable(s), and we have
categorical variables in our dataset.
The NB model has predictive accuracy of 87.7% and correctly classifies 87% of
customers who went on to leave.
Confusion Matrix and Statistics

Prediction No Yes
No 863 123
Yes 0 14

Accuracy : 0.877
95% CI : (0.855, 0.8967)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.106

Kappa : 0.1642

Mcnemar's Test P-Value : <2e-16

Sensitivity : 1.0000
Specificity : 0.1022
Pos Pred Value : 0.8753
Neg Pred Value : 1.0000
Prevalence : 0.8630
Detection Rate : 0.8630
Detection Prevalence : 0.9860
Balanced Accuracy : 0.5511

'Positive' Class : No

pg. 15

3.4.4. Model Comparison

The kNN and Naïve Bayes showed the greatest overall predictive accuracy but the Logistic
Regression model showed the highest sensitivity, correctly identifying 97.5% of customers
who did go on to churn the connection. Additional decision will be required to declare which

pg. 16

model is absolutely perfect fit. But for this data set, Logistic Regression model is the
perfect fit

ROC Curve

3.5. Actionable Insights

 In this case all models show significantly greater predictive accuracy than the null model
that predicts ‘No’ for every customer with accuracy of 72.6%. The Logistic and kNN
methods are almost identical in terms of results.
 Tenure, Data Plan, Data Usage, are the most significant predictors. Customers with
these attributes are the most likely to not churn.
 Being signed up to a one year or two-year contract is a strong indicator that a
customer will not leave the service, so those on month-to-month contracts are the most
likely to churn
 Provide better data packs on a lower monthly rental. This will also reduce any
overcharge fee for extra data usage
 Resolve customer queries faster, which encourage customers to extend the
contract or keep the connection
 Provide better roaming benefits, which will reduce the monthly charge. As
customers who are on roaming, have more minutes spent on calls, resulting in
overcharge of fees.

3.6. Code

> library(plyr)
> library(corrplot)
> library(ggplot2)
> library(gridExtra)

pg. 17

> library(ggthemes)
> library(caret)
> library(MASS)
> library(randomForest)
> library(party)
> ##Library
> library(plyr)
> library(corrplot)
> library(ggplot2)
> library(gridExtra)
> library(ggthemes)
> library(caret)
> library(MASS)
> library(randomForest)
> library(party)
> churn <- read.csv('Cellphone.csv')
> str(churn)
'data.frame': 3333 obs. of 12 variables:
$ Churn : int 0 0 0 0 0 0 0 0 0 0 ...
$ AccountWeeks : int 128 107 137 84 75 118 121 147 117 141 ...
$ tenure : int 29 25 32 19 17 27 28 34 27 32 ...
$ ContractRenewal: int 1 1 1 0 0 0 1 0 1 0 ...
$ DataPlan : int 1 1 0 0 0 0 1 0 0 1 ...
$ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : int 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num 265 162 243 299 167 ...
$ DayCalls : int 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
> head(churn)
Churn AccountWeeks tenure ContractRenewal DataPlan DataUsage CustServCalls
DayMins DayCalls MonthlyCharge OverageFee
1 0 128 29 1 1 2.7 1
265.1 110 89 9.87
2 0 107 25 1 1 3.7 1
161.6 123 82 9.78
3 0 137 32 1 0 0.0 0
243.4 114 52 6.06
4 0 84 19 0 0 0.0 2
299.4 71 57 3.10
5 0 75 17 0 0 0.0 3
166.7 113 41 7.42
6 0 118 27 0 0 0.0 0
223.4 98 57 11.03
1 10.0
2 13.7
3 12.2
4 6.6
5 10.1
6 6.3
> View(churn)

pg. 18

> ##Data Cleaning & arranging
> ##The raw data contains 7043 rows (customers) and 21 columns (features). The
“Churn” column is our target.
> ##We use sapply to check the number if missing values in each columns. We
found that there are 11 missing values in “TotalCharges” columns. ##So, let’s
remove all rows with missing values.
> sapply(churn, function(x) sum(
Churn AccountWeeks tenure ContractRenewal
DataPlan DataUsage CustServCalls
0 0 0 0
0 0 0
DayMins DayCalls MonthlyCharge OverageFee
0 0 0 0
> churn <- churn[complete.cases(churn), ]
> ##3. Since the minimum tenure is 0 month and maximum tenure is 56 months, we
can group them into five tenure groups: “0–12 Month”, “12–24 ##Month”, “24–48
Months”, “> 48 Month”
> min(churn$tenure); max(churn$tenure)
[1] 0
[1] 56
> group_tenure <- function(tenure){
+ if (tenure >= 0 & tenure <= 12){
+ return('0-12 Month')
+ }else if(tenure > 12 & tenure <= 24){
+ return('12-24 Month')
+ }else if (tenure > 24 & tenure <= 48){
+ return('24-48 Month')
+ }else if (tenure > 48){
+ return('> 48 Month')
+ }
+ }
> churn$tenure_group <- sapply(churn$tenure,group_tenure)
> churn$tenure_group <- as.factor(churn$tenure_group)
> ##4. Change the values in column “Churn” from 0 or 1 to “No” or “Yes”.
> churn$Churn <- as.factor(mapvalues(churn$Churn,
+ from=c("0","1"),
+ to=c("No", "Yes")))
> ##5. Change the values in column “ContractRenewal” from 0 or 1 to “No” or
> churn$ContractRenewal <- as.factor(mapvalues(churn$ContractRenewal,
+ from=c("0","1"),

pg. 19

+ to=c("No", "Yes")))
> ##6. Change the values in column “DataPlan” from 0 or 1 to “No” or “Yes”.
> churn$DataPlan <- as.factor(mapvalues(churn$DataPlan,
+ from=c("0","1"),
+ to=c("No", "Yes")))
> ##Remove the columns we do not need for the analysis.
> churn$AccountWeeks <- NULL
> #########################################
> ##Missing Values
> sum(
[1] 0
> ##EDA
> ##Exploratory data analysis and feature selection
> ##Correlation between numeric variables
> numeric.var <- sapply(churn, is.numeric)
> corr.matrix <- cor(churn[,numeric.var])
> corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables",
> ##Exploring data
> ##How many churns in this dataset ?
> ggplot(churn, aes(x = Churn))+
+ geom_histogram(stat = "count", fill = c("sky blue", "orange"))
Warning: Ignoring unknown parameters: binwidth, bins, pad
> ##When do customer churns ?
> churn %>% filter(churn$tenure == "Yes") %>%
+ ggplot( aes(x= tenure))+
+ geom_bar(fill = "orange" )
> ##Bar plots of categorical variables
> p1 <- ggplot(churn, aes(x=ContractRenewal)) + ggtitle("Contract Renewal") +
xlab("Contract") +
+ geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
+ ylab("Percentage") + coord_flip() + theme_minimal()
> plot(p1)
> library(RWeka)

pg. 20

> ##Decision Tree for Churn (using J48)
> ##Fig. 1 depicts the churn values from table formed by predicting the
> ##values of J48 decision tree on churn parameter
> m2 <- J48(Churn ~ ., data = churn)
> m2
J48 pruned tree

DayMins <= 264.4

| CustServCalls <= 3
| | ContractRenewal = No
| | | RoamMins <= 13.1
| | | | DataPlan = No
| | | | | MonthlyCharge <= 59.9: No (144.0/32.0)
| | | | | MonthlyCharge > 59.9
| | | | | | tenure_group = > 48 Month: Yes (0.0)
| | | | | | tenure_group = 0-12 Month: No (3.0/1.0)
| | | | | | tenure_group = 12-24 Month: Yes (5.0)
| | | | | | tenure_group = 24-48 Month
| | | | | | | OverageFee <= 12.29: No (4.0/1.0)
| | | | | | | OverageFee > 12.29: Yes (5.0)
| | | | DataPlan = Yes: No (58.0/9.0)
| | | RoamMins > 13.1: Yes (48.0)
| | ContractRenewal = Yes
| | | DayMins <= 223.2: No (2221.0/60.0)
| | | DayMins > 223.2
| | | | OverageFee <= 12.06: No (295.0/22.0)
| | | | OverageFee > 12.06
| | | | | DataPlan = No
| | | | | | MonthlyCharge <= 62.5
| | | | | | | RoamMins <= 13.4: No (12.0/1.0)
| | | | | | | RoamMins > 13.4: Yes (2.0)
| | | | | | MonthlyCharge > 62.5
| | | | | | | RoamMins <= 7.8: No (10.0/3.0)
| | | | | | | RoamMins > 7.8
| | | | | | | | DataUsage <= 0.22: Yes (36.0/1.0)
| | | | | | | | DataUsage > 0.22
| | | | | | | | | MonthlyCharge <= 65.5: No (3.0)
| | | | | | | | | MonthlyCharge > 65.5: Yes (5.0)
| | | | | DataPlan = Yes: No (20.0)
| CustServCalls > 3
| | MonthlyCharge <= 45.9: Yes (88.0/5.0)
| | MonthlyCharge > 45.9
| | | DayMins <= 160.2
| | | | ContractRenewal = No: Yes (5.0)
| | | | ContractRenewal = Yes
| | | | | DataPlan = No
| | | | | | CustServCalls <= 4: No (4.0)
| | | | | | CustServCalls > 4: Yes (3.0/1.0)
| | | | | DataPlan = Yes
| | | | | | MonthlyCharge <= 70.1: Yes (10.0)
| | | | | | MonthlyCharge > 70.1
| | | | | | | DataUsage <= 3.08: No (4.0)
| | | | | | | DataUsage > 3.08
| | | | | | | | DayCalls <= 103: Yes (4.0)

pg. 21

| | | | | | | | DayCalls > 103: No (3.0/1.0)

| | | DayMins > 160.2
| | | | OverageFee <= 6.75
| | | | | DataPlan = No
| | | | | | CustServCalls <= 4: No (4.0/1.0)
| | | | | | CustServCalls > 4: Yes (2.0)
| | | | | DataPlan = Yes: Yes (3.0)
| | | | OverageFee > 6.75
| | | | | DataPlan = No: No (81.0/9.0)
| | | | | DataPlan = Yes
| | | | | | OverageFee <= 10.27
| | | | | | | DayMins <= 178.1: Yes (5.0)
| | | | | | | DayMins > 178.1: No (8.0/1.0)
| | | | | | OverageFee > 10.27: No (27.0/1.0)
DayMins > 264.4
| DataPlan = No
| | MonthlyCharge <= 64.9
| | | OverageFee <= 7.21: No (14.0)
| | | OverageFee > 7.21
| | | | CustServCalls <= 0: Yes (3.0)
| | | | CustServCalls > 0
| | | | | ContractRenewal = No: No (7.0/2.0)
| | | | | ContractRenewal = Yes
| | | | | | MonthlyCharge <= 62.5
| | | | | | | DayMins <= 267.4: Yes (3.0/1.0)
| | | | | | | DayMins > 267.4: No (6.0)
| | | | | | MonthlyCharge > 62.5
| | | | | | | DataUsage <= 0.26
| | | | | | | | CustServCalls <= 1
| | | | | | | | | DayMins <= 280.4: No (2.0)
| | | | | | | | | DayMins > 280.4: Yes (2.0)
| | | | | | | | CustServCalls > 1: Yes (11.0/1.0)
| | | | | | | DataUsage > 0.26: No (2.0)
| | MonthlyCharge > 64.9
| | | DataUsage <= 0: Yes (80.0)
| | | DataUsage > 0
| | | | DataUsage <= 0.27
| | | | | MonthlyCharge <= 68.6: No (5.0)
| | | | | MonthlyCharge > 68.6: Yes (7.0/1.0)
| | | | DataUsage > 0.27: Yes (16.0)
| DataPlan = Yes
| | ContractRenewal = No
| | | DayMins <= 276.2: Yes (4.0)
| | | DayMins > 276.2: No (4.0/1.0)
| | ContractRenewal = Yes: No (45.0/1.0)

Number of Leaves : 48

Size of the tree : 93

> m3 <- table(churn$Churn, predict(m2))

> m3

No Yes
No 2840 10
Yes 146 337

pg. 22

> plot(m3)
> ##Classification Tree for all the Calls (using rpart )
> library(rpart)
> f<-rpart(Churn ~CustServCalls+MonthlyCharge+DayCalls+tenure_group
+ +OverageFee,method="class", data=churn)
> plot(f, uniform=TRUE,main="Classification Tree for Churn")
> text(f, use.n=TRUE, all=TRUE, cex=.7)
> plot(Churn ~., data = churn, type = "c")
Hit <Return> to see next plot: lines(Churn ~ CustServCalls,type="l")
Hit <Return> to see next plot:
Hit <Return> to see next plot:
Hit <Return> to see next plot: qplot(DayCalls, CustServCalls, data =
Hit <Return> to see next plot: churn,colour=Churn)
Hit <Return> to see next plot:
Hit <Return> to see next plot: qplot(DayCalls,CustServCalls, data = churn,geom
Hit <Return> to see next plot: c("point", "smooth"),color=Churn)
Hit <Return> to see next plot:
Hit <Return> to see next plot: dsc<- churn[sample(nrow(churn),100), ]
Hit <Return> to see next plot: qplot(DayCalls,CustServCalls, data = dsc, geom
= c("point",
There were 11 warnings (use warnings() to see them)
> "smooth"),color=Churn)
Error: unexpected ')' in "
> qplot(tenure_group, CustServCalls, data =
+ churn,colour=Churn)
> qplot(ContractRenewal, CustServCalls, data =
+ churn,colour=Churn)
> ggplot(churn) +
+ geom_bar(aes(x = DataPlan, fill = Churn), position = "dodge")
> ggplot(churn) +
+ geom_bar(aes(x = CustServCalls, fill = Churn), position = "dodge")
> scatterplot(churn)
Error in scatterplot.default(churn) :
argument "y" is missing, with no default
> library("car")
> #################################################################
> ##Identifying outlier
> boxplot(churn$tenure, main="Tenure Outliers",
+ ylab="Tenure in Months")
> boxplot(churn$CustServCalls, main="Cust Serv Outliers",

pg. 23

+ ylab="Calls Made")
> boxplot(churn$DayCalls, main="Day Calls Outliers",
+ ylab="Duration of Calls")
> boxplot(churn$DayMins, main="Day Mins Outliers",
+ ylab="Duration of Calls")
> boxplot(churn$DataUsage, main="Data Usage Outliers",
+ ylab="Usage in GB")
> ##################################################################
> ##Logistic Regression Model
> glm_model <- train(Churn ~ ., data = churn,
+ method="glm",
+ trControl = control
+ )
Warning message:
In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
> glm_model
Generalized Linear Model

3333 samples
11 predictor
2 classes: 'No', 'Yes'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 3000, 2999, 3000, 3000, 2999, 3000, ...
Resampling results:

ROC Sens Spec

0.8118937 0.974386 0.1760204

> ##The Logistic Regression model on the training data gives an ROC value of
0.81. 97.5% of customers that the model identified as leaving their service
did so.
> ##Predictive Capability of Logistic Regression Model
> glm_pred <- predict(glm_model, newdata = test)
> glmcm <- confusionMatrix(glm_pred, test[["Churn"]])
> glmaccuracy <- glmcm$overall[c(1,3,4)]
> glmcm
Confusion Matrix and Statistics

Prediction No Yes
No 846 115
Yes 17 22

pg. 24

Accuracy : 0.868
95% CI : (0.8454, 0.8884)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.3429

Kappa : 0.2015

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.9803
Specificity : 0.1606
Pos Pred Value : 0.8803
Neg Pred Value : 0.5641
Prevalence : 0.8630
Detection Rate : 0.8460
Detection Prevalence : 0.9610
Balanced Accuracy : 0.5704

'Positive' Class : No

> ##When the model is applied to the test data it yields accuracy of 87%, with
95% of customers that the model identified as leaving their service doing so
and 59.46% of customers that the model identified as staying with the service
staying with the company
> ##K-Nearest Neighbours
> ##Cross validation on the training set is used to find the optimal value for
> knn_model <- train(Churn ~ ., data = churn,
+ method = "knn", trControl = control,
+ preProcess = c("center","scale"), tuneLength = 50)
Warning message:
In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
> knn_model
k-Nearest Neighbors

3333 samples
11 predictor
2 classes: 'No', 'Yes'

Pre-processing: centered (13), scaled (13)

Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 2999, 3000, 3000, 3000, 3000, 3000, ...
Resampling results across tuning parameters:

k ROC Sens Spec

5 0.8481891 0.9814035 0.393112245
7 0.8732730 0.9873684 0.372534014
9 0.8771251 0.9870175 0.360119048
11 0.8813202 0.9898246 0.314710884

pg. 25

13 0.8831816 0.9908772 0.294047619

15 0.8829543 0.9922807 0.271173469
17 0.8856169 0.9933333 0.265136054
19 0.8854264 0.9947368 0.254719388
21 0.8832985 0.9950877 0.225765306
23 0.8834157 0.9950877 0.207100340
25 0.8851906 0.9961404 0.194727891
27 0.8834919 0.9968421 0.163605442
29 0.8834786 0.9975439 0.159438776
31 0.8828395 0.9978947 0.144897959
33 0.8821930 0.9975439 0.149064626
35 0.8825335 0.9968421 0.132525510
37 0.8818193 0.9968421 0.124234694
39 0.8813701 0.9978947 0.115901361
41 0.8810189 0.9978947 0.107780612
43 0.8798679 0.9982456 0.103656463
45 0.8796974 0.9989474 0.089115646
47 0.8797049 0.9985965 0.091156463
49 0.8779086 0.9989474 0.082823129
51 0.8780804 0.9989474 0.070323129
53 0.8775485 0.9992982 0.062117347
55 0.8766605 0.9992982 0.057993197
57 0.8750668 0.9992982 0.051785714
59 0.8753655 0.9992982 0.051828231
61 0.8747761 0.9996491 0.045578231
63 0.8736711 0.9996491 0.043537415
65 0.8733800 1.0000000 0.035204082
67 0.8740775 0.9996491 0.031037415
69 0.8739562 0.9996491 0.033120748
71 0.8733926 0.9996491 0.026955782
73 0.8719572 0.9996491 0.028996599
75 0.8726676 1.0000000 0.020705782
77 0.8720143 1.0000000 0.016581633
79 0.8728601 1.0000000 0.016581633
81 0.8724161 1.0000000 0.014540816
83 0.8721667 1.0000000 0.014540816
85 0.8713604 1.0000000 0.014540816
87 0.8706654 1.0000000 0.012500000
89 0.8709014 1.0000000 0.012500000
91 0.8700900 1.0000000 0.010416667
93 0.8691306 1.0000000 0.012500000
95 0.8679043 1.0000000 0.010416667
97 0.8676197 1.0000000 0.008333333
99 0.8677869 1.0000000 0.008333333
101 0.8685215 1.0000000 0.006250000
103 0.8677015 1.0000000 0.006250000

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 17.
> ##Predictive Capability
> knn_pred <- predict(knn_model, newdata = test)
> knncm <- confusionMatrix(knn_pred, test[["Churn"]])
> knnaccuracy <- knncm$overall[c(1,3,4)]
> knncm
Confusion Matrix and Statistics

pg. 26

Prediction No Yes
No 862 93
Yes 1 44

Accuracy : 0.906
95% CI : (0.8862, 0.9234)
No Information Rate : 0.863
P-Value [Acc > NIR] : 2.11e-05

Kappa : 0.446

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.9988
Specificity : 0.3212
Pos Pred Value : 0.9026
Neg Pred Value : 0.9778
Prevalence : 0.8630
Detection Rate : 0.8620
Detection Prevalence : 0.9550
Balanced Accuracy : 0.6600

'Positive' Class : No

> ##The KNN model has predictive accuracy of 90.6% and correctly classifies
87% of customers who went on to leave.
> ##Naive Bayes
> nb_model <- train(Churn ~ ., data = churn,
+ method = "nb", trControl = control,
+ preProcess = c("center","scale"), tuneLength = 50)
There were 50 or more warnings (use warnings() to see the first 50)
> ##Predictive Capability
> nb_pred <- predict(nb_model, newdata = test)
Warning messages:
1: In FUN(X[[i]], ...) :
Numerical 0 probability for all classes with observation 189
2: In FUN(X[[i]], ...) :
Numerical 0 probability for all classes with observation 258
> nbcm <- confusionMatrix(nb_pred, test[["Churn"]])
> nbaccuracy <- nbcm$overall[c(1,3,4)]
> nbcm
Confusion Matrix and Statistics

Prediction No Yes
No 863 123
Yes 0 14

pg. 27

Accuracy : 0.877
95% CI : (0.855, 0.8967)
No Information Rate : 0.863
P-Value [Acc > NIR] : 0.106

Kappa : 0.1642

Mcnemar's Test P-Value : <2e-16

Sensitivity : 1.0000
Specificity : 0.1022
Pos Pred Value : 0.8753
Neg Pred Value : 1.0000
Prevalence : 0.8630
Detection Rate : 0.8630
Detection Prevalence : 0.9860
Balanced Accuracy : 0.5511

'Positive' Class : No

> ##The NB model has predictive accuracy of 87.7% and correctly classifies 87%
of customers who went on to leave.
> model_list <- list("Logistic Regression" = glm_model, "kNN" = knn_model,
"NB" = nb_model)
> resamples <- resamples(model_list)
> dotplot(resamples, metric="ROC", main = "Area Under Curve with 95% CI")
> models <- c("Logistic Regression", "KNN", "NB")
> accuracysummary <- bind_rows(Logistic = glmaccuracy, kNN = knnaccuracy, NB =
> accuracysummary2 <- add_column(accuracysummary, "Model" = models, before =
> accuracysummary2
# A tibble: 3 x 5
Accuracy AccuracyLower AccuracyUpper Model before
<dbl> <dbl> <dbl> <chr> <chr>
1 0.868 0.845 0.888 Logistic Regression Accuracy
2 0.906 0.886 0.923 KNN Accuracy
3 0.877 0.855 0.897 NB Accuracy
> ggplot(accuracysummary2, aes(x = Model, y = Accuracy)) + geom_bar(stat =
"identity") +
+ geom_errorbar(width = 0.5, aes(ymin = AccuracyLower, ymax =
AccuracyUpper), color = "black") +
+ coord_cartesian(ylim = c(0.9, 0.85)) +
+ labs(y = "Accuracy %", x = "Model", title = "Model Prediction Accuracy
with 95% CI") +
+ theme_minimal()

pg. 28

> ##################Conclusion
> ##Identifying attributes of customers likely to churn
> summary(glm_model)


Deviance Residuals:
Min 1Q Median 3Q Max
-1.9916 -0.5103 -0.3490 -0.2080 3.0143

Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.315887 1.142700 -3.777 0.000159 ***
tenure -0.005033 0.013059 -0.385 0.699920
ContractRenewalYes -1.987817 0.143711 -13.832 < 2e-16 ***
DataPlanYes -1.170428 0.536632 -2.181 0.029179 *
DataUsage 0.434520 1.926339 0.226 0.821537
CustServCalls 0.509554 0.038988 13.070 < 2e-16 ***
DayMins 0.018782 0.032536 0.577 0.563752
DayCalls 0.003456 0.002755 1.255 0.209658
MonthlyCharge -0.035050 0.191203 -0.183 0.854551
OverageFee 0.199282 0.326232 0.611 0.541293
RoamMins 0.079964 0.022106 3.617 0.000298 ***
`tenure_group0-12 Month` -1.561340 0.961034 -1.625 0.104238
`tenure_group12-24 Month` -1.490958 0.873711 -1.706 0.087921 .
`tenure_group24-48 Month` -1.394545 0.810196 -1.721 0.085206 .
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2758.3 on 3332 degrees of freedom

Residual deviance: 2185.8 on 3319 degrees of freedom
AIC: 2213.8

Number of Fisher Scoring iterations: 5

> summary(knn_model)
Length Class Mode
learn 2 -none- list
k 1 -none- numeric
theDots 0 -none- list
xNames 13 -none- character
problemType 1 -none- character
tuneValue 1 data.frame list
obsLevels 2 -none- character
param 0 -none- list
> summary(nb_model)
Length Class Mode
apriori 2 table numeric
tables 13 -none- list
levels 2 -none- character
call 6 -none- call

pg. 29

x 13 data.frame list
usekernel 1 -none- logical
varnames 13 -none- character
xNames 13 -none- character
problemType 1 -none- character
tuneValue 3 data.frame list
obsLevels 2 -none- character
param 0 -none- list

pg. 30

You might also like