Customer Churn Prediction
Customer Churn Prediction
likely to stop using a product or service in the near future. It is a valuable predictive analytics technique
used by businesses to forecast customer behavior and take proactive measures to retain customers.
Objective : objective of this project is to predict wather customer is about to churn or not.
Out[2]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetS
7590- No phone
0 Female 0 Yes No 1 No
VHVEG service
5575-
1 Male 0 No No 34 Yes No
GNVDE
3668-
2 Male 0 No No 2 Yes No
QPYBK
7795- No phone
3 Male 0 No No 45 No
CFOCW service
9237-
4 Female 0 No No 2 Yes No Fibe
HQITU
5 rows × 21 columns
In [3]: 1 #print concise summary of the dataset
2 df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 St i M i 7043 ll bj t
In [4]: 1 #check for missing values
2 df.isnull().sum()
Out[4]: customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64
Out[5]: 0
In [6]: 1 #check datatype
2 df.dtypes
In [8]: 1 #since total changes is having numerical value but dtype is object to change it i
2 df['TotalCharges']=pd.to_numeric(df['TotalCharges'],errors='coerce')
Out[9]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService O
No phone
7040 Female 0 Yes Yes 11 No DSL
service
as we can see 83.8 % of the customers are senior citizen and only 16.2% are adult customer.
Since our dataset is highly imbalance we need to balance before fitting it into model
In [13]: 1 #how much loss we are having because of customer churn
2 churn_customers=df[df["Churn"]=="Yes"]
3 loss=churn_customers["TotalCharges"].sum()
4 total_revenue=df["TotalCharges"].sum()
5 print("We have lost arround {}$ due to customer churn".format(loss))
6 print("We have lost arround {} percentage of revengue due to customer churn".form
after plotting histogram and boxplot we found that there is no outlier present in numeric dataset so we
don't need to do any kind of outlier treatment.
In [16]: 1 sns.pairplot(df.drop(columns="SeniorCitizen"),hue="Churn",kind="scatter")
2 plt.show()
Univariate Analysis
In [17]: 1 #plot cateogrical features :
2 cat_features=list(df.select_dtypes(include='object').columns)
3 cat_features.remove('Churn')
4 cat_features.append('SeniorCitizen')
5
6 fig,axs=plt.subplots(nrows=4,ncols=4,figsize=(20,10))
7 axes=axs.flatten()
8 for i,col in enumerate(cat_features):
9 sns.countplot(x=col,hue="Churn",data=df,ax=axes[i])
10 #adjust spacing between subplots
11 fig.tight_layout()
12 plt.show()
Data Cleaning
In [18]: 1 df.head(5)
Out[18]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService Onlin
No phone
0 Female 0 Yes No 1 No DSL
service
No phone
3 Male 0 No No 45 No DSL
service
Out[19]: gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 11
Churn 0
dtype: int64
In [20]: 1 df["TotalCharges"].fillna(df["TotalCharges"].mean(),inplace=True)
In [21]: 1 df.isnull().sum().sum()
Out[21]: 0
In [23]: 1 df.head()
Out[23]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService Onlin
0 0 0 1 0 1 0 1 0
1 1 0 0 0 34 1 0 0
2 1 0 0 0 2 1 0 0
3 1 0 0 0 45 0 1 0
4 0 0 0 0 2 1 0 1
In [24]: 1 df.dtypes
since we are using ensemble methods for model building so there is no need of feature scaling as its
prediction is based on creating multiple decision tree
In [28]: 1 x.shape
Feature Selection
selecting only 10 features which has higher correlation with churn
Out[29]: ▾ SelectKBest
SelectKBest()
In [31]: 1 x=x[select_feature.get_feature_names_out()]
In [32]: 1 x.shape
according to the feature selection we have selected 10 top features out of 19 features
In [33]: 1 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
In [34]: 1 x_train.shape,y_train.shape,x_test.shape,y_test.shape
Out[35]: 0 5174
1 1869
Name: Churn, dtype: int64
In [37]: 1 #Random Forest Model without balancing dataset and without hyper paramter tuning
2 rand_forest=RandomForestClassifier()
3 rand_forest.fit(x_train,y_train)
Out[37]: ▾ RandomForestClassifier
RandomForestClassifier()
Out[39]: ▾ GradientBoostingClassifier
GradientBoostingClassifier()
as we can see our model is not performing up to the mark because of imbalance nature of dataset so we
will balance it to reduce TN,FN and increase TP,FP
In [41]: 1 plt.figure(figsize=(8,4))
2 y.value_counts().plot(kind="pie",autopct="%1.f%%",labels=['No','Yes'])
3 plt.show()
we have 2 classes class 0 and class 1. class 0 - majority class class 1 -minority class
In [42]: 1 smote=SMOTEENN()
2 x_st,y_st=smote.fit_resample(x,y)
In [43]: 1 y_st.value_counts().plot(kind="bar")
2 plt.title("target class distribution after under sampling")
3 plt.show()
In [44]: 1 y_st.value_counts()
Out[44]: 1 3101
0 2666
Name: Churn, dtype: int64
since we have performed SMOTEENN (combination of Smote + ENN) sampling method and we can see
our dataset is nearly balanced
In [45]: 1 #now split training and validation set using balanced dataset
2 x_train,x_test,y_train,y_test=train_test_split(x_st,y_st,test_size=0.2,random_sta
In [46]: 1 x_train.shape,y_train.shape,x_test.shape,y_test.shape
Building Model with Balanced Dataset and performance hyper parameter tuning using
RandomSearchCV
In [47]: 1 param_grid={'n_estimators':[40,80,120,160,200],
2 'max_depth':[2,4,6,8,10],
3 "criterion":['gini'],
4 "random_state":[27,42,43]
5 }
6 random_search_cv=RandomizedSearchCV( estimator=RandomForestClassifier(), param_di
7 random_search_cv.fit(x_train,y_train)
Out[47]: ▸ RandomizedSearchCV
▸ estimator: RandomForestClassifier
▸ RandomForestClassifier
In [48]: 1 random_search_cv.best_params_
In [52]: 1 random_search_cv2=RandomizedSearchCV(estimator=GradientBoostingClassifier(random_
2 random_search_cv2.fit(x_train,y_train)
Out[52]: ▸ RandomizedSearchCV
▸ estimator: GradientBoostingClassifier
▸ GradientBoostingClassifier
In [53]: 1 random_search_cv2.best_params_
In [54]: 1 gb_final_model=random_search_cv2.best_estimator_
In [55]: 1 #evaluate final GradientBoostingClassifier Performance
2 evaluate_model_performance(gb_final_model,x_test)
In [56]: 1 file=open("trained_model.pkl","wb")
2 pickle.dump(gb_final_model,file)
3 file.close()
Conclusion : after balancing the dataset using smootenn and hyper paramter tuning model performance
has increase and the highest f1 score we are getting is 97%.