FULLTEXT01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

DEGREE PROJECT IN ENGINEERING PHYSICS,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2020

Customer Churn Analysis and


Prediction using Machine Learning
for a B2B SaaS company

MARIE SERGUE

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ENGINEERING SCIENCES
Customer Churn Analysis and
Prediction using Machine Learning
for a B2B SaaS company

MARIE SERGUE

Degree Projects in Mathematical Statistics (30 ECTS credits)


Degree Programme in Engineering Physics
KTH Royal Institute of Technology year 2020
Supervisor at Aircall: Edouard Flouriot
Supervisor at KTH: Jimmy Olsson
Examiner at KTH: Jimmy Olsson
TRITA-SCI-GRU 2020:032
MAT-E 2020:010

Royal Institute of Technology


School of Engineering Sciences
KTH SCI
SE-100 44 Stockholm, Sweden
URL: www.kth.se/sci
Abstract
This past decade, the majority of services have been digitalized and data more and
more available, easy to store and to process in order to understand customers behav-
iors. In order to be leaders in their proper industries, subscription-based businesses
must focus on their Customer Relationship Management and in particular churn
management, that is understanding customers cancelling their subscription. In this
thesis, churn analysis is performed on real life data from a Software as a Service
(SaaS) company selling an advanced cloud-based business phone system, Aircall.
This use case has the particularity that the available dataset gathers customers
data on a monthly basis and has a very imbalanced distribution of the target: a
large majority of customers do not churn. Therefore, several methods are tried in
order to diminish the impact of the imbalance while remaining as close as possible to
the real world and the temporal framework. These methods include oversampling
and undersampling (SMOTE and Tomek’s link) and time series cross-validation.
Then logistic regression and random forest models are used with an aim to both
predict and explain churn. The non-linear method performed better than logistic
regression, suggesting the limitation of linear models for our use case. Moreover,
mixing oversampling with undersampling gives better performances in terms of pre-
cision/recall trade-off. Time series cross-validation also happens to be an efficient
method to improve performance of the model. Overall, the resulting model is more
useful to explain churn than to predict it. It highlighted some features majorly
influencing churn, mostly related to product usage.

ii
Sammanfattning
Kundundersökning och förutsägelse med maskininlärning för
ett B2B SaaS-företag
Under det senaste decenniet har många tjänster digitaliserats och data blivit mer och
mer tillgangliga, enkla att lagra och bearbeta med syftet att forsta kundbeteende.
For att kunna vara ledande inom sina branscher maste prenumerationsbaserade fore-
tag fokusera pa kundrelationshantering och i synnerhet churn management, det vill
saga forstaelse for hur kunder avbryter sin prenumeration. I denna uppsats utfors
karnanalys pa verkliga data fran ett SaaS-foretag (software as a service) som sal-
jer ett avancerat molnbaserat foretagstelefonsystem, Aircall. Denna fallstudie är
speciell på så sätt att den tillgangliga datamängden består av månatlig kunddata
med en mycket ojämn fordelning: en stor majoritet av kunderna avbryter inte sina
prenumerationer. Darfor undersöks flera metoder for att minska effekten av denna
obalans, samtidigt som de forblir sa nara den verkliga varlden och den tidsmassiga
ramen. Dessa metoder inkluderar oversampling och undersampling (SMOTE och
Tomeks lank) och korsvalidering av tidsserier. Sedan anvands logistisk regression
och random forests i syfte att bade forutsaga och forklara prenumerationsbortfall.
Den icke-linjara metoden presterade battre an logistisk regression, vilket tyder pa
en begransning hos linjara modeller i vart anvandningsfall. Dessutom ger blandning
av oversampling med undersampling battre prestanda nar det galler precision och
aterkoppling. Korsvalidering av tidsserier ar ocksa en effektiv metod for att forbattra
modellens prestanda. Sammantaget ar den resulterande modellen mer anvandbar
for att forklara bortfall an att forutsaga dessa. Med hjälp av modellen kunde vissa
faktorer, främst relaterade till produktanvändning, som påverkar bortfallet identi-
fieras.

iii
Acknowledgements
First of all, I would like to express my deepest gratitude to my industrial and
academic supervisors, Edouard Flouriot and professor Anja Janssen, respectively.
Their guidance and assistance have been of great value for the completion of this
work. I would also like to thank my colleagues at Aircall whom supported me and
my work a lot, while integrating me in their team with kindness. Finally, I would
like to thank professor Jimmy Olsson, for being my examiner and to my classmates,
Louis and Pierre, for their continuous reviews and discussions.

iv
Contents

1 Introduction 1
1.1 Context and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature 5

3 Theory 7
3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 ROC Curve and AUC . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Precision - Recall Curve . . . . . . . . . . . . . . . . . . . . . 13
3.3 Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Methods 18
4.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Data Procurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Flatten temporal data . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.3 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Dealing with data set imbalance . . . . . . . . . . . . . . . . . . . . . 21

v
vi CONTENTS

4.4.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Time-series cross-validation . . . . . . . . . . . . . . . . . . . 22
4.5 Modelling and hyperparameter search . . . . . . . . . . . . . . . . . . 23
4.5.1 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5.2 Classification threshold . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Results and Analysis 26


5.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2 Time Series analysis . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.3 Feature distribution . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.1 Oversampling: Random oversampling . . . . . . . . . . . . . . 31
5.3.2 Oversampling: SMOTE . . . . . . . . . . . . . . . . . . . . . 32
5.3.3 Oversampling and Undersampling: SMOTE-Tomek . . . . . . 34
5.4 Time series Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 35

6 Discussion 37
6.1 Temporal framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Available data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Sequential data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Conclusions 39

Bibliography 41
Chapter 1

Introduction

In the past years, companies have been able to store and process huge amounts of
data while realizing that being customer-centric was becoming a main requirement
to stand out of the competition. Indeed, due to saturated markets, focusing on
Customer Relationship Management (CRM) in order to retain existing customer
base is not optional anymore, but an absolute necessity for competitive survival. In
its research for Bain and Company, Frederick Reicheld [1] stated that the cost of
acquiring a new customer could be higher than that of retaining a customer by as
much as 700%, and that increasing customer retention rates by a mere 5% could
increase profits by 25% to 95%.

More generally, data-driven decision making is way for businesses to make sure
their next move will benefit both them and their customers. Almost every company,
especially in the Tech ecosystem has now put into place a tracking process, to gather
data related to their customers’ behavior. The data to track varies along with the
specific business model of each company and the problem service they aim to address.
By analyzing how, when and why customers behave a certain way, it is possible to
predict their next steps and have time to work on fixing issues beforehand.

Churn prediction is the activity of trying to predict the phenomena of loss of


customers. This prediction and quantification of the risk of losing customers can
be done globally or individually and is mainly used in areas where the product or
service is marketed on a subscription basis. The prediction of churn is generally
done by studying consumer behaviour or by observing individual behaviour that
indicates a risk of attrition. It involves the use of modelling and machine learning
techniques that can sometimes use a considerable amount of data.

1
2 CHAPTER 1. INTRODUCTION

These behaviours can be :


• variations in consumption or usage behaviour

• a change to inactive client status or a drop in service usage

• the formulation of a claim (number, frequency and types of claims)

• an increase in consumption leading to a sharp rise in the bill

1.1 Context and Terminology


Aircall is a SaaS (Software As A Service) B2B company on a mission to redefine
the business phone. It is an advanced, cloud-based business phone system and call
center software, all wrapped up in a single tool.

Aircall’s customer base consists in almost 5000 international small businesses and
start-ups, which represent the customers. All those customers can add how many
users they need to their account and assign them to one or more phone numbers.
When the customers subscribe to the service, they choose between several pricing
plans, that each propose a price per user added.
The main specificity and competitive advantage of Aircall is that it can be con-
nected to many other business tools so that each customer can build their own
custom workflows. The connection that is built by a customer between its Aircall
account and any other software that it uses is called an integration. Each customer
can create as many integrations as it wishes. The use of Aircall with integrations is
perceived as a generator of adherence to the product.
For now, product usage is not precisely tracked yet on the product side but
a trend can be highlighted by just having a look at the number of calls that are
made by the customer (inbound calls), received by the customer (outbound calls)
and the number of integrations they configured in their Aircall account, as well as
the evolution of these metrics over time for a given customer.
Finally, the customers can assess the quality of Aircall’s product and service by
two different means. First, a form is sent to each of their user every 3 months where
they can express how much they would recommend the product to someone (ranking
from 0 to 10, 0 being the most negative answer). Depending on their grade, they
are then qualified as being either promoter (graded 9 or 10), detractor (from 0 to 6)
or neutral. Aircall then compute the Net Promoter Score (NPS ) which is calculated
by subtracting the percentage of customers who are Detractors from the percentage
of customers who are Promoters. An NPS can be as low as -100 (every respondent
is a detractor) or as high as +100 (every respondent is a promoter).
CHAPTER 1. INTRODUCTION 3

The business teams at Aircall are divided into five groups: Sales, Onboarding,
Customer Success, Support and Marketing. They all have their importance at dif-
ferent moment during customer lifetime. First, the Sales team is in charge of finding
potential clients, make sure they are qualified (meaning Aircall’s product could ac-
tually be useful for them), and signing the deal. Once the targeted company has
become a customer, the Onboarding team has to help them configuring the prod-
uct on their own system so that they can have the best experience with it. From
the time the company becomes a customer, it becomes the responsibility of the
Customer Success team. They are the point of contact between Aircall and their
customers, and are the key stakeholder when talking about churn. Indeed, their job
is split into trying to upgrade the customers to a better plan, or having it adding
more users, and preventing them from churning. Finally, the Support team is the
point of contact for any kind of technical issues. They can be reached to through
tickets, chat or phone.

Aircall’s customers are divided into two distinct categories depending on how
much they bring to the company. The ones that represent more than $1K of monthly
recurring revenue are defined as VIP accounts, and the other ones as Mid-Market
accounts. VIP accounts represent less than 10% of Aircall’s total number of cus-
tomers but 50% of total monthly recurring revenue and they are assigned to 70%
of the Customer Success team. All these accounts are carefully cared about so the
Customer Success team successfully manage to prevent their churn. On the contrary,
there are too many Mid-Markets accounts, and too little Customer Success Man-
agers to handle their churn. Human resources are too limited to conduct this work
which is why the company decided to invest in some Data Analysis and Machine
Learning to get insights about the Mid-Market accounts, and give those information
to Customer Success Managers so that they can contact the sensitive customers
before they leave.

1.2 Research Question


Based on the background of the company, the fact that it is the first time that
this type of research is conducted at Aircall and that not much data is currently
available, this thesis focuses on investigating the following research question:

To what extent can statistical analysis and machine learning algorithms


be used to highlight churn drivers for a B2B company with little data
available so that churn can be explained?
4 CHAPTER 1. INTRODUCTION

1.3 Delimitation
In order to narrow down the scope of this thesis and to match Aircall’s needs and
context, the following delimitation has been made:

• Aircall has been able to gather and store basic customer data from its creation
in 2014. However, due to many changes in the business in the first 4 years, and
the absence of a Data Team until early 2018, it was highlighted that the data
before 2018 would not be relevant enough to be considered in the model. The
data used for the analysis is then historical and gathered from the beginning
of 2018 to reflect modern behaviour.

• The analysis is limited to Mid-Market accounts as precised in section 3.1,


because VIP accounts usually have their own specific behaviour and are already
well handled by the business teams.

1.4 Outline
In Chapter 2, relevant literature is reviewed in order to get familiar with the work
that have been conducted in this area. Chapter 3 gathers all the theoretical princi-
ples that have been used during this research. In Chapter 4, the hole process and
methodology of the thesis is detailed to get the full picture of the project, which
leads to the results and analysis presented in Chapter 5. Finally, this paper ends
with a discussion of the findings in Chapter 6 and a conclusion in Chapter 7.
Chapter 2

Literature

Customer churn as newly developed concept has been widely employed in telecom-
munication, e-retailing and banking industries in recent years. The definition for
churn may vary for each organization with regards to the duration that a customer
is apart from a company. Many existing studies highlight that in order to boost
customer loyalty and boost retention, having a customized and differentiated rela-
tionship with its customers is fundamental. To answer this problem, several ways
of performing analysis capturing customer behaviors exist.
Among others, RFM (Recency, Frequency, Monetary) is a commonly used frame-
work for interpreting customers’ prior behavioural actions and applied for organiza-
tions. The Pareto/NBD model is then usually performed by optimizing the param-
eters to provide a best fit to a set of RFM data. In RFM, Recency refers to the time
interval between the latest purchase and the determined time period. Frequency
refers to the number of purchases in a particular time period. Finally, Monetary
refers to the amount of total spending in a particular time period. Customer with
a high frequency, high monetary and low recency means can be defined as valuable
customer. For large data sets RFM is a good, fast and practical method which helps
to understand most valuable or less valuable customers. Two studies were proposed
to group customers for each RFM variables by adding weighting into calculation.
Hughes [2] suggests three RFM variables have equal importance so weights should
be equal, when Stone [3] suggests RFM variables’ importance vary among the sec-
tors. Stone’s suggestion have been approved in some studies. These weights which
sum of total is 1, are multiplied with each RFM variable to create a score for each
customer. The score stands for value of each customer where biggest score means
loyal and lowest score means disloyal. Unfortunately, even though the RFM frame-
work is very suitable for B2C companies, it is less intuitive for subscription-based
B2B companies, especially SaaS. Indeed, the service is billed monthly, so recency

5
6 CHAPTER 2. LITERATURE

and frequency are the same for all customers. The only relevant indicator is thus the
Monetary value, which is highlighted by looking at the Monthly Recurring Revenue
(MRR).
To expand the analysis, numerous studies have explored various machine learning
algorithms and their potential for modelling churn. Because predicting whether a
customer will churn or not is a binary classification problem, multiple models such as
logistic regression [4], decision trees [5], random forest, support vector machines and
neural networks have been tested. Logistic regression has the advantage to be easy
to use and to produce robust results. Therefore, it is often applied as a benchmark
to more advanced techniques. However, these waves of studies have usually been
focusing on pure methodological improvements. The researches have been conducted
aiming at improving the accuracy of customer churn model. According to these
studies, some of the methods indeed provided better prediction whilst being tested
with a particular data-set. However, the improvements are normally limited to an
industry or even a company. We cannot observe any generic improvements from
simply employing different methods. It shows the necessity to explore not only
methodological improvements but also the conceptual development [6].
Nevertheless, these studies raised the main challenges when trying to model
churn. The main one concerns the fact that the data is usually imbalanced, that is
the proportion of churn cases is small in the overall data set. The most proposed
solution is to apply an oversampling method called SMOTE, which creates synthetic
samples of the minority class to even out the balance in the data set.
While business-to-customer (B2C) companies, in the telecom sector for instance,
have been making use of customer churn prediction for many years, churn prediction
in the business-to-business (B2B) domain receives much less attention in existing
literature [7]. Some studies have proposed data-mining approaches, comparing logis-
tic regression, decision trees and boosting, and showed that logistic regression could
provide some robust results but was outperformed by the aggregation of decision
trees [8].
Chapter 3

Theory

This chapter presents the theoretical backgrounds of the methods and models used
in the thesis.

3.1 Classification
When applying a model, the response variable Y can be either quantitative or qual-
itative. The process for predicting qualitative responses involves assigning the ob-
servation to a category, or class and is thus known as classification. The methods
used for classification often first predict the probability of each category of a qual-
itative variable. There exists many classification techniques named classifiers, that
enable to predict a qualitative response. In this thesis, two of the most widely-used
classifiers have been discussed: logistic regression and random forests. [9]

3.1.1 Logistic Regression


When it comes to classification, the probability of an observation to be part of a
certain class or not is determined. In order to generate values between 0 and 1, we
express the probability using the logistic equation:
exp(β0 + β1 X)
p(X) =
1 + exp(β0 + β1 X)
After a bit of manipulation, we find that:
p(X)
= exp(β0 + β1 X)
1 − p(X)
The left side of the equation is called the odds, and can take on any value between
0 and ∞. Values of the odds close to 0 and ∞ indicate respectively very low and

7
8 CHAPTER 3. THEORY

very high probabilities. By taking the logarithm of both sides we arrive at:
p(X)
log( ) = β0 + β1 X
1 − p(X)
There, the left side is called the log-odds or logit. This function is linear in X.
Hence, if the coefficients are positive, then an increase in X will result in a higher
probability.
The coefficients β0 and β1 in the logistic equation are unknown and must be
estimated based on the available training data. To fit the model, we use a method
called maximum likelihood. The basic intuition behind using maximum likelihood
to fit a logistic regression model is as follows: we seek estimates for β0 and β1 such
that the predicted probability p̂(xi ) of the target for each sample corresponds as
closely as possible to the sample’s observed status. In other words, the estimates βˆ0
and βˆ1 are chosen to maximize the likelihood function:
(1 − p(Xi0 ))
Y Y
l(β0 , β1 ) = p(Xi )
i:y1 i0 :y10

After applying logistic regression, the accuracy of the coefficient estimates can be
measured by computing their standard errors. Another performance metric is z-
statistic. For example, the z-statistic associated with β1 is equal to βˆ1 /SE(βˆ1 ),
and so a large absolute value of the z-statistic indicates evidence against the null
hypothesis H0 : β1 = 0. [10]

3.1.2 Decision Trees


Decision trees and random forest are tree-based methods that involve segmenting
the predictor space into several simple regions. The mean or mode of the training
observations in their own region are used to compute the prediction of a given
observation. This process is composed of a succession of splitting rules to segment
the space which mimic the branches of a tree and is referred to as a decision tree.
Even if tree-based methods do not compete with more advanced supervised learning
approaches, they are often preferred thanks to their simple interpretation. However,
it is possible to improve prediction accuracy by combining a large number of trees,
at the expense of some loss in interpretation.
To grow a classification tree, we use what is called the classification error rate as
a criterion for making recursive binary splits. The goal is to assign an observation in
a given region to the most commonly occurring class of training observations in that
region. Therefore, the classification error rate is simply the fraction of the training
observations in that region that do not belong to the most common class:
E = 1 − max(p̂mk ),
k
CHAPTER 3. THEORY 9

where p̂mk is the proportion of observations in the mth region that are from the
kth class. In practice, this criterion is not sensitive enough for growing the trees,
which leads us to two other measures that are usually preferred: the Gini index and
entropy. The Gini index is a measure of total variance across the K classes and is
defined by:
K
X
G= p̂mk (1 − p̂mk ).
k=1
If all the p̂mk are close to 0 or 1, the Gini index will be small meaning that a small
value of G indicates that a node mainly contains observations from only one class,
which can be characterized as node purity. As was mentioned before, an alternative
to Gini index is entropy. [11]

3.1.3 Bagging
As we mentioned in the previous section, decision trees suffer from high variance,
meaning that if we fit a decision tree to two different subsets of the training data,
we might get quite different results. Bootstrap aggregation, or bagging, is a method
aiming at reducing the variance, and therefore is commonly used when decision
trees are implemented. Given a set of n independent observations Z1 , ..., Zn , each
with variance σ 2 , the variance of the mean Z of the observations is given by σ 2 /n.
This means that averaging a set of observations reduces variance. Hence, by taking
many training sets from the population, building a separate prediction model using
each training set and averaging the resulting predictions, we can reduce the variance
and consequently increase the prediction accuracy of the method. In particular, we
calculate fˆ1 (x), fˆ2 (x), ..., fˆB (x) using B separate training sets, and average them in
order to obtain a single low-variance statistical model, given by:
B
1 X
fˆavg (x) = fˆb (x).
B b=1
However, in most use cases, it might not be possible to access multiple training
sets. That is where the bootstrap method becomes useful. Bootstrap consists in
taking repeated samples from the original training data set. It generates B different
bootstrapped training data sets. Then, the model is fit on the both bootstrapped
training set, resulting in the prediction fˆ∗b (x). All the predictions are averaged to
obtain:
B
1 X
fˆbag (x) = fˆ∗b (x).
B b=1
Bagging can easily be applied to a classification problem, to predict a qualitative
outcome Y . For a given test observation, each B tree predict a class and we choose
10 CHAPTER 3. THEORY

the overall prediction as the most commonly occurring class among B predictions.
[9]

3.1.4 Random Forests


The main drawback when bagging several decision trees is that the trees are corre-
lated. Random forest provides a way to fix this issue by using individual trees as
building blocks. Random forest builds multiple decision trees using bootstrapped
samples from the training data, each tree having high variance, and average these
trees which reduces variance. To prevent the correlation between the trees, it ran-
domly√selects a subset of variables to use for each tree whose size is usually set to
m = N , N being the total number of features. Correlation is avoided because each
tree does not consider all variables, but only subsets of them. The problem of over-
fitting is also addressed by this technique. The main disadvantage in using multiple
trees is that it lowers the interpretability of the model [12]. Gini index, presented
in Section 5.1.2 can also be used there in order to measure feature importance. The
depth of a feature used for a split can also be used to indicate importance of a given
feature. That is, as the intuition confirms, features used at the top splits of a tree
will influence the final predicted observations more than features used for splits at
the bottom of the tree. [9]

3.2 Model Evaluation


3.2.1 Confusion Matrix
Confusion matrix is a performance measurement for machine learning classification
problem where output can be two or more classes. It is a table with four different
combinations of predicted and actual values. [13]
The labels TP, FP, FN and TN respectively refer to True Positive, False Positive,
False Negative and True Negative and have the following interpretation:

• True Positive (TP): Observation is positive, and is predicted to be positive.

• False Positive (FP or Type I Error): Observation is negative, but is predicted


positive.

• False Negative (FN or Type II Error): Observation is positive, but is predicted


negative.

• True Negative (TN): Observation is negative, and is predicted to be negative.


CHAPTER 3. THEORY 11

Figure 3.1: Confusion Matrix

Given the values of these four entries, several other metrics can be derived,
namely accuracy, recall, precision and F-score.

Accuracy

TP + TN
Accuracy =
TP + TN + FP + FN
Accuracy computes the number of correctly classified items out of all classified items.

Recall

TP
Recall =
TP + FN
Recall tells how much the model predicted correctly, out of all the positive classes. It
should be as high as possible as high recall indicates the class is correctly recognized
(small number of FN). It is usually used when the goal is to limit the number of
false negatives.

Precision

TP
Precision =
TP + FP
Precision tells, out of all the positive classes that were predicted correctly, how
many are actually positive. High precision indicates an example labeled as positive
is actually positive (small number of FP). It is usually used when the goal is to limit
the number of false positives. A model with high recall but low precision means
that most of the positive examples are correctly recognized (low FN) but that there
is a lot of false positive.
12 CHAPTER 3. THEORY

F-score

2 × Recall × Precision
F-score =
Recall + Precision
F-score is a way to represent precision and recall at the same time and is therefore
widely used for measuring model performances. Indeed, if you try to only optimize
recall, the algorithm will predict most examples to belong to the positive class, but
that will result in many false positives and, hence, low precision. On the other hand,
optimizing precision will lead the model to predict very few examples as positive
results (the ones with highest probability), but recall will be very low. [14]

3.2.2 ROC Curve and AUC


AUC - ROC curve is a more visual way to measure the performance of a classifier
at various threshold settings. ROC (Receiver Operating Characteristics) curve is
created by plotting the recall against the false positive rate (FPR) defined by:
FP
FPR =
FP + TN

Figure 3.2: Example of ROC Curve

AUC (Area Under the Curve) represents degree of separability. It tells how much
the model is capable of distinguishing between classes. Higher the AUC, better the
model is at predicting 0s as 0s and 1s as 1s. The ideal ROC curve hugs the top left
corner, indicating a high true positive rate and a low false positive rate. The dotted
diagonal represents the ’no information’ classifier, this is what we would expect from
a ’random guessing’ classifier. [14]
CHAPTER 3. THEORY 13

3.2.3 Precision - Recall Curve


The precision-recall curve illustrates the trade-off between precision and recall that
was mentioned in the previous sections. As with the ROC curve, each point in the
plot corresponds to a different threshold. Threshold equal to 0 implies that the recall
is 1, whereas threshold equal to 1 implies that the recall is 0. With this curve, the
close it is to the top right corner, the better the algorithm. And hence a larger area
under the curve indicates that the algorithm has higher recall and higher precision.
In this context, the area is known as average precision. [14]

Figure 3.3: Example of Precision-Recall Curve

3.3 Imbalanced Data


The imbalanced data is referred to as a situation where one of the classes forms a
high majority and dominates other classes. This type of distribution might cause
an accuracy bias in machine learning algorithms and prevent from evaluating the
performances of the model correctly. Indeed, suppose there are two classes - A and B.
Class A is 90% of the data-set and class B is the other 10%, but it is most interesting
to identify instances of class B. Then, a model that always predict class A will be
90% of times successful in terms of basic accuracy. However, this is impractical for
the intended use case because the costs of false positive (or Type I Error) and false
negative (or Type II Error) predictions are not equal. Instead, a properly calibrated
method may achieve a lower accuracy, but would have a substantially higher true
positive rate (or recall), which is really the metric that should be optimized. There
are a couple of solutions to imbalanced data problem but only the one that were
tested during this project will be mentioned.
14 CHAPTER 3. THEORY

3.3.1 Cross validation


Train - test split
In order to estimate the test error associated with the fitting of a particular model
on a set of observations, it is very common to perform what is called a hold-out
method. In this method, the dataset is randomly divided into two sets: training
and test/validation set i.e. a hold-out set. The model is then trained on the training
set and evaluated on the test/validation. This method is only used when there is only
one model to evaluate and no hyper-parameters to tune. If one wants to compare
multiple models and tune their hyper-parameters, another form of the hold-out
method is used. It includes splitting the data into not two but three separate sets.
The training set is there divided into training and validation set so the original
dataset is divided into training, validation and test set, as shown in Figure 3.4. This
approach is conceptually simple and easy to implement but its main drawback is
that the evaluation of the model highly depends precisely on which observations are
included in the training set and which observations are included in the test set.

Figure 3.4: Hold-out method

k-fold Cross-Validation
Cross-validation is a refinement of the train-test split approach that addresses the
issue highlighted in the previous subsection. This approach consists in randomly
dividing the set of observations into k groups, or folds, of approximately equal size.
The first fold acts as a test set, and the method is fit on the remaining k − 1
folds. The mean squared error is then computed on the observations of the test
set. The overall procedure is computed k times with each time a different subset of
observations taking the role of the test set. As a result, the test error is estimated
CHAPTER 3. THEORY 15

k times and by averaging these values, it gives the k-fold cross-validation (CV)
estimate:
k
1X
CV(k) = MSEi
k i=1
The advantage of cross-validation is that it can be applied to almost any statistical
learning method. For computationally intensive models, one typically performs k-
fold CV using k = 5 or k = 10 which respectively requires fitting the learning
problem only five and ten times.

3.3.2 Resampling
Random Oversampling
A simple way to fix imbalanced data sets is simply to balance them, either by over-
sampling instances of the minority class or undersampling instances of the majority
class. Random resampling provides a naive technique for rebalancing the class dis-
tribution for an imbalanced data-set. In particular, random oversampling involves
randomly selecting examples from the minority class, with replacement, and adding
them to the training data set. It is referred to as ’naive re-sampling’ method because
it assumes nothing about the data and no heuristics are used. This makes them sim-
ple to implement and fast to execute, which is desirable for very large and complex
data sets. Importantly, the change to the class distribution is only applied to the
training data set. The intent is to influence the fit of the models. The resampling is
not applied to the test data set used to evaluate the performance of a model. How-
ever, in some cases, seeking a balanced distribution for a severely imbalanced data
set can cause affected algorithms to overfit the minority class, leading to increased
generalization error. The effect can be better performance on the training data-set,
but worse performance on the test data set. [15]

SMOTE
To avoid the overfitting problem, the Synthetic Minority Over-sampling Technique
(SMOTE) is proposed. This method generates synthetic data based on the feature
space similarities between existing minority instances. In order to create a synthetic
instance, it finds the K-nearest neighbors of each minority instance, randomly selects
one of them, and then calculates linear interpolations to produce a new minority
instance in the neighborhood. In other words, to create a synthetic data point, it
takes the vector between one of those k neighbors, and the current data point. It
multiplies this vector by a random number x which lies between 0 and 1 and add
16 CHAPTER 3. THEORY

this to the current data point to create the new synthetic data point [16]. In other
words, the synthetic sample xn ew are generated by interpolating between x and x̃
as follows:
xnew = x + rand(0, 1) × (x̃ − x),
where rand(0, 1) refers to a random number between 0 and 1.
For most basic versions of SMOTE, there are two parameters that can be ad-
justed. The first is the SMOTE multiplier m. The second parameter is the number
of nearest neighbors to use k. In the original SMOTE paper, the five nearest neigh-
bors were used and one to five of those nearest neighbors are randomly selected
depending upon the amount of oversampling desired. [17]

Figure 3.5: Visualization of SMOTE

In view of geometry, SMOTE can be regarded as interpolating between two


minority class samples. The decision space for the minority class is expanded so
that it allows the classifier to have a higher prediction on unknown minority class
samples. In Figure 5.4 SMOTE is applied to generate synthetic samples for the
minority class sample x1 . We generate randomly k (k = 5) nearest minority class
samples of x1 denoted by x2 , x3 , x4 , x5 and x6 .

SMOTE - Tomek Link


SMOTE - Tomek Link is a combination pre-processing method whereby two algo-
rithms are applied back-to-back. For a data-set with some target class, a Tomek link
is a pair of examples that are nearest neighbors of one another and have different
target class labels. Examples that belong to Tomek link pairs are likely to be either
CHAPTER 3. THEORY 17

noise points or points that lie close to the optimal decision boundary. Removing
these points can result in more well-defined class clusters in the training data, which
can lead to better classifiers. In Tomek link undersampling, only the majority class
example in each Tomek link pair is removed. There are two reasons for this. First, in
an imbalanced data set, the minority class examples may be too valuable to waste,
especially if the minority class is underrepresented. Second, recall on the minority
class is frequently of greater importance than precision, so it is worth inadvertently
retaining some noisy data in order to avoid losing rare cases. The two methods are
combined by first applying SMOTE in order to generate synthetic minority class
samples, and subsequently applying Tomek link under-sampling to data-set com-
posed of both the original and new, synthetic observations.
Chapter 4

Methods

The work conducted in this thesis is following the typical framework of a machine
learning classification problem. However, because the business problem had not been
explored at all before starting the thesis, there were many iterations in the process,
following the insights and findings that were brought. The analysis was completely
realized with Python, through Jupyter Notebook, including Python libraries numpy,
pandas, scikit-learn and imbalanced-learn.

4.1 Goal
The churn model intends to provide Aircall the level of risk a customer might have
of churning so that the Customer Relation team can contact them and implement
an action plan to retain them. The objective of this thesis is to create an acceptable
churn prediction model in order to highlight effective levers that Aircall could imple-
ment to significantly reduce its churn rate. Therefore, perfect accuracy of the model
is not what is aimed at, a model producing a satisfactory recall will be enough to
give some first insights.

4.2 Data Procurement


Because it was crucial for the company to get correct and unbiased insights from
this model, it was mandatory to take some time at the beginning of the thesis to
grasp the core of Aircall’s product, as well as the operation of Customer Relations
teams.
As a recent B2B company, Aircall’s customer database is not so rich. With
about 5000 customers today, having very specific behaviours and in which around

18
CHAPTER 4. METHODS 19

3% churn, gathering more data points is mandatory to build a model that would
give satisfying results. There is not a single simple rule to determine if the data
collected is enough to perform the chosen machine learning models, but given the
specificity of Aircall’s business and the strong imbalance, it was decided to find a
way to get more data samples. Aircall’s customers data is gathered and stored every
day, and aggregated per month to get an overview of each customer characteristics
while they are still subscribers. Because in the telecommunications industry the
behaviour of a customer is not so volatile, it is preferred to look at monthly data.

Table 4.1: Example of an extract of the raw data set


extract date company id nb users plan ...
2019-12-31 1 10 essentials ...
2020-01-31 1 12 essentials ...
2020-02-28 1 20 professional ...

One specificity of the data is that, while some features are static (customer region
for example), the majority can vary over time. Dealing with these features requires
to create a temporal framework. As seen in literature, and in line with the studied
business problem, two non-overlapping time frames are created [18] [19]. First, an
observation period for each customer ought to be specified. This observation window
is the period in which customer activity is gathered. In order to amass sufficient
data, for each customer, the observation window consists in the 3 months before each
extract date. The reasoning behind this is that customers have a certain behaviour
during the quarter before they eventually churn. A more restrictive period could be
affected by seasonality or some specificity in customer’s industry.
In order to handle the temporal features, several approaches have been studied
in literature. One could build a sequential data set, and perform adapted models on
it, such as Long-Short Term Memory (LSTM) or Recurring Neural Network (RNN).
However, these methods tend to perform very poorly on small data sets, which
makes them unsuitable in the context of this thesis. The chosen method consists in
flattening the non-static features per time step as different features for the complete
observation period. With each sequence consisting in 3 time steps, each described
by several features, the flattened version of each observation would have 3 times
the initial number of temporal features which can be considered as independent and
neglect the time dimension.
Following the same idea, an activation window of 1 month is chosen to assign
churn labels. A customer is labeled as churner at a specific extract month if it
20 CHAPTER 4. METHODS

churned in the following month. This time step is chosen in agreement with the
Customer Success team. Indeed, for a given customer predicted as churner, they
need some time to start a retention process, but widening the activation window
to more than 1 month would prevent from observing behaviour of the customer the
months before it churns and losing some significant information.
To sum it up, for each month, from early 2018 to now, we are considering the
batch of users that were still subscribers that specific month, every batch having its
own observation and activation period.

Figure 4.1: Temporal framework used for churn classification

From the data collected each month for each customer, the features are selected
using an iterative approach. Starting from a dataset containing some basic charac-
teristics of the customers, some more actionable features have been added in order
to see their influence on the model and their influence on churn. Also, in order to
reduce the risks of multicollinearity, features that had no correlation with the target
as well as redundant features were removed by looking at the distributions of the
features, independently and in relation with the target, and their correlations. In
the end, the dataset consisted in features including information about:

• customer attributes: billing period, region, stage (time since subscription),


time to close the deal, weather the customer has been onboarded

• product usage: number of users, number of numbers per users, number of


inbound calls, number of outbound calls, missed inbound calls rate, number
of integrations used

• customer satisfaction: call quality, nps

• evolution during the rolling quarter: for product usage metrics


CHAPTER 4. METHODS 21

4.3 Data Processing


4.3.1 Flatten temporal data
When flattening the temporal data as described in the previous section, it is impor-
tant to still reflect their evolution compared to their values at the extract month.
For instance, the numerical features have been added by computing the percentage
of growth between month i − n and i. For categorical features, the historical value
reflects weather the value changed between month i − n and i.

4.3.2 Categorical data


The categorical variables are handled by creating dummy variables, where k − 1
binary variables are created given a feature with k classes. The created number of
dummies is one less than the number of classes in order to avoid multicollinearity.

4.3.3 Outliers
Machine learning algorithms are very sensitive to the range and distribution of data
points. Data outliers can deceive the training process resulting in longer training
times and less accurate models. The outliers were detected by looking at boxplots
and performing extreme value analysis. The key of extreme value analysis is to
determine the statistical tails of the underlying distribution of the variable and find
the values at the extreme of the tails. As the variables are not normally distributed,
a general approach is to calculate the quantiles and then inter-quartile range. If
the data point is above the upper boundary or below the lower boundary, it can be
considered as an outlier. Because the studied data set is already small, removing
samples with outliers would waste even more data and is not considered. Instead,
the extreme value is replaced by the mean value, or most represented value of the
related feature.

4.4 Dealing with data set imbalance


4.4.1 Resampling
As described in Section 3.3 training any model on unsampled data that is very
imbalanced will ultimately lead to poor or biased results. That is the resampling
methods covered in section 3.3.2 are used on all the monthly training sets. Because
the size of the data set is already limited, it has been decided that undersampling
22 CHAPTER 4. METHODS

methods would not be explored, as they would result in an even fewer amount of
available churn observations. Studies have shown that perfectly balanced training
sets do not necessarily lead to optimal performance, so other class distributions are
experimented, with different partition between the two target classes. For the three
oversampling methods tested, the class proportions of the training sets resulting
from oversampling are given in table 4.2.

Table 4.2: Class proportions of the training set after resampling


class proportions
Churn 0.2 0.3 0.4 0.5
Do not churn 0.8 0.7 0.6 0.5

4.4.2 Time-series cross-validation


In order to train and evaluate the model, the simplest procedure is to split the data
set into two parts. The first part is used for training a model and the second part
for testing the model. These two parts are respectively named train set and test set,
in consistency with their function.
This process is highly effective if the collected data set is very large and repre-
sentative of the problem. However, because churn prediction is a temporal use case,
train - test split needs to be carefully done to be operationally valuable. Also, there
is not enough data to get an unbiased estimate of performance, especially when
using the typical train-test split evaluation of the models. In other words, when
considering a small data set, it can be very likely that one subset of the data (let’s
say the train set) contains only samples with the same label. Therefore, the model
will tend to give a better prediction on training data but fail to generalize on test
data, leading to a low training error rate but a high test error rate.
To fix this caveat, the most used model evaluation scheme for classifiers is the k-
fold cross-validation procedure. It is a popular method as it generally results in a less
biased or less optimistic estimate of the model skill than other methods, especially
a simple train-test split. Performed with a value of k = 10, it has been shown to be
effective across a wide range of data set sizes and model types. However, the k-fold
cross-validation is not appropriate for evaluating imbalanced classifiers, on which
resampling strategies have to be applied (see Section 6.5) [20]. The reasoning behind
that is that the data is split into k-folds with a uniform probability distribution.
Data with a balanced class distribution would not be affected by it, but when the
CHAPTER 4. METHODS 23

distribution is severely altered, it is likely that one or more folds will have few or
no samples from the minority class. This would conduct to a model only predicting
the majority class correctly, which is not of interest in a churn business case.
When dealing with time series data, traditional cross-validation should not be
used for two major reasons. The first one refers to temporal dependencies. Indeed,
when splitting time series data, it often happens that information from outside the
training dataset is used to create the model. This additional information can allow
the model to learn or know something that it otherwise would not know and can
then cause to create overly optimistic if not invalidate the estimated performance
of the model that is being constructed. In order to accurately simulate real world
forecasting environment, in which we stand in the present and forecast the future
(Tashman 2000), all data withheld must represent events occurring chronologically
after the event used for fitting the model. Therefore, instead of using k-fold cross-
validation, hold-out cross-validation is preferred for time series data. A subset of
data is reserved for validating the model performance. The test set data always
comes chronologically after the training set and the validation set always comes
chronologically after the training subset. In order not to choose the test set arbi-
trarily, which may lead to a poor estimate of the test set error, the method of Nested
cross-validation can be used. Nested cross-validation contains an outer loop for error
estimation and an inner loop for parameter tuning. In the inner loop, the training
set is split into a training subset and a validation set and the model is trained on
the training subset, and the parameters that minimize the error on the validation
set are chosen. In the outer loop, the dataset is split into multiple different training
and test sets, and the error on each split is averaged in order to compute a robust
estimate of model error. In the case of time series data, we use a method based on
forward-chaining, also referred to in the literature as rolling-origin evaluation [21],
on a monthly basis. It consists in successively considering each month as the test
set and assign all previous data into the training set. For example, if the data set
has six months; it would produce four different training and test splits, as can be
seen in Figure 4.2. This method produces several train-test splits and the error on
each split is averaged in order to compute a robust estimate of the test error.

4.5 Modelling and hyperparameter search


4.5.1 Modelling
Two different classification models will be fit on each of the training sets created
by the cross validation procedure. For each of these models, there exists sets of
hyperparameters that ought to be tuned. In machine learning, a hyperparameter is
24 CHAPTER 4. METHODS

Figure 4.2: Time-series Nested Cross-Validation on a subset of 6 months

a parameter whose value is set before the learning process begins. On the opposite,
the values of the parameters are derived via training. Depending on the trained
model, the importance of these parameters varies with respect to performance. For
the logistic regression model, the regularization strength λ will be tuned. The doc-
umentation on the random forest in Scikit-Learn indicates that the most important
settings are the number of trees in the forest (n_estimators) and the number of
features considered for splitting at each leaf node (max_features), which are the
ones that will be tuned using Grid Search.

4.5.2 Classification threshold


Because the data is imbalanced, with less than 4% of churn, the default value of
0.5 of the probability threshold needs to be altered. The precision-recall and ROC
curves are useful to define where to set the decision threshold of the model to
maximize either sensitivity or specificity. By plotting the precision and recall scores
as functions of the decision threshold, it is possible to view the trade-off between
the two and determine the best classification threshold.
CHAPTER 4. METHODS 25

4.6 Evaluation
The models will be assessed with the metrics presented in section 3.2: accuracy,
precision, recall, F-score, AUC and precision-recall curves. However, the focus will
be put in producing a satisfying recall score and then aim for as high precision
score as possible. This reasoning is based on the studied business problem. Indeed,
Aircall’s main motivation is to detect the maximum number of churners that is
having a low value of false negative. On the other hand, it is much more acceptable
to classify non-churners as churners, given that the action taken towards expected
churners only would result in an extra phone call. Therefore, having a low precision,
even if not optimal, is much less important than maximizing the recall.
Chapter 5

Results and Analysis

In this chapter, the results of the different models and methods presented in Chapter
4 are detailed and analyzed. These results consist in different evaluation metrics as
well some plots that give a more thorough understanding of the results. Because the
focus of the thesis was put on handling a very imbalanced small data set, the results
are presented in order to highlight the effect of the processing techniques, more than
the choice of the model itself. Therefore, the first part gives the performances of the
two studied models (Logistic Regression and Random Forest), without any of the
resampling techniques. The second part highlights the effect of oversampling and
undersampling methods on the model and the final part presents the effects of the
time-series cross validation.

5.1 Exploratory Data Analysis


5.1.1 Correlation matrix
The correlation matrix is computed with the original features. The coefficient is the
Pearson correlation coefficient, which is a measure of the linear correlation between
two variables X and Y. According to the Cauchy–Schwarz inequality it has a value
between +1 and 1, where 1 is total positive linear correlation, 0 is no linear corre-
lation, and 1 is total negative linear correlation. Were removed or rebuilt features
that had a correlation coefficient larger than 0.4. The initial correlation matrix is
shown in Figure 7.1.

26
CHAPTER 5. RESULTS AND ANALYSIS 27

Figure 5.1: Correlation matrix - Initial features

5.1.2 Time Series analysis


The analysis is made on time variant features: number of inbound and outbound
calls, number of users and numbers, missed calls rate and number of integrations.
The line plots of these temporal features are shown in Figure 5.2. Trend and seasonal
components can be highlighted mainly for the features related to the number of calls.
Indeed, there are usually less business calls made and received during Christmas and
Summer breaks. The number of calls as well as the number of users and numbers
and the number of integration all have an increasing trend over the months.

Figure 5.2: Time series analysis of a batch of temporal features


28 CHAPTER 5. RESULTS AND ANALYSIS

5.1.3 Feature distribution


The distributions of 6 major features are shown in Figure 5.3 along with their churn
rate breakdown. The figure shows the most relevant feature. Indeed, we notice that
except for the number of integrations, all the other features have an almost uniform
distribution, but with a churn rate that varies a lot along the distribution. These
features can thus be considered as good explanatory features. Further modelling
will show that they are as well of great importance in churn prediction.

Figure 5.3: Distributions of a batch of features - Up: Churn rate breakdown - Down:
Total number of samples (in blue) and total number of samples with target = 1 (in
red)
CHAPTER 5. RESULTS AND ANALYSIS 29

5.2 Basic Model


5.2.1 Logistic Regression
The following model is a basic logistic regression model fit on a training resulting
from a simple stratified train-test split. No resampling technique is implemented.

Confusion Matrix

Figure 5.4: Confusion matrix - Logistic Regression - No resampling

Figure 5.5: Precision, Recall, F1 - Logistic Regression - No resampling

The model has an accuracy of 99.55% but the confusion matrix shows that none
of the true churners have been detected.
As intuitively presumed, the model is strongly overfitting, and therefore always
predicts the negative class to reach the highest accuracy.
30 CHAPTER 5. RESULTS AND ANALYSIS

5.2.2 Random Forest


As exposed in Section 3.1.4 Random Forest is an ensemble bagging technique where
several decision trees combine to give the result. The process is a combination
of bootstrapping and aggregation. The main idea behind this is that lots of high
variance and low bias trees combine to generate a low bias and low variance random
forest. Since it is distributed over different trees, each tree seeing different subsets
of data, it is less prone to overfitting than logistic regression. In this section, the
confusion matrix and precision, recall and f1 score are presented, for a random forest
model fit on data that was not resampled.

Figure 5.6: Confusion matrix - Random Forest - No resampling

Figure 5.7: Precision, Recall, F1 - Random Forest - No resampling

Because of the class imbalance, using a random forest model, even if it tends to
overfit less than a linear model, is not sufficient to prevent overfitting.
CHAPTER 5. RESULTS AND ANALYSIS 31

5.3 Resampling
5.3.1 Oversampling: Random oversampling
The performances of the logistic regression and random forest models trained on the
oversampled data set (using random oversampling) are displayed in the following.

Figure 5.8: Confusion matrix - Logistic Regression (left) Random Forest (right) -
Random Oversampling

Figure 5.9: Precision, Recall, F1 - Logistic Regression (left) Random Forest (right)
- Random Oversampling

Random oversampling helps increasing recall but gives almost 0 precision, which
leads to a low f1-score. There is no trade-off as recall becomes the performance
metric to maximize. It can be noticed there that Logistic Regression performs a bit
better than Random Forest in terms of f1-score. But Random Forest finds a better
trade-off in terms of precision and recall.
32 CHAPTER 5. RESULTS AND ANALYSIS

Figure 5.10: ROC Curve - Logistic Regression (left) Random Forest (right) - Ran-
dom Oversampling

5.3.2 Oversampling: SMOTE


The performances of the logistic regression and random forest models trained on the
oversampled data set (using SMOTE method), for each of the class proportions are
displayed in the following.

Figure 5.11: Confusion matrix - Logistic Regression (left) Random Forest (right) -
SMOTE

SMOTE performs a bit in the same way as Random Oversampling, in the way
that logistic regression maximizes recall and random forest maximizes precision
without being able to find a good trade-off between the two. There is no clear
improvement in terms of performance compared to Random Oversampling.
CHAPTER 5. RESULTS AND ANALYSIS 33

Figure 5.12: Precision, Recall, F1 - Logistic Regression (left) Random Forest (right)
- SMOTE

Figure 5.13: ROC Curve - Logistic Regression (left) Random Forest (right) - SMOTE
34 CHAPTER 5. RESULTS AND ANALYSIS

5.3.3 Oversampling and Undersampling: SMOTE-Tomek


The performances of the logistic regression and random forest models trained on the
oversampled and undersampled data set (using SMOTE-Tomek method), for each
of the class proportions are displayed in the following.

Figure 5.14: Confusion matrix - Logistic Regression (left) Random Forest (right) -
SMOTE-Tomek

Figure 5.15: Precision, Recall, F1 - Logistic Regression (left) Random Forest (right)
- SMOTE-Tomek
CHAPTER 5. RESULTS AND ANALYSIS 35

Figure 5.16: ROC Curve - Logistic Regression (left) Random Forest (right) -
SMOTE-Tomek

5.4 Time series Cross-Validation


The influence of the time series cross-validation is displayed in this section. Time
series-cross validation is applied on random forest resampled using SMOTE-Tomek
method. Random forest is chosen due to its high interpretability, simplicity, because
feature importance can easily be analysed and because it produced slightly less
biased results than logistic regression.

Figure 5.17: Confusion matrix - Random Forest - Cross Validation + SMOTE-


Tomek
36 CHAPTER 5. RESULTS AND ANALYSIS

Figure 5.18: ROC Curve and Precision Recall Curve- Random Forest - Cross Vali-
dation + SMOTE-Tomek

Figure 5.19: Feature Importance (left) and Probability distribution (right) - Random
Forest (right) - Cross Validation + SMOTE-Tomek

The feature importance graph highlights that three features are predominantly
driving the prediction model: number of users, number of integrations and call
quality. The probability plot is also interesting to look at. Even if the performances
of the model in terms of recall and precision are still poor, the probability plot shows
that samples that are classified as churners with a probability higher than 0.6 have
twice more chance to be real churners than the other ones. This induces that the
model could be of use if we only consider the samples with the highest probability
of churning, even if that represents only a few customers.
Chapter 6

Discussion

In this chapter the general findings of this thesis are discussed and analyzed.

6.1 Temporal framework


When defining a churn model, several parameters have to be chosen including the
definition of target variable (how exactly do we define a churning customer), the tem-
poral framework considered for collecting training data and the lag period between
cutoff and observation of target outcome. In order for the Customer Management
teams to be as efficient as possible, it is of course desirable to predict churners as
soon as possible. However, a too short interval between cutoff and observation of
outcome would lead the training set to contain old observations that might not re-
flect the behaviour of the customer when it churns. In this thesis, the chosen time
frame is determined to be a reasonable trade-off between predicting in a close fu-
ture and allowing more time for churn to happen. Nonetheless, different temporal
frameworks could be investigated especially if increasing the lag could give better
performance in terms of prediction.

6.2 Available data


The analysis in this thesis highlighted the necessity of doing a precision-recall trade-
off, as it appeared that we could not build a model predicting churn with both high
recall and precision. Some more complex models such as neural networks could be
investigated in order to determine if this paradox comes from the models or are
intrinsic to the studied data. What is unsure in our use case is weather there is
enough distinction between customers that churn and those that do not churn in

37
38 CHAPTER 6. DISCUSSION

order to be able to predict churn with enough accuracy. This consideration is also
powered by the fact that the customer database is poor, which increases the risks
of discrepancies between customers behaviours [22]. In addition, Aircall’s customers
are of various types but not identified yet. It would thus be interesting to perform
customer segmentation at first [23], and then adapt the churn prediction to the
different segments of customers. Another improvement would be to include more
data related to product usage in addition to deterministic data that remains constant
regardless of consumption.

6.3 Sequential data set


In this thesis, the process of flattening temporal data and using time series cross
validation was chosen but some more complex data processing techniques and their
associated models could be investigated. Mainly, building a sequential data set in
the form of the one in Figure 6.1 could enable to use sequential deep learning models
such as LSTMs or RNNs [24].

Figure 6.1: Example of sequential data structure where, for a customer X, τ is the
number of months and d is the number of features
Chapter 7

Conclusions

In this chapter the research questions are answered and business and academic
contribution is presented.
After having completed the analysis presented in this thesis report, the research
question can be answered. To what extent can statistical analysis and machine
learning algorithms be used to highlight churn drivers for a B2B company with little
data available so that churn can be explained? All the models are unable to predict
churn well, when taken both precision and recall into account. However, they enable
to understand churn better in the context of the business the thesis was conducted
for. Some metrics have appeared to have a strong implication in the classification
of a customer as a churner, which gives a solid base to investigate further. These
results are valuable insights for Aircall and offer a better understanding of their
customer base, as well as a foundation for creating an improved customer relation.
There is a constant exchange of information and collaboration between the Cus-
tomer Relation Management(CRM) team and the Data team at Aircall. Being able
to customize the relationships with customers is important, and the insights that
this thesis brings will hopefully contribute further to Aircall’s customer relations.
Furthermore, this thesis offers a foundation for the Data team to build upon when
continuing the churn analysis. The scientific contribution of this thesis lies in the
definition and design of the churn model. Based on the studied literature, not many
scientific articles aims to detect churn of subscription-based B2B companies, using
historical temporal data. Moreover, this thesis focuses on a certain type of customers
representing the smallest businesses. Because customers that are paying the most
for the product are less likely to churn, and easier to follow up with, it could be more
important for a business to focus on predicting churn for less mature customers, that
might not use the product to its full potential. This is especially important for com-
panies operating in competitive and digital market, where the same kind of service

39
40 CHAPTER 7. CONCLUSIONS

easily can be accessed through other suppliers.


Bibliography

[1] Frederick Reichheld. Loyalty-Based Management. url: https://www.bain.


com/insights/loyalty-based-management-hbr/. (accessed: 03.01.1993).
[2] A. M. Hughes. Strategic database marketing. Probus Publishing Company,
1994. isbn: 9781557385512.
[3] Bob Stone. Successful direct marketing methods. NTC Business Books, 1994.
isbn: 9780844230047.
[4] Scott Neslin et al. “Defection Detection: Measuring and Understanding the
Predictive Accuracy of Customer Churn Models”. In: Journal of Marketing
Research American Marketing Association ISSN 43 (Apr. 2006), pp. 204–211.
doi: 10.1509/jmkr.43.2.204.
[5] Chih-Ping Wei and I-Tang Chiu. “Turning telecommunications call details to
churn prediction: A data mining approach”. In: Expert Systems with Applica-
tions 23 (Aug. 2002), pp. 103–112. doi: 10.1016/S0957-4174(02)00030-1.
[6] A. Hiziroglu and Omer Seymen. “Modelling customer churn using segmenta-
tion and data mining”. In: Frontiers in Artificial Intelligence and Applications
270 (Jan. 2014), pp. 259–271. doi: 10.3233/978-1-61499-458-9-259.
[7] Iris Figalist et al. “Customer Churn Prediction in B2B Contexts”. In: Jan.
2020.
[8] Ali Tamaddoni, Stanislav Stakhovych, and Michael Ewing. “Managing B2B
customer churn, retention and profitability”. In: Industrial Marketing Man-
agement 43 (July 2014). doi: 10.1016/j.indmarman.2014.06.016.
[9] Gareth James. An Introduction to Statistical Learning with Applications in R.
Springer, 2017. isbn: 9781461471370.
[10] D. R. Cox. “The Regression Analysis of Binary Sequences”. In: Journal of the
Royal Statistical Society. Series B (Methodological) 20.2 (1958), pp. 215–242.
doi: 10.2307/2983890.

41
42 BIBLIOGRAPHY

[11] L. Breiman et al. Classification and Regression Trees. The Wadsworth and
Brooks-Cole statistics-probability series. Taylor & Francis, 1984. isbn: 9780412048418.
[12] L. Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32.
doi: 10.1023/A:1010933404324.
[13] Abhishek Sharma. Confusion Matrix in Machine Learning. url: https://
www.geeksforgeeks.org/confusion-matrix-machine-learning/.
[14] Tom Fawcett. “Introduction to ROC analysis”. In: Pattern Recognition Letters
27 (June 2006), pp. 861–874. doi: 10.1016/j.patrec.2005.10.010.
[15] Jason Brownlee. Random Oversampling and Undersampling for Imbalanced
Classification. url: https://machinelearningmastery.com/random-oversampling-
and-undersampling-for-imbalanced-classification/.
[16] Dataman. Using Over-Sampling Techniques for Extremely Imbalanced Data.
url: https : / / towardsdatascience . com / sampling - techniques - for -
extremely-imbalanced-data-part-ii-over-sampling-d61b43bc4879.
[17] Nitesh Chawla et al. “SMOTE: Synthetic Minority Over-sampling Technique”.
In: J. Artif. Intell. Res. (JAIR) 16 (Jan. 2002), pp. 321–357. doi: 10.1613/
jair.953.
[18] Guilherme Dinis Chaliane Junior. “Churn Analysis in a Music Streaming Ser-
vice”. MA thesis. KTH Royal Institute of Technology, 2017.
[19] Filip Stojanovski. “Churn Prediction using Sequential Activity Patterns in an
On-Demand Music Streaming Service”. MA thesis. KTH Royal Institute of
Technology, 2017.
[20] He Haibo and Ma Yunqian. Imbalanced Learning: Foundations, Algorithms,
and Applications. Wiley, 2013. isbn: 978-1-118-07462-6.
[21] Len Tashman. “Out-of sample tests of forecasting accuracy: a tutorial and
review”. In: Int J Forecasting 16 (Jan. 2000).
[22] Kajsa Barr and Pettersson Hampus. “Predicting and Explaining Customer
Churn for an Audio/ebook Subscription Service using Statistical Analysis and
Machine Learning”. MA thesis. KTH Royal Institute of Technology, 2019.
[23] Dalia Kandeil, Amani Saad, and Sherin Youssef. “A Two-Phase Clustering
Analysis for B2B Customer Segmentation”. In: Proceedings - 2014 Interna-
tional Conference on Intelligent Networking and Collaborative Systems, IEEE
INCoS 2014 (Mar. 2015), pp. 221–228. doi: 10.1109/INCoS.2014.49.
[24] Andreas Brynolfsson Borg. “Non-Contractual Churn Prediction with Limited
User Information”. MA thesis. KTH Royal Institute of Technology, 2019.
TRITA -SCI-GRU 2020:032

www.kth.se

You might also like