FULLTEXT01
FULLTEXT01
FULLTEXT01
MARIE SERGUE
MARIE SERGUE
ii
Sammanfattning
Kundundersökning och förutsägelse med maskininlärning för
ett B2B SaaS-företag
Under det senaste decenniet har många tjänster digitaliserats och data blivit mer och
mer tillgangliga, enkla att lagra och bearbeta med syftet att forsta kundbeteende.
For att kunna vara ledande inom sina branscher maste prenumerationsbaserade fore-
tag fokusera pa kundrelationshantering och i synnerhet churn management, det vill
saga forstaelse for hur kunder avbryter sin prenumeration. I denna uppsats utfors
karnanalys pa verkliga data fran ett SaaS-foretag (software as a service) som sal-
jer ett avancerat molnbaserat foretagstelefonsystem, Aircall. Denna fallstudie är
speciell på så sätt att den tillgangliga datamängden består av månatlig kunddata
med en mycket ojämn fordelning: en stor majoritet av kunderna avbryter inte sina
prenumerationer. Darfor undersöks flera metoder for att minska effekten av denna
obalans, samtidigt som de forblir sa nara den verkliga varlden och den tidsmassiga
ramen. Dessa metoder inkluderar oversampling och undersampling (SMOTE och
Tomeks lank) och korsvalidering av tidsserier. Sedan anvands logistisk regression
och random forests i syfte att bade forutsaga och forklara prenumerationsbortfall.
Den icke-linjara metoden presterade battre an logistisk regression, vilket tyder pa
en begransning hos linjara modeller i vart anvandningsfall. Dessutom ger blandning
av oversampling med undersampling battre prestanda nar det galler precision och
aterkoppling. Korsvalidering av tidsserier ar ocksa en effektiv metod for att forbattra
modellens prestanda. Sammantaget ar den resulterande modellen mer anvandbar
for att forklara bortfall an att forutsaga dessa. Med hjälp av modellen kunde vissa
faktorer, främst relaterade till produktanvändning, som påverkar bortfallet identi-
fieras.
iii
Acknowledgements
First of all, I would like to express my deepest gratitude to my industrial and
academic supervisors, Edouard Flouriot and professor Anja Janssen, respectively.
Their guidance and assistance have been of great value for the completion of this
work. I would also like to thank my colleagues at Aircall whom supported me and
my work a lot, while integrating me in their team with kindness. Finally, I would
like to thank professor Jimmy Olsson, for being my examiner and to my classmates,
Louis and Pierre, for their continuous reviews and discussions.
iv
Contents
1 Introduction 1
1.1 Context and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature 5
3 Theory 7
3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 ROC Curve and AUC . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Precision - Recall Curve . . . . . . . . . . . . . . . . . . . . . 13
3.3 Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Methods 18
4.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Data Procurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Flatten temporal data . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.3 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Dealing with data set imbalance . . . . . . . . . . . . . . . . . . . . . 21
v
vi CONTENTS
4.4.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Time-series cross-validation . . . . . . . . . . . . . . . . . . . 22
4.5 Modelling and hyperparameter search . . . . . . . . . . . . . . . . . . 23
4.5.1 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5.2 Classification threshold . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Discussion 37
6.1 Temporal framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Available data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Sequential data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Conclusions 39
Bibliography 41
Chapter 1
Introduction
In the past years, companies have been able to store and process huge amounts of
data while realizing that being customer-centric was becoming a main requirement
to stand out of the competition. Indeed, due to saturated markets, focusing on
Customer Relationship Management (CRM) in order to retain existing customer
base is not optional anymore, but an absolute necessity for competitive survival. In
its research for Bain and Company, Frederick Reicheld [1] stated that the cost of
acquiring a new customer could be higher than that of retaining a customer by as
much as 700%, and that increasing customer retention rates by a mere 5% could
increase profits by 25% to 95%.
More generally, data-driven decision making is way for businesses to make sure
their next move will benefit both them and their customers. Almost every company,
especially in the Tech ecosystem has now put into place a tracking process, to gather
data related to their customers’ behavior. The data to track varies along with the
specific business model of each company and the problem service they aim to address.
By analyzing how, when and why customers behave a certain way, it is possible to
predict their next steps and have time to work on fixing issues beforehand.
1
2 CHAPTER 1. INTRODUCTION
Aircall’s customer base consists in almost 5000 international small businesses and
start-ups, which represent the customers. All those customers can add how many
users they need to their account and assign them to one or more phone numbers.
When the customers subscribe to the service, they choose between several pricing
plans, that each propose a price per user added.
The main specificity and competitive advantage of Aircall is that it can be con-
nected to many other business tools so that each customer can build their own
custom workflows. The connection that is built by a customer between its Aircall
account and any other software that it uses is called an integration. Each customer
can create as many integrations as it wishes. The use of Aircall with integrations is
perceived as a generator of adherence to the product.
For now, product usage is not precisely tracked yet on the product side but
a trend can be highlighted by just having a look at the number of calls that are
made by the customer (inbound calls), received by the customer (outbound calls)
and the number of integrations they configured in their Aircall account, as well as
the evolution of these metrics over time for a given customer.
Finally, the customers can assess the quality of Aircall’s product and service by
two different means. First, a form is sent to each of their user every 3 months where
they can express how much they would recommend the product to someone (ranking
from 0 to 10, 0 being the most negative answer). Depending on their grade, they
are then qualified as being either promoter (graded 9 or 10), detractor (from 0 to 6)
or neutral. Aircall then compute the Net Promoter Score (NPS ) which is calculated
by subtracting the percentage of customers who are Detractors from the percentage
of customers who are Promoters. An NPS can be as low as -100 (every respondent
is a detractor) or as high as +100 (every respondent is a promoter).
CHAPTER 1. INTRODUCTION 3
The business teams at Aircall are divided into five groups: Sales, Onboarding,
Customer Success, Support and Marketing. They all have their importance at dif-
ferent moment during customer lifetime. First, the Sales team is in charge of finding
potential clients, make sure they are qualified (meaning Aircall’s product could ac-
tually be useful for them), and signing the deal. Once the targeted company has
become a customer, the Onboarding team has to help them configuring the prod-
uct on their own system so that they can have the best experience with it. From
the time the company becomes a customer, it becomes the responsibility of the
Customer Success team. They are the point of contact between Aircall and their
customers, and are the key stakeholder when talking about churn. Indeed, their job
is split into trying to upgrade the customers to a better plan, or having it adding
more users, and preventing them from churning. Finally, the Support team is the
point of contact for any kind of technical issues. They can be reached to through
tickets, chat or phone.
Aircall’s customers are divided into two distinct categories depending on how
much they bring to the company. The ones that represent more than $1K of monthly
recurring revenue are defined as VIP accounts, and the other ones as Mid-Market
accounts. VIP accounts represent less than 10% of Aircall’s total number of cus-
tomers but 50% of total monthly recurring revenue and they are assigned to 70%
of the Customer Success team. All these accounts are carefully cared about so the
Customer Success team successfully manage to prevent their churn. On the contrary,
there are too many Mid-Markets accounts, and too little Customer Success Man-
agers to handle their churn. Human resources are too limited to conduct this work
which is why the company decided to invest in some Data Analysis and Machine
Learning to get insights about the Mid-Market accounts, and give those information
to Customer Success Managers so that they can contact the sensitive customers
before they leave.
1.3 Delimitation
In order to narrow down the scope of this thesis and to match Aircall’s needs and
context, the following delimitation has been made:
• Aircall has been able to gather and store basic customer data from its creation
in 2014. However, due to many changes in the business in the first 4 years, and
the absence of a Data Team until early 2018, it was highlighted that the data
before 2018 would not be relevant enough to be considered in the model. The
data used for the analysis is then historical and gathered from the beginning
of 2018 to reflect modern behaviour.
1.4 Outline
In Chapter 2, relevant literature is reviewed in order to get familiar with the work
that have been conducted in this area. Chapter 3 gathers all the theoretical princi-
ples that have been used during this research. In Chapter 4, the hole process and
methodology of the thesis is detailed to get the full picture of the project, which
leads to the results and analysis presented in Chapter 5. Finally, this paper ends
with a discussion of the findings in Chapter 6 and a conclusion in Chapter 7.
Chapter 2
Literature
Customer churn as newly developed concept has been widely employed in telecom-
munication, e-retailing and banking industries in recent years. The definition for
churn may vary for each organization with regards to the duration that a customer
is apart from a company. Many existing studies highlight that in order to boost
customer loyalty and boost retention, having a customized and differentiated rela-
tionship with its customers is fundamental. To answer this problem, several ways
of performing analysis capturing customer behaviors exist.
Among others, RFM (Recency, Frequency, Monetary) is a commonly used frame-
work for interpreting customers’ prior behavioural actions and applied for organiza-
tions. The Pareto/NBD model is then usually performed by optimizing the param-
eters to provide a best fit to a set of RFM data. In RFM, Recency refers to the time
interval between the latest purchase and the determined time period. Frequency
refers to the number of purchases in a particular time period. Finally, Monetary
refers to the amount of total spending in a particular time period. Customer with
a high frequency, high monetary and low recency means can be defined as valuable
customer. For large data sets RFM is a good, fast and practical method which helps
to understand most valuable or less valuable customers. Two studies were proposed
to group customers for each RFM variables by adding weighting into calculation.
Hughes [2] suggests three RFM variables have equal importance so weights should
be equal, when Stone [3] suggests RFM variables’ importance vary among the sec-
tors. Stone’s suggestion have been approved in some studies. These weights which
sum of total is 1, are multiplied with each RFM variable to create a score for each
customer. The score stands for value of each customer where biggest score means
loyal and lowest score means disloyal. Unfortunately, even though the RFM frame-
work is very suitable for B2C companies, it is less intuitive for subscription-based
B2B companies, especially SaaS. Indeed, the service is billed monthly, so recency
5
6 CHAPTER 2. LITERATURE
and frequency are the same for all customers. The only relevant indicator is thus the
Monetary value, which is highlighted by looking at the Monthly Recurring Revenue
(MRR).
To expand the analysis, numerous studies have explored various machine learning
algorithms and their potential for modelling churn. Because predicting whether a
customer will churn or not is a binary classification problem, multiple models such as
logistic regression [4], decision trees [5], random forest, support vector machines and
neural networks have been tested. Logistic regression has the advantage to be easy
to use and to produce robust results. Therefore, it is often applied as a benchmark
to more advanced techniques. However, these waves of studies have usually been
focusing on pure methodological improvements. The researches have been conducted
aiming at improving the accuracy of customer churn model. According to these
studies, some of the methods indeed provided better prediction whilst being tested
with a particular data-set. However, the improvements are normally limited to an
industry or even a company. We cannot observe any generic improvements from
simply employing different methods. It shows the necessity to explore not only
methodological improvements but also the conceptual development [6].
Nevertheless, these studies raised the main challenges when trying to model
churn. The main one concerns the fact that the data is usually imbalanced, that is
the proportion of churn cases is small in the overall data set. The most proposed
solution is to apply an oversampling method called SMOTE, which creates synthetic
samples of the minority class to even out the balance in the data set.
While business-to-customer (B2C) companies, in the telecom sector for instance,
have been making use of customer churn prediction for many years, churn prediction
in the business-to-business (B2B) domain receives much less attention in existing
literature [7]. Some studies have proposed data-mining approaches, comparing logis-
tic regression, decision trees and boosting, and showed that logistic regression could
provide some robust results but was outperformed by the aggregation of decision
trees [8].
Chapter 3
Theory
This chapter presents the theoretical backgrounds of the methods and models used
in the thesis.
3.1 Classification
When applying a model, the response variable Y can be either quantitative or qual-
itative. The process for predicting qualitative responses involves assigning the ob-
servation to a category, or class and is thus known as classification. The methods
used for classification often first predict the probability of each category of a qual-
itative variable. There exists many classification techniques named classifiers, that
enable to predict a qualitative response. In this thesis, two of the most widely-used
classifiers have been discussed: logistic regression and random forests. [9]
7
8 CHAPTER 3. THEORY
very high probabilities. By taking the logarithm of both sides we arrive at:
p(X)
log( ) = β0 + β1 X
1 − p(X)
There, the left side is called the log-odds or logit. This function is linear in X.
Hence, if the coefficients are positive, then an increase in X will result in a higher
probability.
The coefficients β0 and β1 in the logistic equation are unknown and must be
estimated based on the available training data. To fit the model, we use a method
called maximum likelihood. The basic intuition behind using maximum likelihood
to fit a logistic regression model is as follows: we seek estimates for β0 and β1 such
that the predicted probability p̂(xi ) of the target for each sample corresponds as
closely as possible to the sample’s observed status. In other words, the estimates βˆ0
and βˆ1 are chosen to maximize the likelihood function:
(1 − p(Xi0 ))
Y Y
l(β0 , β1 ) = p(Xi )
i:y1 i0 :y10
After applying logistic regression, the accuracy of the coefficient estimates can be
measured by computing their standard errors. Another performance metric is z-
statistic. For example, the z-statistic associated with β1 is equal to βˆ1 /SE(βˆ1 ),
and so a large absolute value of the z-statistic indicates evidence against the null
hypothesis H0 : β1 = 0. [10]
where p̂mk is the proportion of observations in the mth region that are from the
kth class. In practice, this criterion is not sensitive enough for growing the trees,
which leads us to two other measures that are usually preferred: the Gini index and
entropy. The Gini index is a measure of total variance across the K classes and is
defined by:
K
X
G= p̂mk (1 − p̂mk ).
k=1
If all the p̂mk are close to 0 or 1, the Gini index will be small meaning that a small
value of G indicates that a node mainly contains observations from only one class,
which can be characterized as node purity. As was mentioned before, an alternative
to Gini index is entropy. [11]
3.1.3 Bagging
As we mentioned in the previous section, decision trees suffer from high variance,
meaning that if we fit a decision tree to two different subsets of the training data,
we might get quite different results. Bootstrap aggregation, or bagging, is a method
aiming at reducing the variance, and therefore is commonly used when decision
trees are implemented. Given a set of n independent observations Z1 , ..., Zn , each
with variance σ 2 , the variance of the mean Z of the observations is given by σ 2 /n.
This means that averaging a set of observations reduces variance. Hence, by taking
many training sets from the population, building a separate prediction model using
each training set and averaging the resulting predictions, we can reduce the variance
and consequently increase the prediction accuracy of the method. In particular, we
calculate fˆ1 (x), fˆ2 (x), ..., fˆB (x) using B separate training sets, and average them in
order to obtain a single low-variance statistical model, given by:
B
1 X
fˆavg (x) = fˆb (x).
B b=1
However, in most use cases, it might not be possible to access multiple training
sets. That is where the bootstrap method becomes useful. Bootstrap consists in
taking repeated samples from the original training data set. It generates B different
bootstrapped training data sets. Then, the model is fit on the both bootstrapped
training set, resulting in the prediction fˆ∗b (x). All the predictions are averaged to
obtain:
B
1 X
fˆbag (x) = fˆ∗b (x).
B b=1
Bagging can easily be applied to a classification problem, to predict a qualitative
outcome Y . For a given test observation, each B tree predict a class and we choose
10 CHAPTER 3. THEORY
the overall prediction as the most commonly occurring class among B predictions.
[9]
Given the values of these four entries, several other metrics can be derived,
namely accuracy, recall, precision and F-score.
Accuracy
TP + TN
Accuracy =
TP + TN + FP + FN
Accuracy computes the number of correctly classified items out of all classified items.
Recall
TP
Recall =
TP + FN
Recall tells how much the model predicted correctly, out of all the positive classes. It
should be as high as possible as high recall indicates the class is correctly recognized
(small number of FN). It is usually used when the goal is to limit the number of
false negatives.
Precision
TP
Precision =
TP + FP
Precision tells, out of all the positive classes that were predicted correctly, how
many are actually positive. High precision indicates an example labeled as positive
is actually positive (small number of FP). It is usually used when the goal is to limit
the number of false positives. A model with high recall but low precision means
that most of the positive examples are correctly recognized (low FN) but that there
is a lot of false positive.
12 CHAPTER 3. THEORY
F-score
2 × Recall × Precision
F-score =
Recall + Precision
F-score is a way to represent precision and recall at the same time and is therefore
widely used for measuring model performances. Indeed, if you try to only optimize
recall, the algorithm will predict most examples to belong to the positive class, but
that will result in many false positives and, hence, low precision. On the other hand,
optimizing precision will lead the model to predict very few examples as positive
results (the ones with highest probability), but recall will be very low. [14]
AUC (Area Under the Curve) represents degree of separability. It tells how much
the model is capable of distinguishing between classes. Higher the AUC, better the
model is at predicting 0s as 0s and 1s as 1s. The ideal ROC curve hugs the top left
corner, indicating a high true positive rate and a low false positive rate. The dotted
diagonal represents the ’no information’ classifier, this is what we would expect from
a ’random guessing’ classifier. [14]
CHAPTER 3. THEORY 13
k-fold Cross-Validation
Cross-validation is a refinement of the train-test split approach that addresses the
issue highlighted in the previous subsection. This approach consists in randomly
dividing the set of observations into k groups, or folds, of approximately equal size.
The first fold acts as a test set, and the method is fit on the remaining k − 1
folds. The mean squared error is then computed on the observations of the test
set. The overall procedure is computed k times with each time a different subset of
observations taking the role of the test set. As a result, the test error is estimated
CHAPTER 3. THEORY 15
k times and by averaging these values, it gives the k-fold cross-validation (CV)
estimate:
k
1X
CV(k) = MSEi
k i=1
The advantage of cross-validation is that it can be applied to almost any statistical
learning method. For computationally intensive models, one typically performs k-
fold CV using k = 5 or k = 10 which respectively requires fitting the learning
problem only five and ten times.
3.3.2 Resampling
Random Oversampling
A simple way to fix imbalanced data sets is simply to balance them, either by over-
sampling instances of the minority class or undersampling instances of the majority
class. Random resampling provides a naive technique for rebalancing the class dis-
tribution for an imbalanced data-set. In particular, random oversampling involves
randomly selecting examples from the minority class, with replacement, and adding
them to the training data set. It is referred to as ’naive re-sampling’ method because
it assumes nothing about the data and no heuristics are used. This makes them sim-
ple to implement and fast to execute, which is desirable for very large and complex
data sets. Importantly, the change to the class distribution is only applied to the
training data set. The intent is to influence the fit of the models. The resampling is
not applied to the test data set used to evaluate the performance of a model. How-
ever, in some cases, seeking a balanced distribution for a severely imbalanced data
set can cause affected algorithms to overfit the minority class, leading to increased
generalization error. The effect can be better performance on the training data-set,
but worse performance on the test data set. [15]
SMOTE
To avoid the overfitting problem, the Synthetic Minority Over-sampling Technique
(SMOTE) is proposed. This method generates synthetic data based on the feature
space similarities between existing minority instances. In order to create a synthetic
instance, it finds the K-nearest neighbors of each minority instance, randomly selects
one of them, and then calculates linear interpolations to produce a new minority
instance in the neighborhood. In other words, to create a synthetic data point, it
takes the vector between one of those k neighbors, and the current data point. It
multiplies this vector by a random number x which lies between 0 and 1 and add
16 CHAPTER 3. THEORY
this to the current data point to create the new synthetic data point [16]. In other
words, the synthetic sample xn ew are generated by interpolating between x and x̃
as follows:
xnew = x + rand(0, 1) × (x̃ − x),
where rand(0, 1) refers to a random number between 0 and 1.
For most basic versions of SMOTE, there are two parameters that can be ad-
justed. The first is the SMOTE multiplier m. The second parameter is the number
of nearest neighbors to use k. In the original SMOTE paper, the five nearest neigh-
bors were used and one to five of those nearest neighbors are randomly selected
depending upon the amount of oversampling desired. [17]
noise points or points that lie close to the optimal decision boundary. Removing
these points can result in more well-defined class clusters in the training data, which
can lead to better classifiers. In Tomek link undersampling, only the majority class
example in each Tomek link pair is removed. There are two reasons for this. First, in
an imbalanced data set, the minority class examples may be too valuable to waste,
especially if the minority class is underrepresented. Second, recall on the minority
class is frequently of greater importance than precision, so it is worth inadvertently
retaining some noisy data in order to avoid losing rare cases. The two methods are
combined by first applying SMOTE in order to generate synthetic minority class
samples, and subsequently applying Tomek link under-sampling to data-set com-
posed of both the original and new, synthetic observations.
Chapter 4
Methods
The work conducted in this thesis is following the typical framework of a machine
learning classification problem. However, because the business problem had not been
explored at all before starting the thesis, there were many iterations in the process,
following the insights and findings that were brought. The analysis was completely
realized with Python, through Jupyter Notebook, including Python libraries numpy,
pandas, scikit-learn and imbalanced-learn.
4.1 Goal
The churn model intends to provide Aircall the level of risk a customer might have
of churning so that the Customer Relation team can contact them and implement
an action plan to retain them. The objective of this thesis is to create an acceptable
churn prediction model in order to highlight effective levers that Aircall could imple-
ment to significantly reduce its churn rate. Therefore, perfect accuracy of the model
is not what is aimed at, a model producing a satisfactory recall will be enough to
give some first insights.
18
CHAPTER 4. METHODS 19
3% churn, gathering more data points is mandatory to build a model that would
give satisfying results. There is not a single simple rule to determine if the data
collected is enough to perform the chosen machine learning models, but given the
specificity of Aircall’s business and the strong imbalance, it was decided to find a
way to get more data samples. Aircall’s customers data is gathered and stored every
day, and aggregated per month to get an overview of each customer characteristics
while they are still subscribers. Because in the telecommunications industry the
behaviour of a customer is not so volatile, it is preferred to look at monthly data.
One specificity of the data is that, while some features are static (customer region
for example), the majority can vary over time. Dealing with these features requires
to create a temporal framework. As seen in literature, and in line with the studied
business problem, two non-overlapping time frames are created [18] [19]. First, an
observation period for each customer ought to be specified. This observation window
is the period in which customer activity is gathered. In order to amass sufficient
data, for each customer, the observation window consists in the 3 months before each
extract date. The reasoning behind this is that customers have a certain behaviour
during the quarter before they eventually churn. A more restrictive period could be
affected by seasonality or some specificity in customer’s industry.
In order to handle the temporal features, several approaches have been studied
in literature. One could build a sequential data set, and perform adapted models on
it, such as Long-Short Term Memory (LSTM) or Recurring Neural Network (RNN).
However, these methods tend to perform very poorly on small data sets, which
makes them unsuitable in the context of this thesis. The chosen method consists in
flattening the non-static features per time step as different features for the complete
observation period. With each sequence consisting in 3 time steps, each described
by several features, the flattened version of each observation would have 3 times
the initial number of temporal features which can be considered as independent and
neglect the time dimension.
Following the same idea, an activation window of 1 month is chosen to assign
churn labels. A customer is labeled as churner at a specific extract month if it
20 CHAPTER 4. METHODS
churned in the following month. This time step is chosen in agreement with the
Customer Success team. Indeed, for a given customer predicted as churner, they
need some time to start a retention process, but widening the activation window
to more than 1 month would prevent from observing behaviour of the customer the
months before it churns and losing some significant information.
To sum it up, for each month, from early 2018 to now, we are considering the
batch of users that were still subscribers that specific month, every batch having its
own observation and activation period.
From the data collected each month for each customer, the features are selected
using an iterative approach. Starting from a dataset containing some basic charac-
teristics of the customers, some more actionable features have been added in order
to see their influence on the model and their influence on churn. Also, in order to
reduce the risks of multicollinearity, features that had no correlation with the target
as well as redundant features were removed by looking at the distributions of the
features, independently and in relation with the target, and their correlations. In
the end, the dataset consisted in features including information about:
4.3.3 Outliers
Machine learning algorithms are very sensitive to the range and distribution of data
points. Data outliers can deceive the training process resulting in longer training
times and less accurate models. The outliers were detected by looking at boxplots
and performing extreme value analysis. The key of extreme value analysis is to
determine the statistical tails of the underlying distribution of the variable and find
the values at the extreme of the tails. As the variables are not normally distributed,
a general approach is to calculate the quantiles and then inter-quartile range. If
the data point is above the upper boundary or below the lower boundary, it can be
considered as an outlier. Because the studied data set is already small, removing
samples with outliers would waste even more data and is not considered. Instead,
the extreme value is replaced by the mean value, or most represented value of the
related feature.
methods would not be explored, as they would result in an even fewer amount of
available churn observations. Studies have shown that perfectly balanced training
sets do not necessarily lead to optimal performance, so other class distributions are
experimented, with different partition between the two target classes. For the three
oversampling methods tested, the class proportions of the training sets resulting
from oversampling are given in table 4.2.
distribution is severely altered, it is likely that one or more folds will have few or
no samples from the minority class. This would conduct to a model only predicting
the majority class correctly, which is not of interest in a churn business case.
When dealing with time series data, traditional cross-validation should not be
used for two major reasons. The first one refers to temporal dependencies. Indeed,
when splitting time series data, it often happens that information from outside the
training dataset is used to create the model. This additional information can allow
the model to learn or know something that it otherwise would not know and can
then cause to create overly optimistic if not invalidate the estimated performance
of the model that is being constructed. In order to accurately simulate real world
forecasting environment, in which we stand in the present and forecast the future
(Tashman 2000), all data withheld must represent events occurring chronologically
after the event used for fitting the model. Therefore, instead of using k-fold cross-
validation, hold-out cross-validation is preferred for time series data. A subset of
data is reserved for validating the model performance. The test set data always
comes chronologically after the training set and the validation set always comes
chronologically after the training subset. In order not to choose the test set arbi-
trarily, which may lead to a poor estimate of the test set error, the method of Nested
cross-validation can be used. Nested cross-validation contains an outer loop for error
estimation and an inner loop for parameter tuning. In the inner loop, the training
set is split into a training subset and a validation set and the model is trained on
the training subset, and the parameters that minimize the error on the validation
set are chosen. In the outer loop, the dataset is split into multiple different training
and test sets, and the error on each split is averaged in order to compute a robust
estimate of model error. In the case of time series data, we use a method based on
forward-chaining, also referred to in the literature as rolling-origin evaluation [21],
on a monthly basis. It consists in successively considering each month as the test
set and assign all previous data into the training set. For example, if the data set
has six months; it would produce four different training and test splits, as can be
seen in Figure 4.2. This method produces several train-test splits and the error on
each split is averaged in order to compute a robust estimate of the test error.
a parameter whose value is set before the learning process begins. On the opposite,
the values of the parameters are derived via training. Depending on the trained
model, the importance of these parameters varies with respect to performance. For
the logistic regression model, the regularization strength λ will be tuned. The doc-
umentation on the random forest in Scikit-Learn indicates that the most important
settings are the number of trees in the forest (n_estimators) and the number of
features considered for splitting at each leaf node (max_features), which are the
ones that will be tuned using Grid Search.
4.6 Evaluation
The models will be assessed with the metrics presented in section 3.2: accuracy,
precision, recall, F-score, AUC and precision-recall curves. However, the focus will
be put in producing a satisfying recall score and then aim for as high precision
score as possible. This reasoning is based on the studied business problem. Indeed,
Aircall’s main motivation is to detect the maximum number of churners that is
having a low value of false negative. On the other hand, it is much more acceptable
to classify non-churners as churners, given that the action taken towards expected
churners only would result in an extra phone call. Therefore, having a low precision,
even if not optimal, is much less important than maximizing the recall.
Chapter 5
In this chapter, the results of the different models and methods presented in Chapter
4 are detailed and analyzed. These results consist in different evaluation metrics as
well some plots that give a more thorough understanding of the results. Because the
focus of the thesis was put on handling a very imbalanced small data set, the results
are presented in order to highlight the effect of the processing techniques, more than
the choice of the model itself. Therefore, the first part gives the performances of the
two studied models (Logistic Regression and Random Forest), without any of the
resampling techniques. The second part highlights the effect of oversampling and
undersampling methods on the model and the final part presents the effects of the
time-series cross validation.
26
CHAPTER 5. RESULTS AND ANALYSIS 27
Figure 5.3: Distributions of a batch of features - Up: Churn rate breakdown - Down:
Total number of samples (in blue) and total number of samples with target = 1 (in
red)
CHAPTER 5. RESULTS AND ANALYSIS 29
Confusion Matrix
The model has an accuracy of 99.55% but the confusion matrix shows that none
of the true churners have been detected.
As intuitively presumed, the model is strongly overfitting, and therefore always
predicts the negative class to reach the highest accuracy.
30 CHAPTER 5. RESULTS AND ANALYSIS
Because of the class imbalance, using a random forest model, even if it tends to
overfit less than a linear model, is not sufficient to prevent overfitting.
CHAPTER 5. RESULTS AND ANALYSIS 31
5.3 Resampling
5.3.1 Oversampling: Random oversampling
The performances of the logistic regression and random forest models trained on the
oversampled data set (using random oversampling) are displayed in the following.
Figure 5.8: Confusion matrix - Logistic Regression (left) Random Forest (right) -
Random Oversampling
Figure 5.9: Precision, Recall, F1 - Logistic Regression (left) Random Forest (right)
- Random Oversampling
Random oversampling helps increasing recall but gives almost 0 precision, which
leads to a low f1-score. There is no trade-off as recall becomes the performance
metric to maximize. It can be noticed there that Logistic Regression performs a bit
better than Random Forest in terms of f1-score. But Random Forest finds a better
trade-off in terms of precision and recall.
32 CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.10: ROC Curve - Logistic Regression (left) Random Forest (right) - Ran-
dom Oversampling
Figure 5.11: Confusion matrix - Logistic Regression (left) Random Forest (right) -
SMOTE
SMOTE performs a bit in the same way as Random Oversampling, in the way
that logistic regression maximizes recall and random forest maximizes precision
without being able to find a good trade-off between the two. There is no clear
improvement in terms of performance compared to Random Oversampling.
CHAPTER 5. RESULTS AND ANALYSIS 33
Figure 5.12: Precision, Recall, F1 - Logistic Regression (left) Random Forest (right)
- SMOTE
Figure 5.13: ROC Curve - Logistic Regression (left) Random Forest (right) - SMOTE
34 CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.14: Confusion matrix - Logistic Regression (left) Random Forest (right) -
SMOTE-Tomek
Figure 5.15: Precision, Recall, F1 - Logistic Regression (left) Random Forest (right)
- SMOTE-Tomek
CHAPTER 5. RESULTS AND ANALYSIS 35
Figure 5.16: ROC Curve - Logistic Regression (left) Random Forest (right) -
SMOTE-Tomek
Figure 5.18: ROC Curve and Precision Recall Curve- Random Forest - Cross Vali-
dation + SMOTE-Tomek
Figure 5.19: Feature Importance (left) and Probability distribution (right) - Random
Forest (right) - Cross Validation + SMOTE-Tomek
The feature importance graph highlights that three features are predominantly
driving the prediction model: number of users, number of integrations and call
quality. The probability plot is also interesting to look at. Even if the performances
of the model in terms of recall and precision are still poor, the probability plot shows
that samples that are classified as churners with a probability higher than 0.6 have
twice more chance to be real churners than the other ones. This induces that the
model could be of use if we only consider the samples with the highest probability
of churning, even if that represents only a few customers.
Chapter 6
Discussion
In this chapter the general findings of this thesis are discussed and analyzed.
37
38 CHAPTER 6. DISCUSSION
order to be able to predict churn with enough accuracy. This consideration is also
powered by the fact that the customer database is poor, which increases the risks
of discrepancies between customers behaviours [22]. In addition, Aircall’s customers
are of various types but not identified yet. It would thus be interesting to perform
customer segmentation at first [23], and then adapt the churn prediction to the
different segments of customers. Another improvement would be to include more
data related to product usage in addition to deterministic data that remains constant
regardless of consumption.
Figure 6.1: Example of sequential data structure where, for a customer X, τ is the
number of months and d is the number of features
Chapter 7
Conclusions
In this chapter the research questions are answered and business and academic
contribution is presented.
After having completed the analysis presented in this thesis report, the research
question can be answered. To what extent can statistical analysis and machine
learning algorithms be used to highlight churn drivers for a B2B company with little
data available so that churn can be explained? All the models are unable to predict
churn well, when taken both precision and recall into account. However, they enable
to understand churn better in the context of the business the thesis was conducted
for. Some metrics have appeared to have a strong implication in the classification
of a customer as a churner, which gives a solid base to investigate further. These
results are valuable insights for Aircall and offer a better understanding of their
customer base, as well as a foundation for creating an improved customer relation.
There is a constant exchange of information and collaboration between the Cus-
tomer Relation Management(CRM) team and the Data team at Aircall. Being able
to customize the relationships with customers is important, and the insights that
this thesis brings will hopefully contribute further to Aircall’s customer relations.
Furthermore, this thesis offers a foundation for the Data team to build upon when
continuing the churn analysis. The scientific contribution of this thesis lies in the
definition and design of the churn model. Based on the studied literature, not many
scientific articles aims to detect churn of subscription-based B2B companies, using
historical temporal data. Moreover, this thesis focuses on a certain type of customers
representing the smallest businesses. Because customers that are paying the most
for the product are less likely to churn, and easier to follow up with, it could be more
important for a business to focus on predicting churn for less mature customers, that
might not use the product to its full potential. This is especially important for com-
panies operating in competitive and digital market, where the same kind of service
39
40 CHAPTER 7. CONCLUSIONS
41
42 BIBLIOGRAPHY
[11] L. Breiman et al. Classification and Regression Trees. The Wadsworth and
Brooks-Cole statistics-probability series. Taylor & Francis, 1984. isbn: 9780412048418.
[12] L. Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32.
doi: 10.1023/A:1010933404324.
[13] Abhishek Sharma. Confusion Matrix in Machine Learning. url: https://
www.geeksforgeeks.org/confusion-matrix-machine-learning/.
[14] Tom Fawcett. “Introduction to ROC analysis”. In: Pattern Recognition Letters
27 (June 2006), pp. 861–874. doi: 10.1016/j.patrec.2005.10.010.
[15] Jason Brownlee. Random Oversampling and Undersampling for Imbalanced
Classification. url: https://machinelearningmastery.com/random-oversampling-
and-undersampling-for-imbalanced-classification/.
[16] Dataman. Using Over-Sampling Techniques for Extremely Imbalanced Data.
url: https : / / towardsdatascience . com / sampling - techniques - for -
extremely-imbalanced-data-part-ii-over-sampling-d61b43bc4879.
[17] Nitesh Chawla et al. “SMOTE: Synthetic Minority Over-sampling Technique”.
In: J. Artif. Intell. Res. (JAIR) 16 (Jan. 2002), pp. 321–357. doi: 10.1613/
jair.953.
[18] Guilherme Dinis Chaliane Junior. “Churn Analysis in a Music Streaming Ser-
vice”. MA thesis. KTH Royal Institute of Technology, 2017.
[19] Filip Stojanovski. “Churn Prediction using Sequential Activity Patterns in an
On-Demand Music Streaming Service”. MA thesis. KTH Royal Institute of
Technology, 2017.
[20] He Haibo and Ma Yunqian. Imbalanced Learning: Foundations, Algorithms,
and Applications. Wiley, 2013. isbn: 978-1-118-07462-6.
[21] Len Tashman. “Out-of sample tests of forecasting accuracy: a tutorial and
review”. In: Int J Forecasting 16 (Jan. 2000).
[22] Kajsa Barr and Pettersson Hampus. “Predicting and Explaining Customer
Churn for an Audio/ebook Subscription Service using Statistical Analysis and
Machine Learning”. MA thesis. KTH Royal Institute of Technology, 2019.
[23] Dalia Kandeil, Amani Saad, and Sherin Youssef. “A Two-Phase Clustering
Analysis for B2B Customer Segmentation”. In: Proceedings - 2014 Interna-
tional Conference on Intelligent Networking and Collaborative Systems, IEEE
INCoS 2014 (Mar. 2015), pp. 221–228. doi: 10.1109/INCoS.2014.49.
[24] Andreas Brynolfsson Borg. “Non-Contractual Churn Prediction with Limited
User Information”. MA thesis. KTH Royal Institute of Technology, 2019.
TRITA -SCI-GRU 2020:032
www.kth.se