ADA Assignment - Final - 2024
ADA Assignment - Final - 2024
ADA Assignment - Final - 2024
ASSIGNMENT
Instructions to candidates
This question paper consists of Five (5) printed pages excluding the cover page
Part A [42 marks]
The dataset for this question (churn_real.xlsx) is provided in Microsoft Teams files under the
ADA module space. This dataset contains variables that have a bearing in predicting whether a
customer is likely to churn out of a telecommunication service. In each case, write Python code
snippets to achieve the following:
b) Load the dataset into a Python data frame and check its contents. [3 marks]
h) Create a set of count plot visualizations for each categorical variable in the dataset
against churn labels, ensuring each plot includes a legend and customized titles, with
the entire collection of plots organized into a grid layout and each bar labeled with its
count. [6 marks]
i) Display the frequency and percentage of each churn reason in the dataset, ensuring that
the reasons are neatly organized in a DataFrame with both counts and percentages
formatted as percentages. [4 marks]
j) Reflect on the insights provided by top 3 churn influencers from the results in i) above.
What interventions would be put in place to mitigate on the findings? [6 marks]
Part B [21 marks]
Data preprocessing
a) Display the list of the variables alongside their total missing values. [2 marks]
b) Using the most frequently occurring value (mode), perform imputation to deal with the
missing values in the “Total Charges” column. [3 marks]
Page 2 of 5
c) Analyze the skewness of numeric columns in the dataset, excluding specific columns
like Latitude, Longitude, Churn Value, and Churn Score, visualize their relationship
with 'Churn Score' using regression plots, compare their distributions by 'Churn Label'
with KDE plots, and additionally, create box plots to assess the spread and identify
outliers in these numeric columns. [8 marks]
d) Giving examples from the plots, identify any outliers in any of the variables.
[2 marks]
e) Improve the skewness of any variable with a value greater than 0.8. [6 marks]
Part C [12 marks]
Hypothesis Testing
a) Test the following hypothesis:
i) Null Hypothesis (H0): There is no significant relationship between Phone Service
and Churn. Alternative Hypothesis (H1): There is a significant relationship
between Phone Service and Churn.
ii) Null Hypothesis (H0): The type of contract does not affect the likelihood of
churn. Alternative Hypothesis (H1): The type of contract significantly influences
the likelihood of churn.
iii) Null Hypothesis (H0): The Senior Citizen does not affect the likelihood of churn.
Alternative Hypothesis (H1): The Senior Citizen significantly influences the
likelihood of churn.
[6 marks]
b) From the outcomes of the hypothesis tests, what management conclusions can be
reached? [6 marks]
Part D [25 marks]
Predictive model development
a) Perform data normalization on the ‘Tenure Months’, ‘Monthly Charges’ and the ‘Total
Charges’ columns. [3 marks]
b) Split the dataset into training set and test set using an appropriate proportion.
[5 marks]
c) Compile and train a Logistic Regression model with appropriate hyperparameters.
[6 marks]
d) Print a classification report from the model. [4 marks]
e) Print the confusion matrix from the model. [3 marks]
f) Interpret the confusion matrix [4 marks]
Page 3 of 5
Marking key
a) Libraries 4
b) Dataset loading 3
d) Descriptive statistics 3
f) Insights summaries 6
g) Pie chart 2
c) Skewness 8
d) Outlier detection 2
e) Skewness improvement 6
a) Hypothesis testing 6
b) Management conclusions 6
Page 4 of 5
a) Data Normalization 3
c) LR model training 6
d) Classification report 4
e) Confusion matrix 3
Page 5 of 5