0% found this document useful (0 votes)
1 views22 pages

machine-learning-module-3-logistic-regression

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 22

Univariate Logistic Regression

Finding the Best Fit Sigmoid Curve - I


Likelihood
Now, let’s say that for the ten points in our example, the labels are as follows:

Point no. 1 2 3 4 5 6 7 8 9 10
Diabetes no no no yes no yes yes yes yes yes
In this case, the likelihood would be equal to:

(1−P1) (1−P2 ) (1−P3) (1−P5 ) (P4) (P6) (P7) (P8) (P9) (P10) ✓ Correct

Odds and Log Odds


Log Odds
So, let’s say that the equation for the log odds is:

For x = 220, the log odds are equal to -13.5+(0.06*220) = -0.3. For x = 231.5, log odds are equal to:

ans: 0.39

Log Odds
So, let’s say that the equation for log odds is:

For x = 220, the log odds are equal to -0.3 and for x = 231.5, the log odds are equal to 0.39. For x = 243, the log
odds are equal to:

ans: 1.08

Multivariate Logistic Regression - Model Building


Data Cleaning and Preparation - I
Level counts
In the text above, you saw that for the variable ‘MultipleLines’ the value counts of the levels ‘Yes’, ‘No’, and ‘No
phone service’ are 3390, 2971, and 682 respectively. When you run the same command for the column
‘OnlineBackup’, what will the value count for its level ‘No internet service’ turn out to be?

1526 ✓ Correct

Levels of Dummy Variables


If you check the value counts of the levels ‘OnlineBackup’, ‘OnlineSecurity’, ‘DeviceProtection’, and all the others for
which one of the levels was dropped manually, you can see that the count of the level ‘No internet service’ is the
same for all, i.e. 1526. Can you explain brie y why this has happened?

ans: This happens because the level ‘No internet service’ just tells you whether a user has internet service or not.
Now because the number of users not having an internet service is the same, the count of this level in all of these
variables will be the same. You can also check the value counts of the variable ‘InternetService’ and you’ll see that
the output you’ll get is:

Fiber Optic 3096


DSL 2421
No 1526

Coincidence? No!
This information is already contained in the variable ‘InternetService’ and hence, the count will be the same in all
the variables with the level ‘No internet service’. This is actually also the reason we chose to drop this particular
level.

Data Cleaning and Preparation - II


Standardising Variables
In a dataset with mean 50 and standard deviation 12, what will be the value of a variable with an initial value of 20
after you standardise it?

1.9

-1.9

2.5

-2.5 ✓ Correct

Standardising the train and test sets


As Rahim mentioned in the lecture, you use ' t_transform' on the train set but just 'transform' on the test set. Recall
you had learnt this in linear regression as well. Why do you think this is done?

Suggested Answer
The ' t_transform' command rst ts the data to have a mean of 0 and a standard deviation of 1, i.e. it scales all
the variables using:

Now, once this is done, all the variables are transformed using this formula. Now, when you go ahead to the test
set, you want the variables to not learn anything new. You want to use the old centralisation that you had when
you used t on the train dataset. And this is why you don't apply ' t' on the test data, just the 'transform'.

Building your First Model


Correlation Table
Which of the following command can be used to view the correlation table for the dataframe telecom?

telecom.corr() ✓ Correct

Checking Correlations
Take a look at the heatmap provided above. Which of the variables have the highest correlation between them?

StreamingTV_Yes and StreamingMovies_Yes

StreamingTV_No and StreamingMovies_No

MultipleLines_No and MultipleLines_Yes ✓ Correct

Signi cant Variables


Which of the following variables are insigni cant as of now based on the summary statistics above? (More than
one option may be correct.)

Note: Use p-value to determine the insigni cant variables.

PhoneService ✓ Correct

MultipleLines_Yes

TechSupport_Yes ✓ Correct

Negatively Correlated Variables


Which of the following variables are negatively correlated with the target variable based on the summary statistics
given above? (More than one option may be correct.)

tenure ✓ Correct

TotalCharges

MonthlyCharges ✓ Correct

p-values
After learning the coe cients of each variable, the model also produces a ‘p-value’ of each coe cient. Fill in the
blanks so that the statement is correct:

“The null hypothesis is that the coe cient is __. If the p-value is small, you can say that the coe cient is signi cant
and hence the null hypothesis ____.”

zero, can be rejected ✓ Correct


Feature Elimination using RFE
Threshold Value
You saw that Rahim chose a cut-off of 0.5. What can be said about this threshold?

It was arbitrarily chosen by us, i.e. there’s nothing special about 0.5. We could have chosen something else as well.
✓ Correct

Signi cance based on RFE


Based on the RFE output shown above, which of the variables is least signi cant?

OnlineBackup_Yes

Partner

gender_Male ✓ Correct

Churn based on Threshold


Suppose the following table shows the predicted values for the probabilities for 'Churn'. Assuming you chose an
arbitrary cut-off of 0.5 wherein a probability of greater than 0.5 means the customer would churn and a
probability of less than or equal 0.5 means the customer wouldn't churn, which of these customers do you think
will churn? (More than one option may be correct.)

Customer Probability(Churn)
A 0.45
B 0.67
C 0.98
D 0.49
E 0.03

B ✓ Correct

C ✓ Correct

Confusion Matrix and Accuracy


Confusion Matrix and Accuracy
Given the confusion matrix below, can you tell how many 'Churns' were correctly identi ed, i.e. if the person has
actually churned, it is predicted as a churn?
Actual/Predicted Not Churn Churn
Not Churn 80 30
Churn 20 70

80

30

20

70 ✓ Correct

Calculating Accuracy
From the confusion matrix you saw in the last question, compute the accuracy of the model.

Actual/Predicted Not Churn Churn


Not Churn 80 30
Churn 20 70

70%

75% ✓ Correct

Confusion Matrix
Suppose you built a logistic regression model to predict whether a patient has lung cancer or not and you get the
following confusion matrix as the output.

Actual/Predicted No Yes
No 400 100
Yes 50 150
How many of the patients were wrongly identi ed as a 'Yes'?

400

100 ✓ Correct

Confusion Matrix
Take a look at the table again.

Actual/Predicted No Yes
No 400 100
Yes 50 150
How many of these patients were correctly labelled, i.e. if the patient had lung cancer it was actually predicted as a
'Yes' and if they didn't have lung cancer, it was actually predicted as a 'No'?

150
400

500

550 ✓ Correct

Accuracy Calculation
From the table you used for the last two questions, what will be the accuracy of the model?

Actual/Predicted No Yes
No 400 100
Yes 50 150

57.14%

64.29%

71.43%

78.57% ✓ Correct

Manual Feature Elimination


Multivariate Logistic Regression (Variable Selection)
Based on the above information, what can you say about the log odds of these two customers?

PS: Recall the log odds for univariate logistic regression was given as:

log odds (customer A) < log odds (customer B)

log odds (customer A) = log odds (customer B)

log odds (customer A) > log odds (customer B) ✓ Correct

Multivariate Logistic Regression (Variable Selection)


Now, what can you say about the odds of churn for these two customers?

For customer A, the odds of churning are lower than for customer B

For customer A, the odds of churning are equal to those for customer B

For customer A, the odds of churning are higher than for customer B ✓ Correct

Multivariate Logistic Regression - Log Odds


Now, suppose two customers, customer C and customer D, are such that their behaviour is exactly the same,
except for the fact that customer C has OnlineSecurity, while customer D does not. What can you say about the
odds of churn for these two customers?
For customer C, the odds of churning are lower than for customer D ✓ Correct

Graded Questions
Logistic Regression in Python
Which of these methods is used for tting a logistic regression model using statsmodels?

OLS()

GLM() ✓ Correct

Confusion Matrix
Given the following confusion matrix, calculate the accuracy of the model.

Actual/Predicted Nos Yeses


Nos 1000 50
Yeses 250 1200

96%

88% ✓ Correct

Diabetic based on Threshold


Suppose you are building a logistic regression model to determine whether a person has diabetes or not. Following
are the values of predicted probabilities of 10 patients.

6 ✓ Correct

Log Odds
Suppose you are working for a media services company like Net ix. They're launching a new show called 'Sacred
Games' and you are building a logistic regression model which will predict whether a person will like it or not based
on whether consumers have liked/disliked some previous shows. You have the data of ve of the previous shows
and you're just using the dummy variables for these ve shows to build the model. If the variable is 1, it means that
the consumer liked the show and if the variable is zero, it means that the consumer didn't like the show. The
following table shows the values of the coe cients for these ve shows that you got after building the logistic
regression model.

Variable Name Coe cient Value


TrueDetective_Liked 0.47
ModernFamily_Liked -0.45
Mindhunter_Liked 0.39
Friends_Liked -0.23
Narcos_Liked 0.55

Now, you have the data of three consumers Reetesh, Kshitij, and Shruti for these 5 shows indicating whether or
not they liked these shows. This is shown in the table below:

Based on this data, which one of these three consumers is most likely to like to new show 'Sacred Games'?
\

Reetesh ✓ Correct

Multivariate Logistic Regression - Model Evaluation


Metrics Beyond Accuracy: Sensitivity & Specificity
False Positives
What is the number of False Positives for the model given below?

Actual/Predicted Not Churn Churn


Not Churn 400 100
Churn 50 150

400

100 ✓ Correct

Sensitivity
Sensitivity is de ned as the fraction of the number of correctly predicted positives and the total number of actual
positives, i.e.

What is the sensitivity of the following model?

Actual/Predicted Not Churn Churn


Not Churn 400 100
Churn 50 150

60%

75% ✓ Correct

Evaluation Metrics
Among the three metrics that you've learnt about, which one is the highest for the model below?
Actual/Predicted Not Churn Churn
Not Churn 400 100
Churn 50 150

Accuracy

Sensitivity

Speci city ✓ Correct

Sensitivity and Specificity in Python


False Negatives
What is the number of False Negatives for the model given below?

Actual/Predicted Not Churn Churn


Not Churn 80 40
Churn 30 50

80

40

30 ✓ Correct

Speci city
Speci city is de ned as the fraction of the number of correctly predicted negatives and the total number of actual
negatives, i.e.

What is the approximate speci city of the following model?

Actual/Predicted Not Churn Churn


Not Churn 80 40
Churn 30 50

60%

67% ✓ Correct

Evaluation Metrics
Which among accuracy, sensitivity, and speci city is the highest for the model below?

Actual/Predicted Not Churn Churn


Not Churn 80 40
Churn 30 50
Accuracy

Sensitivity

Speci city ✓ Correct

Other Metrics
In the code, you saw Rahim evaluate some other metrics as well. These were:

As you can see, the 'False Positive Rate' is basically (1 - Speci city). Check the formula and the values in the code
to verify.
The positive predictive value is the number of positives correctly predicted by the total number of positives
predicted. This is also known as 'Precision' which you'll learn more about soon.
Similarly, the negative predictive value is the number of negatives correctly predicted by the total number of
negatives predicted. There's no particular term for this as such.
Calculate the given three metrics for the model below and identify which one is the largest among them.

Negative Predictive Value ✓ Correct

Understanding ROC Curve


TPR and FPR
Given the following confusion matrix, calculate the value of True Positive Rate (TPR) and False Positive Rate
(FPR).

Actual/Predicted Not Churn Churn


Not Churn 300 200
Churn 100 400

TPR = 40%

FPR = 80%

TPR = 40%

FPR = 60%

TPR = 80%

FPR = 40% ✓ Correct

True Positive Rate


You have the following table showcasing the actual 'Churn' labels and the predicted probabilities for 5 customers.

Customer Churn Predicted Churn Probability


Thulasi 1 0.52
Aditi 0 0.56
Jaideep 1 0.78
Ashok 0 0.45
Amulya 0 0.22

Calculate the True Positive Rate and False Positive rate for the cutoffs of 0.4 and 0.5. Which of these cutoffs, will
give you a better model?

Note: The good model is the one in which TPR is high and FPR is low.

Cutoff of 0.4

Cutoff of 0.5 ✓ Correct

Changing the Threshold


You initially chose a threshold of 0.5 wherein a churn probability of greater than 0.5 would result in the customer
being identi ed as 'Churn' and a churn probability of lesser than 0.5 would result in the customer being identi ed
as 'Not Churn'.

Now, suppose you decreased the threshold to a value of 0.3. What will be its effect on the classi cation?

More customers would now be classi ed as 'Churn'. ✓ Correct

TPR and FPR


Fill in the blanks:

When the value of TPR increases, the value of FPR ______.

increases ✓ Correct

Area Under the Curve


You have the following ve AUCs (Area under the curve) for ROCs plotted for ve different models. Which of these
models is the best?

Model AUC
A 0.54
B 0.82
C 0.79
D 0.66
E 0.56

B ✓ Correct
ROC Curve in Python
ROC Curve
Following is the ROC curve that you got.

As you can see, when the 'True Positive Rate' is 0.8, the 'False Positive Rate' is about 0.24. What will be the value of
speci city, then?

0.8

0.2

0.76 ✓ Correct

ROC Curve
Which of the following ROC curve represents the best model?

C ✓ Correct

Finding the Optimal Threshold


Choosing the Optimal Cut-off
Suppose you created a dataframe to nd out the optimal cut-off point for a model you built. The dataframe looks
like the following:

Threshold Probability Accuracy Sensitivity Speci city


0.0 0.0 0.21 1.00 0.00
0.1 0.1 0.39 0.96 0.22
0.2 0.2 0.56 0.88 0.49
0.3 0.3 0.59 0.81 0.53
0.4 0.4 0.62 0.78 0.63
0.5 0.5 0.74 0.73 0.74
0.6 0.6 0.81 0.64 0.79
0.7 0.7 0.78 0.42 0.83
0.8 0.8 0.63 0.21 0.92
0.9 0.9 0.56 0.03 0.98

Based on the table above, what will the approximate value of the optimal cut-off be?

0.4

0.5 ✓ Correct
Choosing a model evaluation metric
As you learnt, there is usually a trade-off between various model evaluation metrics, and you cannot maximise all
of them simultaneously. For e.g., if you increase sensitivity (% of correctly predicted churns), the speci city (% of
correctly predicted non-churns) will reduce.

Let's say that you are building a telecom churn prediction model with the business objective that your company
wants to implement an aggressive customer retention campaign to retain the 'high churn-risk' customers. This is
because a competitor has launched extremely low-cost mobile plans, and you want to avoid churn as much as
possible by incentivising the customers. Assume that budget is not a constraint.

Which of the following metrics should you choose the maximise?

Accuracy

Sensitivity ✓ Correct

Model Evaluation Metrics - Exercise


Accuracy of the Model
Using the threshold of 0.3, what is the approximate accuracy of the model now?

72%

77% ✓ Correct

Confusion Matrix
Get the confusion matrix after using the cut-off 0.3. What is the number of 'False Negatives' now?

2793

842

283 ✓ Correct

Sensitivity
In the last question you saw that in the confusion matrix, the Churns are being captured better now. Using the
confusion matrix, can you tell what will the approximate sensitivity of the model now be?

67

72

76
78 ✓ Correct

Precision and Recall


Calculating Precision
Calculate the precision value for the following model.

Actual/Predicted Not Churn Churn


Not Churn 400 100
Churn 50 150

60% ✓ Correct

F1-score
There is a measure known as F1-score which essentially combines both precision and recall. It is the basically the
harmonic mean of precision and recall and its formula is given by:

The F1-score is useful when you want to look at the performance of precision and recall together.

Calculate the F1-score for the model below:

Actual/Predicted Not Churn Churn


Not Churn 400 100
Churn 50 150

33%

67% ✓ Correct

Optimal Cut-off
When using the sensitivity-speci city tradeoff, you found out that the optimal cutoff point was 0.3. Now, when you
plotted the precision-recall tradeoff, you got the following curve:

What is the optimal cutoff point according to the curve given above?

0.24

0.42 ✓ Correct

Making Predictions
Calculating Accuracy
Recall that in the last segment you saw that the cutoff based on the precision-recall tradeoff curve was
approximately 0.42. When you take this cut-off, you get the following confusion matrix on the test set.

Actual/Predicted Not Churn Churn


Not Churn 1294 234
Churn 223 359
What will the approximate value of accuracy be on the test set now?

60%

72%

75%

78% ✓ Correct

Calculating Recall
For the confusion matrix you saw in the last question, what will the approximate value of recall be?

Actual/Predicted Not Churn Churn


Not Churn 1294 234
Churn 223 359

62% ✓ Correct

Graded Questions
Calculating Sensitivity
Suppose you got the following confusion matrix for a model by using a cutoff of 0.5.

Actual/Predicted Not Churn Churn


Not Churn 1200 400
Churn 350 1050

Calculate the sensitivity for the model above. Now suppose for the same model, you changed the cutoff from 0.5
to 0.4 such that your number of true positives increased from 1050 to 1190. What will the be the change in
sensitivity?

Note: Report the answer in terms of new_value - old_value, i.e. if the sensitivity was, say, 0.6 earlier and then
changed to 0.8, report it as (0.8 - 0.6), i.e. 0.2.

0.05

-0.05

0.1 ✓ Correct
Calculating Precision
Consider the confusion matrix you had in the last question.

Actual/Predicted Not Churn Churn


Not Churn 1200 400
Churn 350 1050

Calculate the values of precision and recall for the model and determine which of the two is higher.

Precision

Recall ✓ Correct

True Positive Rate


Fill in the blanks.

The True Positive Rate (TPR) metric is exactly the same as ______.

Sensitivity ✓ Correct

Threshold
Suppose someone built a logistic regression model to predict whether a person has a heart disease or not. All you
have from their model is the following table which contains data of 10 patients.

Patient ID Heart Disease Predicted Probability for Heart Disease Predicted Label
1001 0 0.34 0
1002 1 0.58 1
1003 1 0.79 1
1004 0 0.68 1
1005 0 0.21 0
1006 0 0.04 0
1007 1 0.48 0
1008 1 0.64 1
1009 0 0.61 1
1010 1 0.86 1

Now, you wanted to nd out the cutoff based on which the classes were predicted, but you can't. But can you
identify which of the following cutoffs would be a valid cutoff for the model above based on the 10 data points
given in the table? (More than one option may be correct.)

0.50

✓ Correct

0.55

✓ Correct
Evaluation Metrics
Consider the same model given in the last question.

Patient ID Heart Disease Predicted Probability for Heart Disease Predicted Label
1001 0 0.34 0
1002 1 0.58 1
1003 1 0.79 1
1004 0 0.68 1
1005 0 0.21 0
1006 0 0.04 0
1007 1 0.48 0
1008 1 0.64 1
1009 0 0.61 1
1010 1 0.86 1

Calculate the values of Accuracy, Sensitivity, Speci city, and Precision. Which of these four metrics is the highest
for the model?

Accuracy

Sensitivity ✓ Correct

Logistic Regression - Industry Applications - Part I


Nuances of Logistic Regression - Variable Transformation-II
Woe Analysis
What information would you infer from the woe trend of tenure variable?

As tenure increases, the chances of churning decrease ✓ Correct

Woe Analysis
Choose the correct option:

Coarse binning is required for tenure variable as there is no monotonic trend in ne binning

Coarse binning is not required for tenure variable as there is a clear monotonic trend in ne binning ✓ Correct

Woe Analysis
What does negative woe signify in 'contract' variable (refer sheet-3)?
% of churners (bad customers) are more than % of no-churners (good customers) ✓ Correct

Woe Analysis
Compare the woe trends of both variables (tenure and contract).

Based on the woe trend, which variable when increased in value, might decrease the likelihood of churn?

Tenure

Contract

Both ✓ Correct

Information Value
What is the total information value of both the variables?

Contract = 0.83, Tenure = 1.24

Contract = 1.24 , Tenure = 0.83 ✓ Correct

Information Value
Choose the correct option?

Contract variable has stronger predictive power than tenure ✓ Correct

Nuances of Logistic Regression - Variable Transformation-III


WOE Missing Value
Woe value for NA bucket is:

0.51

0.41

-0.41

-0.51 ✓ Correct

Missing value
NA bucket can be merged with -
1-1 Bucket

2-2 Bucket

7-9 Bucket

None ✓ Correct

Graded Questions
Logistic Regression
What do you infer from the woe plot of the 'Grade' variable?

As the loan grade varies from A to G, the woe values gradually decrease from +0.99 to -1.09 ✓ Correct

Logistic Regression
Choose the correct option:

Woe graph shows monotonic nature ✓ Correct

Logistic Regression
Information value of the 'Grade' variable is:

0.56

0.43

0.34 ✓ Correct

Logistic Regression: Industry Applications - Part II


Coding Practice Optional)
Fibonacci Series
Description
Compute and display Fibonacci series upto n terms where n is a positive integer entered
You can go here to read about Fibonacci series.

n=int(input())
# first two terms
n1, n2 = 0, 1
count = 0

# check if the number of terms is valid


if n <= 0:
print("Please enter a positive integer")
# if there is only one term, return n1
elif n == 1:
print("Fibonacci sequence upto",nterms,":")
print(n1)
# generate fibonacci sequence
else:
print("Fibonacci sequence:")
while count < n:
print(n1)
nth = n1 + n2
# update values
n1 = n2
n2 = nth
count += 1

Prime Numbers
Description
Determine whether a positive integer n is a prime number or not. Assume n>1.
Display “number entered is prime” if n is prime, otherwise display “number entered is n

n=int(input())
out=True
for i in range(2,n):
if(n%i==0):
out=False
break
if out==True:
print("number entered is prime")
else:
print("number entered is not prime")

Armstrong number
Description
Any number, say n is called an Armstrong number if it is equal to the sum of its digits

n=int(input())
# Python program to check if the number is an Armstrong number or not
# initialize sum
sum = 0

# find the sum of the cube of each digit


temp = n
while temp > 0:
digit = temp % 10
sum += digit ** 3
temp //= 10

# display the result


if n == sum:
print(True)
else:
print(False)

Selecting dataframe columns


Description
Write a program to select all columns of a dataframe except the ones specified.
The input will contain a list of columns that you should skip.
You should print the first five rows of the dataframe as output where the columns are a

import pandas as pd
import ast,sys
df=pd.read_csv("https://media-doselect.s3.amazonaws.com/generic/X0kvr3wEYXRzONE5W37xWWY
input_str = sys.stdin.read()
to_omit = ast.literal_eval(input_str)
#write your code here
df=df[df.columns[~df.columns.isin(to_omit)]] #### check before submit
print(df.loc[:, sorted(list(df.columns))].head())

Two series
Description
Given two pandas series, find the position of elements in series2 in series1.
You can assume that all elements in series2 will be present in series1.
The input will contain two lines with series1 and series2 respectively.
The output should be a list of indexes indicating elements of series2 in series 1.
Note: In the output list, the indexes should be in ascending order.

import ast,sys
import pandas as pd
input_str = sys.stdin.read()
input_list = ast.literal_eval(input_str)
series1=pd.Series(input_list[0])
series2=pd.Series(input_list[1])
out_list=[pd.Index(series1).get_loc(num) for num in series2]
print(list(map(int,out_list)))#do not alter this step, list must be int type for evalua

Cleaning columns
Description
For the given dataframe, you have to clean the "Installs" column and print its correlat
You have to do the following:
1. Remove characters like ',' from the number of installs.
2. Delete rows where the Installs column has irrelevant strings like 'Free'
3. Convert the column to int type
You can access the dataframe using the following URL in your Jupyter notebook:
https://media-doselect.s3.amazonaws.com/generic/8NMooe4G0ENEe8z9q5ZvaZA7/googleplaystor

import pandas as pd
df=pd.read_csv("https://media-doselect.s3.amazonaws.com/generic/8NMooe4G0ENEe8z9q5ZvaZA

df.Installs=df.Installs.str.replace(',','')

df.Installs=df.Installs.str.replace('+','')
df=df[df.Installs!='Free']

df.Installs=df.Installs.astype(int)

print(df.corr())

import jovian

jovian.commit()

You might also like