machine-learning-module-3-logistic-regression
machine-learning-module-3-logistic-regression
machine-learning-module-3-logistic-regression
Point no. 1 2 3 4 5 6 7 8 9 10
Diabetes no no no yes no yes yes yes yes yes
In this case, the likelihood would be equal to:
(1−P1) (1−P2 ) (1−P3) (1−P5 ) (P4) (P6) (P7) (P8) (P9) (P10) ✓ Correct
For x = 220, the log odds are equal to -13.5+(0.06*220) = -0.3. For x = 231.5, log odds are equal to:
ans: 0.39
Log Odds
So, let’s say that the equation for log odds is:
For x = 220, the log odds are equal to -0.3 and for x = 231.5, the log odds are equal to 0.39. For x = 243, the log
odds are equal to:
ans: 1.08
1526 ✓ Correct
ans: This happens because the level ‘No internet service’ just tells you whether a user has internet service or not.
Now because the number of users not having an internet service is the same, the count of this level in all of these
variables will be the same. You can also check the value counts of the variable ‘InternetService’ and you’ll see that
the output you’ll get is:
Coincidence? No!
This information is already contained in the variable ‘InternetService’ and hence, the count will be the same in all
the variables with the level ‘No internet service’. This is actually also the reason we chose to drop this particular
level.
1.9
-1.9
2.5
-2.5 ✓ Correct
Suggested Answer
The ' t_transform' command rst ts the data to have a mean of 0 and a standard deviation of 1, i.e. it scales all
the variables using:
Now, once this is done, all the variables are transformed using this formula. Now, when you go ahead to the test
set, you want the variables to not learn anything new. You want to use the old centralisation that you had when
you used t on the train dataset. And this is why you don't apply ' t' on the test data, just the 'transform'.
telecom.corr() ✓ Correct
Checking Correlations
Take a look at the heatmap provided above. Which of the variables have the highest correlation between them?
PhoneService ✓ Correct
MultipleLines_Yes
TechSupport_Yes ✓ Correct
tenure ✓ Correct
TotalCharges
MonthlyCharges ✓ Correct
p-values
After learning the coe cients of each variable, the model also produces a ‘p-value’ of each coe cient. Fill in the
blanks so that the statement is correct:
“The null hypothesis is that the coe cient is __. If the p-value is small, you can say that the coe cient is signi cant
and hence the null hypothesis ____.”
It was arbitrarily chosen by us, i.e. there’s nothing special about 0.5. We could have chosen something else as well.
✓ Correct
OnlineBackup_Yes
Partner
gender_Male ✓ Correct
Customer Probability(Churn)
A 0.45
B 0.67
C 0.98
D 0.49
E 0.03
B ✓ Correct
C ✓ Correct
80
30
20
70 ✓ Correct
Calculating Accuracy
From the confusion matrix you saw in the last question, compute the accuracy of the model.
70%
75% ✓ Correct
Confusion Matrix
Suppose you built a logistic regression model to predict whether a patient has lung cancer or not and you get the
following confusion matrix as the output.
Actual/Predicted No Yes
No 400 100
Yes 50 150
How many of the patients were wrongly identi ed as a 'Yes'?
400
100 ✓ Correct
Confusion Matrix
Take a look at the table again.
Actual/Predicted No Yes
No 400 100
Yes 50 150
How many of these patients were correctly labelled, i.e. if the patient had lung cancer it was actually predicted as a
'Yes' and if they didn't have lung cancer, it was actually predicted as a 'No'?
150
400
500
550 ✓ Correct
Accuracy Calculation
From the table you used for the last two questions, what will be the accuracy of the model?
Actual/Predicted No Yes
No 400 100
Yes 50 150
57.14%
64.29%
71.43%
78.57% ✓ Correct
PS: Recall the log odds for univariate logistic regression was given as:
For customer A, the odds of churning are lower than for customer B
For customer A, the odds of churning are equal to those for customer B
For customer A, the odds of churning are higher than for customer B ✓ Correct
Graded Questions
Logistic Regression in Python
Which of these methods is used for tting a logistic regression model using statsmodels?
OLS()
GLM() ✓ Correct
Confusion Matrix
Given the following confusion matrix, calculate the accuracy of the model.
96%
88% ✓ Correct
6 ✓ Correct
Log Odds
Suppose you are working for a media services company like Net ix. They're launching a new show called 'Sacred
Games' and you are building a logistic regression model which will predict whether a person will like it or not based
on whether consumers have liked/disliked some previous shows. You have the data of ve of the previous shows
and you're just using the dummy variables for these ve shows to build the model. If the variable is 1, it means that
the consumer liked the show and if the variable is zero, it means that the consumer didn't like the show. The
following table shows the values of the coe cients for these ve shows that you got after building the logistic
regression model.
Now, you have the data of three consumers Reetesh, Kshitij, and Shruti for these 5 shows indicating whether or
not they liked these shows. This is shown in the table below:
Based on this data, which one of these three consumers is most likely to like to new show 'Sacred Games'?
\
Reetesh ✓ Correct
400
100 ✓ Correct
Sensitivity
Sensitivity is de ned as the fraction of the number of correctly predicted positives and the total number of actual
positives, i.e.
60%
75% ✓ Correct
Evaluation Metrics
Among the three metrics that you've learnt about, which one is the highest for the model below?
Actual/Predicted Not Churn Churn
Not Churn 400 100
Churn 50 150
Accuracy
Sensitivity
80
40
30 ✓ Correct
Speci city
Speci city is de ned as the fraction of the number of correctly predicted negatives and the total number of actual
negatives, i.e.
60%
67% ✓ Correct
Evaluation Metrics
Which among accuracy, sensitivity, and speci city is the highest for the model below?
Sensitivity
Other Metrics
In the code, you saw Rahim evaluate some other metrics as well. These were:
As you can see, the 'False Positive Rate' is basically (1 - Speci city). Check the formula and the values in the code
to verify.
The positive predictive value is the number of positives correctly predicted by the total number of positives
predicted. This is also known as 'Precision' which you'll learn more about soon.
Similarly, the negative predictive value is the number of negatives correctly predicted by the total number of
negatives predicted. There's no particular term for this as such.
Calculate the given three metrics for the model below and identify which one is the largest among them.
TPR = 40%
FPR = 80%
TPR = 40%
FPR = 60%
TPR = 80%
Calculate the True Positive Rate and False Positive rate for the cutoffs of 0.4 and 0.5. Which of these cutoffs, will
give you a better model?
Note: The good model is the one in which TPR is high and FPR is low.
Cutoff of 0.4
Now, suppose you decreased the threshold to a value of 0.3. What will be its effect on the classi cation?
increases ✓ Correct
Model AUC
A 0.54
B 0.82
C 0.79
D 0.66
E 0.56
B ✓ Correct
ROC Curve in Python
ROC Curve
Following is the ROC curve that you got.
As you can see, when the 'True Positive Rate' is 0.8, the 'False Positive Rate' is about 0.24. What will be the value of
speci city, then?
0.8
0.2
0.76 ✓ Correct
ROC Curve
Which of the following ROC curve represents the best model?
C ✓ Correct
Based on the table above, what will the approximate value of the optimal cut-off be?
0.4
0.5 ✓ Correct
Choosing a model evaluation metric
As you learnt, there is usually a trade-off between various model evaluation metrics, and you cannot maximise all
of them simultaneously. For e.g., if you increase sensitivity (% of correctly predicted churns), the speci city (% of
correctly predicted non-churns) will reduce.
Let's say that you are building a telecom churn prediction model with the business objective that your company
wants to implement an aggressive customer retention campaign to retain the 'high churn-risk' customers. This is
because a competitor has launched extremely low-cost mobile plans, and you want to avoid churn as much as
possible by incentivising the customers. Assume that budget is not a constraint.
Accuracy
Sensitivity ✓ Correct
72%
77% ✓ Correct
Confusion Matrix
Get the confusion matrix after using the cut-off 0.3. What is the number of 'False Negatives' now?
2793
842
283 ✓ Correct
Sensitivity
In the last question you saw that in the confusion matrix, the Churns are being captured better now. Using the
confusion matrix, can you tell what will the approximate sensitivity of the model now be?
67
72
76
78 ✓ Correct
60% ✓ Correct
F1-score
There is a measure known as F1-score which essentially combines both precision and recall. It is the basically the
harmonic mean of precision and recall and its formula is given by:
The F1-score is useful when you want to look at the performance of precision and recall together.
33%
67% ✓ Correct
Optimal Cut-off
When using the sensitivity-speci city tradeoff, you found out that the optimal cutoff point was 0.3. Now, when you
plotted the precision-recall tradeoff, you got the following curve:
What is the optimal cutoff point according to the curve given above?
0.24
0.42 ✓ Correct
Making Predictions
Calculating Accuracy
Recall that in the last segment you saw that the cutoff based on the precision-recall tradeoff curve was
approximately 0.42. When you take this cut-off, you get the following confusion matrix on the test set.
60%
72%
75%
78% ✓ Correct
Calculating Recall
For the confusion matrix you saw in the last question, what will the approximate value of recall be?
62% ✓ Correct
Graded Questions
Calculating Sensitivity
Suppose you got the following confusion matrix for a model by using a cutoff of 0.5.
Calculate the sensitivity for the model above. Now suppose for the same model, you changed the cutoff from 0.5
to 0.4 such that your number of true positives increased from 1050 to 1190. What will the be the change in
sensitivity?
Note: Report the answer in terms of new_value - old_value, i.e. if the sensitivity was, say, 0.6 earlier and then
changed to 0.8, report it as (0.8 - 0.6), i.e. 0.2.
0.05
-0.05
0.1 ✓ Correct
Calculating Precision
Consider the confusion matrix you had in the last question.
Calculate the values of precision and recall for the model and determine which of the two is higher.
Precision
Recall ✓ Correct
The True Positive Rate (TPR) metric is exactly the same as ______.
Sensitivity ✓ Correct
Threshold
Suppose someone built a logistic regression model to predict whether a person has a heart disease or not. All you
have from their model is the following table which contains data of 10 patients.
Patient ID Heart Disease Predicted Probability for Heart Disease Predicted Label
1001 0 0.34 0
1002 1 0.58 1
1003 1 0.79 1
1004 0 0.68 1
1005 0 0.21 0
1006 0 0.04 0
1007 1 0.48 0
1008 1 0.64 1
1009 0 0.61 1
1010 1 0.86 1
Now, you wanted to nd out the cutoff based on which the classes were predicted, but you can't. But can you
identify which of the following cutoffs would be a valid cutoff for the model above based on the 10 data points
given in the table? (More than one option may be correct.)
0.50
✓ Correct
0.55
✓ Correct
Evaluation Metrics
Consider the same model given in the last question.
Patient ID Heart Disease Predicted Probability for Heart Disease Predicted Label
1001 0 0.34 0
1002 1 0.58 1
1003 1 0.79 1
1004 0 0.68 1
1005 0 0.21 0
1006 0 0.04 0
1007 1 0.48 0
1008 1 0.64 1
1009 0 0.61 1
1010 1 0.86 1
Calculate the values of Accuracy, Sensitivity, Speci city, and Precision. Which of these four metrics is the highest
for the model?
Accuracy
Sensitivity ✓ Correct
Woe Analysis
Choose the correct option:
Coarse binning is required for tenure variable as there is no monotonic trend in ne binning
Coarse binning is not required for tenure variable as there is a clear monotonic trend in ne binning ✓ Correct
Woe Analysis
What does negative woe signify in 'contract' variable (refer sheet-3)?
% of churners (bad customers) are more than % of no-churners (good customers) ✓ Correct
Woe Analysis
Compare the woe trends of both variables (tenure and contract).
Based on the woe trend, which variable when increased in value, might decrease the likelihood of churn?
Tenure
Contract
Both ✓ Correct
Information Value
What is the total information value of both the variables?
Information Value
Choose the correct option?
0.51
0.41
-0.41
-0.51 ✓ Correct
Missing value
NA bucket can be merged with -
1-1 Bucket
2-2 Bucket
7-9 Bucket
None ✓ Correct
Graded Questions
Logistic Regression
What do you infer from the woe plot of the 'Grade' variable?
As the loan grade varies from A to G, the woe values gradually decrease from +0.99 to -1.09 ✓ Correct
Logistic Regression
Choose the correct option:
Logistic Regression
Information value of the 'Grade' variable is:
0.56
0.43
0.34 ✓ Correct
n=int(input())
# first two terms
n1, n2 = 0, 1
count = 0
Prime Numbers
Description
Determine whether a positive integer n is a prime number or not. Assume n>1.
Display “number entered is prime” if n is prime, otherwise display “number entered is n
n=int(input())
out=True
for i in range(2,n):
if(n%i==0):
out=False
break
if out==True:
print("number entered is prime")
else:
print("number entered is not prime")
Armstrong number
Description
Any number, say n is called an Armstrong number if it is equal to the sum of its digits
n=int(input())
# Python program to check if the number is an Armstrong number or not
# initialize sum
sum = 0
import pandas as pd
import ast,sys
df=pd.read_csv("https://media-doselect.s3.amazonaws.com/generic/X0kvr3wEYXRzONE5W37xWWY
input_str = sys.stdin.read()
to_omit = ast.literal_eval(input_str)
#write your code here
df=df[df.columns[~df.columns.isin(to_omit)]] #### check before submit
print(df.loc[:, sorted(list(df.columns))].head())
Two series
Description
Given two pandas series, find the position of elements in series2 in series1.
You can assume that all elements in series2 will be present in series1.
The input will contain two lines with series1 and series2 respectively.
The output should be a list of indexes indicating elements of series2 in series 1.
Note: In the output list, the indexes should be in ascending order.
import ast,sys
import pandas as pd
input_str = sys.stdin.read()
input_list = ast.literal_eval(input_str)
series1=pd.Series(input_list[0])
series2=pd.Series(input_list[1])
out_list=[pd.Index(series1).get_loc(num) for num in series2]
print(list(map(int,out_list)))#do not alter this step, list must be int type for evalua
Cleaning columns
Description
For the given dataframe, you have to clean the "Installs" column and print its correlat
You have to do the following:
1. Remove characters like ',' from the number of installs.
2. Delete rows where the Installs column has irrelevant strings like 'Free'
3. Convert the column to int type
You can access the dataframe using the following URL in your Jupyter notebook:
https://media-doselect.s3.amazonaws.com/generic/8NMooe4G0ENEe8z9q5ZvaZA7/googleplaystor
import pandas as pd
df=pd.read_csv("https://media-doselect.s3.amazonaws.com/generic/8NMooe4G0ENEe8z9q5ZvaZA
df.Installs=df.Installs.str.replace(',','')
df.Installs=df.Installs.str.replace('+','')
df=df[df.Installs!='Free']
df.Installs=df.Installs.astype(int)
print(df.corr())
import jovian
jovian.commit()