4/22/25, 4:30 PM diabetes check CIE (3)
In [ ]: #importing the required libraries to build logistic regresion model
In [93]: import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
from sklearn import metrics
In [94]: #Importing the file
In [95]: d_check= pd.read_excel('/Users/sivamugunthanashok/Desktop/MAJORS/PA/diabetes check.xlsx')
d_check.head()
Out[95]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [96]: #Checking the total rows and columns
In [97]: d_check.shape
Out[97]: (768, 9)
In [98]: #General information about the dataset(d_check)
In [99]: d_check.info()
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 1/18
4/22/25, 4:30 PM diabetes check CIE (3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [100… d_check.columns
Out[100… Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
In [137… import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Display the correlation matrix
correlation_matrix = d_check.corr()
print("Correlation Matrix:")
print(correlation_matrix)
# Optional: Plot a heatmap for better visualization
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 2/18
4/22/25, 4:30 PM diabetes check CIE (3)
Correlation Matrix:
Pregnancies Glucose BloodPressure SkinThickness \
Pregnancies 1.000000 0.129459 0.141282 -0.081672
Glucose 0.129459 1.000000 0.152590 0.057328
BloodPressure 0.141282 0.152590 1.000000 0.207371
SkinThickness -0.081672 0.057328 0.207371 1.000000
Insulin -0.073535 0.331357 0.088933 0.436783
BMI 0.017683 0.221071 0.281805 0.392573
DiabetesPedigreeFunction -0.033523 0.137337 0.041265 0.183928
Age 0.544341 0.263514 0.239528 -0.113970
diabetes 0.221898 0.466581 0.065068 0.074752
Insulin BMI DiabetesPedigreeFunction \
Pregnancies -0.073535 0.017683 -0.033523
Glucose 0.331357 0.221071 0.137337
BloodPressure 0.088933 0.281805 0.041265
SkinThickness 0.436783 0.392573 0.183928
Insulin 1.000000 0.197859 0.185071
BMI 0.197859 1.000000 0.140647
DiabetesPedigreeFunction 0.185071 0.140647 1.000000
Age -0.042163 0.036242 0.033561
diabetes 0.130548 0.292695 0.173844
Age diabetes
Pregnancies 0.544341 0.221898
Glucose 0.263514 0.466581
BloodPressure 0.239528 0.065068
SkinThickness -0.113970 0.074752
Insulin -0.042163 0.130548
BMI 0.036242 0.292695
DiabetesPedigreeFunction 0.033561 0.173844
Age 1.000000 0.238356
diabetes 0.238356 1.000000
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 3/18
4/22/25, 4:30 PM diabetes check CIE (3)
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 4/18
4/22/25, 4:30 PM diabetes check CIE (3)
In [101… #Renaming the column name(outcome) to (diabetes)
In [102… d_check = d_check.rename(columns={'Outcome': 'diabetes'})
In [103… #To check the count occurrences of each unique value in the 'diabetes' column
In [104… d_check.diabetes.value_counts()
Out[104… diabetes
0 500
1 268
Name: count, dtype: int64
In [105… #Defining explantory variables
In [106… x_features=list(d_check.columns)
x_features.remove('diabetes')
x_features
Out[106… ['Pregnancies',
'Glucose',
'BloodPressure',
'SkinThickness',
'Insulin',
'BMI',
'DiabetesPedigreeFunction',
'Age']
In [107… #defining explantory(X) and outcome variable(Y),Adding constant to explantory variable(X) get (Bo)
In [108… Y=d_check.diabetes
X = sm.add_constant(d_check[x_features])
In [109… # Initialize the logistic regression model with outcome (Y) and explanatory (X) variables
# Fit the logistic regression model to the data
# Display a detailed summary of the logistic regression results
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 5/18
4/22/25, 4:30 PM diabetes check CIE (3)
In [110… logit=sm.Logit(Y,X)
logit_model=logit.fit()
logit_model.summary2()
Optimization terminated successfully.
Current function value: 0.470993
Iterations 6
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 6/18
4/22/25, 4:30 PM diabetes check CIE (3)
Out[110… Model: Logit Method: MLE
Dependent Variable: diabetes Pseudo R-squared: 0.272
Date: 2025-04-11 19:02 AIC: 741.4454
No. Observations: 768 BIC: 783.2395
Df Model: 8 Log-Likelihood: -361.72
Df Residuals: 759 LL-Null: -496.74
Converged: 1.0000 LLR p-value: 9.6516e-54
No. Iterations: 6.0000 Scale: 1.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
const -8.4047 0.7166 -11.7280 0.0000 -9.8093 -7.0001
Pregnancies 0.1232 0.0321 3.8401 0.0001 0.0603 0.1861
Glucose 0.0352 0.0037 9.4814 0.0000 0.0279 0.0424
BloodPressure -0.0133 0.0052 -2.5404 0.0111 -0.0236 -0.0030
SkinThickness 0.0006 0.0069 0.0897 0.9285 -0.0129 0.0141
Insulin -0.0012 0.0009 -1.3223 0.1861 -0.0030 0.0006
BMI 0.0897 0.0151 5.9453 0.0000 0.0601 0.1193
DiabetesPedigreeFunction 0.9452 0.2991 3.1596 0.0016 0.3589 1.5315
Age 0.0149 0.0093 1.5929 0.1112 -0.0034 0.0332
In [111… def get_significant_vars(lm):
# Step 1: Convert p-values into a table (DataFrame)
var_p_vals_df = pd.DataFrame(lm.pvalues)
# Step 2: Add variable names as a column in the table
var_p_vals_df['vars'] = var_p_vals_df.index
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 7/18
4/22/25, 4:30 PM diabetes check CIE (3)
# Step 3: Rename the columns to 'pvals' (for p-values) and 'vars' (for variable names)
var_p_vals_df.columns = ['pvals', 'vars']
# Step 4: Find the variables where p-value <= 0.05 and return their names as a list
return list(var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'])
In [112… #Printing the significant variables
In [113… significant_vars=get_significant_vars(logit_model)
significant_vars
Out[113… ['const',
'Pregnancies',
'Glucose',
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction']
In [114… # Fit a logistic regression model using significant variables and adding constant to the (X) explanatory variable
In [115… final_logit=sm.Logit(Y,sm.add_constant(X[significant_vars])).fit()
Optimization terminated successfully.
Current function value: 0.474323
Iterations 6
In [116… #Final summary of the model (With only significant variables)
final_logit.summary2()
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 8/18
4/22/25, 4:30 PM diabetes check CIE (3)
Out[116… Model: Logit Method: MLE
Dependent Variable: diabetes Pseudo R-squared: 0.267
Date: 2025-04-11 19:02 AIC: 740.5596
No. Observations: 768 BIC: 768.4223
Df Model: 5 Log-Likelihood: -364.28
Df Residuals: 762 LL-Null: -496.74
Converged: 1.0000 LLR p-value: 3.4421e-55
No. Iterations: 6.0000 Scale: 1.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
const -7.9550 0.6758 -11.7708 0.0000 -9.2795 -6.6304
Pregnancies 0.1535 0.0278 5.5143 0.0000 0.0989 0.2080
Glucose 0.0347 0.0034 10.2130 0.0000 0.0280 0.0413
BloodPressure -0.0120 0.0050 -2.3868 0.0170 -0.0219 -0.0021
BMI 0.0848 0.0141 6.0059 0.0000 0.0571 0.1125
DiabetesPedigreeFunction 0.9106 0.2940 3.0971 0.0020 0.3343 1.4869
In [117… #Printing actual value vs predicted value for the significant variables from the final summary
Y_pred=pd.DataFrame({'actual':Y,
'predicted_prob':final_logit.predict(
sm.add_constant(X[significant_vars]))})
In [118… # Sample 10 random predictions from the predicted values, ensuring the same random sample every time by setting ran
Y_pred.sample(10,random_state=7)
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 9/18
4/22/25, 4:30 PM diabetes check CIE (3)
Out[118… actual predicted_prob
353 0 0.069714
236 1 0.876866
323 1 0.762600
98 0 0.160798
701 1 0.313795
61 1 0.513703
600 0 0.079305
242 1 0.312677
744 0 0.942662
644 0 0.143922
In [119… # Create a new column 'predicted' in Y_pred DataFrame by converting predicted probabilities to binary outcomes
# If the predicted probability is greater than 0.5, assign 1 (positive class), otherwise assign 0 (negative class)
Y_pred['predicted']=Y_pred.predicted_prob.map(
lambda x:1 if x>0.5 else 0)
Y_pred.sample(10, random_state=7)
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 10/18
4/22/25, 4:30 PM diabetes check CIE (3)
Out[119… actual predicted_prob predicted
353 0 0.069714 0
236 1 0.876866 1
323 1 0.762600 1
98 0 0.160798 0
701 1 0.313795 0
61 1 0.513703 1
600 0 0.079305 0
242 1 0.312677 0
744 0 0.942662 1
644 0 0.143922 0
In [138… # Define a function to draw the confusion matrix
def draw_cm(actual, predicted):
# Generate the confusion matrix using actual and predicted labels
cm= metrics.confusion_matrix(actual,predicted, labels=[0,1])
# Use seaborn's heatmap to visualize the confusion matrix
sn.heatmap(cm,annot=True,fmt='.2f',
xticklabels=['Negative','Positive'],
yticklabels=['Negative','Positive'])
# Set the labels for the axes
plt.ylabel('True lable')
plt.xlabel('predicted label')
# Display the plot
plt.show()
In [139… # Call the draw_cm function to visualize the confusion matrix using the 'actual' and 'predicted' columns from the Y
draw_cm(Y_pred['actual'],Y_pred['predicted'])
"""
Interpretation
True Positive (Top-Left): 436 instances were correctly predicted as "negative."
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 11/18
4/22/25, 4:30 PM diabetes check CIE (3)
False Positive (Top-Right): 64 instances were incorrectly predicted as "positive" when they were actually "negative
False Negative (Bottom-Left): 113 instances were incorrectly predicted as "Negative" when they were actually "posit
True Negative (Bottom-Right): 155 instances were correctly predicted as "positive"
"""
Out[139… '\nInterpretation \nTrue Positive (Top-Left): 436 instances were correctly predicted as "Not Subscribed."\nFalse P
ositive (Top-Right): 64 instances were incorrectly predicted as "Subscribed" when they were actually "Not Subscrib
ed."\nFalse Negative (Bottom-Left): 113 instances were incorrectly predicted as "Not Subscribed" when they were ac
tually "Subscribed."\nTrue Negative (Bottom-Right): 155 instances were correctly predicted as "Subscribed."\n'
In [122… # Print the classification report using actual and predicted labels from the Y_pred DataFrame
print(metrics.classification_report(Y_pred.actual, Y_pred.predicted))
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 12/18
4/22/25, 4:30 PM diabetes check CIE (3)
precision recall f1-score support
0 0.79 0.88 0.84 500
1 0.72 0.57 0.64 268
accuracy 0.77 768
macro avg 0.76 0.73 0.74 768
weighted avg 0.77 0.77 0.77 768
In [123… import matplotlib.pyplot as plt
import seaborn as sns
#Set figure size
plt.figure(figsize=(8, 6))
#Plot distribution of predicted probabilities for Bad Credit
sns.histplot(Y_pred[Y_pred.actual == 1]["predicted_prob"], bins=20, color="b", label="Bad Credit", alpha=0.6)
#Plot distribution of predicted probabilities for Good Credit
sns.histplot(Y_pred[Y_pred.actual == 0]["predicted_prob"], bins=20, color="g", label="Good Credit", alpha=0.6)
# Adding Legend plt.legend()
#Adding labels and title
plt.xlabel("Predicted Probability")
plt.ylabel("Frequency")
plt.title("Distribution of Predicted Probabilities")
# Display plot
plt.show()
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 13/18
4/22/25, 4:30 PM diabetes check CIE (3)
(ROC)Reciver operator curve (AUC)Area under the curve
In [124… import matplotlib.pyplot as plt
from sklearn import metrics
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 14/18
4/22/25, 4:30 PM diabetes check CIE (3)
In [125… def draw_roc(actual, predicted_prob):
# Obtain fpr, tpr, thresholds
fpr, tpr, thresholds = metrics.roc_curve(actual, predicted_prob, drop_intermediate=False)
auc_score = metrics.roc_auc_score(actual, predicted_prob)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label="ROC curve (area = %0.2f)" % auc_score)
# Draw a diagonal line (random classifier line)
plt.plot([0, 1], [0, 1], "k--")
# Set axis limits
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
# Add labels and legend
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Sensitivity)")
plt.legend(loc="lower right")
# Show the plot
plt.show()
# Return fpr, tpr, thresholds
return fpr, tpr, thresholds
In [126… fpr, tpr, thresholds = draw_roc(Y_pred.actual, Y_pred.predicted_prob)
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 15/18
4/22/25, 4:30 PM diabetes check CIE (3)
In [127… auc_score = metrics.roc_auc_score(Y_pred.actual, Y_pred.predicted_prob)
round(float(auc_score),2)
Out[127… 0.84
Youndens index:
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 16/18
4/22/25, 4:30 PM diabetes check CIE (3)
In [128… tpr_fpr=pd.DataFrame({"tpr":tpr,"fpr":fpr,"thresholds":thresholds})
tpr_fpr["diff"]=tpr_fpr.tpr - tpr_fpr.fpr
tpr_fpr.sort_values("diff",ascending=False)[0:5]
Out[128… tpr fpr thresholds diff
335 0.794776 0.244 0.319596 0.550776
341 0.802239 0.252 0.312677 0.550239
324 0.779851 0.230 0.328583 0.549851
333 0.791045 0.242 0.321644 0.549045
336 0.794776 0.246 0.318831 0.548776
In [129… Y_pred["predicted_new"] = Y_pred.predicted_prob.map(lambda x: 1 if x>0.22 else 0)
draw_cm(Y_pred.actual, Y_pred.predicted_new)
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 17/18
4/22/25, 4:30 PM diabetes check CIE (3)
In [130… print(metrics.classification_report(Y_pred.actual, Y_pred.predicted_new))
precision recall f1-score support
0 0.89 0.59 0.71 500
1 0.53 0.87 0.66 268
accuracy 0.69 768
macro avg 0.71 0.73 0.69 768
weighted avg 0.77 0.69 0.69 768
In [ ]:
localhost:8964/lab/tree/Desktop/MAJORS/ML/CIE 3/diabetes check CIE (3).ipynb 18/18