Business Report: by Sreenath Radhakrishnan
Business Report: by Sreenath Radhakrishnan
Business Report: by Sreenath Radhakrishnan
By Sreenath Radhakrishnan
1
Problem 1:
The dataset given is about the Health and economic conditions in different States of a country.
Group States based on how similar their situation is, so as to provide these groups to the
government so that appropriate measures can be taken to escalate their Health and Economic
conditions.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)
First, we check the top 5 values of the dataset after reading the
file using head function.
df = pd.read_csv("State_wise_Health_income.csv")
df.head()
We check the shape using shape function to check the number of rows
and columns in the dataset
df.shape
(297, 6)
To check the datatypes and to summarize the data we use the info
function
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 6 columns):
Unnamed: 0 297 non-null int64
States 297 non-null object
Health_indeces1 297 non-null int64
Health_indices2 297 non-null int64
Per_capita_income 297 non-null int64
GDP 297 non-null int64
dtypes: int64(5), object(1)
memory usage: 14.0+ KB
Check for any null values in the dataset using the isnull function.
2
df.isnull().sum()
Unnamed: 0 0
States 0
Health_indeces1 0
Health_indices2 0
Per_capita_income 0
GDP 0
dtype: int64
df.describe()
Perform Univariate and Bivariate Analysis for the given data set.
Univariate Analysis:
3
Bi-Variate Analysis:
4
Also, we use the heatmap to understand the correlation between the
variables.
5
1.2 Do you think scaling is necessary for clustering in this case? Justify
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.
6
Due to huge data being represented with multiple clusters, it is
impossible to read the clusters and make an inference on it.
array([2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
7
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and find silhouette score.
for i in range(1,15):
KM = KMeans(n_clusters = i)
KM.fit(scaled_df)
wss.append(KM.inertia_)
wss
[1188.0000000000005,
471.3102140867778,
260.5729408376231,
183.60983976801245,
149.787873629525,
117.21156155994824,
90.7040780005266,
79.62086308917355,
70.83287890866458,
64.46674739014672,
56.30758119028052,
51.35374692455001,
46.99615066493063,
45.24802616666356]
for i in range(1,15):
print('The WSS value for',i,'clusters is',wss[i-1])
From the above WSS values, there is not much change in the
distances after the fourth WSS. Hence the chosen size of the
8
clusters to be 4 is an indication of the clustering the states
based on the health and income in 4 clusters.
From the above elbow curve we see that there is a slight variation
between n=3 and n=4, but the curve starts to flatten from n=4.
silhouette_score(scaled_df, labels)
0.5520464132164421
1.5. Describe cluster profiles for the clusters defined. Recommend different priority based actions
need to be taken for different clusters on the bases of their vulnerability situations according to
their Economic and Health Conditions.
A:
clust_profile=df.drop(['States'],axis=1)
clust_profile=clust_profile.groupby('Clus_kmeans4').mean()
clust_profile['freq']=df.Clus_kmeans4.value_counts().sort_index()
clust_profile
Out[98]:
9
Health_indeces1Health_indices2Per_capita_income GDP clustersfreq
Clus_kmeans4
0 499.158416 116.356436 693.772277 9428.099010 2.000000101
1 2597.089109 783.019802 2464.128713 141264.1386141.960396101
2 4799.355932 1142.288136 2372.220339 396907.2372881.00000059
3 5146.444444 1327.138889 5047.083333 367196.9166671.00000036
Cluster 1:
This cluster represents the worst health systems and the lowest
average income.
Cluster 2:
This cluster represents with slightly better health systems and the
average income as compared to cluster 1.
Cluster 3:
Cluster 4:
This cluster represents with the best health systems also with the
average income being high.
Conclusion:
Problem 2:
The mifem data frame has 1295 rows and 10 columns. This is a Dataset of females having coronary
heart disease (CHD). you have to predict with the given information whether the female is dead or
alive so as to discover important factors that should be considered crucial in the treatment of the
disease. Use CART, RF & ANN and compare the models' performances in train and test sets.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it.
A: Load the dataset and check the data using the head function.
10
mifem_df = pd.read_csv('mifem.csv')
mifem_df.head()
To check the datatypes and to summarize the data we use the info
function
mifem_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1295 entries, 0 to 1294
Data columns (total 12 columns):
Unnamed: 0 1295 non-null int64
Unnamed: 0.1 1295 non-null int64
outcome 1295 non-null object
age 1295 non-null int64
yronset 1295 non-null int64
premi 1295 non-null object
smstat 1295 non-null object
diabetes 1295 non-null object
highbp 1295 non-null object
hichol 1295 non-null object
angina 1295 non-null object
stroke 1295 non-null object
dtypes: int64(4), object(8)
memory usage: 121.5+ KB
mifem_df.shape
(1295, 12)
mifem_df.isnull().sum()
Unnamed: 0 0
Unnamed: 0.1 0
outcome 0
age 0
yronset 0
premi 0
smstat 0
11
diabetes 0
highbp 0
hichol 0
angina 0
stroke 0
dtype: int64
From the above result we confirm that there are no null values in
the dataset.
mifem_df.describe()
dups = mifem_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
From the above result we found that there are 75 duplicates in the
dataset.
Univariate Analysis:
12
Bi-Variate Analysis:
13
plot to visually represent the degree of correlation between the
two columns.
From the above heat map the diagonals are all 1 because those
squares are correlating each variable to itself. Also we observe
that the age and yronset have highly negative correlation as
indicated in dark color coding which is close to 0.00
2.2. Encode the data (having string values) for Modelling. Data Split: Split the data into test and
train, build classification model CART, Random Forest, Artificial Neural Network
14
A: For modeling convert the object variables in to Integers and
check the datatypes using info function.
mifem_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1295 entries, 0 to 1294
Data columns (total 10 columns):
outcome 1295 non-null int8
age 1295 non-null int64
yronset 1295 non-null int64
premi 1295 non-null int8
smstat 1295 non-null int8
diabetes 1295 non-null int8
highbp 1295 non-null int8
hichol 1295 non-null int8
angina 1295 non-null int8
stroke 1295 non-null int8
dtypes: int64(2), int8(8)
memory usage: 30.5 KB
mifem_df.head()
outcomeageyronsetpremismstatdiabeteshighbphicholanginastroke
01 63 85 0 3 0 2 1 0 0
11 55 85 0 0 0 2 1 0 0
21 68 85 2 2 1 2 1 2 0
31 64 85 0 3 0 2 0 2 0
40 67 85 0 2 1 1 1 1 1
X_train (906, 9)
X_test (389, 9)
train_labels (906,)
test_labels (389,)
CART Model:
First import the python code for Decision Tree Classifier. We fit the
training and testing data in to the model.
15
print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
stroke 0.809294
age 0.090348
angina 0.086010
premi 0.011889
smstat 0.002459
yronset 0.000000
diabetes 0.000000
highbp 0.000000
hichol 0.000000
ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()
0 1
00.0439560.956044
10.8909090.109091
20.1556420.844358
30.3703700.629630
40.1551720.844828
First import the python code for the Random Forest Classifier. We fit
the training and testing data in to the model.
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
stroke 0.391011
angina 0.195552
age 0.110486
diabetes 0.104443
yronset 0.053794
smstat 0.047519
highbp 0.045090
premi 0.036655
16
hichol 0.015450
First import the python code for the MLPC Classifier. We fit the
training and testing data in to the model.
grid_search_nncl.fit(X_train, train_labels)
grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl
ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()
0 1
00.2177120.782288
10.3906880.609312
20.2671530.732847
30.6088860.391114
40.3650390.634961
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
CART Model:
Training Data:
confusion_matrix(train_labels, ytrain_predict_dtcl)
17
cart_train_acc=best_grid_dtcl.score(X_train, train_labels)
cart_train_acc
0.8013245033112583
print(classification_report(train_labels, ytrain_predict_dtcl))
ROC Curve:
AUC: 0.747
Testing Data:
confusion_matrix(test_labels, ytest_predict_dtcl)
cart_test_acc=best_grid_dtcl.score(X_test, test_labels)
cart_test_acc
0.794344473007712
18
print(classification_report(test_labels, ytest_predict_dtcl))
ROC Curve:
AUC: 0.613
CART Conclusion:
Training Data:
cart_train_precision: 0.8
cart_train_recall: 0.99
cart_train_f1: 0.88
AUC: 0.74
Accuracy: 0.80
Testing Data:
cart_test_precision: 0.79
cart_test_recall: 0.99
cart_test_f1: 0.88
AUC: 0.61
Accuracy: 0.79
19
Training and Test results are almost similar. AUC seems to be lower
in testing data as compared to training data.
Training Data:
confusion_matrix(train_labels, ytrain_predict_rfcl)
rf_train_acc=best_grid_rfcl.score(X_train, train_labels)
rf_train_acc
0.8145695364238411
print(classification_report(train_labels, ytrain_predict_rfcl))
ROC Curve:
AUC is 0.8083928067284272
Testing Data:
20
confusion_matrix(test_labels, ytest_predict_rfcl)
rf_test_acc=best_grid_rfcl.score(X_test, test_labels)
rf_test_acc
0.7994858611825193
print(classification_report(test_labels, ytest_predict_rfcl))
ROC Curve:
AUC is 0.6893891577249457
Training Data:
rf_train_precision: 0.81
rf_train_recall: 0.99
rf_train_f1: 0.89
21
AUC: 0.80
Accuracy: 0.81
Testing Data:
rf_test_precision: 0.79
rf_test_recall: 0.99
rf_test_f1: 0.88
AUC: 0.68
Accuracy: 0.79
Training and testing results are almost similar to the CART model,
but the AUC is higher in the testing data in this model compared to
the CART model.
Training Data:
confusion_matrix(train_labels, ytrain_predict_nncl)
nn_train_acc=best_grid_nncl.score(X_train, train_labels)
nn_train_acc
0.7384105960264901
print(classification_report(train_labels, ytrain_predict_nncl))
ROC Curve:
22
AUC is 0.6983500646711619
Testing Data:
confusion_matrix(test_labels, ytest_predict_nncl)
nn_test_acc=best_grid_nncl.score(X_test, test_labels)
nn_test_acc
0.7403598971722365
print(classification_report(test_labels, ytest_predict_nncl))
ROC Curve:
23
AUC is 0.6391752577319589
Training Data:
nn_train_precision: 0.78
nn_train_recall: 0.9
nn_train_f1: 0.84
AUC: 0.69
Accuracy: 0.73
Testing Data:
nn_test_precision: 0.78
nn_test_recall: 0.9
nn_test_f1: 0.84
AUC: 0.63
Accuracy: 0.74
Training and testing results are almost similar, but the overall
measures are lower compared to the other two model.
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
A:
index=['Accuracy', 'AUC', 'Recall', 'Precision', 'F1 Score']
data = pd.DataFrame({'CART Train':
[cart_train_acc,cart_train_auc,cart_train_recall,cart_train_precision,cart_train_f1],
'CART Test':
[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_precision,cart_test_f1],
'Random Forest Train':
[rf_train_acc,rf_train_auc,rf_train_recall,rf_train_precision,rf_train_f1],
24
'Random Forest Test': [rf_test_acc,
rf_test_auc,rf_test_recall,rf_test_precision,rf_test_f1],
'Neural Network Train':
[nn_train_acc,nn_train_auc,nn_train_recall,nn_train_precision,nn_train_f1],
'Neural Network Test':
[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_precision,nn_test_f1]}, index=index)
round(data,3)
CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
Accuracy0.801 0.794 0.815 0.799 0.738 0.740
AUC 0.747 0.613 0.808 0.689 0.698 0.639
Recall 0.990 0.990 0.990 0.990 0.900 0.900
Precision0.800 0.790 0.810 0.790 0.780 0.780
F1 Score 0.880 0.880 0.890 0.880 0.840 0.840
Seeing the above results we can come to the inference that the Random
Forest model can be selected as the best model for this case study, as
you see the accuracy, AUC(Area Under Curve) score in general this
model will tend to perform better than other models.
2.5 Inference: Basis on these predictions, what are the insights and recommendations
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
stroke 0.391011
angina 0.195552
age 0.110486
diabetes 0.104443
yronset 0.053794
25
The females with past stroke, age, diabetes are less likely to cure
from disease.
26