Business Report: by Sreenath Radhakrishnan

Business Report
By Sreenath Radhakrishnan
1
Problem 1:
The dataset given is about the Health and economic conditions in different States of a country.
Group States based on how similar their situation is, so as to provide these groups to the
government so that appropriate measures can be taken to escalate their Health and Economic
conditions.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)
A: Exploratory Data Analysis:
First, we check the top 5 values of the dataset after reading the
file using head function.
df = pd.read_csv("State_wise_Health_income.csv")
df.head()
Unnamed: 0 StatesHealth_indeces1Health_indices2Per_capita_income GDP

00 Bachevo 417 66 564 1823
11 Balgarchevo1485 646 2710 73662
22 Belasitsa 654 299 1104 27318
33 Belo_Pole 192 25 573 250
44 Beslen 43 8 528 22
We check the shape using shape function to check the number of rows
and columns in the dataset
df.shape
(297, 6)
To check the datatypes and to summarize the data we use the info
function
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 6 columns):
Unnamed: 0 297 non-null int64
States 297 non-null object
Health_indeces1 297 non-null int64
Health_indices2 297 non-null int64
Per_capita_income 297 non-null int64
GDP 297 non-null int64
dtypes: int64(5), object(1)
memory usage: 14.0+ KB
Check for any null values in the dataset using the isnull function.
2
df.isnull().sum()
Unnamed: 0 0
States 0
Health_indeces1 0
Health_indices2 0
Per_capita_income 0
GDP 0
dtype: int64
From the above result we confirm that there is no missing values.
To summarize the data we use the describe function.
df.describe()
Unnamed: 0Health_indeces1Health_indices2Per_capita_income GDP

count 297.000000 297.000000 297.000000 297.000000 297.000000
mean148.000000 2630.151515 693.632997 2156.915825 174601.117845
std 85.880731 2038.505431 468.944354 1491.854058 167167.992863
min 0.000000 -10.000000 0.000000 500.000000 22.000000
25% 74.000000 641.000000 175.000000 751.000000 8721.000000
50% 148.000000 2451.000000 810.000000 1865.000000 137173.000000
75% 222.000000 4094.000000 1073.000000 3137.000000 313092.000000
max 296.000000 10219.000000 1508.000000 7049.000000 728575.000000
Perform Univariate and Bivariate Analysis for the given data set.
Univariate Analysis:
To perform Univariate analysis for the provided dataset we have

considered 4 continuous variables and have used the distplot and
boxplot to analyse. From the below results, Per_capita_income, GDP
and Health_Indeces1 show positive skewness while Health_Indices2
shows more of normal distribution. While through the box plot
Health_indeces1 and Per_Capita_Income have outliers.
3
Bi-Variate Analysis:
Bi-Variate Analysis is performed to understand interactions between

different fields in the data set. This is performed using a pair
plot to visually represent the degree of correlation between the
two columns.
4
Also, we use the heatmap to understand the correlation between the
variables.
From the below heat map we observe that there is a positive

correlation between GDP and per_capita_income and also between
Health_indeces1 and per_capita_income. The diagonals are all 1
because those squares are correlating each variable to itself.
5
1.2 Do you think scaling is necessary for clustering in this case? Justify
A: Scaling should be done before clustering of the dataset. As we

see that the variables of this dataset are in different scales, it
becomes tough to compare.
If we observe the dataset we see that, Health_indeces1 varies from

min value -10 to max value 10219, Health_indices2 varies from min
value 0 to max value 1508, GDP varies from min value 22 to max
value 728575, Per_capita_income varies from min value 500 to max
value 7049.
We need to align the variable for a better analysis. Hence if we

don’t align the dataset, it may impact clustering due to the biased
Euclidian distance for the columns or larger range of values.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.
A: By applying hierarchical clustering to the given dataset and

using ward method we get the below dendrogram with n number of
clusters.
6
Due to huge data being represented with multiple clusters, it is
impossible to read the clusters and make an inference on it.
Thereafter we truncate the dendrogram to show only last 25 leaves

from the hierarchical which can be used to make further inference.
By using ‘maxclust’ criterion we obtain three clusters for the

dendrogram obtained above. Below is the array of the clusters.
array([2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
7
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and find silhouette score.
A: K-means to the provided dataset applied. After applying K-means

on the scaled data we get the below WSS values if we take the
number of clusters from range 1-15.
for i in range(1,15):
KM = KMeans(n_clusters = i)
KM.fit(scaled_df)
wss.append(KM.inertia_)
wss
[1188.0000000000005,
471.3102140867778,
260.5729408376231,
183.60983976801245,
149.787873629525,
117.21156155994824,
90.7040780005266,
79.62086308917355,
70.83287890866458,
64.46674739014672,
56.30758119028052,
51.35374692455001,
46.99615066493063,
45.24802616666356]
for i in range(1,15):
print('The WSS value for',i,'clusters is',wss[i-1])
The WSS value for 1 clusters is 1188.0000000000005

From the above WSS values, there is not much change in the
distances after the fourth WSS. Hence the chosen size of the
8
clusters to be 4 is an indication of the clustering the states
based on the health and income in 4 clusters.
Please find below elbow curve:
From the above elbow curve we see that there is a slight variation
between n=3 and n=4, but the curve starts to flatten from n=4.
Therefore we can state that an optimum number of clusters for state

segmentation to understand the health and economic condition in the
state will be 4.
Silhoutte score is 0.55 which is enough to comprehend the number of

clusters.
silhouette_score(scaled_df, labels)
0.5520464132164421
1.5. Describe cluster profiles for the clusters defined. Recommend different priority based actions
need to be taken for different clusters on the bases of their vulnerability situations according to
their Economic and Health Conditions.
A:
clust_profile=df.drop(['States'],axis=1)
clust_profile=clust_profile.groupby('Clus_kmeans4').mean()
clust_profile['freq']=df.Clus_kmeans4.value_counts().sort_index()
clust_profile
Out[98]:
9
Health_indeces1Health_indices2Per_capita_income GDP clustersfreq
Clus_kmeans4
0 499.158416 116.356436 693.772277 9428.099010 2.000000101
1 2597.089109 783.019802 2464.128713 141264.1386141.960396101
2 4799.355932 1142.288136 2372.220339 396907.2372881.00000059
3 5146.444444 1327.138889 5047.083333 367196.9166671.00000036
Cluster 1:
This cluster represents the worst health systems and the lowest
average income.
Cluster 2:
This cluster represents with slightly better health systems and the
average income as compared to cluster 1.
Cluster 3:
This cluster represents with highest GDP but less per_capita_income

in comparison with cluster 2.
Cluster 4:
This cluster represents with the best health systems also with the
average income being high.
Conclusion:
Cluster 3 has worst health systems as analyzed. The average income

is low as well as the GDP. We need to immediately focus on this
cluster which needs immediate attention. Also in comparison with
Cluster 1, cluster 2 needs to be improved which has better health
system compared to cluster 1 but is lacking in terms of income
compared to cluster 3 & 4.
Problem 2:
The mifem data frame has 1295 rows and 10 columns. This is a Dataset of females having coronary
heart disease (CHD). you have to predict with the given information whether the female is dead or
alive so as to discover important factors that should be considered crucial in the treatment of the
disease. Use CART, RF & ANN and compare the models' performances in train and test sets.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it.
A: Load the dataset and check the data using the head function.
10
mifem_df = pd.read_csv('mifem.csv')
mifem_df.head()
Unnamed: 0Unnamed: 0.1outcomeageyronsetpremismstatdiabeteshighbphicholanginastroke

00 1 live 63 85 n x n y y n n
11 6 live 55 85 n c n y y n n
22 8 live 68 85 y nk nk y y y n
33 10 live 64 85 n x n y n y n
44 11 dead 67 85 n nk nk nk y nk nk
To check the datatypes and to summarize the data we use the info
function
mifem_df.info()
Unnamed: 0 1295 non-null int64
Unnamed: 0.1 1295 non-null int64
outcome 1295 non-null object
age 1295 non-null int64
yronset 1295 non-null int64
premi 1295 non-null object
smstat 1295 non-null object
diabetes 1295 non-null object
highbp 1295 non-null object
hichol 1295 non-null object
angina 1295 non-null object
stroke 1295 non-null object
dtypes: int64(4), object(8)
memory usage: 121.5+ KB
Observe the shape of the dataset using shape function to know on

the number of rows and columns in the dataset.
mifem_df.shape
(1295, 12)
Check for missing values in any column.
mifem_df.isnull().sum()
Unnamed: 0 0
Unnamed: 0.1 0
outcome 0
age 0
yronset 0
premi 0
smstat 0
11
diabetes 0
highbp 0
hichol 0
angina 0
stroke 0
dtype: int64
From the above result we confirm that there are no null values in
the dataset.
We have to use the describe function to know the statistical

details like percentile, mean, std. etc of the dataset.
mifem_df.describe()
Unnamed: 0Unnamed: 0.1 age yronset

count 1295.0000001295.000000 1295.0000001295.000000
mean647.000000 3156.392278 60.922008 88.785328
std 373.978609 1821.587708 7.042327 2.553647
min 0.000000 1.000000 35.000000 85.000000
25% 323.500000 1577.000000 57.000000 87.000000
50% 647.000000 3030.000000 63.000000 89.000000
75% 970.500000 4706.500000 66.000000 91.000000
max 1294.0000006366.000000 69.000000 93.000000
Check for duplicate data:
dups = mifem_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
Number of duplicate rows = 75
From the above result we found that there are 75 duplicates in the
dataset.
Though it has 75 duplicate values, but it can be of different

persons as the dataset has one unnamed column which has the id
different. Hence I’m not removing the duplicates.
Perform Univariate and Bi-Variate Analysis:
Univariate Analysis:
To perform the Univariate analysis of the given dataset we have

considered two continuous variables and have used a distplot and
box plot to analyze the distribution of observations. From the
below results we observe that the variables ‘age’ has positive
skewness while the variable ‘yronset’ is normally distributed. Also
from the box plot results the variable age has higher outliers.
12
Bi-Variate Analysis:
Bi-Variate Analysis is performed to understand interactions between

different fields in the data set. This is performed using a pair
13
plot to visually represent the degree of correlation between the
two columns.
Also, we use the heatmap to understand the correlation between the

variables.
From the above heat map the diagonals are all 1 because those
squares are correlating each variable to itself. Also we observe
that the age and yronset have highly negative correlation as
indicated in dark color coding which is close to 0.00
2.2. Encode the data (having string values) for Modelling. Data Split: Split the data into test and
train, build classification model CART, Random Forest, Artificial Neural Network
14
A: For modeling convert the object variables in to Integers and
check the datatypes using info function.
for feature in mifem_df.columns:

if mifem_df[feature].dtype == 'object':
mifem_df[feature] = pd.Categorical(mifem_df[feature]).codes
mifem_df.info()
outcome 1295 non-null int8
age 1295 non-null int64
yronset 1295 non-null int64
premi 1295 non-null int8
smstat 1295 non-null int8
diabetes 1295 non-null int8
highbp 1295 non-null int8
hichol 1295 non-null int8
angina 1295 non-null int8
stroke 1295 non-null int8
dtypes: int64(2), int8(8)
memory usage: 30.5 KB
mifem_df.head()
outcomeageyronsetpremismstatdiabeteshighbphicholanginastroke
01 63 85 0 3 0 2 1 0 0
11 55 85 0 0 0 2 1 0 0
21 68 85 2 2 1 2 1 2 0
31 64 85 0 3 0 2 0 2 0
40 67 85 0 2 1 1 1 1 1
Split the dataset in to Train and Test Data

print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)
X_train (906, 9)
X_test (389, 9)
train_labels (906,)
test_labels (389,)
CART Model:
First import the python code for Decision Tree Classifier. We fit the
training and testing data in to the model.
15
print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
stroke 0.809294
age 0.090348
angina 0.086010
premi 0.011889
smstat 0.002459
yronset 0.000000
diabetes 0.000000
highbp 0.000000
hichol 0.000000
The importance of variables to be considered for making the prediction

are mentioned below ranked in order of decreasing order of importance.
Also the predicted Class and Probs
ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()
0 1
00.0439560.956044
10.8909090.109091
20.1556420.844358
30.3703700.629630
40.1551720.844828
Random Forest Classifier:
First import the python code for the Random Forest Classifier. We fit
the training and testing data in to the model.
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
Imp
stroke 0.391011
angina 0.195552
age 0.110486
diabetes 0.104443
yronset 0.053794
smstat 0.047519
highbp 0.045090
premi 0.036655
16
hichol 0.015450
The importance of variables to be considered for making the prediction

are mentioned below ranked in order of decreasing order of importance.
Artificial Neural Network Classifier:
First import the python code for the MLPC Classifier. We fit the
training and testing data in to the model.
The best parameters we get from grid search are as below:
grid_search_nncl.fit(X_train, train_labels)
grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl
MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1, tol=0.01)
ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()
0 1
00.2177120.782288
10.3906880.609312
20.2671530.732847
30.6088860.391114
40.3650390.634961
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
A: The performance metrics for all models are given below:
CART Model:
Training Data:
confusion_matrix(train_labels, ytrain_predict_dtcl)
array([[ 49, 174],

[ 6, 677]], dtype=int64)
17
cart_train_acc=best_grid_dtcl.score(X_train, train_labels)
cart_train_acc
0.8013245033112583
print(classification_report(train_labels, ytrain_predict_dtcl))
precision recall f1-score support
0 0.89 0.22 0.35 223

1 0.80 0.99 0.88 683
accuracy 0.80 906

macro avg 0.84 0.61 0.62 906
weighted avg 0.82 0.80 0.75 906
ROC Curve:
AUC: 0.747
Testing Data:
confusion_matrix(test_labels, ytest_predict_dtcl)
array([[ 21, 77],

[ 3, 288]], dtype=int64)
cart_test_acc=best_grid_dtcl.score(X_test, test_labels)
cart_test_acc
0.794344473007712
18
print(classification_report(test_labels, ytest_predict_dtcl))
0 0.88 0.21 0.34 98

1 0.79 0.99 0.88 291
accuracy 0.79 389

macro avg 0.83 0.60 0.61 389
weighted avg 0.81 0.79 0.74 389
ROC Curve:
AUC: 0.613
CART Conclusion:
Training Data:
cart_train_precision: 0.8
cart_train_recall: 0.99
cart_train_f1: 0.88
AUC: 0.74
Accuracy: 0.80
Testing Data:
cart_test_precision: 0.79
cart_test_recall: 0.99
cart_test_f1: 0.88
AUC: 0.61
Accuracy: 0.79
19
Training and Test results are almost similar. AUC seems to be lower
in testing data as compared to training data.
Random Forest Model:
Training Data:
confusion_matrix(train_labels, ytrain_predict_rfcl)
array([[ 60, 163],

[ 5, 678]], dtype=int64)
rf_train_acc=best_grid_rfcl.score(X_train, train_labels)
rf_train_acc
0.8145695364238411
print(classification_report(train_labels, ytrain_predict_rfcl))
0 0.92 0.27 0.42 223

1 0.81 0.99 0.89 683
accuracy 0.81 906

macro avg 0.86 0.63 0.65 906
weighted avg 0.83 0.81 0.77 906
ROC Curve:
AUC is 0.8083928067284272
Testing Data:
20
confusion_matrix(test_labels, ytest_predict_rfcl)
array([[ 23, 75],

[ 3, 288]], dtype=int64)
rf_test_acc=best_grid_rfcl.score(X_test, test_labels)
rf_test_acc
0.7994858611825193
print(classification_report(test_labels, ytest_predict_rfcl))
0 0.88 0.23 0.37 98

1 0.79 0.99 0.88 291
accuracy 0.80 389

macro avg 0.84 0.61 0.63 389
weighted avg 0.82 0.80 0.75 389
ROC Curve:
AUC is 0.6893891577249457
Random Forest Conclusion:
Training Data:
rf_train_precision: 0.81
rf_train_recall: 0.99
rf_train_f1: 0.89
21
AUC: 0.80
Accuracy: 0.81
Testing Data:
rf_test_precision: 0.79
rf_test_recall: 0.99
rf_test_f1: 0.88
AUC: 0.68
Accuracy: 0.79
Training and testing results are almost similar to the CART model,
but the AUC is higher in the testing data in this model compared to
the CART model.
Neural Network Model:
Training Data:
confusion_matrix(train_labels, ytrain_predict_nncl)
array([[ 54, 169],

[ 68, 615]], dtype=int64)
nn_train_acc=best_grid_nncl.score(X_train, train_labels)
nn_train_acc
0.7384105960264901
print(classification_report(train_labels, ytrain_predict_nncl))
0 0.44 0.24 0.31 223

1 0.78 0.90 0.84 683
accuracy 0.74 906

macro avg 0.61 0.57 0.58 906
weighted avg 0.70 0.74 0.71 906
ROC Curve:
22
AUC is 0.6983500646711619
Testing Data:
confusion_matrix(test_labels, ytest_predict_nncl)
array([[ 26, 72],

[ 29, 262]], dtype=int64)
nn_test_acc=best_grid_nncl.score(X_test, test_labels)
nn_test_acc
0.7403598971722365
print(classification_report(test_labels, ytest_predict_nncl))
0 0.47 0.27 0.34 98

1 0.78 0.90 0.84 291
accuracy 0.74 389

macro avg 0.63 0.58 0.59 389
weighted avg 0.71 0.74 0.71 389
ROC Curve:
23
AUC is 0.6391752577319589
Neural Network Classification:
Training Data:
nn_train_precision: 0.78
nn_train_recall: 0.9
nn_train_f1: 0.84
AUC: 0.69
Accuracy: 0.73
Testing Data:
nn_test_precision: 0.78
nn_test_recall: 0.9
nn_test_f1: 0.84
AUC: 0.63
Accuracy: 0.74
Training and testing results are almost similar, but the overall
measures are lower compared to the other two model.
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
A:
index=['Accuracy', 'AUC', 'Recall', 'Precision', 'F1 Score']
data = pd.DataFrame({'CART Train':
[cart_train_acc,cart_train_auc,cart_train_recall,cart_train_precision,cart_train_f1],
'CART Test':
[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_precision,cart_test_f1],
'Random Forest Train':
[rf_train_acc,rf_train_auc,rf_train_recall,rf_train_precision,rf_train_f1],
24
'Random Forest Test': [rf_test_acc,
rf_test_auc,rf_test_recall,rf_test_precision,rf_test_f1],
'Neural Network Train':
[nn_train_acc,nn_train_auc,nn_train_recall,nn_train_precision,nn_train_f1],
'Neural Network Test':
[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_precision,nn_test_f1]}, index=index)
round(data,3)
CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
Accuracy0.801 0.794 0.815 0.799 0.738 0.740
AUC 0.747 0.613 0.808 0.689 0.698 0.639
Recall 0.990 0.990 0.990 0.990 0.900 0.900
Precision0.800 0.790 0.810 0.790 0.780 0.780
F1 Score 0.880 0.880 0.890 0.880 0.840 0.840
Seeing the above results we can come to the inference that the Random
Forest model can be selected as the best model for this case study, as
you see the accuracy, AUC(Area Under Curve) score in general this
model will tend to perform better than other models.
2.5 Inference: Basis on these predictions, what are the insights and recommendations
A: Based on the provided dataset we have segregated the dataset in to

training and testing data. After fitting all three models and
comparing the performance, we came to the conclusion that Random
Forest model is best suited for prediction of mortality rates.
The important features which impacts on predicting whether females are

dead or alive
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
Imp
stroke 0.391011
angina 0.195552
age 0.110486
diabetes 0.104443
yronset 0.053794
These parameters can be considered while treating the disease and if

these are neglected while the treatment, the chances of mortality rate
among females will be high.
25
The females with past stroke, age, diabetes are less likely to cure
from disease.
26

Business Report: by Sreenath Radhakrishnan

Uploaded by

Copyright:

Available Formats

Business Report: by Sreenath Radhakrishnan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Report: by Sreenath Radhakrishnan

Uploaded by

Copyright:

Available Formats

Business Report

A: Exploratory Data Analysis:

Unnamed: 0 StatesHealth_indeces1Health_indices2Per_capita_income GDP

From the above result we confirm that there is no missing values.

To summarize the data we use the describe function.

Unnamed: 0Health_indeces1Health_indices2Per_capita_income GDP

To perform Univariate analysis for the provided dataset we have

Bi-Variate Analysis is performed to understand interactions between

From the below heat map we observe that there is a positive

A: Scaling should be done before clustering of the dataset. As we

If we observe the dataset we see that, Health_indeces1 varies from

We need to align the variable for a better analysis. Hence if we

A: By applying hierarchical clustering to the given dataset and

Thereafter we truncate the dendrogram to show only last 25 leaves

By using ‘maxclust’ criterion we obtain three clusters for the

A: K-means to the provided dataset applied. After applying K-means

The WSS value for 1 clusters is 1188.0000000000005

Please find below elbow curve:

Therefore we can state that an optimum number of clusters for state

Silhoutte score is 0.55 which is enough to comprehend the number of

This cluster represents with highest GDP but less per_capita_income

Cluster 3 has worst health systems as analyzed. The average income

Unnamed: 0Unnamed: 0.1outcomeageyronsetpremismstatdiabeteshighbphicholanginastroke

Observe the shape of the dataset using shape function to know on

Check for missing values in any column.

We have to use the describe function to know the statistical

Unnamed: 0Unnamed: 0.1 age yronset

Check for duplicate data:

Number of duplicate rows = 75

Though it has 75 duplicate values, but it can be of different

Perform Univariate and Bi-Variate Analysis:

To perform the Univariate analysis of the given dataset we have

Bi-Variate Analysis is performed to understand interactions between

Also, we use the heatmap to understand the correlation between the

for feature in mifem_df.columns:

Split the dataset in to Train and Test Data

The importance of variables to be considered for making the prediction

Also the predicted Class and Probs

Random Forest Classifier:

The importance of variables to be considered for making the prediction

Artificial Neural Network Classifier:

The best parameters we get from grid search are as below:

MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1, tol=0.01)

A: The performance metrics for all models are given below:

array([[ 49, 174],

precision recall f1-score support

0 0.89 0.22 0.35 223

accuracy 0.80 906

array([[ 21, 77],

precision recall f1-score support

0 0.88 0.21 0.34 98

accuracy 0.79 389

Random Forest Model:

array([[ 60, 163],

precision recall f1-score support

0 0.92 0.27 0.42 223

accuracy 0.81 906

array([[ 23, 75],

precision recall f1-score support