Business Report: by Sreenath Radhakrishnan

Business Report

By Sreenath Radhakrishnan

Problem 1:
The dataset given is about the Health and economic conditions in different States of a country.
Group States based on how similar their situation is, so as to provide these groups to the
government so that appropriate measures can be taken to escalate their Health and Economic

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)

A: Exploratory Data Analysis:

First, we check the top 5 values of the dataset after reading the
file using head function.

df = pd.read_csv("State_wise_Health_income.csv")

Unnamed: 0 StatesHealth_indeces1Health_indices2Per_capita_income GDP

00 Bachevo 417 66 564 1823
11 Balgarchevo1485 646 2710 73662
22 Belasitsa 654 299 1104 27318
33 Belo_Pole 192 25 573 250
44 Beslen 43 8 528 22

We check the shape using shape function to check the number of rows
and columns in the dataset


(297, 6)

To check the datatypes and to summarize the data we use the info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 6 columns):
Unnamed: 0 297 non-null int64
States 297 non-null object
Health_indeces1 297 non-null int64
Health_indices2 297 non-null int64
Per_capita_income 297 non-null int64
GDP 297 non-null int64
dtypes: int64(5), object(1)
memory usage: 14.0+ KB

Check for any null values in the dataset using the isnull function.


Unnamed: 0 0
States 0
Health_indeces1 0
Health_indices2 0
Per_capita_income 0
dtype: int64

From the above result we confirm that there is no missing values.

To summarize the data we use the describe function.


Unnamed: 0Health_indeces1Health_indices2Per_capita_income GDP

count 297.000000 297.000000 297.000000 297.000000 297.000000
mean148.000000 2630.151515 693.632997 2156.915825 174601.117845
std 85.880731 2038.505431 468.944354 1491.854058 167167.992863
min 0.000000 -10.000000 0.000000 500.000000 22.000000
25% 74.000000 641.000000 175.000000 751.000000 8721.000000
50% 148.000000 2451.000000 810.000000 1865.000000 137173.000000
75% 222.000000 4094.000000 1073.000000 3137.000000 313092.000000
max 296.000000 10219.000000 1508.000000 7049.000000 728575.000000

Perform Univariate and Bivariate Analysis for the given data set.

Univariate Analysis:

To perform Univariate analysis for the provided dataset we have

considered 4 continuous variables and have used the distplot and
boxplot to analyse. From the below results, Per_capita_income, GDP
and Health_Indeces1 show positive skewness while Health_Indices2
shows more of normal distribution. While through the box plot
Health_indeces1 and Per_Capita_Income have outliers.

Bi-Variate Analysis:

Bi-Variate Analysis is performed to understand interactions between

different fields in the data set. This is performed using a pair
plot to visually represent the degree of correlation between the
two columns.

Also, we use the heatmap to understand the correlation between the

From the below heat map we observe that there is a positive

correlation between GDP and per_capita_income and also between
Health_indeces1 and per_capita_income. The diagonals are all 1
because those squares are correlating each variable to itself.

1.2 Do you think scaling is necessary for clustering in this case? Justify

A: Scaling should be done before clustering of the dataset. As we

see that the variables of this dataset are in different scales, it
becomes tough to compare.

If we observe the dataset we see that, Health_indeces1 varies from

min value -10 to max value 10219, Health_indices2 varies from min
value 0 to max value 1508, GDP varies from min value 22 to max
value 728575, Per_capita_income varies from min value 500 to max
value 7049.

We need to align the variable for a better analysis. Hence if we

don’t align the dataset, it may impact clustering due to the biased
Euclidian distance for the columns or larger range of values.

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.

A: By applying hierarchical clustering to the given dataset and

using ward method we get the below dendrogram with n number of

Due to huge data being represented with multiple clusters, it is
impossible to read the clusters and make an inference on it.

Thereafter we truncate the dendrogram to show only last 25 leaves

from the hierarchical which can be used to make further inference.

By using ‘maxclust’ criterion we obtain three clusters for the

dendrogram obtained above. Below is the array of the clusters.

array([2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and find silhouette score.

A: K-means to the provided dataset applied. After applying K-means

on the scaled data we get the below WSS values if we take the
number of clusters from range 1-15.

for i in range(1,15):
KM = KMeans(n_clusters = i)



for i in range(1,15):
print('The WSS value for',i,'clusters is',wss[i-1])

The WSS value for 1 clusters is 1188.0000000000005

The WSS value for 2 clusters is 471.3102140867778
The WSS value for 3 clusters is 260.5729408376231
The WSS value for 4 clusters is 183.60983976801245
The WSS value for 5 clusters is 149.787873629525
The WSS value for 6 clusters is 117.21156155994824
The WSS value for 7 clusters is 90.7040780005266
The WSS value for 8 clusters is 79.62086308917355
The WSS value for 9 clusters is 70.83287890866458
The WSS value for 10 clusters is 64.46674739014672
The WSS value for 11 clusters is 56.30758119028052
The WSS value for 12 clusters is 51.35374692455001
The WSS value for 13 clusters is 46.99615066493063
The WSS value for 14 clusters is 45.24802616666356

From the above WSS values, there is not much change in the
distances after the fourth WSS. Hence the chosen size of the

clusters to be 4 is an indication of the clustering the states
based on the health and income in 4 clusters.

Please find below elbow curve:

From the above elbow curve we see that there is a slight variation
between n=3 and n=4, but the curve starts to flatten from n=4.

Therefore we can state that an optimum number of clusters for state

segmentation to understand the health and economic condition in the
state will be 4.

Silhoutte score is 0.55 which is enough to comprehend the number of


silhouette_score(scaled_df, labels)


1.5. Describe cluster profiles for the clusters defined. Recommend different priority based actions
need to be taken for different clusters on the bases of their vulnerability situations according to
their Economic and Health Conditions.




Health_indeces1Health_indices2Per_capita_income GDP clustersfreq
0 499.158416 116.356436 693.772277 9428.099010 2.000000101
1 2597.089109 783.019802 2464.128713 141264.1386141.960396101
2 4799.355932 1142.288136 2372.220339 396907.2372881.00000059
3 5146.444444 1327.138889 5047.083333 367196.9166671.00000036

Cluster 1:

This cluster represents the worst health systems and the lowest
average income.

Cluster 2:

This cluster represents with slightly better health systems and the
average income as compared to cluster 1.

Cluster 3:

This cluster represents with highest GDP but less per_capita_income

in comparison with cluster 2.

Cluster 4:

This cluster represents with the best health systems also with the
average income being high.


Cluster 3 has worst health systems as analyzed. The average income

is low as well as the GDP. We need to immediately focus on this
cluster which needs immediate attention. Also in comparison with
Cluster 1, cluster 2 needs to be improved which has better health
system compared to cluster 1 but is lacking in terms of income
compared to cluster 3 & 4.

Problem 2:

The mifem data frame has 1295 rows and 10 columns. This is a Dataset of females having coronary
heart disease (CHD). you have to predict with the given information whether the female is dead or
alive so as to discover important factors that should be considered crucial in the treatment of the
disease. Use CART, RF & ANN and compare the models' performances in train and test sets.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it.

A: Load the dataset and check the data using the head function.

mifem_df = pd.read_csv('mifem.csv')

Unnamed: 0Unnamed: 0.1outcomeageyronsetpremismstatdiabeteshighbphicholanginastroke

00 1 live 63 85 n x n y y n n
11 6 live 55 85 n c n y y n n
22 8 live 68 85 y nk nk y y y n
33 10 live 64 85 n x n y n y n
44 11 dead 67 85 n nk nk nk y nk nk

To check the datatypes and to summarize the data we use the info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1295 entries, 0 to 1294
Data columns (total 12 columns):
Unnamed: 0 1295 non-null int64
Unnamed: 0.1 1295 non-null int64
outcome 1295 non-null object
age 1295 non-null int64
yronset 1295 non-null int64
premi 1295 non-null object
smstat 1295 non-null object
diabetes 1295 non-null object
highbp 1295 non-null object
hichol 1295 non-null object
angina 1295 non-null object
stroke 1295 non-null object
dtypes: int64(4), object(8)
memory usage: 121.5+ KB

Observe the shape of the dataset using shape function to know on

the number of rows and columns in the dataset.


(1295, 12)

Check for missing values in any column.


Unnamed: 0 0
Unnamed: 0.1 0
outcome 0
age 0
yronset 0
premi 0
smstat 0

diabetes 0
highbp 0
hichol 0
angina 0
stroke 0
dtype: int64

From the above result we confirm that there are no null values in
the dataset.

We have to use the describe function to know the statistical

details like percentile, mean, std. etc of the dataset.


Unnamed: 0Unnamed: 0.1 age yronset

count 1295.0000001295.000000 1295.0000001295.000000
mean647.000000 3156.392278 60.922008 88.785328
std 373.978609 1821.587708 7.042327 2.553647
min 0.000000 1.000000 35.000000 85.000000
25% 323.500000 1577.000000 57.000000 87.000000
50% 647.000000 3030.000000 63.000000 89.000000
75% 970.500000 4706.500000 66.000000 91.000000
max 1294.0000006366.000000 69.000000 93.000000

Check for duplicate data:

dups = mifem_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 75

From the above result we found that there are 75 duplicates in the

Though it has 75 duplicate values, but it can be of different

persons as the dataset has one unnamed column which has the id
different. Hence I’m not removing the duplicates.

Perform Univariate and Bi-Variate Analysis:

Univariate Analysis:

To perform the Univariate analysis of the given dataset we have

considered two continuous variables and have used a distplot and
box plot to analyze the distribution of observations. From the
below results we observe that the variables ‘age’ has positive
skewness while the variable ‘yronset’ is normally distributed. Also
from the box plot results the variable age has higher outliers.

Bi-Variate Analysis:

Bi-Variate Analysis is performed to understand interactions between

different fields in the data set. This is performed using a pair

plot to visually represent the degree of correlation between the
two columns.

Also, we use the heatmap to understand the correlation between the


From the above heat map the diagonals are all 1 because those
squares are correlating each variable to itself. Also we observe
that the age and yronset have highly negative correlation as
indicated in dark color coding which is close to 0.00

2.2. Encode the data (having string values) for Modelling. Data Split: Split the data into test and
train, build classification model CART, Random Forest, Artificial Neural Network

A: For modeling convert the object variables in to Integers and
check the datatypes using info function.

for feature in mifem_df.columns:

if mifem_df[feature].dtype == 'object':
mifem_df[feature] = pd.Categorical(mifem_df[feature]).codes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1295 entries, 0 to 1294
Data columns (total 10 columns):
outcome 1295 non-null int8
age 1295 non-null int64
yronset 1295 non-null int64
premi 1295 non-null int8
smstat 1295 non-null int8
diabetes 1295 non-null int8
highbp 1295 non-null int8
hichol 1295 non-null int8
angina 1295 non-null int8
stroke 1295 non-null int8
dtypes: int64(2), int8(8)
memory usage: 30.5 KB


01 63 85 0 3 0 2 1 0 0
11 55 85 0 0 0 2 1 0 0
21 68 85 2 2 1 2 1 2 0
31 64 85 0 3 0 2 0 2 0
40 67 85 0 2 1 1 1 1 1

Split the dataset in to Train and Test Data


X_train (906, 9)
X_test (389, 9)
train_labels (906,)
test_labels (389,)

CART Model:

First import the python code for Decision Tree Classifier. We fit the
training and testing data in to the model.

print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))

stroke 0.809294
age 0.090348
angina 0.086010
premi 0.011889
smstat 0.002459
yronset 0.000000
diabetes 0.000000
highbp 0.000000
hichol 0.000000

The importance of variables to be considered for making the prediction

are mentioned below ranked in order of decreasing order of importance.

Also the predicted Class and Probs


0 1

Random Forest Classifier:

First import the python code for the Random Forest Classifier. We fit
the training and testing data in to the model.

print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))

stroke 0.391011
angina 0.195552
age 0.110486
diabetes 0.104443
yronset 0.053794
smstat 0.047519
highbp 0.045090
premi 0.036655

hichol 0.015450

The importance of variables to be considered for making the prediction

are mentioned below ranked in order of decreasing order of importance.

Artificial Neural Network Classifier:

First import the python code for the MLPC Classifier. We fit the
training and testing data in to the model.

The best parameters we get from grid search are as below:, train_labels)
best_grid_nncl = grid_search_nncl.best_estimator_

MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1, tol=0.01)


0 1

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model

A: The performance metrics for all models are given below:

CART Model:

Training Data:

confusion_matrix(train_labels, ytrain_predict_dtcl)

array([[ 49, 174],

[ 6, 677]], dtype=int64)

cart_train_acc=best_grid_dtcl.score(X_train, train_labels)


print(classification_report(train_labels, ytrain_predict_dtcl))

precision recall f1-score support

0 0.89 0.22 0.35 223

1 0.80 0.99 0.88 683

accuracy 0.80 906

macro avg 0.84 0.61 0.62 906
weighted avg 0.82 0.80 0.75 906

ROC Curve:

AUC: 0.747

Testing Data:

confusion_matrix(test_labels, ytest_predict_dtcl)

array([[ 21, 77],

[ 3, 288]], dtype=int64)

cart_test_acc=best_grid_dtcl.score(X_test, test_labels)


print(classification_report(test_labels, ytest_predict_dtcl))

precision recall f1-score support

0 0.88 0.21 0.34 98

1 0.79 0.99 0.88 291

accuracy 0.79 389

macro avg 0.83 0.60 0.61 389
weighted avg 0.81 0.79 0.74 389

ROC Curve:

AUC: 0.613

CART Conclusion:

Training Data:

cart_train_precision: 0.8
cart_train_recall: 0.99
cart_train_f1: 0.88
AUC: 0.74
Accuracy: 0.80

Testing Data:

cart_test_precision: 0.79
cart_test_recall: 0.99
cart_test_f1: 0.88
AUC: 0.61
Accuracy: 0.79

Training and Test results are almost similar. AUC seems to be lower
in testing data as compared to training data.

Random Forest Model:

Training Data:

confusion_matrix(train_labels, ytrain_predict_rfcl)

array([[ 60, 163],

[ 5, 678]], dtype=int64)

rf_train_acc=best_grid_rfcl.score(X_train, train_labels)


print(classification_report(train_labels, ytrain_predict_rfcl))

precision recall f1-score support

0 0.92 0.27 0.42 223

1 0.81 0.99 0.89 683

accuracy 0.81 906

macro avg 0.86 0.63 0.65 906
weighted avg 0.83 0.81 0.77 906

ROC Curve:

AUC is 0.8083928067284272

Testing Data:

confusion_matrix(test_labels, ytest_predict_rfcl)

array([[ 23, 75],

[ 3, 288]], dtype=int64)

rf_test_acc=best_grid_rfcl.score(X_test, test_labels)


print(classification_report(test_labels, ytest_predict_rfcl))

precision recall f1-score support

0 0.88 0.23 0.37 98

1 0.79 0.99 0.88 291

accuracy 0.80 389

macro avg 0.84 0.61 0.63 389
weighted avg 0.82 0.80 0.75 389

ROC Curve:

AUC is 0.6893891577249457

Random Forest Conclusion:

Training Data:

rf_train_precision: 0.81
rf_train_recall: 0.99
rf_train_f1: 0.89

AUC: 0.80
Accuracy: 0.81

Testing Data:

rf_test_precision: 0.79
rf_test_recall: 0.99
rf_test_f1: 0.88
AUC: 0.68
Accuracy: 0.79

Training and testing results are almost similar to the CART model,
but the AUC is higher in the testing data in this model compared to
the CART model.

Neural Network Model:

Training Data:

confusion_matrix(train_labels, ytrain_predict_nncl)

array([[ 54, 169],

[ 68, 615]], dtype=int64)

nn_train_acc=best_grid_nncl.score(X_train, train_labels)


print(classification_report(train_labels, ytrain_predict_nncl))

precision recall f1-score support

0 0.44 0.24 0.31 223

1 0.78 0.90 0.84 683

accuracy 0.74 906

macro avg 0.61 0.57 0.58 906
weighted avg 0.70 0.74 0.71 906

ROC Curve:

AUC is 0.6983500646711619

Testing Data:

confusion_matrix(test_labels, ytest_predict_nncl)

array([[ 26, 72],

[ 29, 262]], dtype=int64)

nn_test_acc=best_grid_nncl.score(X_test, test_labels)


print(classification_report(test_labels, ytest_predict_nncl))

precision recall f1-score support

0 0.47 0.27 0.34 98

1 0.78 0.90 0.84 291

accuracy 0.74 389

macro avg 0.63 0.58 0.59 389
weighted avg 0.71 0.74 0.71 389

ROC Curve:

AUC is 0.6391752577319589

Neural Network Classification:

Training Data:

nn_train_precision: 0.78
nn_train_recall: 0.9
nn_train_f1: 0.84
AUC: 0.69
Accuracy: 0.73

Testing Data:

nn_test_precision: 0.78
nn_test_recall: 0.9
nn_test_f1: 0.84
AUC: 0.63
Accuracy: 0.74

Training and testing results are almost similar, but the overall
measures are lower compared to the other two model.

2.4 Final Model: Compare all the models and write an inference which model is best/optimized.

index=['Accuracy', 'AUC', 'Recall', 'Precision', 'F1 Score']
data = pd.DataFrame({'CART Train':
'CART Test':
'Random Forest Train':

'Random Forest Test': [rf_test_acc,
'Neural Network Train':
'Neural Network Test':
[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_precision,nn_test_f1]}, index=index)

CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
Accuracy0.801 0.794 0.815 0.799 0.738 0.740
AUC 0.747 0.613 0.808 0.689 0.698 0.639
Recall 0.990 0.990 0.990 0.990 0.900 0.900
Precision0.800 0.790 0.810 0.790 0.780 0.780
F1 Score 0.880 0.880 0.890 0.880 0.840 0.840

Seeing the above results we can come to the inference that the Random
Forest model can be selected as the best model for this case study, as
you see the accuracy, AUC(Area Under Curve) score in general this
model will tend to perform better than other models.

2.5 Inference: Basis on these predictions, what are the insights and recommendations

A: Based on the provided dataset we have segregated the dataset in to

training and testing data. After fitting all three models and
comparing the performance, we came to the conclusion that Random
Forest model is best suited for prediction of mortality rates.

The important features which impacts on predicting whether females are

dead or alive

print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))

stroke 0.391011
angina 0.195552
age 0.110486
diabetes 0.104443
yronset 0.053794

These parameters can be considered while treating the disease and if

these are neglected while the treatment, the chances of mortality rate
among females will be high.

The females with past stroke, age, diabetes are less likely to cure
from disease.


