EDA Assignment
EDA Assignment
EDA Assignment
About the Data Set: The dataset contains cases from a study that was conducted between 1958 and 1970
at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.
Database Information:
Source of Data Set: https://www.kaggle.com/gilsousa/habermans-survival-data-set
(https://www.kaggle.com/gilsousa/habermans-survival-data-set)
Objective: To predict whether the patient will survive after 5 years or not based upon the patient's age, year
of treatment and the number of positive lymph nodes
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
In [2]:
# 1 = positive, 2 = negative
haberman["survival_status"]=haberman["survival_status"].map({1:'positive',2:'negative'
})
haberman.head()
Out[2]:
0 30 64 1 positive
1 30 62 3 positive
2 30 65 0 positive
3 31 59 2 positive
4 31 65 4 positive
In [3]:
haberman.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year_of_operation 306 non-null int64
positive_axillary_nodes 306 non-null int64
survival_status 306 non-null object
dtypes: int64(3), object(1)
memory usage: 9.6+ KB
Observations:
Conclusions:
In [4]:
print (haberman.shape)
(306, 4)
Number of Columns: 4
Attribute Information:
In [5]:
haberman["survival_status"].value_counts()
Out[5]:
positive 225
negative 81
Name: survival_status, dtype: int64
Observation: In the above data which is an imbalance dataset, the number of patient who will survive after 5
years is 225 and 81 patients will died within 5 year.
In [6]:
print (haberman.describe())
Observation:
The Maximum age of the patients is 83 years & Minimum age is 30 years, with a mean value of
52.46 years.
The median age of the patients is 52 years.
All the operation of the given data done between 1958 - 1969.
The Maximum positive axillary nodes found in a patients is 52 & Minimum is 0 years, with an
average of 4.03.
In [7]:
In [8]:
print (haberman_positive.describe())
Conclusion: Hence we can assume that less amount of nodes may be signify less amount of health risk.
In [9]:
print (haberman_negative.describe())
Conclusion: Hence we can assume that higher amount of nodes may be signify higher amount of health
risk.
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "positive_axillary_nodes", "age") \
.add_legend();
plt.show();
Observation: More amount of positive survival status found where the positive axillary nodes are minimum.
Conclusion: The chance of recovery is greater if the less amount of positive axillary nodes were found.
In [11]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "year_of_operation", "age") \
.add_legend();
plt.show();
In [12]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "positive_axillary_nodes", "year_of_operation") \
.add_legend();
plt.show();
Pairplots
Here we get 3c2 = 3 pairplots because we have 3 features in which we can select only 2.
In the pairplots we did not take the principle diagonal graphs.
In [13]:
sns.set_style('whitegrid')
sns.pairplot(haberman, hue='survival_status', vars=['age', 'year_of_operation', 'positi
ve_axillary_nodes'], size=4)
plt.show()
Conclusion:
We were unable to get any conclusion from the plot between year of operation vs age and year of
operation vs positive axillary nodes because the data mostly overlapped.
Where as it can be said that from the plot between positive axillary nodes & age it can be assumesd
that the chance of recovery is greater if the less amount of positive axillary nodes were found.
1D scatter
In [14]:
# Patient Age
sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"age")\
.add_legend();
plt.show();
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
Observation: Patients with age range 30-40 have survived the most.
In [15]:
# Year of Operation
sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"year_of_operation")\
.add_legend();
plt.show()
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
Observation:
In [16]:
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
Conclusion: Positive axillary nodes is most significant amonog all as its value is lessly overlapped.
In [17]:
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('positive_axillary_nodes')
plt.show()
Observation: About 85% of the patients have a positive axillary nodes <= 10 who survive more than 5 years
after operation.
In [18]:
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('positive_axillary_nodes')
plt.show()
Observation: About 85% of the patients have a positive axillary nodes <= 20 who survive less than 5 years
after operation.
Conclusion: No of positive auxillary nodes is directly related with the chance of survive.
In [19]:
sns.boxplot(x='survival_status',y='positive_axillary_nodes', data=haberman)
plt.show()
sns.boxplot(x='survival_status',y='year_of_operation', data=haberman)
plt.show()
sns.boxplot(x='survival_status',y='age', data=haberman)
plt.show()
In [20]:
sns.violinplot(x='survival_status',y='positive_axillary_nodes', data=haberman)
plt.show()
sns.violinplot(x='survival_status',y='year_of_operation', data=haberman)
plt.show()
sns.violinplot(x='survival_status',y='age', data=haberman)
plt.show()
Observation: Out of the 3 features, positive_auxiliary_nodes has the most significant distinct-distribution
among the two-classes. From the above observation we can only conclude that higher the
positive_axillary_nodes, higher the chances of their death. The age of the patient does not seem to have any
relation with survial status.
Final Conclusion:
From the dataset we can say that the majority of operations are performed on people age between
40 to70. Observing scatter plot between year_of_operation vs age.
We can say that a large number of operation were done in between 1960 and 1965 (From box plot
between year_of_operation vs survival_status)
We get a conclusion that patients with 0 positive axillary nodes are more likely to survive
irrespective to there age. (positive_axillary_nodes vs age)
Patients with age range 30-40 have survived the most.
From the box plot we can say that, the more number of positive axillary nodes, the more chances
that the patients would die.