EDA Assignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

3/15/2020 EDA Assignment

Haberman's Survival Data Set


Survival of patients who had undergone surgery for breast cancer.

About the Data Set: The dataset contains cases from a study that was conducted between 1958 and 1970
at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.

Database Information:
Source of Data Set: https://www.kaggle.com/gilsousa/habermans-survival-data-set
(https://www.kaggle.com/gilsousa/habermans-survival-data-set)

Objective: To predict whether the patient will survive after 5 years or not based upon the patient's age, year
of treatment and the number of positive lymph nodes

In [1]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

haberman = pd.read_csv('haberman.csv', header = 0, names = ['age', 'year_of_operation',


'positive_axillary_nodes', 'survival_status'])

In [2]:

# 1 = positive, 2 = negative
haberman["survival_status"]=haberman["survival_status"].map({1:'positive',2:'negative'
})
haberman.head()

Out[2]:

age year_of_operation positive_axillary_nodes survival_status

0 30 64 1 positive

1 30 62 3 positive

2 30 65 0 positive

3 31 59 2 positive

4 31 65 4 positive

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 1/15


3/15/2020 EDA Assignment

In [3]:

haberman.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year_of_operation 306 non-null int64
positive_axillary_nodes 306 non-null int64
survival_status 306 non-null object
dtypes: int64(3), object(1)
memory usage: 9.6+ KB

Observations:

306 entries ranging from 0 to 305


There are 4 columns, 3 describing the data and 1 for the output
Each column has 306 non-null values
The output, survival_status column can take only 2 values 1 or 2. Which I changed into positive and
negative.

Conclusions:

There are no missing values in this dataset.


This survival_status column is a binary-classification problem.

In [4]:

print (haberman.shape)

(306, 4)

Number of Rows: 306

Number of Columns: 4

Attribute Information:

Age of patient at time of operation (numerical)


Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within
5 year
Missing Attribute Values: None

Positive axillary nodes: https://en.wikipedia.org/wiki/Positive_axillary_lymph_node


(https://en.wikipedia.org/wiki/Positive_axillary_lymph_node)

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 2/15


3/15/2020 EDA Assignment

In [5]:

haberman["survival_status"].value_counts()

Out[5]:

positive 225
negative 81
Name: survival_status, dtype: int64

Observation: In the above data which is an imbalance dataset, the number of patient who will survive after 5
years is 225 and 81 patients will died within 5 year.

In [6]:

print (haberman.describe())

age year_of_operation positive_axillary_nodes


count 306.000000 306.000000 306.000000
mean 52.457516 62.852941 4.026144
std 10.803452 3.249405 7.189654
min 30.000000 58.000000 0.000000
25% 44.000000 60.000000 0.000000
50% 52.000000 63.000000 1.000000
75% 60.750000 65.750000 4.000000
max 83.000000 69.000000 52.000000

Observation:

The Maximum age of the patients is 83 years & Minimum age is 30 years, with a mean value of
52.46 years.
The median age of the patients is 52 years.
All the operation of the given data done between 1958 - 1969.
The Maximum positive axillary nodes found in a patients is 52 & Minimum is 0 years, with an
average of 4.03.

In [7]:

haberman_positive = haberman.loc[haberman.survival_status == 'positive']


haberman_negative = haberman.loc[haberman.survival_status == 'negative']

In [8]:

print (haberman_positive.describe())

age year_of_operation positive_axillary_nodes


count 225.000000 225.000000 225.000000
mean 52.017778 62.862222 2.791111
std 11.012154 3.222915 5.870318
min 30.000000 58.000000 0.000000
25% 43.000000 60.000000 0.000000
50% 52.000000 63.000000 0.000000
75% 60.000000 66.000000 3.000000
max 77.000000 69.000000 46.000000

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 3/15


3/15/2020 EDA Assignment

Observation: Mean value of positive axillary nodes is 2.79

Conclusion: Hence we can assume that less amount of nodes may be signify less amount of health risk.

In [9]:

print (haberman_negative.describe())

age year_of_operation positive_axillary_nodes


count 81.000000 81.000000 81.000000
mean 53.679012 62.827160 7.456790
std 10.167137 3.342118 9.185654
min 34.000000 58.000000 0.000000
25% 46.000000 59.000000 1.000000
50% 53.000000 63.000000 4.000000
75% 61.000000 65.000000 11.000000
max 83.000000 69.000000 52.000000

Observation: Mean value of positive axillary nodes is 7.46

Conclusion: Hence we can assume that higher amount of nodes may be signify higher amount of health
risk.

2-D Scatter plot


In [10]:

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "positive_axillary_nodes", "age") \
.add_legend();
plt.show();

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 4/15


3/15/2020 EDA Assignment

Observation: More amount of positive survival status found where the positive axillary nodes are minimum.

Conclusion: The chance of recovery is greater if the less amount of positive axillary nodes were found.

In [11]:

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "year_of_operation", "age") \
.add_legend();
plt.show();

Conclusion: Most of the operations done in between 40 to 70 years of age.

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 5/15


3/15/2020 EDA Assignment

In [12]:

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "positive_axillary_nodes", "year_of_operation") \
.add_legend();
plt.show();

Conclusion: No significant conclusion can be observerd as the data overlapped.

Pairplots

Here we get 3c2 = 3 pairplots because we have 3 features in which we can select only 2.
In the pairplots we did not take the principle diagonal graphs.

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 6/15


3/15/2020 EDA Assignment

In [13]:

sns.set_style('whitegrid')
sns.pairplot(haberman, hue='survival_status', vars=['age', 'year_of_operation', 'positi
ve_axillary_nodes'], size=4)
plt.show()

Conclusion:

We were unable to get any conclusion from the plot between year of operation vs age and year of
operation vs positive axillary nodes because the data mostly overlapped.
Where as it can be said that from the plot between positive axillary nodes & age it can be assumesd
that the chance of recovery is greater if the less amount of positive axillary nodes were found.

1D scatter

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 7/15


3/15/2020 EDA Assignment

In [14]:

# Patient Age
sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"age")\
.add_legend();
plt.show();

C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "

Observation: Patients with age range 30-40 have survived the most.

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 8/15


3/15/2020 EDA Assignment

In [15]:

# Year of Operation
sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"year_of_operation")\
.add_legend();
plt.show()

C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "

Observation:

Operation year having range (63-66) had highest un-successfull rate.


Operation year (60-62) had highest successfull rate.

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 9/15


3/15/2020 EDA Assignment

In [16]:

# Positive axillary nodes


sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"positive_axillary_nodes")\
.add_legend();
plt.show()

C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "

Observation: positive axillary nodes=0 has the highest Survival rate.

Conclusion: Positive axillary nodes is most significant amonog all as its value is lessly overlapped.

PDF & CDF

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 10/15


3/15/2020 EDA Assignment

In [17]:

counts, bin_edges = np.histogram(haberman_positive['positive_axillary_nodes'], bins=20,


density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('positive_axillary_nodes')
plt.show()

[0.73333333 0.10222222 0.02666667 0.05333333 0.01333333 0.00888889


0.02222222 0.00444444 0.00888889 0.00888889 0.00444444 0.
0.00444444 0.00444444 0. 0. 0. 0.
0. 0.00444444]
[ 0. 2.3 4.6 6.9 9.2 11.5 13.8 16.1 18.4 20.7 23. 25.3 27.6 29.9
32.2 34.5 36.8 39.1 41.4 43.7 46. ]

Observation: About 85% of the patients have a positive axillary nodes <= 10 who survive more than 5 years
after operation.

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 11/15


3/15/2020 EDA Assignment

In [18]:

counts, bin_edges = np.histogram(haberman_negative['positive_axillary_nodes'], bins=20,


density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('positive_axillary_nodes')
plt.show()

[0.39506173 0.17283951 0.0617284 0.08641975 0.04938272 0.08641975


0.01234568 0.03703704 0.0617284 0.01234568 0. 0.
0. 0.01234568 0. 0. 0. 0.
0. 0.01234568]
[ 0. 2.6 5.2 7.8 10.4 13. 15.6 18.2 20.8 23.4 26. 28.6 31.2 33.8
36.4 39. 41.6 44.2 46.8 49.4 52. ]

Observation: About 85% of the patients have a positive axillary nodes <= 20 who survive less than 5 years
after operation.

Conclusion: No of positive auxillary nodes is directly related with the chance of survive.

Box Plot & Whiskers

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 12/15


3/15/2020 EDA Assignment

In [19]:

sns.boxplot(x='survival_status',y='positive_axillary_nodes', data=haberman)
plt.show()
sns.boxplot(x='survival_status',y='year_of_operation', data=haberman)
plt.show()
sns.boxplot(x='survival_status',y='age', data=haberman)
plt.show()

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 13/15


3/15/2020 EDA Assignment

In [20]:

sns.violinplot(x='survival_status',y='positive_axillary_nodes', data=haberman)
plt.show()
sns.violinplot(x='survival_status',y='year_of_operation', data=haberman)
plt.show()
sns.violinplot(x='survival_status',y='age', data=haberman)
plt.show()

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 14/15


3/15/2020 EDA Assignment

Observation: Out of the 3 features, positive_auxiliary_nodes has the most significant distinct-distribution
among the two-classes. From the above observation we can only conclude that higher the
positive_axillary_nodes, higher the chances of their death. The age of the patient does not seem to have any
relation with survial status.

Final Conclusion:

From the dataset we can say that the majority of operations are performed on people age between
40 to70. Observing scatter plot between year_of_operation vs age.
We can say that a large number of operation were done in between 1960 and 1965 (From box plot
between year_of_operation vs survival_status)
We get a conclusion that patients with 0 positive axillary nodes are more likely to survive
irrespective to there age. (positive_axillary_nodes vs age)
Patients with age range 30-40 have survived the most.
From the box plot we can say that, the more number of positive axillary nodes, the more chances
that the patients would die.

file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 15/15

You might also like