EDA HabermanDataset
EDA HabermanDataset
January 6, 2018
(305, 4)
In [4]: #no column names mentioned in the data set. so will add headers to the columns.
haberman.columns = ["Age","Year","Axillary nodes","Survival status"]
print (haberman.columns)
In [5]: haberman.head()
1
Out[5]: Age Year Axillary nodes Survival status
0 30 62 3 1
1 30 65 0 1
2 31 59 2 1
3 31 65 4 1
4 33 58 10 1
In [6]: #how many patients are survived 5 years and more and how many died within 5years
haberman["Survival status"].value_counts()
Out[6]: 1 224
2 81
Name: Survival status, dtype: int64
1.1.1 Obervation:
1. Imbalanced data set.
2. Clearly the data is not balanced as we have 224 patients survived more than 5 years and 81
patients died within 5 years.
2
In [8]: sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Survival status", size=6) \
.map(plt.scatter, "Age", "Axillary nodes") \
.add_legend();
plt.show();
1.2.1 Observation:
1. It seems most of the patients have 0 Auxillary nodes detected.
3
1.3.1 Observation:
1. Auxillary nodes versus Age is the useful plot to atleast get the insight that most people who
survived have 0 Auxillary nodes detected.
2. It looks like we cannot distinguish the data easily with the help of above scalar plots as most
of them are overlapping.
4
In [11]: sns.FacetGrid(haberman, hue="Survival status", size=5) \
.map(sns.distplot, "Age") \
.add_legend();
plt.show();
5
In [12]: sns.FacetGrid(haberman, hue="Survival status", size=5) \
.map(sns.distplot, "Year") \
.add_legend();
plt.show();
6
1.4.1 Observation:
1. From the above PDFS(Univariate analysis) both Age and Year are not good features for use-
ful insights as the distibution is more similar for both people who survived and also dead.
2. axillary nodes is the only feature that is useful to know the survival status of patients as
there is difference between the distributions for both classes(labels). From that distibution
we can infer that most survival patients have fall in to zero axillary nodes.
3. From the year distribution, we can observe that people who didnt survive suddenly fall and
rise in between 1958 and 1960. lets check the summary statistics to get more insights.
2 CDF
In [13]: #divide the data set in two according to the label Survival status
# alive means status=1 and dead means status =2
alive=haberman.loc[haberman["Survival status"]==1]
dead=haberman.loc[haberman["Survival status"]==2]
7
In [14]: counts, bin_edges = np.histogram(alive['Axillary nodes'], bins=30,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['Pdf for the patients who survive more than 5 years',
'Cdf for the patients who survive more than 5 years'])
plt.show()
8
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['Pdf for the patients who dead Within 5 years',
'Cdf for the patients who dead within 5 years'])
plt.show()
In [16]: # check also summary statistics below to get an idea to distinguish the
#survival and not survival
9
3 Mean, Variance and Std-dev
In [17]: print("Summary Statistics of Patients who are alive for more than 5 years:")
alive.describe()
Summary Statistics of Patients who are alive for more than 5 years:
3.0.1 Observations:
1. From both the tables we can observe that almost for all the features the statistics are similar
except for Axillary nodes.
2. The auxillary nodes mean(average) is more for people who died within 5 years than people
who live more than 5 years
3. From the observation of Cdfs, we can infer that patients above 46 axillary nodes detected
can be considered as dead within 5 years.
10
In [20]: sns.boxplot(x='Survival status',y='Age', data=haberman)
plt.show()
11
In [21]: sns.boxplot(x='Survival status',y='Year', data=haberman)
plt.show()
12
In [23]: sns.violinplot(x='Survival status',y='Axillary nodes', data=haberman,size=8)
plt.show()
13
In [24]: sns.violinplot(x='Survival status',y='Age', data=haberman,size=8)
plt.show()
4.1.1 Observation:
1. From box,violin plots we can say that more no of patients who are dead have age between
46-62,year between 59-65 and the patients who survived have age between 42-60, year be-
tween 60-66.
In [25]: # contors-plot
sns.jointplot(x="Age", y="Year", data=haberman, kind="kde");
plt.show();
14
15