0% found this document useful (0 votes)

17 views

EDA HabermanDataset

The document provides an exploratory data analysis of the Haberman's breast cancer dataset using Python. Key observations from the analysis include: (1) the data is imbalanced with more patients surviving than dying; (2) axillary nodes is the most useful feature for distinguishing survival status as patients who survived generally had zero nodes detected; (3) patients with more than 46 axillary nodes detected can be considered at higher risk of dying within 5 years based on the CDFs and summary statistics.

Uploaded by

gopisai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

EDA HabermanDataset

Uploaded by

gopisai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

EDA-HabermanDataset

January 6, 2018

1 Exploratory Data analysis on HabermanDataset

1.1 Dataset Information:
Number of Instances: 306
Number of Attributes: 4 (including the class attribute)
Attribute Information:
Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute):
1 = the patient survived 5 years or longer
2 = the patient died within 5 year

In [1]: import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [2]: #download the data set from

#https://www.kaggle.com/gilsousa/habermans-survival-data-set/data
# load the data set
haberman=pd.read_csv("haberman.csv")

In [3]: # data-points and features

print (haberman.shape)

(305, 4)

In [4]: #no column names mentioned in the data set. so will add headers to the columns.
haberman.columns = ["Age","Year","Axillary nodes","Survival status"]
print (haberman.columns)

Index(['Age', 'Year', 'Axillary nodes', 'Survival status'], dtype='object')

In [5]: haberman.head()

1
Out[5]: Age Year Axillary nodes Survival status
0 30 62 3 1
1 30 65 0 1
2 31 59 2 1
3 31 65 4 1
4 33 58 10 1

In [6]: #how many patients are survived 5 years and more and how many died within 5years
haberman["Survival status"].value_counts()

Out[6]: 1 224
2 81
Name: Survival status, dtype: int64

1.1.1 Obervation:
1. Imbalanced data set.

2. Clearly the data is not balanced as we have 224 patients survived more than 5 years and 81
patients died within 5 years.

1.2 2-D ScatterPlot

In [7]: # lets plot plain scatter plot considering age and axillary nodes
haberman.plot(kind='scatter', x='Age', y='Axillary nodes') ;
plt.show()

2
In [8]: sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Survival status", size=6) \
.map(plt.scatter, "Age", "Axillary nodes") \
.add_legend();
plt.show();

1.2.1 Observation:
1. It seems most of the patients have 0 Auxillary nodes detected.

1.3 Pair Plot

In [9]: plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="Survival status",
vars=['Age','Year','Axillary nodes'], size=3)
plt.show()
# The diagnol elements are PDFs for each feature.

3
1.3.1 Observation:
1. Auxillary nodes versus Age is the useful plot to atleast get the insight that most people who
survived have 0 Auxillary nodes detected.

2. It looks like we cannot distinguish the data easily with the help of above scalar plots as most
of them are overlapping.

1.4 Histogram, PDF

In [10]: sns.FacetGrid(haberman, hue="Survival status", size=5) \
.map(sns.distplot, "Axillary nodes") \
.add_legend();
plt.show();

4
In [11]: sns.FacetGrid(haberman, hue="Survival status", size=5) \
.map(sns.distplot, "Age") \
.add_legend();
plt.show();

5
In [12]: sns.FacetGrid(haberman, hue="Survival status", size=5) \
.map(sns.distplot, "Year") \
.add_legend();
plt.show();

6
1.4.1 Observation:
1. From the above PDFS(Univariate analysis) both Age and Year are not good features for use-
ful insights as the distibution is more similar for both people who survived and also dead.

2. axillary nodes is the only feature that is useful to know the survival status of patients as
there is difference between the distributions for both classes(labels). From that distibution
we can infer that most survival patients have fall in to zero axillary nodes.

3. From the year distribution, we can observe that people who didnt survive suddenly fall and
rise in between 1958 and 1960. lets check the summary statistics to get more insights.

2 CDF
In [13]: #divide the data set in two according to the label Survival status
# alive means status=1 and dead means status =2
alive=haberman.loc[haberman["Survival status"]==1]
dead=haberman.loc[haberman["Survival status"]==2]

7
In [14]: counts, bin_edges = np.histogram(alive['Axillary nodes'], bins=30,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['Pdf for the patients who survive more than 5 years',
'Cdf for the patients who survive more than 5 years'])
plt.show()

[ 0.66517857 0.125 0.04464286 0.02678571 0.02232143 0.03125

0.00892857 0.00892857 0.00446429 0.01785714 0.00446429 0.00446429
0.00446429 0.00446429 0.00892857 0. 0.00446429 0.
0.00446429 0.00446429 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00446429]
[ 0. 1.53333333 3.06666667 4.6 6.13333333
7.66666667 9.2 10.73333333 12.26666667 13.8 15.33333333
16.86666667 18.4 19.93333333 21.46666667 23. 24.53333333
26.06666667 27.6 29.13333333 30.66666667 32.2 33.73333333
35.26666667 36.8 38.33333333 39.86666667 41.4 42.93333333
44.46666667 46. ]

In [15]: counts, bin_edges = np.histogram(dead['Axillary nodes'], bins=30, density=True)

8
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['Pdf for the patients who dead Within 5 years',
'Cdf for the patients who dead within 5 years'])
plt.show()

[ 0.33333333 0.14814815 0.08641975 0.03703704 0.04938272 0.0617284

0.04938272 0.04938272 0.03703704 0.01234568 0.02469136 0.01234568
0.02469136 0.04938272 0. 0. 0. 0. 0.
0. 0.01234568 0. 0. 0. 0. 0.
0. 0. 0. 0.01234568]
[ 0. 1.73333333 3.46666667 5.2 6.93333333
8.66666667 10.4 12.13333333 13.86666667 15.6 17.33333333
19.06666667 20.8 22.53333333 24.26666667 26. 27.73333333
29.46666667 31.2 32.93333333 34.66666667 36.4 38.13333333
39.86666667 41.6 43.33333333 45.06666667 46.8 48.53333333
50.26666667 52. ]

In [16]: # check also summary statistics below to get an idea to distinguish the
#survival and not survival

9
3 Mean, Variance and Std-dev
In [17]: print("Summary Statistics of Patients who are alive for more than 5 years:")
alive.describe()

Summary Statistics of Patients who are alive for more than 5 years:

Out[17]: Age Year Axillary nodes Survival status

count 224.000000 224.000000 224.000000 224.0
mean 52.116071 62.857143 2.799107 1.0
std 10.937446 3.229231 5.882237 0.0
min 30.000000 58.000000 0.000000 1.0
25% 43.000000 60.000000 0.000000 1.0
50% 52.000000 63.000000 0.000000 1.0
75% 60.000000 66.000000 3.000000 1.0
max 77.000000 69.000000 46.000000 1.0

In [18]: print("Summary Statistics of Patients who are dead within 5 years:")

dead.describe()

Summary Statistics of Patients who are dead within 5 years:

Out[18]: Age Year Axillary nodes Survival status

count 81.000000 81.000000 81.000000 81.0
mean 53.679012 62.827160 7.456790 2.0
std 10.167137 3.342118 9.185654 0.0
min 34.000000 58.000000 0.000000 2.0
25% 46.000000 59.000000 1.000000 2.0
50% 53.000000 63.000000 4.000000 2.0
75% 61.000000 65.000000 11.000000 2.0
max 83.000000 69.000000 52.000000 2.0

3.0.1 Observations:
1. From both the tables we can observe that almost for all the features the statistics are similar
except for Axillary nodes.

2. The auxillary nodes mean(average) is more for people who died within 5 years than people
who live more than 5 years

3. From the observation of Cdfs, we can infer that patients above 46 axillary nodes detected
can be considered as dead within 5 years.

4 Box plot and Whiskers

In [19]: sns.boxplot(x='Survival status',y='Axillary nodes', data=haberman)
plt.show()

10
In [20]: sns.boxplot(x='Survival status',y='Age', data=haberman)
plt.show()

11
In [21]: sns.boxplot(x='Survival status',y='Year', data=haberman)
plt.show()

4.1 Violin plots

In [22]: # Denser regions of the data are fatter, and sparser ones thinner
#in a violin plot

sns.violinplot(x='Survival status',y='Year', data=haberman,size=8)

plt.show()

12
In [23]: sns.violinplot(x='Survival status',y='Axillary nodes', data=haberman,size=8)
plt.show()

13
In [24]: sns.violinplot(x='Survival status',y='Age', data=haberman,size=8)
plt.show()

4.1.1 Observation:
1. From box,violin plots we can say that more no of patients who are dead have age between
46-62,year between 59-65 and the patients who survived have age between 42-60, year be-
tween 60-66.

In [25]: # contors-plot
sns.jointplot(x="Age", y="Year", data=haberman, kind="kde");
plt.show();

14
15

4 Exploratory Data Analysis.
No ratings yet
4 Exploratory Data Analysis.
1 page
Explonatory Data analysis
No ratings yet
Explonatory Data analysis
11 pages
(3.12) Exercise:: Observation
No ratings yet
(3.12) Exercise:: Observation
1 page
Assignment Instructions:: Import As
No ratings yet
Assignment Instructions:: Import As
1 page
Haberman Datasets Analysis - Ipynb - Colaboratory
No ratings yet
Haberman Datasets Analysis - Ipynb - Colaboratory
13 pages
Haberman Data Set Ed A
No ratings yet
Haberman Data Set Ed A
10 pages
EDA On Haberman Survival Data
No ratings yet
EDA On Haberman Survival Data
6 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
AIML Expt
No ratings yet
AIML Expt
7 pages
Exploratory Data Analysis On Haberman Dataset PDF
No ratings yet
Exploratory Data Analysis On Haberman Dataset PDF
11 pages
مشروع بيانات تخطيط القلب ويكا
No ratings yet
مشروع بيانات تخطيط القلب ويكا
21 pages
H-410; Survival Analysis with R
No ratings yet
H-410; Survival Analysis with R
63 pages
Lifelines
No ratings yet
Lifelines
347 pages
Lifelines
No ratings yet
Lifelines
343 pages
Seaborn
No ratings yet
Seaborn
2 pages
Hear Disease
No ratings yet
Hear Disease
45 pages
A Guide To Model Selection For Survival Analysis
No ratings yet
A Guide To Model Selection For Survival Analysis
26 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
A Weighted Random Survival Forest
No ratings yet
A Weighted Random Survival Forest
27 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Breast Cancer Diagnosis Using Machine Learning Alg
No ratings yet
Breast Cancer Diagnosis Using Machine Learning Alg
13 pages
Data Mining Tutorial: D. A. Dickey
No ratings yet
Data Mining Tutorial: D. A. Dickey
109 pages
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
No ratings yet
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
18 pages
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
No ratings yet
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
50 pages
Regression Analysis of Gapminder Data
No ratings yet
Regression Analysis of Gapminder Data
41 pages
Frailty Models Final
No ratings yet
Frailty Models Final
5 pages
Hangal - Frailty Models
No ratings yet
Hangal - Frailty Models
307 pages
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
No ratings yet
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
42 pages
Final Group Project
No ratings yet
Final Group Project
26 pages
Survival Analysis
100% (1)
Survival Analysis
15 pages
The Frailty Model: Luc Duchateau and Paul Janssen
No ratings yet
The Frailty Model: Luc Duchateau and Paul Janssen
334 pages
Data Visualization
No ratings yet
Data Visualization
159 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
Survival Analysis Practical
No ratings yet
Survival Analysis Practical
22 pages
DeepSurv Using A Cox Proportional Hasards DeepNets 1652051740
No ratings yet
DeepSurv Using A Cox Proportional Hasards DeepNets 1652051740
12 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Mathematics 10 03907
No ratings yet
Mathematics 10 03907
23 pages
A Tutorial On Frailty Models
No ratings yet
A Tutorial On Frailty Models
31 pages
Dmitry Grapov
No ratings yet
Dmitry Grapov
41 pages
Reliability Theory and Survival Analysis Final
No ratings yet
Reliability Theory and Survival Analysis Final
12 pages
Analyzing Survival or Reliability
No ratings yet
Analyzing Survival or Reliability
7 pages
ML Lab Experiments (1) - Pages-4
No ratings yet
ML Lab Experiments (1) - Pages-4
10 pages
Survival Analysis Submission
No ratings yet
Survival Analysis Submission
4 pages
EDA AnalysisA
No ratings yet
EDA AnalysisA
15 pages
Mastering Data Visualization Techniques (Part 1)
No ratings yet
Mastering Data Visualization Techniques (Part 1)
20 pages
Mastering Data Visualization Techniques 1728896857
No ratings yet
Mastering Data Visualization Techniques 1728896857
85 pages
Chap5_wei.ipynb - Colab
No ratings yet
Chap5_wei.ipynb - Colab
29 pages
Compete
No ratings yet
Compete
29 pages
33051-33061
No ratings yet
33051-33061
70 pages
2 Eda 9 10
No ratings yet
2 Eda 9 10
25 pages
Mayank Chaudhary DEV Practicals
No ratings yet
Mayank Chaudhary DEV Practicals
14 pages
Hazard Rate Theory and Inference
No ratings yet
Hazard Rate Theory and Inference
296 pages
compete
No ratings yet
compete
29 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
Fall2012 - Brown - Introduction To Survival Analysis v3
No ratings yet
Fall2012 - Brown - Introduction To Survival Analysis v3
21 pages
AAAI_2018_DeepHit (1)
No ratings yet
AAAI_2018_DeepHit (1)
8 pages
3D Printing of Medical Models from Ct-Mri Images: A Practical Step-By-Step Guide
From Everand
3D Printing of Medical Models from Ct-Mri Images: A Practical Step-By-Step Guide
Eric Luis
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
8) What Is A Foreign Key Relationship?
No ratings yet
8) What Is A Foreign Key Relationship?
2 pages
DSRT839 Researchppt Nagisetti
No ratings yet
DSRT839 Researchppt Nagisetti
16 pages
Rebekah Jones Dashboard
100% (1)
Rebekah Jones Dashboard
4 pages
Password Based Door Locking
No ratings yet
Password Based Door Locking
4 pages
Eloquent Relationships Cheat Sheet - Mahmoud Zalt - Medium
No ratings yet
Eloquent Relationships Cheat Sheet - Mahmoud Zalt - Medium
17 pages
Classification of Constitution
No ratings yet
Classification of Constitution
23 pages
How To Actually Use ChatGPT For Academia
100% (2)
How To Actually Use ChatGPT For Academia
22 pages
Essential Access Exercises
No ratings yet
Essential Access Exercises
16 pages
CH03-The Relational Model
No ratings yet
CH03-The Relational Model
54 pages
Alcina PDF
No ratings yet
Alcina PDF
24 pages
DWDM question bank MCQ
No ratings yet
DWDM question bank MCQ
11 pages
MYSQL Associate - Final
No ratings yet
MYSQL Associate - Final
11 pages
11 Pasipanodya Tsamabatare Chapter 1-5
No ratings yet
11 Pasipanodya Tsamabatare Chapter 1-5
92 pages
Recap On Some SQL and Joins: Bernie Lydon Bernie - Lydon@dbs - Ie
No ratings yet
Recap On Some SQL and Joins: Bernie Lydon Bernie - Lydon@dbs - Ie
21 pages
DM
No ratings yet
DM
2 pages
Full download Learning SQL Master SQL Fundamentals Alan Beaulieu pdf docx
83% (6)
Full download Learning SQL Master SQL Fundamentals Alan Beaulieu pdf docx
55 pages
Lecture On Database by Miss Aysha (GCUF)
No ratings yet
Lecture On Database by Miss Aysha (GCUF)
6 pages
Elp 2
No ratings yet
Elp 2
49 pages
Computing
No ratings yet
Computing
128 pages
Tanishq File Edited
No ratings yet
Tanishq File Edited
76 pages
Case Study Archive 2022
No ratings yet
Case Study Archive 2022
87 pages
1A Student Checklist
No ratings yet
1A Student Checklist
2 pages
Data Collection and Processing
No ratings yet
Data Collection and Processing
17 pages
(Ebook) Making Software: What Really Works, and Why We Believe It by Andy Oram, Greg Wilson ISBN 9780596808327, 0596808321 pdf download
100% (2)
(Ebook) Making Software: What Really Works, and Why We Believe It by Andy Oram, Greg Wilson ISBN 9780596808327, 0596808321 pdf download
47 pages
Ex-2 Odt
No ratings yet
Ex-2 Odt
11 pages
Programming Flowchart
60% (5)
Programming Flowchart
105 pages
International Journal of Surgery Case Reports
No ratings yet
International Journal of Surgery Case Reports
15 pages
Bootex
No ratings yet
Bootex
8 pages
MBA Summer Internship Guidelines
No ratings yet
MBA Summer Internship Guidelines
9 pages
How To Accelerate AI With Apache Airflow
No ratings yet
How To Accelerate AI With Apache Airflow
14 pages

EDA HabermanDataset

Uploaded by

EDA HabermanDataset

Uploaded by

EDA-HabermanDataset

1 Exploratory Data analysis on HabermanDataset

In [1]: import pandas as pd

In [2]: #download the data set from

In [3]: # data-points and features

Index(['Age', 'Year', 'Axillary nodes', 'Survival status'], dtype='object')

1.2 2-D ScatterPlot

1.3 Pair Plot

1.4 Histogram, PDF

[ 0.66517857 0.125 0.04464286 0.02678571 0.02232143 0.03125

In [15]: counts, bin_edges = np.histogram(dead['Axillary nodes'], bins=30, density=True)

[ 0.33333333 0.14814815 0.08641975 0.03703704 0.04938272 0.0617284

Out[17]: Age Year Axillary nodes Survival status

In [18]: print("Summary Statistics of Patients who are dead within 5 years:")

Summary Statistics of Patients who are dead within 5 years:

Out[18]: Age Year Axillary nodes Survival status

4 Box plot and Whiskers

4.1 Violin plots

sns.violinplot(x='Survival status',y='Year', data=haberman,size=8)

You might also like