0% found this document useful (0 votes)

24 views7 pages

AIML Expt

1. The document discusses exploratory data analysis (EDA) of a breast cancer dataset containing 305 patient cases from 1958-1970. EDA involves visually and statistically examining the data from different perspectives without assumptions. 2. The dataset has 4 attributes - patient age, year of operation, number of positive lymph nodes, and survival status. EDA includes checking data types and distributions, identifying outliers, handling imbalances, and determining relationships between variables. 3. Univariate analysis of each variable is done through distribution, box, and violin plots to understand their relationships to the survival class label. Bivariate analysis using pair plots, joint plots, and heatmaps examines correlations between attribute pairs.

Uploaded by

D Slm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views7 pages

AIML Expt

Uploaded by

D Slm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Subject- AIML

Practical No. 2
Aim- To acquire , visualization and analyze the dataset

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a process of describing the data by means

of statistical and visualization techniques in order to bring important aspects of
that data into focus for further analysis. This involves inspecting the dataset
from many angles, describing & summarizing it without making any
assumptions about its contents.

Exploratory data analysis is a significant step to take before diving into statistical
modeling or machine learning, to ensure the data is really what it is claimed to
be and that there are no obvious errors. It should be part of data science
projects in every organization.

Why Exploratory Data Analysis is important?

Just like everything in this world, data has its imperfections. Raw data is usually
skewed, may have outliers, or too many missing values. A model built on such
data results in sub-optimal performance. In hurry to get to the machine learning
stage, some data professionals either entirely skip the exploratory data analysis
process or do a very mediocre job. This is a mistake with many implications,
that includes generating inaccurate models, generating accurate models but on
the wrong data, not creating the right types of variables in data preparation, and
using resources inefficiently.
Data set description

The dataset contains cases from the research carried out between the years
1958 and 1970 at the University of Chicago’s Billings Hospital on the survival
of patients who had undergone surgery for breast cancer.

Attribute information :

1. Patient’s age at the time of operation (numerical).

2. Year of operation (year — 1900, numerical).
3. A number of positive axillary nodes were detected (numerical).
4. Survival status (class attribute)
1: the patient survived 5 years or longer post-operation.
2: the patient died within 5 years post-operation.

Attributes 1, 2, and 3 form our features (independent variables), while attribute

4 is our class label (dependent variable).

1. Importing libraries and loading data

Import all necessary packages —

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

Load the dataset in pandas dataframe —

df = pd.read_csv('haberman.csv', header = 0)
df.columns = ['patient_age', 'operation_year',
'positive_axillary_nodes', 'survival_status']
2. Understanding data

Output:

Shape of the dataframe —

df.shape

Output:
(305, 4)

There are 305 rows and 4 columns. But how many data points for each class
label are present in our dataset?

df[‘survival_status’].value_counts()

Output:

 The dataset is imbalanced as expected.

 Out of a total of 305 patients, the number of patients who survived over 5 years
post-operation is nearly 3 times the number of patients who died within 5 years.

df.info()
Output:

 All the columns are of integer type.

 No missing values in the dataset.

2.1 Data preparation

Before we go for statistical analysis and visualization, we see that the original
class labels — 1 (survived 5 years and above) and 2 (died within 5 years) are
not in accordance with the case.

So, we map survival status values 1 and 2 in the column survival_status to

categorical variables ‘yes’ and ‘no’ respectively such that,
survival_status = 1 → survival_status = ‘yes’
survival_status = 2 → survival_status = ‘no’

df['survival_status'] = df['survival_status'].map({1:"yes", 2:"no"})

2.2 General statistical analysis

df.describe()

Output:
 On average, patients got operated at age of 63.
 An average number of positive axillary nodes detected = 4.
 As indicated by the 50th percentile, the median of positive axillary nodes is 1.
 As indicated by the 75th percentile, 75% of the patients have less than 4 nodes
detected.

3. Uni-variate data analysis

3.1 Distribution Plots
Uni-variate analysis as the name suggests is an analysis carried out by
considering one variable at a time. Let’s say our aim is to be able to correctly
determine the survival status given the features — patient’s age, operation year,
and positive axillary nodes count. Which among these 3 variables is more useful
than other variables in order to distinguish between the class labels ‘yes’ and
‘no’? To answer this, we’ll plot the distribution plots (also called probability
density function or PDF plots) with each feature as a variable on X-axis. The
values on the Y-axis in each case represent the normalized density.

3.2 Box plots and Violin plots

Box plot, also known as box and whisker plot, displays a summary of data in
five numbers — minimum, lower quartile(25th percentile), median(50th
percentile), upper quartile(75th percentile), and maximum data values.

A violin plot displays the same information as the box and whisker plot;
additionally, it also shows the density-smoothed plot of the underlying
distribution

4. Bi-variate data analysis

4.1 Pair plot
Next, we shall plot a pair plot to visualize the relationship between the features
in a pairwise manner. A pair plot enables us to visualize both distributions of
single variables as well as the relationship between pairs of variables.

4.2 Joint plot

While the Pair plot provides a visual insight into all possible correlations, the
Joint plot provides bivariate plots with univariate marginal distributions.

4.3 Heatmap
Heatmaps are used to observe the correlations among the feature variables.
This is particularly important when we are trying to obtain the feature
importance in regression analysis. Although correlated features do not impact
the performance of the statistical model, it could mess up the post-modeling
analysis.
Conclusion- we learned some common steps involved in exploratory data
analysis. We also saw several types of charts & plots and what information is
conveyed by each of these.

Lec 3 and 2 After Mid
No ratings yet
Lec 3 and 2 After Mid
15 pages
Chapter 2_ Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2_ Data Exploration, Preprocessing and Visualization
92 pages
EDA
No ratings yet
EDA
52 pages
Explonatory Data analysis
No ratings yet
Explonatory Data analysis
11 pages
Machine
No ratings yet
Machine
10 pages
STS Students Guide
No ratings yet
STS Students Guide
6 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
EDA HabermanDataset
No ratings yet
EDA HabermanDataset
15 pages
Ml Lab Manual Bcsl602
No ratings yet
Ml Lab Manual Bcsl602
108 pages
EXP-12
No ratings yet
EXP-12
4 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
No ratings yet
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
50 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Learneverythingai
No ratings yet
Learneverythingai
9 pages
Module 1 - 2 - EDA
No ratings yet
Module 1 - 2 - EDA
12 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
Lecture 2 EDA 1
No ratings yet
Lecture 2 EDA 1
26 pages
Data Exploration LEC3 AM
No ratings yet
Data Exploration LEC3 AM
59 pages
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
No ratings yet
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
47 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
What Is Exploratory Data Analysis - by Prasad Patil - Towards Data Science
No ratings yet
What Is Exploratory Data Analysis - by Prasad Patil - Towards Data Science
17 pages
Haberman Data Set Ed A
No ratings yet
Haberman Data Set Ed A
10 pages
03a EDA
No ratings yet
03a EDA
47 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
DAUP Exam Notes -2in1
No ratings yet
DAUP Exam Notes -2in1
35 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
02 Data
No ratings yet
02 Data
62 pages
Dissertation Cambridge University
100% (2)
Dissertation Cambridge University
6 pages
Mastering Class 10
No ratings yet
Mastering Class 10
44 pages
B.Sc. Biotechnology
No ratings yet
B.Sc. Biotechnology
116 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
(3.12) Exercise:: Observation
No ratings yet
(3.12) Exercise:: Observation
1 page
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
Gps Presentation
No ratings yet
Gps Presentation
33 pages
Lect 3
No ratings yet
Lect 3
51 pages
Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
No ratings yet
Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
22 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
Assignment Instructions:: Import As
No ratings yet
Assignment Instructions:: Import As
1 page
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
UNIT-2
No ratings yet
UNIT-2
36 pages
2nd unit
No ratings yet
2nd unit
31 pages
TERM 2 ASSESSMENT ROSTER 2025
No ratings yet
TERM 2 ASSESSMENT ROSTER 2025
7 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
A Strategic Report for MoonLight Energy Solutions
No ratings yet
A Strategic Report for MoonLight Energy Solutions
2 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Contemporary Systems Thinking: Towards A Post-Bertalanffy Systemics
No ratings yet
Contemporary Systems Thinking: Towards A Post-Bertalanffy Systemics
266 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
DSOST2
No ratings yet
DSOST2
44 pages
MATH-816 Applied Linear Algebra Matrices & Linear Transformations
No ratings yet
MATH-816 Applied Linear Algebra Matrices & Linear Transformations
26 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
What Is Exploratory Data Analysis?: Intuition
No ratings yet
What Is Exploratory Data Analysis?: Intuition
8 pages
Asynch-Key Take Aways
No ratings yet
Asynch-Key Take Aways
3 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Senate Exam Committee
No ratings yet
Senate Exam Committee
2 pages
518PST01-P&S-unit IV-Qnbank
No ratings yet
518PST01-P&S-unit IV-Qnbank
4 pages
Final Examination 11TH PSY
No ratings yet
Final Examination 11TH PSY
3 pages
MDSP-01-Stresses-and-Kinematics
No ratings yet
MDSP-01-Stresses-and-Kinematics
1 page
Chapter 11 Collision Theory: 11.1 Center of Mass Reference Frame
No ratings yet
Chapter 11 Collision Theory: 11.1 Center of Mass Reference Frame
18 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Sample LAT
No ratings yet
Sample LAT
6 pages
Revise The First Conditional With A Song
No ratings yet
Revise The First Conditional With A Song
3 pages
Umw Lab 01
No ratings yet
Umw Lab 01
1 page
Feedback On Ogl 260 Integrated Report
No ratings yet
Feedback On Ogl 260 Integrated Report
1 page
4 Exploratory Data Analysis.
No ratings yet
4 Exploratory Data Analysis.
1 page
Void Scanner - DC Legends
No ratings yet
Void Scanner - DC Legends
1 page
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Homework Real Time
No ratings yet
Homework Real Time
9 pages
Titan Legions Box Contents
No ratings yet
Titan Legions Box Contents
1 page
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Unit 3
No ratings yet
Unit 3
47 pages
VTAMPS 15 S2 Set 4 Solution Manual
No ratings yet
VTAMPS 15 S2 Set 4 Solution Manual
21 pages
Fitzpatrick's Dermatology in General Medicine, 7th Ed
50% (2)
Fitzpatrick's Dermatology in General Medicine, 7th Ed
27 pages
Behavioural and Neo-Classical Economics - Economics - Tutor2u
No ratings yet
Behavioural and Neo-Classical Economics - Economics - Tutor2u
4 pages
Design of Reconfigurable 2 Way Wilkinson Power Divider For WLAN Applications
No ratings yet
Design of Reconfigurable 2 Way Wilkinson Power Divider For WLAN Applications
5 pages
HR9610 Critical Organisational Analysis: Lecture 6: Assignment
No ratings yet
HR9610 Critical Organisational Analysis: Lecture 6: Assignment
18 pages
Scoring Instructions For SDQs For 2-4 Year Olds - Completed by Parents or Teachers
No ratings yet
Scoring Instructions For SDQs For 2-4 Year Olds - Completed by Parents or Teachers
3 pages
Metal Core Wire
No ratings yet
Metal Core Wire
30 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Layers+of+Contemporary+Architecture
No ratings yet
Layers+of+Contemporary+Architecture
192 pages

AIML Expt

Uploaded by

AIML Expt

Uploaded by

Subject- AIML

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a process of describing the data by means

Why Exploratory Data Analysis is important?

1. Patient’s age at the time of operation (numerical).

Attributes 1, 2, and 3 form our features (independent variables), while attribute

1. Importing libraries and loading data

Import all necessary packages —

Load the dataset in pandas dataframe —

Shape of the dataframe —

 The dataset is imbalanced as expected.

 All the columns are of integer type.

2.1 Data preparation

So, we map survival status values 1 and 2 in the column survival_status to

df['survival_status'] = df['survival_status'].map({1:"yes", 2:"no"})

2.2 General statistical analysis

3. Uni-variate data analysis

3.2 Box plots and Violin plots

4. Bi-variate data analysis

4.2 Joint plot

You might also like