0% found this document useful (0 votes)
24 views7 pages

AIML Expt

1. The document discusses exploratory data analysis (EDA) of a breast cancer dataset containing 305 patient cases from 1958-1970. EDA involves visually and statistically examining the data from different perspectives without assumptions. 2. The dataset has 4 attributes - patient age, year of operation, number of positive lymph nodes, and survival status. EDA includes checking data types and distributions, identifying outliers, handling imbalances, and determining relationships between variables. 3. Univariate analysis of each variable is done through distribution, box, and violin plots to understand their relationships to the survival class label. Bivariate analysis using pair plots, joint plots, and heatmaps examines correlations between attribute pairs.

Uploaded by

D Slm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views7 pages

AIML Expt

1. The document discusses exploratory data analysis (EDA) of a breast cancer dataset containing 305 patient cases from 1958-1970. EDA involves visually and statistically examining the data from different perspectives without assumptions. 2. The dataset has 4 attributes - patient age, year of operation, number of positive lymph nodes, and survival status. EDA includes checking data types and distributions, identifying outliers, handling imbalances, and determining relationships between variables. 3. Univariate analysis of each variable is done through distribution, box, and violin plots to understand their relationships to the survival class label. Bivariate analysis using pair plots, joint plots, and heatmaps examines correlations between attribute pairs.

Uploaded by

D Slm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Subject- AIML

Practical No. 2
Aim- To acquire , visualization and analyze the dataset

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a process of describing the data by means


of statistical and visualization techniques in order to bring important aspects of
that data into focus for further analysis. This involves inspecting the dataset
from many angles, describing & summarizing it without making any
assumptions about its contents.

Exploratory data analysis is a significant step to take before diving into statistical
modeling or machine learning, to ensure the data is really what it is claimed to
be and that there are no obvious errors. It should be part of data science
projects in every organization.

Why Exploratory Data Analysis is important?

Just like everything in this world, data has its imperfections. Raw data is usually
skewed, may have outliers, or too many missing values. A model built on such
data results in sub-optimal performance. In hurry to get to the machine learning
stage, some data professionals either entirely skip the exploratory data analysis
process or do a very mediocre job. This is a mistake with many implications,
that includes generating inaccurate models, generating accurate models but on
the wrong data, not creating the right types of variables in data preparation, and
using resources inefficiently.
Data set description

The dataset contains cases from the research carried out between the years
1958 and 1970 at the University of Chicago’s Billings Hospital on the survival
of patients who had undergone surgery for breast cancer.

Attribute information :

1. Patient’s age at the time of operation (numerical).


2. Year of operation (year — 1900, numerical).
3. A number of positive axillary nodes were detected (numerical).
4. Survival status (class attribute)
1: the patient survived 5 years or longer post-operation.
2: the patient died within 5 years post-operation.

Attributes 1, 2, and 3 form our features (independent variables), while attribute


4 is our class label (dependent variable).

1. Importing libraries and loading data

Import all necessary packages —

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

Load the dataset in pandas dataframe —

df = pd.read_csv('haberman.csv', header = 0)
df.columns = ['patient_age', 'operation_year',
'positive_axillary_nodes', 'survival_status']
2. Understanding data

Output:

Shape of the dataframe —

df.shape

Output:
(305, 4)

There are 305 rows and 4 columns. But how many data points for each class
label are present in our dataset?

df[‘survival_status’].value_counts()

Output:

 The dataset is imbalanced as expected.


 Out of a total of 305 patients, the number of patients who survived over 5 years
post-operation is nearly 3 times the number of patients who died within 5 years.

df.info()
Output:

 All the columns are of integer type.


 No missing values in the dataset.

2.1 Data preparation


Before we go for statistical analysis and visualization, we see that the original
class labels — 1 (survived 5 years and above) and 2 (died within 5 years) are
not in accordance with the case.

So, we map survival status values 1 and 2 in the column survival_status to


categorical variables ‘yes’ and ‘no’ respectively such that,
survival_status = 1 → survival_status = ‘yes’
survival_status = 2 → survival_status = ‘no’

df['survival_status'] = df['survival_status'].map({1:"yes", 2:"no"})

2.2 General statistical analysis


df.describe()

Output:
 On average, patients got operated at age of 63.
 An average number of positive axillary nodes detected = 4.
 As indicated by the 50th percentile, the median of positive axillary nodes is 1.
 As indicated by the 75th percentile, 75% of the patients have less than 4 nodes
detected.

3. Uni-variate data analysis


3.1 Distribution Plots
Uni-variate analysis as the name suggests is an analysis carried out by
considering one variable at a time. Let’s say our aim is to be able to correctly
determine the survival status given the features — patient’s age, operation year,
and positive axillary nodes count. Which among these 3 variables is more useful
than other variables in order to distinguish between the class labels ‘yes’ and
‘no’? To answer this, we’ll plot the distribution plots (also called probability
density function or PDF plots) with each feature as a variable on X-axis. The
values on the Y-axis in each case represent the normalized density.

3.2 Box plots and Violin plots


Box plot, also known as box and whisker plot, displays a summary of data in
five numbers — minimum, lower quartile(25th percentile), median(50th
percentile), upper quartile(75th percentile), and maximum data values.

A violin plot displays the same information as the box and whisker plot;
additionally, it also shows the density-smoothed plot of the underlying
distribution

4. Bi-variate data analysis


4.1 Pair plot
Next, we shall plot a pair plot to visualize the relationship between the features
in a pairwise manner. A pair plot enables us to visualize both distributions of
single variables as well as the relationship between pairs of variables.

4.2 Joint plot


While the Pair plot provides a visual insight into all possible correlations, the
Joint plot provides bivariate plots with univariate marginal distributions.

4.3 Heatmap
Heatmaps are used to observe the correlations among the feature variables.
This is particularly important when we are trying to obtain the feature
importance in regression analysis. Although correlated features do not impact
the performance of the statistical model, it could mess up the post-modeling
analysis.
Conclusion- we learned some common steps involved in exploratory data
analysis. We also saw several types of charts & plots and what information is
conveyed by each of these.

You might also like