AIML Expt
AIML Expt
Practical No. 2
Aim- To acquire , visualization and analyze the dataset
Exploratory data analysis is a significant step to take before diving into statistical
modeling or machine learning, to ensure the data is really what it is claimed to
be and that there are no obvious errors. It should be part of data science
projects in every organization.
Just like everything in this world, data has its imperfections. Raw data is usually
skewed, may have outliers, or too many missing values. A model built on such
data results in sub-optimal performance. In hurry to get to the machine learning
stage, some data professionals either entirely skip the exploratory data analysis
process or do a very mediocre job. This is a mistake with many implications,
that includes generating inaccurate models, generating accurate models but on
the wrong data, not creating the right types of variables in data preparation, and
using resources inefficiently.
Data set description
The dataset contains cases from the research carried out between the years
1958 and 1970 at the University of Chicago’s Billings Hospital on the survival
of patients who had undergone surgery for breast cancer.
Attribute information :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
df = pd.read_csv('haberman.csv', header = 0)
df.columns = ['patient_age', 'operation_year',
'positive_axillary_nodes', 'survival_status']
2. Understanding data
Output:
df.shape
Output:
(305, 4)
There are 305 rows and 4 columns. But how many data points for each class
label are present in our dataset?
df[‘survival_status’].value_counts()
Output:
df.info()
Output:
Output:
On average, patients got operated at age of 63.
An average number of positive axillary nodes detected = 4.
As indicated by the 50th percentile, the median of positive axillary nodes is 1.
As indicated by the 75th percentile, 75% of the patients have less than 4 nodes
detected.
A violin plot displays the same information as the box and whisker plot;
additionally, it also shows the density-smoothed plot of the underlying
distribution
4.3 Heatmap
Heatmaps are used to observe the correlations among the feature variables.
This is particularly important when we are trying to obtain the feature
importance in regression analysis. Although correlated features do not impact
the performance of the statistical model, it could mess up the post-modeling
analysis.
Conclusion- we learned some common steps involved in exploratory data
analysis. We also saw several types of charts & plots and what information is
conveyed by each of these.