Week - 6-7
Week - 6-7
Week 6-7 1
2. Exploring Basic Statistical Analysis Tools in Python Pandas Library
The Pandas library in Python is one of the most popular tools for data analysis due
to its versatility and powerful functions. It provides a range of functions that allow
data exploration with ease and flexibility. Below are some essential tools and steps
used for EDA in Pandas:
Data Overview:
Loading the Data: Use read_csv() , read_excel() , etc., to load data into a
Pandas DataFrame.
Basic Inspection:
Head and Tail: The head() and tail() functions allow you to preview
the first or last few records of your dataset, helping in the initial
understanding of its structure.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.info())
print(df.describe())
Data Types: The function dtypes can be used to verify data types of
columns, helping identify if transformations are needed.
Week 6-7 2
Strategies to Handle Missing Data:
data.
Summary Statistics:
Week 6-7 3
Spread: Quantitative measures like Range, Variance, and Standard
Deviation help understand the dispersion of the data.
print('Mean:', df['column'].mean())
print('Median:', df['column'].median())
print('Standard Deviation:', df['column'].std())
print('Skewness:', df['column'].skew())
print('Kurtosis:', df['column'].kurt())
Correlation:
Week 6-7 4
Types of Correlation:
Kendall Tau: Used for ordinal data and helps understand relationships
between ranked variables.
n n
ˉ)2 ∑i=1 (yi − yˉ)2
∑i=1 (xi − x
Causation:
Definition: Causation implies that one event is the result of the occurrence
of the other event. Unlike correlation, causation requires evidence that
changing one variable will produce a change in another.
Week 6-7 5
Important Consideration: Correlation does not imply causation. A high
correlation between two variables may be coincidental or influenced by an
unseen third variable (confounder).
Formula:
∑ni=1 xi
Mean =
n
Median: The middle value when all observations are sorted in ascending
or descending order. Median is particularly useful in skewed distributions
where the mean can be misleading.
Formula:
Mode: The value that appears most frequently in the data, particularly
helpful for categorical data analysis.
Measures of Dispersion:
Week 6-7 6
Variance: The average of the squared differences from the Mean.
Variance is a crucial measure of how spread out the data is.
Formula:
n
ˉ)2
∑i=1 (xi − x
2
Variance(σ ) =
n−1
∑ni=1 (xi − x
ˉ)2
StandardDeviation(σ) =
n−1
Interquartile Range (IQR): The range between the first quartile (Q1) and
third quartile (Q3). It measures the spread of the middle 50% of the data
and helps identify outliers.
Formula:
IQR = Q3 − Q1
Measures of Distribution Shape:
Formula:
n
∑i=1 (xi − xˉ)3
Skewness =
(n − 1) ⋅ σ 3
Formula:
Week 6-7 7
n
∑i=1 (xi − xˉ)4
Kurtosis =
(n − 1) ⋅ σ 4
Data Visualization:
Week 6-7 8
Density Plots: Provide a smooth distribution of data values, helping to
understand data concentration and shape.
Week 6-7 9