0% found this document useful (0 votes)
4 views

Week - 6-7

This document provides a comprehensive overview of Exploratory Data Analysis (EDA) and Descriptive Statistics, emphasizing the importance of understanding data structure, trends, and relationships through visualizations and statistical tools. It details the use of Python's Pandas library for data exploration, including handling missing values, calculating summary statistics, and assessing correlation and causation. Additionally, it covers various descriptive statistics measures, their formulas, and visualization techniques to effectively analyze and interpret datasets.

Uploaded by

nghiemhoa4895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Week - 6-7

This document provides a comprehensive overview of Exploratory Data Analysis (EDA) and Descriptive Statistics, emphasizing the importance of understanding data structure, trends, and relationships through visualizations and statistical tools. It details the use of Python's Pandas library for data exploration, including handling missing values, calculating summary statistics, and assessing correlation and causation. Additionally, it covers various descriptive statistics measures, their formulas, and visualization techniques to effectively analyze and interpret datasets.

Uploaded by

nghiemhoa4895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Week 6-7

Data Exploratory Analysis and Descriptive Statistics


(Graduate-Level Detail)
1. Introduction to Data Exploratory Analysis (EDA)
Exploratory Data Analysis (EDA) is the initial process of investigating datasets to
summarize their key characteristics, often through visualizations and summary
statistics. It helps data analysts and scientists to detect patterns, spot anomalies,
test hypotheses, and check assumptions. EDA is typically one of the first steps
taken in any data analysis project to better understand the data before applying
advanced models. The insights gained from EDA directly inform data cleaning,
feature engineering, and model selection.
Objectives of EDA:

Understand the structure and quality of data.

Identify trends, relationships, and anomalies.

Prepare data for modeling through cleaning, transformation, and feature


selection.

Discover important variables, relationships, and hidden patterns.

Week 6-7 1
2. Exploring Basic Statistical Analysis Tools in Python Pandas Library
The Pandas library in Python is one of the most popular tools for data analysis due
to its versatility and powerful functions. It provides a range of functions that allow
data exploration with ease and flexibility. Below are some essential tools and steps
used for EDA in Pandas:

Data Overview:

Loading the Data: Use read_csv() , read_excel() , etc., to load data into a
Pandas DataFrame.

Basic Inspection:

Head and Tail: The head() and tail() functions allow you to preview
the first or last few records of your dataset, helping in the initial
understanding of its structure.

Information Summary: info() provides data types, non-null counts,


and memory usage, which is useful for understanding the data
completeness and types of columns.

Summary Statistics: describe() gives statistical summaries of numeric


columns, including mean, standard deviation, min, max, and quartiles.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.info())
print(df.describe())

Data Types: The function dtypes can be used to verify data types of
columns, helping identify if transformations are needed.

Handling Missing Values:

Missing values are common in real-world datasets and can introduce


biases or inaccuracies in modeling if not handled correctly.

Identify Missing Values: Use isnull() and sum() to get an overview of


where missing data exists.

Week 6-7 2
Strategies to Handle Missing Data:

Removal: If the missing values are minimal and randomly distributed,


rows or columns with missing values can be dropped using dropna() .
This approach is effective when the impact on data quality is
negligible.

Imputation: Fill missing values using central tendency metrics ( mean() ,


median() , mode() ) or predictive models.

Forward/Backward Filling: Methods like ffill() or bfill() can be


used in time series data to fill missing values based on neighboring
data points.

# Count missing values in each column


print(df.isnull().sum())

# Fill missing values with mean


df['column'] = df['column'].fillna(df['column'].mean())

# Drop rows with missing values


df.dropna(inplace=True)

Advanced Techniques: Use machine learning models for imputation (e.g.,


KNNImputer from Scikit-Learn) for more sophisticated handling of missing

data.

Summary Statistics:

Central Tendency: Calculations such as Mean, Median, and Mode give


insights into the typical values of a dataset.

Mean provides the average of all values.

Median is less affected by outliers and provides a central point of the


distribution.

Mode is used particularly for categorical data to determine the most


frequent category.

Week 6-7 3
Spread: Quantitative measures like Range, Variance, and Standard
Deviation help understand the dispersion of the data.

Range: Indicates the difference between maximum and minimum


values.

Variance and Standard Deviation: Variance represents the spread of


data points around the mean, while standard deviation is its square
root, making it more interpretable.

Skewness and Kurtosis:

Skewness tells us about the asymmetry in data distribution.

Kurtosis provides insight into the peakedness and the presence of


outliers.

print('Mean:', df['column'].mean())
print('Median:', df['column'].median())
print('Standard Deviation:', df['column'].std())
print('Skewness:', df['column'].skew())
print('Kurtosis:', df['column'].kurt())

Application: High skewness and kurtosis can affect the performance of


statistical models, and transformations like log or square root may be
applied to normalize the data.

3. Correlation and Methods to Calculate Causation

Correlation:

Definition: Correlation measures the strength and direction of a


relationship between two variables. The correlation coefficient (r) ranges
from -1 to 1.

Positive Correlation: As one variable increases, the other also


increases.

Negative Correlation: As one variable increases, the other decreases.

No Correlation: No apparent relationship between the variables.

Week 6-7 4
Types of Correlation:

Pearson Correlation: Measures linear relationships and is sensitive to


outliers.

Spearman Rank Correlation: Measures monotonic relationships using


rank, less sensitive to outliers.

Kendall Tau: Used for ordinal data and helps understand relationships
between ranked variables.

# Calculate Pearson correlation matrix


correlation_matrix = df.corr(method='pearson')
print(correlation_matrix)

# Calculate Spearman correlation


spearman_corr = df.corr(method='spearman')
print(spearman_corr)

Correlation Coefficient Formula:

Pearson Correlation Coefficient (r):


n
∑i=1 (xi − x
ˉ)(yi − yˉ)
![r = ]
​ ​ ​

n n
ˉ)2 ∑i=1 (yi − yˉ)2
∑i=1 (xi − x
​ ​ ​ ​ ​

Causation:

Definition: Causation implies that one event is the result of the occurrence
of the other event. Unlike correlation, causation requires evidence that
changing one variable will produce a change in another.

Proving Causation: Methods like Randomized Controlled Trials (RCTs) or


advanced statistical tests (e.g., Granger causality, A/B testing) are required
to demonstrate causation.

Week 6-7 5
Important Consideration: Correlation does not imply causation. A high
correlation between two variables may be coincidental or influenced by an
unseen third variable (confounder).

4. Descriptive Statistics: Types and Formulas

Descriptive statistics summarize and describe the features of a dataset through


numerical and graphical summaries. These measures are key to understanding
the nature and distribution of the data.

Measures of Central Tendency:

Mean (Average): The sum of all values divided by the number of


observations.

Formula:

∑ni=1 xi
Mean =
​ ​

n
Median: The middle value when all observations are sorted in ascending
or descending order. Median is particularly useful in skewed distributions
where the mean can be misleading.

Formula:

If \(n\) is odd, Median = value at position \((n + 1) / 2\)

If \(n\) is even, Median = average of values at positions \((n / 2)\) and \


((n / 2) + 1\)

Mode: The value that appears most frequently in the data, particularly
helpful for categorical data analysis.

Formula: No specific formula; determined based on frequency count.

Measures of Dispersion:

Range: The difference between the maximum and minimum values,


showing the spread of the data.
Formula:

Range = Max(x) − Min(x)

Week 6-7 6
Variance: The average of the squared differences from the Mean.
Variance is a crucial measure of how spread out the data is.

Formula:
n
ˉ)2
∑i=1 (xi − x
2
Variance(σ ) =
​ ​

n−1

Standard Deviation: The square root of variance, providing a measure of


the average distance of each data point from the mean.
Formula:

∑ni=1 (xi − x
ˉ)2
StandardDeviation(σ) =
​ ​

n−1
​ ​

Interquartile Range (IQR): The range between the first quartile (Q1) and
third quartile (Q3). It measures the spread of the middle 50% of the data
and helps identify outliers.

Formula:

IQR = Q3 − Q1
Measures of Distribution Shape:

Skewness: Measures the asymmetry of the data distribution.

Formula:
n
∑i=1 (xi − xˉ)3
Skewness =
​ ​

(n − 1) ⋅ σ 3

A positive skewness value indicates a right-skewed distribution, while


a negative value indicates a left-skewed distribution. Highly skewed
data may require transformations for modeling.

Kurtosis: Measures the "tailedness" of the distribution, which helps


identify the presence of outliers.

Formula:

Week 6-7 7
n
∑i=1 (xi − xˉ)4
Kurtosis =
​ ​

(n − 1) ⋅ σ 4

High kurtosis indicates a distribution with heavy tails, while low


kurtosis indicates light tails. Excess kurtosis (kurtosis - 3) helps
determine whether data has heavier or lighter tails compared to a
normal distribution.

5. Additional Tools in Python for Descriptive Analysis

Quantiles and Percentiles:

Use quantile() to find different quantiles and percentiles of a dataset,


which are useful for understanding data spread and identifying potential
outliers.

# Calculate 25th, 50th, and 75th percentiles


Q1 = df['column'].quantile(0.25)
Q2 = df['column'].quantile(0.50)
Q3 = df['column'].quantile(0.75)
print('Q1:', Q1)
print('Median (Q2):', Q2)
print('Q3:', Q3)

Application: Quantiles are particularly helpful for creating boxplots,


detecting outliers, and understanding the data's distribution.

Data Visualization:

Visualizing descriptive statistics is crucial in understanding data properties


intuitively:

Boxplots: Useful for visualizing the spread, median, and potential


outliers of a dataset. It uses quartiles and highlights the IQR, giving a
view of the data distribution.

Histograms: Help visualize the frequency distribution of data and give


a sense of skewness and kurtosis.

Week 6-7 8
Density Plots: Provide a smooth distribution of data values, helping to
understand data concentration and shape.

import seaborn as sns


import matplotlib.pyplot as plt

# Boxplot to visualize distribution


sns.boxplot(x=df['column'])
plt.title('Boxplot of Column')
plt.show()

# Histogram for frequency distribution


df['column'].hist(bins=30)
plt.title('Histogram of Column')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

# Density plot for distribution


sns.kdeplot(df['column'], shade=True)
plt.title('Density Plot of Column')
plt.show()

Week 6-7 9

You might also like