0% found this document useful (0 votes)

4 views

Week - 6-7

This document provides a comprehensive overview of Exploratory Data Analysis (EDA) and Descriptive Statistics, emphasizing the importance of understanding data structure, trends, and relationships through visualizations and statistical tools. It details the use of Python's Pandas library for data exploration, including handling missing values, calculating summary statistics, and assessing correlation and causation. Additionally, it covers various descriptive statistics measures, their formulas, and visualization techniques to effectively analyze and interpret datasets.

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Week - 6-7

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Week 6-7

Data Exploratory Analysis and Descriptive Statistics

(Graduate-Level Detail)
1. Introduction to Data Exploratory Analysis (EDA)
Exploratory Data Analysis (EDA) is the initial process of investigating datasets to
summarize their key characteristics, often through visualizations and summary
statistics. It helps data analysts and scientists to detect patterns, spot anomalies,
test hypotheses, and check assumptions. EDA is typically one of the first steps
taken in any data analysis project to better understand the data before applying
advanced models. The insights gained from EDA directly inform data cleaning,
feature engineering, and model selection.
Objectives of EDA:

Understand the structure and quality of data.

Identify trends, relationships, and anomalies.

Prepare data for modeling through cleaning, transformation, and feature

selection.

Discover important variables, relationships, and hidden patterns.

Week 6-7 1
2. Exploring Basic Statistical Analysis Tools in Python Pandas Library
The Pandas library in Python is one of the most popular tools for data analysis due
to its versatility and powerful functions. It provides a range of functions that allow
data exploration with ease and flexibility. Below are some essential tools and steps
used for EDA in Pandas:

Data Overview:

Loading the Data: Use read_csv() , read_excel() , etc., to load data into a
Pandas DataFrame.

Basic Inspection:

Head and Tail: The head() and tail() functions allow you to preview
the first or last few records of your dataset, helping in the initial
understanding of its structure.

Information Summary: info() provides data types, non-null counts,

and memory usage, which is useful for understanding the data
completeness and types of columns.

Summary Statistics: describe() gives statistical summaries of numeric

columns, including mean, standard deviation, min, max, and quartiles.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.info())
print(df.describe())

Data Types: The function dtypes can be used to verify data types of
columns, helping identify if transformations are needed.

Handling Missing Values:

Missing values are common in real-world datasets and can introduce

biases or inaccuracies in modeling if not handled correctly.

Identify Missing Values: Use isnull() and sum() to get an overview of

where missing data exists.

Week 6-7 2
Strategies to Handle Missing Data:

Removal: If the missing values are minimal and randomly distributed,

rows or columns with missing values can be dropped using dropna() .
This approach is effective when the impact on data quality is
negligible.

Imputation: Fill missing values using central tendency metrics ( mean() ,

median() , mode() ) or predictive models.

Forward/Backward Filling: Methods like ffill() or bfill() can be

used in time series data to fill missing values based on neighboring
data points.

# Count missing values in each column

print(df.isnull().sum())

# Fill missing values with mean

df['column'] = df['column'].fillna(df['column'].mean())

# Drop rows with missing values

df.dropna(inplace=True)

Advanced Techniques: Use machine learning models for imputation (e.g.,

KNNImputer from Scikit-Learn) for more sophisticated handling of missing

data.

Summary Statistics:

Central Tendency: Calculations such as Mean, Median, and Mode give

insights into the typical values of a dataset.

Mean provides the average of all values.

Median is less affected by outliers and provides a central point of the

distribution.

Mode is used particularly for categorical data to determine the most

frequent category.

Week 6-7 3
Spread: Quantitative measures like Range, Variance, and Standard
Deviation help understand the dispersion of the data.

Range: Indicates the difference between maximum and minimum

values.

Variance and Standard Deviation: Variance represents the spread of

data points around the mean, while standard deviation is its square
root, making it more interpretable.

Skewness and Kurtosis:

Skewness tells us about the asymmetry in data distribution.

Kurtosis provides insight into the peakedness and the presence of

outliers.

print('Mean:', df['column'].mean())
print('Median:', df['column'].median())
print('Standard Deviation:', df['column'].std())
print('Skewness:', df['column'].skew())
print('Kurtosis:', df['column'].kurt())

Application: High skewness and kurtosis can affect the performance of

statistical models, and transformations like log or square root may be
applied to normalize the data.

3. Correlation and Methods to Calculate Causation

Correlation:

Definition: Correlation measures the strength and direction of a

relationship between two variables. The correlation coefficient (r) ranges
from -1 to 1.

Positive Correlation: As one variable increases, the other also

increases.

Negative Correlation: As one variable increases, the other decreases.

No Correlation: No apparent relationship between the variables.

Week 6-7 4
Types of Correlation:

Pearson Correlation: Measures linear relationships and is sensitive to

outliers.

Spearman Rank Correlation: Measures monotonic relationships using

rank, less sensitive to outliers.

Kendall Tau: Used for ordinal data and helps understand relationships
between ranked variables.

# Calculate Pearson correlation matrix

correlation_matrix = df.corr(method='pearson')
print(correlation_matrix)

# Calculate Spearman correlation

spearman_corr = df.corr(method='spearman')
print(spearman_corr)

Correlation Coefficient Formula:

Pearson Correlation Coefficient (r):

n
∑i=1 (xi − x
ˉ)(yi − yˉ)
![r = ]

n n
ˉ)2 ∑i=1 (yi − yˉ)2
∑i=1 (xi − x

Causation:

Definition: Causation implies that one event is the result of the occurrence
of the other event. Unlike correlation, causation requires evidence that
changing one variable will produce a change in another.

Proving Causation: Methods like Randomized Controlled Trials (RCTs) or

advanced statistical tests (e.g., Granger causality, A/B testing) are required
to demonstrate causation.

Week 6-7 5
Important Consideration: Correlation does not imply causation. A high
correlation between two variables may be coincidental or influenced by an
unseen third variable (confounder).

4. Descriptive Statistics: Types and Formulas

Descriptive statistics summarize and describe the features of a dataset through

numerical and graphical summaries. These measures are key to understanding
the nature and distribution of the data.

Measures of Central Tendency:

Mean (Average): The sum of all values divided by the number of

observations.

Formula:

∑ni=1 xi
Mean =

n
Median: The middle value when all observations are sorted in ascending
or descending order. Median is particularly useful in skewed distributions
where the mean can be misleading.

Formula:

If \(n\) is odd, Median = value at position \((n + 1) / 2\)

If \(n\) is even, Median = average of values at positions \((n / 2)\) and \

((n / 2) + 1\)

Mode: The value that appears most frequently in the data, particularly
helpful for categorical data analysis.

Formula: No specific formula; determined based on frequency count.

Measures of Dispersion:

Range: The difference between the maximum and minimum values,

showing the spread of the data.
Formula:

Range = Max(x) − Min(x)

Week 6-7 6
Variance: The average of the squared differences from the Mean.
Variance is a crucial measure of how spread out the data is.

Formula:
n
ˉ)2
∑i=1 (xi − x
2
Variance(σ ) =

n−1

Standard Deviation: The square root of variance, providing a measure of

the average distance of each data point from the mean.
Formula:

∑ni=1 (xi − x
ˉ)2
StandardDeviation(σ) =

n−1

Interquartile Range (IQR): The range between the first quartile (Q1) and
third quartile (Q3). It measures the spread of the middle 50% of the data
and helps identify outliers.

Formula:

IQR = Q3 − Q1
Measures of Distribution Shape:

Skewness: Measures the asymmetry of the data distribution.

Formula:
n
∑i=1 (xi − xˉ)3
Skewness =

(n − 1) ⋅ σ 3

A positive skewness value indicates a right-skewed distribution, while

a negative value indicates a left-skewed distribution. Highly skewed
data may require transformations for modeling.

Kurtosis: Measures the "tailedness" of the distribution, which helps

identify the presence of outliers.

Formula:

Week 6-7 7
n
∑i=1 (xi − xˉ)4
Kurtosis =

(n − 1) ⋅ σ 4

High kurtosis indicates a distribution with heavy tails, while low

kurtosis indicates light tails. Excess kurtosis (kurtosis - 3) helps
determine whether data has heavier or lighter tails compared to a
normal distribution.

5. Additional Tools in Python for Descriptive Analysis

Quantiles and Percentiles:

Use quantile() to find different quantiles and percentiles of a dataset,

which are useful for understanding data spread and identifying potential
outliers.

# Calculate 25th, 50th, and 75th percentiles

Q1 = df['column'].quantile(0.25)
Q2 = df['column'].quantile(0.50)
Q3 = df['column'].quantile(0.75)
print('Q1:', Q1)
print('Median (Q2):', Q2)
print('Q3:', Q3)

Application: Quantiles are particularly helpful for creating boxplots,

detecting outliers, and understanding the data's distribution.

Data Visualization:

Visualizing descriptive statistics is crucial in understanding data properties

intuitively:

Boxplots: Useful for visualizing the spread, median, and potential

outliers of a dataset. It uses quartiles and highlights the IQR, giving a
view of the data distribution.

Histograms: Help visualize the frequency distribution of data and give

a sense of skewness and kurtosis.

Week 6-7 8
Density Plots: Provide a smooth distribution of data values, helping to
understand data concentration and shape.

import seaborn as sns

import matplotlib.pyplot as plt

# Boxplot to visualize distribution

sns.boxplot(x=df['column'])
plt.title('Boxplot of Column')
plt.show()

# Histogram for frequency distribution

df['column'].hist(bins=30)
plt.title('Histogram of Column')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

# Density plot for distribution

sns.kdeplot(df['column'], shade=True)
plt.title('Density Plot of Column')
plt.show()

Week 6-7 9

Resampling Methods For Dependent Data
No ratings yet
Resampling Methods For Dependent Data
382 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Unit 3
No ratings yet
Unit 3
20 pages
S1 - Descriptive Statistics
No ratings yet
S1 - Descriptive Statistics
133 pages
Module3
No ratings yet
Module3
54 pages
Unit .......
No ratings yet
Unit .......
45 pages
program-1_
No ratings yet
program-1_
15 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
13 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Business Statstics Complete
No ratings yet
Business Statstics Complete
13 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
21 pages
Edaunit IV
No ratings yet
Edaunit IV
15 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
stastics for data science1 (quiz1 notes)
No ratings yet
stastics for data science1 (quiz1 notes)
2 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
Descriptive Statistics (1)
No ratings yet
Descriptive Statistics (1)
63 pages
4-DataUnderstanding
No ratings yet
4-DataUnderstanding
51 pages
Programming Python Statistics
No ratings yet
Programming Python Statistics
7 pages
Section 1 Slide
No ratings yet
Section 1 Slide
132 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
UNIT 1,2
No ratings yet
UNIT 1,2
17 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
Descriptive Analytics - Uni and Bi
No ratings yet
Descriptive Analytics - Uni and Bi
36 pages
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
No ratings yet
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
44 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
BRM Chapter 6
No ratings yet
BRM Chapter 6
8 pages
CSA105 Data Exploration
No ratings yet
CSA105 Data Exploration
4 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
ADS EXP 1
No ratings yet
ADS EXP 1
13 pages
Qunt Data Coding & Analysis
No ratings yet
Qunt Data Coding & Analysis
104 pages
Lec 2
No ratings yet
Lec 2
26 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
48 pages
PYTHON STATISC
No ratings yet
PYTHON STATISC
7 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Iba Unit - Ii
No ratings yet
Iba Unit - Ii
31 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Descriptive Analytics - Univariate and Bivariate
No ratings yet
Descriptive Analytics - Univariate and Bivariate
41 pages
RM-EBBA-class-8-CH0-11-Quatitative-analysis
No ratings yet
RM-EBBA-class-8-CH0-11-Quatitative-analysis
37 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
4 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
5_Data Summaries and Visualization (4)
No ratings yet
5_Data Summaries and Visualization (4)
87 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Lesson 2 Notes
No ratings yet
Lesson 2 Notes
11 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
DS Lab Manual Final
No ratings yet
DS Lab Manual Final
49 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
SPSS Lab Session 3
No ratings yet
SPSS Lab Session 3
10 pages
Sampling and Sampling Distributions
No ratings yet
Sampling and Sampling Distributions
29 pages
Chapter 10 Solutions
No ratings yet
Chapter 10 Solutions
11 pages
Basic Level - Assignment 1
No ratings yet
Basic Level - Assignment 1
8 pages
2004
No ratings yet
2004
20 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Research Methods: Inferential Statistics: Two Group Design
No ratings yet
Research Methods: Inferential Statistics: Two Group Design
36 pages
Statistical Critique Paper
100% (2)
Statistical Critique Paper
2 pages
Statistical Tools
No ratings yet
Statistical Tools
24 pages
Stats Activity 8
No ratings yet
Stats Activity 8
2 pages
Independent Sample T-Test
100% (1)
Independent Sample T-Test
21 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
HMW 2
No ratings yet
HMW 2
5 pages
Sampling Techniques
No ratings yet
Sampling Techniques
19 pages
Assignment
No ratings yet
Assignment
4 pages
Unit 3 Test
No ratings yet
Unit 3 Test
4 pages
Statistical Tables
No ratings yet
Statistical Tables
7 pages
Running Head: Relationship Between Circumference and Diameter 1
No ratings yet
Running Head: Relationship Between Circumference and Diameter 1
8 pages
Instant ebooks textbook Statistics at Square Two 3rd Edition Michael J. Campbell download all chapters
100% (1)
Instant ebooks textbook Statistics at Square Two 3rd Edition Michael J. Campbell download all chapters
37 pages
Problem Data
No ratings yet
Problem Data
43 pages
Process Capability-7
No ratings yet
Process Capability-7
24 pages
Sullivan Et Al 2018 - Should Multiple Imputation Be The Method of Choice For Handling Missing Data in Randomized Trials
No ratings yet
Sullivan Et Al 2018 - Should Multiple Imputation Be The Method of Choice For Handling Missing Data in Randomized Trials
17 pages
Lab 04 Handout
No ratings yet
Lab 04 Handout
35 pages
Final Project Forecasting
No ratings yet
Final Project Forecasting
35 pages
Manual Penggunaan Spss+Eviews+Shazam (Contoh Aplikasi)
No ratings yet
Manual Penggunaan Spss+Eviews+Shazam (Contoh Aplikasi)
32 pages
Chapter Two
50% (2)
Chapter Two
13 pages
TD6 Anova
No ratings yet
TD6 Anova
5 pages
Full Download Multivariate Statistical Methods A Primer Third Edition Manly PDF DOCX
100% (10)
Full Download Multivariate Statistical Methods A Primer Third Edition Manly PDF DOCX
65 pages
Question Bank Answers Statistics
No ratings yet
Question Bank Answers Statistics
17 pages

Week - 6-7

Uploaded by

Week - 6-7

Uploaded by

Week 6-7

Data Exploratory Analysis and Descriptive Statistics

Understand the structure and quality of data.

Identify trends, relationships, and anomalies.

Prepare data for modeling through cleaning, transformation, and feature

Discover important variables, relationships, and hidden patterns.

Information Summary: info() provides data types, non-null counts,

Summary Statistics: describe() gives statistical summaries of numeric

Handling Missing Values:

Missing values are common in real-world datasets and can introduce

Identify Missing Values: Use isnull() and sum() to get an overview of

Removal: If the missing values are minimal and randomly distributed,

Imputation: Fill missing values using central tendency metrics ( mean() ,

Forward/Backward Filling: Methods like ffill() or bfill() can be

# Count missing values in each column

# Fill missing values with mean

# Drop rows with missing values

Advanced Techniques: Use machine learning models for imputation (e.g.,

Central Tendency: Calculations such as Mean, Median, and Mode give

Mean provides the average of all values.

Median is less affected by outliers and provides a central point of the

Mode is used particularly for categorical data to determine the most

Range: Indicates the difference between maximum and minimum

Variance and Standard Deviation: Variance represents the spread of

Skewness and Kurtosis:

Skewness tells us about the asymmetry in data distribution.

Kurtosis provides insight into the peakedness and the presence of

Application: High skewness and kurtosis can affect the performance of

3. Correlation and Methods to Calculate Causation

Definition: Correlation measures the strength and direction of a

Positive Correlation: As one variable increases, the other also

Negative Correlation: As one variable increases, the other decreases.

No Correlation: No apparent relationship between the variables.

Pearson Correlation: Measures linear relationships and is sensitive to

Spearman Rank Correlation: Measures monotonic relationships using

# Calculate Pearson correlation matrix

# Calculate Spearman correlation

Correlation Coefficient Formula:

Pearson Correlation Coefficient (r):

Proving Causation: Methods like Randomized Controlled Trials (RCTs) or

4. Descriptive Statistics: Types and Formulas

Descriptive statistics summarize and describe the features of a dataset through

Measures of Central Tendency:

Mean (Average): The sum of all values divided by the number of

If \(n\) is odd, Median = value at position \((n + 1) / 2\)

If \(n\) is even, Median = average of values at positions \((n / 2)\) and \

Formula: No specific formula; determined based on frequency count.

Range: The difference between the maximum and minimum values,

Range = Max(x) − Min(x)

Standard Deviation: The square root of variance, providing a measure of

Skewness: Measures the asymmetry of the data distribution.

A positive skewness value indicates a right-skewed distribution, while

Kurtosis: Measures the "tailedness" of the distribution, which helps

High kurtosis indicates a distribution with heavy tails, while low

5. Additional Tools in Python for Descriptive Analysis

Quantiles and Percentiles:

Use quantile() to find different quantiles and percentiles of a dataset,

# Calculate 25th, 50th, and 75th percentiles

Application: Quantiles are particularly helpful for creating boxplots,

Visualizing descriptive statistics is crucial in understanding data properties

Boxplots: Useful for visualizing the spread, median, and potential

Histograms: Help visualize the frequency distribution of data and give

import seaborn as sns

# Boxplot to visualize distribution

# Histogram for frequency distribution

# Density plot for distribution

You might also like