0% found this document useful (0 votes)

18 views37 pages

Lecture Notes

Uploaded by

kyaligonzaerick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views37 pages

Lecture Notes

Uploaded by

kyaligonzaerick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 37

Exploratory Data Analysis

and Descriptive Statistics

Today

• What is descriptive statistics and

exploratory data analysis?

• Basic numerical summaries of data

• Basic graphical summaries of data

“Central Dogma” of Statistics

Probability
Population
Descriptive
Statistics

Sample

Inferential Statistics
EDA
Before making inferences from data it is
essential to examine all your variables.

Why?

To listen to the data:

- to catch mistakes
- to see patterns in the data
- to find violations of statistical assumptions
- to generate hypotheses
…and because if you don’t, you will have trouble later
Types of Data

Categorical Quantitative

binary nominal ordinal discrete continuous

2 categories
more categories
order matters
numerical
uninterrupted
Dimensionality of Data Sets

• Univariate: Measurement made on one variable

per subject

• Bivariate: Measurement made on two variables

per subject

• Multivariate: Measurement made on many

variables per subject
Numerical Summaries of Data

• Central Tendency measures. They are

computed to give a “center” around which the
measurements in the data are distributed.

• Variation or Variability measures. They

describe “data spread” or how far away the
measurements are from the center.
Location: Mean

1. The Mean

To calculate the average x of a set of observations, add

their value and divide by the number of observations:
Location: Median
• Median – the exact middle value

• Calculation:
- If there are an odd number of observations, find the middle value

- If there are an even number of observations, find the

middle two values and average them

• Example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Median = (22+23)/2 = 22.5

Which Location Measure Is Best?

• Mean is best for symmetric distributions without outliers

• Median is useful for skewed distributions or data

with outliers

012345678910 012345678910

Mean = 3 Mean = 4

Median = 3 Median = 3
Scale: Variance

• Average of squared deviations of values

from the mean
Why Squared Deviations?

• Adding deviations will yield a sum of ?

• Absolute values do not have nice
mathematical properties
• Squares eliminate the negatives

• Result:
– Increasing contribution to the variance as
you go farther from the mean.
Scale: Standard Deviation
• Variance is somewhat arbitrary

• What does it mean to have a variance of

10.8? Or 2.2? Or 1459.092? Or 0.000001?

• Nothing. But if you could “standardize” that

value, you could talk about any variance (i.e.
deviation) in equivalent terms

• Standard deviations are simply the square root

of the variance
Scale: Standard Deviation

1. Score (in the units that are meaningful)

2. Mean
3. Each score’s deviation from the mean
4. Square that deviation
5. Sum all the squared deviations (Sum of Squares)
6. Divide by n-1
7. Square root – now the value is in the units we started with!!!
Scale: Quartiles and IQR
IQR
25% 25% 25% 25%

Q1 Q2 Q3

• The first quartile, Q1, is the value for which 25% of

the observations are smaller and 75% are larger

• Q2 is the same as the median (50% are smaller,

50% are larger)

• Only 25% of the observations are greater than the

third quartile
Percentiles (aka Quantiles)
th
In general the n percentile is a value such that n% of
the observations fall at or below or it

th
Q1 = 25 percentile
th
Median = 50 percentile
th
Q2 = 75 percentile
Graphical Summaries of Data

A (Good) Picture Is
Worth A 1,000 Words
Univariate Data: Histograms
and Bar Plots
• What’s the difference between a histogram and bar plot?
Bar plot
• Used for categorical variables to show frequency or
proportion in each category.
• Translate the data from frequency tables into a
pictorial representation…

Histogram
• Used to visualize distribution (shape, center, range,
variation) of continuous variables
• “Bin size” important
Effect of Bin Size on Histogram
• Simulated 1000 N(0,1) and 500 N(1,1)

Frequency
Frequency
Frequency More on Histograms
• What’s the difference between a frequency
histogram and a density histogram?
More on Histograms
• What’s the difference between a frequency
histogram and a density histogram?
Frequency Histogram Density Histogram
Box Plots
100.0
maximum

66.7 Q
3

IQR
Years

median

Q1
33.3

minimum

0.0
AGE
Variables
Bivariate Data

Variable 1 Variable 2 Display

Categorical Categorical Crosstabs
Stacked Box Plot

Categorical Continuous Boxplot

nuous Continuous Scatterplot Stacked

Box Plot
Multivariate Data
Clustering
• Organize units into clusters
• Descriptive, not inferential
• Many approaches
• “Clusters” always produced

Data Reduction Approaches (PCA)

• Reduce n-dimensional dataset into much smaller number
• Finds a new (smaller) set of variables that retains
most of the information in the total sample
• Effective way to visualize multivariate data
How to Make a Bad Graph
The aim of good data graphics:
Display data accurately and clearly

Some rules for displaying data badly:

– Display as little information as possible
– Obscure what you do show (with chart junk)
– Use pseudo-3d and color gratuitously
– Make a pie chart (preferably in color and 3d)
– Use a poorly chosen scale

From Karl Broman: http://www.biostat.wisc.edu/~kbroman/

Example 1
Example 2
Example 3
Example 4
Example 5
R Tutorial

• Calculating descriptive statistics in R

• Useful R commands for working with

multivariate data (apply and its derivatives)

• Creating graphs for different types of

data (histograms, boxplots, scatterplots)

• Basic clustering and PCA analysis

Psychology Project
No ratings yet
Psychology Project
14 pages
Safari
No ratings yet
Safari
385 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
2 - Introduction To Statistics
No ratings yet
2 - Introduction To Statistics
97 pages
Week 8 Quantitative Data Analysis - Descriptive Statistics
No ratings yet
Week 8 Quantitative Data Analysis - Descriptive Statistics
59 pages
Unit 4
No ratings yet
Unit 4
152 pages
Data Analysis
No ratings yet
Data Analysis
43 pages
01 Data
No ratings yet
01 Data
100 pages
Data Analyst Question-Answers
No ratings yet
Data Analyst Question-Answers
17 pages
MÔ TẢ BIẾN SỐ
No ratings yet
MÔ TẢ BIẾN SỐ
48 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
SCA - Module 4
No ratings yet
SCA - Module 4
49 pages
Topic 2- Descriptive_statistics
No ratings yet
Topic 2- Descriptive_statistics
36 pages
Lecture-1 Descriptive Statistics
No ratings yet
Lecture-1 Descriptive Statistics
50 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Week 2 - Chapter 1 Introduction To Statistics (Part 2)
No ratings yet
Week 2 - Chapter 1 Introduction To Statistics (Part 2)
45 pages
Psyc 103 (Stats)
No ratings yet
Psyc 103 (Stats)
75 pages
Stats Lecture 1
No ratings yet
Stats Lecture 1
45 pages
3. Variables & Chart
No ratings yet
3. Variables & Chart
60 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
48 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
How Much Data Does Google Handle?
No ratings yet
How Much Data Does Google Handle?
132 pages
Unit 8. Data Analysis
No ratings yet
Unit 8. Data Analysis
69 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Ch1 Prob&Stat NEW
No ratings yet
Ch1 Prob&Stat NEW
35 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Unit Iii
No ratings yet
Unit Iii
152 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
MS102
No ratings yet
MS102
9 pages
C4 Descriptive Statistics
No ratings yet
C4 Descriptive Statistics
34 pages
LabModule - Exploratory Data Analysis - 2023ic
No ratings yet
LabModule - Exploratory Data Analysis - 2023ic
24 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Inferential Statistics
No ratings yet
Inferential Statistics
92 pages
Lecture 1ASADA Descriptive Stats
No ratings yet
Lecture 1ASADA Descriptive Stats
38 pages
LEC 03 - Descriptive Statistics
No ratings yet
LEC 03 - Descriptive Statistics
42 pages
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
No ratings yet
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
69 pages
Ch 2 Lecture Notes
No ratings yet
Ch 2 Lecture Notes
12 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
02 - Descriptive Statistics
No ratings yet
02 - Descriptive Statistics
45 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
Applied Statistics and Multivariate Data Analysis for Business and Economics: A Modern Approach Using SPSS, Stata, and Excel Thomas Cleff instant download
100% (2)
Applied Statistics and Multivariate Data Analysis for Business and Economics: A Modern Approach Using SPSS, Stata, and Excel Thomas Cleff instant download
55 pages
1.8.4 Test (TST) - Statistical Analysis (Test)
No ratings yet
1.8.4 Test (TST) - Statistical Analysis (Test)
12 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Notes 3 Descriptive Statistics RJMurden 2021
No ratings yet
Notes 3 Descriptive Statistics RJMurden 2021
47 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
Chapter1 Statistics
No ratings yet
Chapter1 Statistics
17 pages
Biostats Lesson 3
No ratings yet
Biostats Lesson 3
6 pages
Statistical Analysis_ Descriptive Stat (2)
No ratings yet
Statistical Analysis_ Descriptive Stat (2)
6 pages
Quant Factor Investing Book PDF
No ratings yet
Quant Factor Investing Book PDF
104 pages
10.1201_b10957_previewpdf
100% (1)
10.1201_b10957_previewpdf
144 pages
Week 5A - Statistics Handout
No ratings yet
Week 5A - Statistics Handout
9 pages
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
No ratings yet
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
34 pages
Descriptive Statistic
No ratings yet
Descriptive Statistic
37 pages
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
No ratings yet
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
35 pages
Descriptive Statistics Summary (Session 1-5) : Types of Data - Two Types
No ratings yet
Descriptive Statistics Summary (Session 1-5) : Types of Data - Two Types
4 pages
Interpreting Test Score: Online Workshop 8602 Aiou
100% (1)
Interpreting Test Score: Online Workshop 8602 Aiou
39 pages
PM Project Logistic Regression LDA.docx
No ratings yet
PM Project Logistic Regression LDA.docx
22 pages
Final Report Project CHAMA
No ratings yet
Final Report Project CHAMA
45 pages
Multivariate Statistics Made Simple A Practical Approach by K. v. S. Sarma, R. Vishnu Vardhan
100% (1)
Multivariate Statistics Made Simple A Practical Approach by K. v. S. Sarma, R. Vishnu Vardhan
259 pages
STATISTICS
No ratings yet
STATISTICS
25 pages
1472(eBook PDF) Basic Marketing Research 9th Edition by Tom J. Brown instant download
100% (2)
1472(eBook PDF) Basic Marketing Research 9th Edition by Tom J. Brown instant download
58 pages
Statistics
No ratings yet
Statistics
212 pages
Business Analytics Casebook
No ratings yet
Business Analytics Casebook
132 pages
SOA Sample Difficulties - P
No ratings yet
SOA Sample Difficulties - P
14 pages
Staticus: Math 103 Lecture 9 Class Notes
No ratings yet
Staticus: Math 103 Lecture 9 Class Notes
4 pages
Download Full Data Assimilation for the Geosciences: From Theory to Application 2nd Edition Steven J. Fletcher - eBook PDF PDF All Chapters
100% (4)
Download Full Data Assimilation for the Geosciences: From Theory to Application 2nd Edition Steven J. Fletcher - eBook PDF PDF All Chapters
69 pages
D1UA401B Research Methodology-UNIT-4 Pazhanisamy-BBA IV Semester Section19
No ratings yet
D1UA401B Research Methodology-UNIT-4 Pazhanisamy-BBA IV Semester Section19
108 pages
Data Assimilation for the Geosciences: From Theory to Application 2nd Edition Steven J. Fletcher 2024 scribd download
100% (3)
Data Assimilation for the Geosciences: From Theory to Application 2nd Edition Steven J. Fletcher 2024 scribd download
41 pages
Unit 2 maths
No ratings yet
Unit 2 maths
2 pages
IDA Question Bank Ch2
No ratings yet
IDA Question Bank Ch2
26 pages
Unit 2 - DA - Statistical Concepts
No ratings yet
Unit 2 - DA - Statistical Concepts
140 pages
Data Science Notes
No ratings yet
Data Science Notes
44 pages
Check and Schutt
No ratings yet
Check and Schutt
32 pages
Predictive - Modelling - Project - PDF 1
No ratings yet
Predictive - Modelling - Project - PDF 1
31 pages
Business Statistics and Research Methodology Theory
No ratings yet
Business Statistics and Research Methodology Theory
39 pages
Syll 6thsem
No ratings yet
Syll 6thsem
7 pages
Chemometrics and Intelligent Laboratory
No ratings yet
Chemometrics and Intelligent Laboratory
19 pages
Ayar 2016
No ratings yet
Ayar 2016
9 pages
Andrews Et Al 2022 Concussions in The National Hockey League Analysis of Incidence Return To Play and Performance
No ratings yet
Andrews Et Al 2022 Concussions in The National Hockey League Analysis of Incidence Return To Play and Performance
6 pages
66 Data Analyst Interview Questions To Ace Your in
No ratings yet
66 Data Analyst Interview Questions To Ace Your in
38 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
2 pages
Statistics Elect
No ratings yet
Statistics Elect
8 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

Lecture Notes

Uploaded by

Lecture Notes

Uploaded by

Exploratory Data Analysis

and Descriptive Statistics

• What is descriptive statistics and

• Basic numerical summaries of data

• Basic graphical summaries of data

To listen to the data:

binary nominal ordinal discrete continuous

• Univariate: Measurement made on one variable

• Bivariate: Measurement made on two variables

• Multivariate: Measurement made on many

• Central Tendency measures. They are

• Variation or Variability measures. They

To calculate the average x of a set of observations, add

- If there are an even number of observations, find the

Median = (22+23)/2 = 22.5

• Mean is best for symmetric distributions without outliers

• Median is useful for skewed distributions or data

• Average of squared deviations of values

• Adding deviations will yield a sum of ?

• What does it mean to have a variance of

• Nothing. But if you could “standardize” that

• Standard deviations are simply the square root

1. Score (in the units that are meaningful)

• The first quartile, Q1, is the value for which 25% of

• Q2 is the same as the median (50% are smaller,

• Only 25% of the observations are greater than the

Variable 1 Variable 2 Display

Categorical Continuous Boxplot

nuous Continuous Scatterplot Stacked

Data Reduction Approaches (PCA)

Some rules for displaying data badly:

From Karl Broman: http://www.biostat.wisc.edu/~kbroman/

• Calculating descriptive statistics in R

• Useful R commands for working with

• Creating graphs for different types of

• Basic clustering and PCA analysis

You might also like