Lecture Notes
Lecture Notes
Probability
Population
Descriptive
Statistics
Sample
Inferential Statistics
EDA
Before making inferences from data it is
essential to examine all your variables.
Why?
Categorical Quantitative
2 categories
more categories
order matters
numerical
uninterrupted
Dimensionality of Data Sets
1. The Mean
• Calculation:
- If there are an odd number of observations, find the middle value
• Example
Some data:
Age of participants: 17 19 21 22 23 23 23 38
012345678910 012345678910
Mean = 3 Mean = 4
Median = 3 Median = 3
Scale: Variance
• Result:
– Increasing contribution to the variance as
you go farther from the mean.
Scale: Standard Deviation
• Variance is somewhat arbitrary
Q1 Q2 Q3
n%
th
Q1 = 25 percentile
th
Median = 50 percentile
th
Q2 = 75 percentile
Graphical Summaries of Data
A (Good) Picture Is
Worth A 1,000 Words
Univariate Data: Histograms
and Bar Plots
• What’s the difference between a histogram and bar plot?
Bar plot
• Used for categorical variables to show frequency or
proportion in each category.
• Translate the data from frequency tables into a
pictorial representation…
Histogram
• Used to visualize distribution (shape, center, range,
variation) of continuous variables
• “Bin size” important
Effect of Bin Size on Histogram
• Simulated 1000 N(0,1) and 500 N(1,1)
Frequency
Frequency
Frequency More on Histograms
• What’s the difference between a frequency
histogram and a density histogram?
More on Histograms
• What’s the difference between a frequency
histogram and a density histogram?
Frequency Histogram Density Histogram
Box Plots
100.0
maximum
66.7 Q
3
IQR
Years
median
Q1
33.3
minimum
0.0
AGE
Variables
Bivariate Data