Lecture 1ASADA Descriptive Stats
Lecture 1ASADA Descriptive Stats
16/10/2023
The class sessions will not be recorded. All the materials will be posted
on Moodle.
No late registrations are allowed, either for the course or the exam.
Figure: https://www.google.de/books/edition/Even_You_Can_Learn_
Statistics_and_Analyt/5y2tBQAAQBAJ?hl=en&gbpv=1
Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 3 / 33
Overview
1 Introduction
3 Measures of Dispersion
5 Descriptive Statistics in R
A sample is the specic group that you will collect data from.
Figure: https://www.omniconvert.com/what-is/sample-size/
Figure: https://www.questionpro.com/blog/population-vs-sample/
The market researcher analyzes the data and nds that 61% of survey
respondents are willing to switch their regular drink to something new.
What is the 61% referred to as?
a) Parameter
b) Statistics
c) Sampling error
d) Standard error
The market researcher analyzes the data and nds that 61% of survey
respondents are willing to switch their regular drink to something new.
What is the 61% referred to as?
b) Statistics
To gain intuition for any data set, one can use numerical summary
measures.
There are three main measures of central tendency: the mean, the
median, the mode.
sample population
size n N
x x
mean x̄ =
P P
n µ= N
average of two
Therefore, the mode is the value that occurs with the highest frequency
in a data set.
We also need a measure that can provide some information about the
variation among data values.
The range generally gives you a good indicator of variability when you
have a distribution without extreme values.
But the range can be misleading when you have outliers in your data
set.
Figure: https://www.scribbr.com/statistics/interquartile-range/
Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 18 / 33
Variance and Standard Deviation
偏离
The variance is the the sum of squared deviations from the mean. The
variance for population data is
(xi − µ)2
P
2
σ =
N
and the variance calculated for sample data is
(xi − x̄)2
P
2
s =
n−1
(xi − µ)2
rP
σ=
N
and sample data standard deviation is
(xi − x̄)2
rP
s=
n−1
The quantity xi − µ or xi − x̄ in the above formulas is called the
deviation of the xi value from the mean.
size n N
x x
mean x̄ =
P P
n µ= N
Figure: https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/
Figure:
https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
A rectangle is drawn from the lower quartile Q1 (i.e., the 1st quartile)
to the upper quartile Q3 (the 3rd quartile), calculated from the data.
The line inside the rectangle represents the median.
The whiskers extending from the box mark the range of data that is
not considered outliers. The upper whisker corresponds to the largest
non-outlier, and the lower one to the smallest. Each individual data
point outside this range is depicted as a separate point and is
considered an outlier.
Why 1.5?
John W. Tukey: Because 1 is too small and 2 is too large.
j=1
http://www.r-project.org/
https://posit.co/downloads/
A survey of adult heights was conducted, and the results have been
compiled in the le Height_Survey.csv, where each row includes
observational data for height in centimeters (Height_cm) and gender
(Sex).