0% found this document useful (0 votes)
14 views

Lecture 1ASADA Descriptive Stats

Uploaded by

shengyanmin49
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 1ASADA Descriptive Stats

Uploaded by

shengyanmin49
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Applied Statistics and Data Analysis

Descriptive Statistics Review

Oksana Chernova, Ph.D.


Technical University of Munich

16/10/2023

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 1 / 33


Applied Statistics and Data Analysis [CIT5130001] is oered in both
semesters
in Freising during winter semesters
at Garching Forschungszentrum during summer semesters.

The class sessions will not be recorded. All the materials will be posted
on Moodle.

Your nal course grade will be determined solely by your performance


in the nal exam, and no grade bonuses available.

Exam 12.02.2024 (registration till 15.01.2024). The retake is the next


semester.

No late registrations are allowed, either for the course or the exam.

Only emails from a TUM address will be read.


Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 2 / 33
Motivation

Figure: https://www.google.de/books/edition/Even_You_Can_Learn_
Statistics_and_Analyt/5y2tBQAAQBAJ?hl=en&gbpv=1
Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 3 / 33
Overview

1 Introduction

2 Measures of Central Tendency

3 Measures of Dispersion

4 Graphical Data Analysis

5 Descriptive Statistics in R

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 4 / 33


Introduction

Applied statistics can be divided into two areas:


descriptive statistics (methods for organizing, displaying, and
describing data by using tables, graphs, and summary measures)
inferential statistics (consists of methods that use sample results to
make decisions or predictions about a population)

Today we make a gentle introduction to univariate descriptive statistics.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 5 / 33


Population vs sample

A population is the entire group that you want to draw conclusions


about.

A sample is the specic group that you will collect data from.

Figure: https://www.omniconvert.com/what-is/sample-size/

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 6 / 33


界限,范围;参数

Figure: https://www.questionpro.com/blog/population-vs-sample/

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 7 / 33


Example 1.

A market researcher surveys 85 people on their coee-drinking habits.


The aim is to know whether people in the local region are willing to
switch their regular drink to something new. What is the sample?
population:the local people
sample:85 people

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 8 / 33


Example 1.

A market researcher surveys 85 people on their coee-drinking habits.


The aim is to know whether people in the local region are willing to
switch their regular drink to something new. What is the sample?
The sample is the 85 people surveyed, while the population is all the
people in the local region.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 8 / 33


Example 2.

The market researcher analyzes the data and nds that 61% of survey
respondents are willing to switch their regular drink to something new.
What is the 61% referred to as?
a) Parameter
b) Statistics
c) Sampling error
d) Standard error

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 9 / 33


Example 2.

The market researcher analyzes the data and nds that 61% of survey
respondents are willing to switch their regular drink to something new.
What is the 61% referred to as?

b) Statistics

The 61% is referred to as a statistic because it is a measure taken from


the sample.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 9 / 33


Measures of Central Tendency

To gain intuition for any data set, one can use numerical summary
measures.

There are three main measures of central tendency: the mean, the
median, the mode.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 10 / 33


Mean

Sum of all values


Mean =
Number of all values

The arithmetic mean calculated for sample data is denoted by x̄,


and the mean for population data is denoted by µ .

sample population

size n N
x x
mean x̄ =
P P
n µ= N

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 11 / 33


Median
The median is the value of the middle term in a data set that has been
ranked in increasing order.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 12 / 33


Median
The median is the value of the middle term in a data set that has been
ranked in increasing order.
Odd sample size

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 12 / 33


Median
The median is the value of the middle term in a data set that has been
ranked in increasing order.
Odd sample size

Even sample size

average of two

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 12 / 33


Mean vs Median
how outer measure means differ
The median is not inuenced by outliers. Consequently, the median is
preferred over the mean as a measure of central tendency for data sets
that contain outliers.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 13 / 33


Mode

The mode represents the most common value in a data set.

Therefore, the mode is the value that occurs with the highest frequency
in a data set.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 14 / 33


Figure: The Flaw of Averages; Sam L. Savage, 2009

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 15 / 33


Figure: The Flaw of Averages; Sam L. Savage, 2009
The measures of central tendency do not reveal the whole picture of the
distribution of a data set.
Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 15 / 33
Measures of Dispersion

We also need a measure that can provide some information about the
variation among data values.

Therefore, to get the full picture we need to consider both measures


central tendency and dispersion.
传播,散布

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 16 / 33


Range

Range = Largest value − Smallest value

The range generally gives you a good indicator of variability when you
have a distribution without extreme values.

But the range can be misleading when you have outliers in your data
set.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 17 / 33


IQR
The interquartile range gives the range of the middle half of a data set
IQR = Q3 − Q1 ,
Q1 = 1st quantile or 25th percentile,
Q3 = 3st quantile or 75th percentile

Figure: https://www.scribbr.com/statistics/interquartile-range/
Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 18 / 33
Variance and Standard Deviation
偏离

The variance is the the sum of squared deviations from the mean. The
variance for population data is
(xi − µ)2
P
2
σ =
N
and the variance calculated for sample data is
(xi − x̄)2
P
2
s =
n−1

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 19 / 33


Variance and Standard Deviation

The standard deviation for population data is

(xi − µ)2
rP
σ=
N
and sample data standard deviation is
(xi − x̄)2
rP
s=
n−1
The quantity xi − µ or xi − x̄ in the above formulas is called the
deviation of the xi value from the mean.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 20 / 33


sample population

size n N
x x
mean x̄ =
P P
n µ= N

(xi −x̄)2 (xi −µ)2


variance s2 = σ2 =
P P
n−1 N
√ √
standard deviation s = s2 σ= σ2

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 21 / 33


The value of the standard deviation tells how closely the values of a
data set are clustered around the mean.

Figure: https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 22 / 33


Graphical Data Analysis

We've covered statistics that provide a summary of data using a single


value to describe either its central tendency or its variability. Exploring
the distribution of data is also valuable. To do this, you can use:
Boxplot
Histogram
Frequency table
Density plot

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 23 / 33


Boxplot

A boxplot, or a box-and-whisker plot, summarizes a data set visually


using a ve-number summary: Lowest value, Q1, Median, Q3, Highest
value.

Figure:
https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 24 / 33


Boxplot (Explanation)

A rectangle is drawn from the lower quartile Q1 (i.e., the 1st quartile)
to the upper quartile Q3 (the 3rd quartile), calculated from the data.
The line inside the rectangle represents the median.

The whiskers extending from the box mark the range of data that is
not considered outliers. The upper whisker corresponds to the largest
non-outlier, and the lower one to the smallest. Each individual data
point outside this range is depicted as a separate point and is
considered an outlier.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 25 / 33


Outliers

Observations that do not lie in


[Q1 − 1.5 × IQR ; Q3 + 1.5 × IQR]

are potential outliers.

Why 1.5?
John W. Tukey: Because 1 is too small and 2 is too large.

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 26 / 33


Histogram
Observe (X1 , . . . , Xn ). We dene an interval [a, b] that encompasses all
observed values. This interval is divided into K subintervals A1 , . . . , Ak ,
each with the same width h = (b − a)/K.
Ai = (ti−1 , ti ], where ti = a + ih, i = 2, . . . , K, and A1 = [t1 , t2 ].
n
ni = I{Xj ∈ Ai }
X

j=1

This represents the number of observations falling within interval Ai .

The quantity ni is known as the absolute frequency of interval Ai


within the sample. The value νi = ni /n is the relative frequency.
n ti
1X
Z
νi = I{Xj ∈ Ai } ≈ P{X1 ∈ Ai } = f (t) dt .
n j= 1 ti−1

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 27 / 33


R and RStudio

http://www.r-project.org/

Once you installed R you are ready to go, however we highly


recommend to install RStudio as well. RStudio is an open-source
integrated development environment for R, which includes a console,
syntax-highlighting editor that supports direct code execution, as well
as tools for plotting, history, debugging and workspace management.

https://posit.co/downloads/

Getting started - Installing R and RStudio


https://www.geo.fu-berlin.de/en/v/soga-r/Introduction-to-R/
Getting-Started/index.html

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 28 / 33


Descriptive Statistics in R

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 29 / 33


Height survey example

A survey of adult heights was conducted, and the results have been
compiled in the le Height_Survey.csv, where each row includes
observational data for height in centimeters (Height_cm) and gender
(Sex).

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 30 / 33


Height survey example

hist(height.f, freq = FALSE)


lines(density(height.f), lwd = 3, col = 'red')

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 31 / 33


Boxplot

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 32 / 33


References

∗ Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project


SOGA: Statistics and Geospatial Data Analysis. Department of
Earth Sciences, Freie Universitaet Berlin. https://www.geo.
fu-berlin.de/en/v/soga-r/Basics-of-statistics/index.html
∗ http://awofaverages.com/
∗ https://www.scribbr.com/statistics/descriptive-statistics/
∗ https://towardsdatascience.com/
understanding-descriptive-statistics-c9c2b0641291

Oksana Chernova, Ph.D. Descriptive Statistics 16/10/2023 33 / 33

You might also like