0% found this document useful (0 votes)

24 views

DS-2, Week 2 - Lectures

The document discusses moments in statistics, which are used to describe the characteristics of a statistical distribution. It defines the four main moments - mean, variance, skewness, and kurtosis - and explains how to calculate and interpret each one. It also discusses transforming data to make it normally distributed and the use of the ladder of powers concept for data transformations.

Uploaded by

Prerana Varshney

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

DS-2, Week 2 - Lectures

Uploaded by

Prerana Varshney

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

MDS 202: Data Science II with R

Lectures 4-6: What are moments in Statistics?

Dr. Shatrughan Singh∗

Week 2 (13-17 February) 2023

1 LEARNING OBJECTIVES
1.1 What is expected from you to learn from this lecture?
• Definitions and notions of statistics in data science.
• What are the moments in statistics?
• Hww to formulate and calculate these in R?

2 Moments in Statistics
The moments in statistics plays a vital role, especially when we work with probability distribution of data.
Here, with the help of moments, we can describe the properties of statistical distribution. Further, while
conducting “Statistical Estimation” and “Hypothesis Testing”, which all are based on the numerical values
arrived for each distribution, we required the statistical moments. In conclusion, moments are mainly used
to describe the characteristic of a distribution. Let’s assume the random variable of our interest is X
then, moments are defined as the X’s expected values.
For example, 𝐸(𝑋 1 ), 𝐸(𝑋 2 ), 𝐸(𝑋 3 ), 𝐸(𝑋 4 ), … etc.

2.1 What is the use of Moments in Statistics?

• These are very useful in statistics because these explains about the data.
• The four commonly used moments in statistics are - mean (or median, mode), variance (or standard
deviation), skewness, and kurtosis.
• These are the basic characteristics of a distribution to compare different datasets.

2.1.1 The First Moment

• The first central moment is the expected value - also known as an expectation or a mathematical
expectation or a Mean or an average.
• It represents the central point of the data set.
• Median — also a middle value of a dataset after arranging data in an ascending order.
• Mode — The most likely or frequently occuring value in a given dataset.

2.1.1.1 R Commands to compute the first moments

• mean()
∗ Amity University Rajasthan (Jaipur), [email protected]

Dr. S. Singh MDS 202: Lec —> 04-06 1

• median()
• There is “no direct command to compute mode” in R. However, one can create a function if one
needs to compute mode of a given dataset.

2.1.2 The Second Moment

• The second central moment is Variance.
• It represents the spread of values in the distribution, or, how far the values are from the normal.
• Variance represents how a set of data points are spread out around their mean value.

Figure 1: Example: Variance calculation in Excel.

STANDARD DEVATION
Standard deviation is just a square root of the variance. Since the units of the variable X and the standard
deviation are same, standard deviations are most commonly used.
For instance, a normal distribution would have standard deviations divided as:

Figure 2: Standard deviations in a Normal Distribution.

– Within 1𝑠𝑡 Standard Deviation: 68.27% of the data points lie

– Within 2𝑛𝑑 Standard Deviation: 95.45% of the data points lie
– Within 3𝑟𝑑 Standard Deviation: 99.73% of the data points lie

2.1.3 The Third Moment

• The third statistical moment is Skewness.

Dr. S. Singh MDS 202: Lec —> 04-06 2

• It measures how asymmetric the distribution is about its mean.
We can differentiate three types of distribution with respect to its skewness:
1. Symmetrical distribution
If both tails of a distribution are symmetrical, and the skewness is equal to zero, then that distribution is
symmetrical.
2. Positively Skewed
In these types of distributions, the right tail (with larger values) is longer. So, this also tells us about
outliers that have values higher than the mean. Sometimes, this is also referred to as:
– Right-skewed

– Right-tailed

– Skewed to the Right

3. Negatively skewed
In these types of distributions, the left tail (with small values) is longer. So, this also tells us about outliers
that have values lower than the mean. Sometimes, this is also referred to as:
– Left-skewed

– Left-tailed

– Skewed to the Left

This can be further seen from the below figure.

Figure 3: Types of skewness.

For instance, for a Normal Distribution, which is symmetric, the value of Skewness equals 0 and that
distribution is symmetrical.
In general, Skewness will impact the relationship of mean, median, and mode in the following manner:
– For a Symmetrical distribution: Mean = Median = Mode

Dr. S. Singh MDS 202: Lec —> 04-06 3

– For a positively skewed distribution: Mode < Median < Mean (large tail of high values)

– For a negatively skewed distribution: Mean < Median < Mode (large tail of small values)
Other Formula of Calculating Skewness:
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝑆𝐷 =3∗ 𝑆𝐷

Because, 𝑀 𝑜𝑑𝑒 = 3 ∗ 𝑀 𝑒𝑑𝑖𝑎𝑛 − 2 ∗ 𝑀 𝑒𝑎𝑛

If the given data is not normally distributed, we can transform the data to make it normally distributed.

2.1.3.1 Reasons to transform data

• To more closely approximate a theoretical distribution that has nice statistical properties
• To spread data out more evenly
• To make data distributions more symmetrical
• To make relationships between variables more linear
• To make data more constant in variance (homoscedastic)

2.1.3.2 Ladder of powers A useful organizing concept for data transformations is the ladder of powers
(P.F. Velleman and D.C. Hoaglin, Applications, Basics, and Computing of Exploratory Data Analysis, 354
pp., Duxbury Press, 1981).
Data transformations are commonly power transformations, 𝑥′ = 𝑥𝜃 (where x’ is the transformed x). One
can visualize these as a continuous series of transformations:

𝜃 transformation
3
3 𝑥 cube
2 𝑥2 square
1 𝑥1 identity (no transformation)
1/2 𝑥0.5 square root
1/3 𝑥1/3 cube root
0 𝑙𝑜𝑔(𝑥) logarithmic (holds the place of zero)
1
-1/2 − 𝑥0.5 reciprocal root
1
-1 − 𝑥1 reciprocal
-2 − 𝑥12 reciprocal square

NOTE:
• higher and lower powers can be used
• fractional powers (other than those shown) can be used
• minus sign in reciprocal transformations can (optionally) be used to preserve the order (relative ranking)
of the data, which would otherwise be inverted by transformations for 𝜃 < 0.
To use the ladder of powers, visualize the original, untransformed data as starting at 𝜃 = 1. Then if the data
are right-skewed (clustered at lower values) move down the ladder of powers (that is, try square root, cube
root, logarithmic, etc. transformations). If the data are left-skewed (clustered at higher values) move up the
ladder of powers (cube, square, etc)

2.1.4 The Fourth Moment

• The fourth statistical moment is Kurtosis.
• It measures the amount in the tails and outliers.

Dr. S. Singh MDS 202: Lec —> 04-06 4

• It focuses on the tails of the distribution and explains whether the distribution is flat or rather with
a high peak. This measure informs us whether our distribution is richer in extreme values than the
normal distribution.
For a normal distribution, the accepted value of Kurtosis is equal to 3.
However, for Kurtosis not equal to 3, there are the following scenarios:
• Kurtosis < 3 [Lighter tails]: Negative kurtosis indicates a broad flat distribution.
• Kurtosis > 3 [Heavier tails]: Positive kurtosis indicates a thin pointed distribution.
In general, we can differentiate three types of distributions based on the kurtosis:
1. Mesokurtic
These types of distributions are having the kurtosis of 3 (or excess kurtosis of 0). This category includes the
normal distribution and some specific binomial distributions.
2. Leptokurtic
These types of distributions are having a kurtosis greater than 3 (or excess kurtosis greater than 0). This is
the distribution with fatter tails and a narrower peak.
3. Platykurtic
These types of distributions are having the kurtosis smaller than 3 (or excess kurtosis less than 0 (negative)).
This is a distribution with very thin tails compared to the normal distribution.

Figure 4: Types of kurtosis.

Now, let’s define what is Excess Kurtosis:

𝐸𝑥𝑐𝑒𝑠𝑠 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 = 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 − 3
Understanding of Kurtosis related to Outliers:
• Kurtosis is defined as the average of the standardized data raised to the fourth power. Any standardized
values less than |1| (i.e., data within one standard deviation of the mean) will contribute petty to
kurtosis.
• The standardized values that will contribute immensely are the outliers.
• Therefore, the high value of Kurtosis alerts about the presence of outliers.

3 DATA
Data originates from the latin word ‘datum’. Data is defined as a collection of facts in the form of numbers,
words, measurements, and observations etc. that can be translated into some numeric form for computer

Dr. S. Singh MDS 202: Lec —> 04-06 5

processing. Data is divide broadly into two classes: Categorical (Qualitative) and Numerical (Quantitative).
Different types of data and their classification is shown in the figure below.

Figure 5: Types of Data commonly known.

Note to Remember:
1. Population vs. Sample
• Population: total number of all possible specimens (N) in a study.
– Example:
∗ pebbles on a beach
∗ human population living in a city/state/nation
• Sample: a subset of elements/specimens (n) taken from a population (N). Statistically - a subset of
a population (group of data).
– Example:
∗ human population in a colony/area for a given city and so on …
∗ samples can be representative or biased

Figure 6: Samples are withdrawn from the population. It may or may not represent the population.

2. Parameter vs. Statistic

• Parameter: a property of a population which usually cannot be measured directly but must be
estimated by a statistic – describes a characteristic of a population. Represented as Mean = 𝜇; Std.

Dr. S. Singh MDS 202: Lec —> 04-06 6

Dev. = 𝜎.
• Statistic: something calculated from a set of data (sample) – describes a characteristic of a sample.
Represented as Mean = 𝑥 or 𝜇;̂ Std. Dev. = s (𝜇̂ read as – ‘mu-hat’).
3. Accuracy vs. Precision
• Accuracy:
– How close you are to the actual value
– The degree of perfection in measurement — relates to the quality of the result
– You can be precise and not accurate
∗ generally NOT OK
– You can be accurate and not precise
∗ generally OK, but …
• Precision:
– How well you can measure something
– The reproducibility of the observation
– The quality of the operation by which a result is obtained
– The number of decimal points on an analysis

Figure 7: Difference between Accuracy and Precision in data analysis.

4. Response vs. Explanatory variable

• Response: The response variable is the variable that goes on the y–axis of the graph (the ordinate).
It is the variable whose variation we are attempting to understand.
• Explanatory: The explanatory variable goes on the x–axis of the graph (the abscissa). We are
interested in the extent to which variation in the response variable is associated with variation in the
explanatory variable.

Dr. S. Singh MDS 202: Lec —> 04-06 7

Table 2: Keys to understand kind of statistical analysis.

1. The explanatory variables (pick one of the rows)

(a) All explanatory variables continuous → Regression
(b) All explanatory variables categorical → Analysis of Variance (ANOVA)
(c) Some explanatory variables continuous and some → Analysis of Covariance (ANCOVA)
categorical
——————————————————————–
2. The response variable (pick one of the rows)
——————————————————————–
(a) Continuous → Regression, ANOVA, ANCOVA
(b) Proportion → Logistic Regression
(c) Count → Log-Linear Models
(d) Binary → Binary Logistic Analysis
(e) Time at death → Survival Analysis

4 STATISTICS
Statistics is a sub-discipline of mathematics where data collection and summarization of data is emphasized.
While studying statistics, the user focuses on the collection/acquisition, organization, analysis, presentation,
and interpretation of the data.

Dr. S. Singh MDS 202: Lec —> 04-06 8

Types Of Statistics:
Broadly, there are two types of statistics.
• Descriptive
• Inferential
Descriptive Statistics – describes the population (or sample) data using numerical calculations (or
summaries i.e., mean, median, mode etc.) or graphs/tables regarding population.
Example: We are interested in average weight of students in a class. Using descriptive statistics, we will
record the weights of each and every student in the class and then calculate the maximum, minimum, and
average (or mean) weight of the class.
Inferential Statistics – draws the inferences and predicts about a population variation based on a sample
data withdrawn from the population.
Example: We are interested in average weight of all the students registered in the University, Using a random
sample of the population (few students out of all students in the University), we will calculate the weights
of each and every student in the sample and infer results about the whole population (calculating average
weight) for this study.

Figure 8: Broadly defined types of statistics.

4.1 Measures of Central Tendency

Measures of central tendency provide information about typical or average values of a data values of a sample.
They describe the center of a data.
1. Mean
→ It is the average of data values of the sample (n) / population (N). It is the sum of observations divided by
the total number of observations. Mean is susceptible to the outliers, especially when unusual large values
are added to the dataset, it gets skewed i.e., deviates from a typical central value.

𝑛
∑ 𝑥𝑖
𝑖=1
𝑀 𝑒𝑎𝑛(𝑥)̄ =
𝑛
where, 𝑥𝑖 = observations from 𝑖 = 1 to 𝑛.

Dr. S. Singh MDS 202: Lec —> 04-06 9

𝑁
∑ 𝑥𝑖
𝑖=1
𝑀 𝑒𝑎𝑛(𝜇) =
𝑁

where, 𝑥𝑖 = observations from 𝑖 = 1 to 𝑁 .

2. Median
→ It is the central value of the data of the sample. Median of the sample is achieved after sorting the
data in ascending order. Median is a better alternative to the mean for a given dataset, especially when the
dataset is affected by the outliers and skewness.
When total number of data values, 𝑛 = {even}

𝑛 + 1 𝑡ℎ 𝑜𝑏𝑠.
𝑀 𝑒𝑑𝑖𝑎𝑛 = [ ]
2
When total number of data values, 𝑛 = {odd}

𝑡ℎ 𝑜𝑏𝑠. 𝑡ℎ 𝑜𝑏𝑠. 𝑡ℎ 𝑜𝑏𝑠.

[ 𝑛2 ] + [ 𝑛2 + 1]
𝑀 𝑒𝑑𝑖𝑎𝑛 = { }
2

3. Mode
→ Most frequently occurring data value in a given sample. It is the value that has the highest frequency in
the given sample.

Table 3: Valid numeric measures for the main variable types.

Variable Type Mean Median Mode

Nominal/Ordinal NO NO YES
Interval/Ratio YES YES YES

4.2 Measures of Spread/Variability

Measures of spread or variability provide information about variation in the dataset. Typically, the range
(difference between maximum and minimum), interquartile range, standard deviation, or dispersion of data
values of a sample represent the measures of spread.
1. Range
→ It is the measure of spread showing how far the values are in a given dataset. It is calculated as:

𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒

2. Interquartile Range (IQR)
→ The quartiles are used to divide the dataset into four equal halves of the given dataset (after sorting),
each containing 25% of the data values. IQR contains middle 50% of the data.
• Q1 = the lowest 25% of the data [0.25*(n+1)].
• Q2 = the next lowest 25% of the data (uptown the median) {Q1 & Q2 together constitutes to 50% of
the data forming median} [0.50*(n+1)].
• Q3 = the second highest 25% of the data (above the median) [0.75*(n+1)].

Dr. S. Singh MDS 202: Lec —> 04-06 10

• Q4 = the highest 25% of the data [1.0*(n+1)].
IQR is calculated as the difference:

𝐼𝑄𝑅 = 𝑄3 − 𝑄1

Outliers are calculated as data points below/above whiskers of Boxplots:

𝐿𝐸𝐹 𝑇 𝑤ℎ𝑖𝑠𝑘𝑒𝑟𝑠 = 𝑄1 − 1.5 ∗ 𝐼𝑄𝑅

𝑅𝐼𝐺𝐻𝑇 𝑤ℎ𝑖𝑠𝑘𝑒𝑟𝑠 = 𝑄3 + 1.5 ∗ 𝐼𝑄𝑅

3. Variance
→ Average of the squared deviation from the mean. The variance for the sample (𝑠, 𝑥,̄ 𝑛) and population
(𝜎, 𝜇, 𝑁 ) are calculated as:

𝑛
∑ (𝑥𝑖 − 𝑥)̄ 2
𝑖=1
𝑠2 =
𝑛−1

𝑁
∑ (𝑥𝑖 − 𝜇)2
𝑖=1
𝜎2 =
𝑁

NOTE: Population Data has 𝑁 data points and sample data has (𝑛 − 1) data points. (𝑛 − 1) is called
Bessel’s Correction and is used to reduce bias.
4. Standard Deviation
→ Square root of the variance. The standard deviation of the sample (𝑠) is calculated as the square root
of variance for the samples (𝑛) and the same for the population (𝜎) is calculated as the square root of the
variance for the population (𝑁 ).

√
𝑠= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑠2 )

√
𝜎= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝜎2 )
5. Coefficient of Variation (CV)
→ Coeﬀicient of variation is also called as the relative standard deviation. It is the ratio of standard
deviation to the mean of the dataset. It can be used for comparing two datasets.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 (%) = ∗ 100
𝑀 𝑒𝑎𝑛

𝑠
𝐶𝑉(𝑠𝑎𝑚𝑝𝑙𝑒) = ∗ 100
𝑥̄

𝜎
𝐶𝑉(𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛) = ∗ 100
𝜇

Dr. S. Singh MDS 202: Lec —> 04-06 11

4.3 Measures Of Asymmetry
Skewness and Kurtosis are measures of asymmetry in statistics. However, I will discuss skewness only
here. You are suggested to look for ‘kurtosis’ on your own from any introductory statistics book.
1. Skewness
→ Skewness is termed as the asymmetry in a statistical distribution, where the curve skewed (appears
distorted) towards to the left or to the right (see figure below). Generally, the skewness represents the data
concentration either on left or right side in a curve. In some cases, transformation techniques are useful in
reshaping skewness for data analysis purposes.
Example: A ‘right--skewed’ data can be transformed using a ‘logarithmic’ transform function to achieve
normality in the distribution.

Figure 9: Broadly defined types of skewness in statistics.

Positive Skewness: Positive Skewness is when the MEAN > MEDIAN > MODE. The outliers are skewed
to the right i.e., the tail is skewed to the right. A single observation in extreme right can skew the curve.
Negative Skewness: Negative Skewness is when the MEAN < MEDIAN < MODE. The outliers are skewed
to the left i.e., the tail is skewed to the left. A single observation in extreme left can skew the curve.

4.4 Measures Of Relationship

To find the comparison/relation between two variables.
1. Covariance
→ It is a measure of the relationship between the variability of two variables, that is, it measures the degree
of change in the variables, when one variable changes, a similar change in the other variable will be noticed.
However, since the variables used for calculating covariance are not ‘normalized’ ones, the covariance does
not give effective information about the relationship between the two variables.

𝑛
∑ (𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝑖=1
𝐶𝑜𝑣(𝑥, 𝑦) = 𝜎𝑥𝑦 =
𝑁
where, 𝑥𝑖 , 𝑦𝑖 are observed values, 𝜇𝑥 , 𝜇𝑦 are population means, and 𝑁 is the population size.

Dr. S. Singh MDS 202: Lec —> 04-06 12

𝑛
∑ (𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
𝑖=1
𝐶𝑜𝑣(𝑥, 𝑦) = 𝑠𝑥𝑦 =
𝑛−1

where, 𝑥𝑖 , 𝑦𝑖 are observed values, 𝑥,̄ 𝑦 ̄ are sample means, and 𝑛 is the sample size.
2. Correlation
→ It is a measure of the relationship between the variability of two variables. However, it is calculated on
normalized variables, hence a better estimator of the relationship. Correlation is actually a standardized
covariance. Often it is called a Pearson Correlation Coeﬀicient. The value of a correlation coeﬀicient
ranges from -1 to +1, where -1 indicates a negative relationship meaning if one variable is increasing, the
other variable is decreasing while the +1 indicates positive relationship meaning increase in one variable is
causing an increase in another variable. ‘0’ is an indicator of independency between both variables.

𝐶𝑜𝑣 (𝑥, 𝑦)
𝐶𝑜𝑟(𝑥, 𝑦) = 𝜌𝑥𝑦 =
𝜎𝑥 𝜎𝑦

EXAMPLE: Let’s take an example where we are interested in knowing if there exists any relationship
between weight (x) and height (y) of people in a given sample.

Table 4: Example data for a given weight & height sample.

Weight (x) Height (y) (𝑥 − 𝑥)̄ (𝑦 − 𝑦)̄ (𝑥− 𝑥)∗(𝑦−

̄ 𝑦)̄ (𝑥 − 𝑥)̄ 2 (𝑦 − 𝑦)̄ 2
45 5.0 -5 -0.14 0.7 25 0.0196
53 5.5 3 0.36 1.08 9 0.1296
70 6.0 20 0.86 17.2 400 0.7396
42 4.7 -8 -0.44 3.52 64 0.1936
40 4.5 -10 -0.64 6.4 100 0.4096
𝑥̄ = 50 𝑦 ̄ = 5.14
∑ = 250 ∑ = 25.7 ∑ = 28.9 ∑ = 598 ∑ = 1.49

28.9
𝐶𝑜𝑟(𝑟𝑥𝑦 ) = √ √ = 0.97
598 ∗ 1.49

Here, the positive correlation = 0.97 explains a strong linear relationship between weight and height of
people. It shows that as the height of the person increases, weight of the person also increases.
NOTE: The correlation does not imply causation. Remember the example discussed in the class
– Strong correlation may exists between ‘rate of potato’ in vegetable market and ‘number of accidents’
occurring on roads. However, physically this correlation does not make any sense.