DS-2, Week 2 - Lectures
DS-2, Week 2 - Lectures
1 LEARNING OBJECTIVES
1.1 What is expected from you to learn from this lecture?
• Definitions and notions of statistics in data science.
• What are the moments in statistics?
• Hww to formulate and calculate these in R?
2 Moments in Statistics
The moments in statistics plays a vital role, especially when we work with probability distribution of data.
Here, with the help of moments, we can describe the properties of statistical distribution. Further, while
conducting “Statistical Estimation” and “Hypothesis Testing”, which all are based on the numerical values
arrived for each distribution, we required the statistical moments. In conclusion, moments are mainly used
to describe the characteristic of a distribution. Let’s assume the random variable of our interest is X
then, moments are defined as the X’s expected values.
For example, 𝐸(𝑋 1 ), 𝐸(𝑋 2 ), 𝐸(𝑋 3 ), 𝐸(𝑋 4 ), … etc.
STANDARD DEVATION
Standard deviation is just a square root of the variance. Since the units of the variable X and the standard
deviation are same, standard deviations are most commonly used.
For instance, a normal distribution would have standard deviations divided as:
– Right-tailed
– Left-tailed
For instance, for a Normal Distribution, which is symmetric, the value of Skewness equals 0 and that
distribution is symmetrical.
In general, Skewness will impact the relationship of mean, median, and mode in the following manner:
– For a Symmetrical distribution: Mean = Median = Mode
– For a negatively skewed distribution: Mean < Median < Mode (large tail of small values)
Other Formula of Calculating Skewness:
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝑆𝐷 =3∗ 𝑆𝐷
2.1.3.2 Ladder of powers A useful organizing concept for data transformations is the ladder of powers
(P.F. Velleman and D.C. Hoaglin, Applications, Basics, and Computing of Exploratory Data Analysis, 354
pp., Duxbury Press, 1981).
Data transformations are commonly power transformations, 𝑥′ = 𝑥𝜃 (where x’ is the transformed x). One
can visualize these as a continuous series of transformations:
𝜃 transformation
3
3 𝑥 cube
2 𝑥2 square
1 𝑥1 identity (no transformation)
1/2 𝑥0.5 square root
1/3 𝑥1/3 cube root
0 𝑙𝑜𝑔(𝑥) logarithmic (holds the place of zero)
1
-1/2 − 𝑥0.5 reciprocal root
1
-1 − 𝑥1 reciprocal
-2 − 𝑥12 reciprocal square
NOTE:
• higher and lower powers can be used
• fractional powers (other than those shown) can be used
• minus sign in reciprocal transformations can (optionally) be used to preserve the order (relative ranking)
of the data, which would otherwise be inverted by transformations for 𝜃 < 0.
To use the ladder of powers, visualize the original, untransformed data as starting at 𝜃 = 1. Then if the data
are right-skewed (clustered at lower values) move down the ladder of powers (that is, try square root, cube
root, logarithmic, etc. transformations). If the data are left-skewed (clustered at higher values) move up the
ladder of powers (cube, square, etc)
3 DATA
Data originates from the latin word ‘datum’. Data is defined as a collection of facts in the form of numbers,
words, measurements, and observations etc. that can be translated into some numeric form for computer
Note to Remember:
1. Population vs. Sample
• Population: total number of all possible specimens (N) in a study.
– Example:
∗ pebbles on a beach
∗ human population living in a city/state/nation
• Sample: a subset of elements/specimens (n) taken from a population (N). Statistically - a subset of
a population (group of data).
– Example:
∗ human population in a colony/area for a given city and so on …
∗ samples can be representative or biased
Figure 6: Samples are withdrawn from the population. It may or may not represent the population.
4 STATISTICS
Statistics is a sub-discipline of mathematics where data collection and summarization of data is emphasized.
While studying statistics, the user focuses on the collection/acquisition, organization, analysis, presentation,
and interpretation of the data.
𝑛
∑ 𝑥𝑖
𝑖=1
𝑀 𝑒𝑎𝑛(𝑥)̄ =
𝑛
where, 𝑥𝑖 = observations from 𝑖 = 1 to 𝑛.
2. Median
→ It is the central value of the data of the sample. Median of the sample is achieved after sorting the
data in ascending order. Median is a better alternative to the mean for a given dataset, especially when the
dataset is affected by the outliers and skewness.
When total number of data values, 𝑛 = {even}
𝑛 + 1 𝑡ℎ 𝑜𝑏𝑠.
𝑀 𝑒𝑑𝑖𝑎𝑛 = [ ]
2
When total number of data values, 𝑛 = {odd}
3. Mode
→ Most frequently occurring data value in a given sample. It is the value that has the highest frequency in
the given sample.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝑛
∑ (𝑥𝑖 − 𝑥)̄ 2
𝑖=1
𝑠2 =
𝑛−1
𝑁
∑ (𝑥𝑖 − 𝜇)2
𝑖=1
𝜎2 =
𝑁
NOTE: Population Data has 𝑁 data points and sample data has (𝑛 − 1) data points. (𝑛 − 1) is called
Bessel’s Correction and is used to reduce bias.
4. Standard Deviation
→ Square root of the variance. The standard deviation of the sample (𝑠) is calculated as the square root
of variance for the samples (𝑛) and the same for the population (𝜎) is calculated as the square root of the
variance for the population (𝑁 ).
√
𝑠= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑠2 )
√
𝜎= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝜎2 )
5. Coefficient of Variation (CV)
→ Coefficient of variation is also called as the relative standard deviation. It is the ratio of standard
deviation to the mean of the dataset. It can be used for comparing two datasets.
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 (%) = ∗ 100
𝑀 𝑒𝑎𝑛
𝑠
𝐶𝑉(𝑠𝑎𝑚𝑝𝑙𝑒) = ∗ 100
𝑥̄
𝜎
𝐶𝑉(𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛) = ∗ 100
𝜇
Positive Skewness: Positive Skewness is when the MEAN > MEDIAN > MODE. The outliers are skewed
to the right i.e., the tail is skewed to the right. A single observation in extreme right can skew the curve.
Negative Skewness: Negative Skewness is when the MEAN < MEDIAN < MODE. The outliers are skewed
to the left i.e., the tail is skewed to the left. A single observation in extreme left can skew the curve.
𝑛
∑ (𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝑖=1
𝐶𝑜𝑣(𝑥, 𝑦) = 𝜎𝑥𝑦 =
𝑁
where, 𝑥𝑖 , 𝑦𝑖 are observed values, 𝜇𝑥 , 𝜇𝑦 are population means, and 𝑁 is the population size.
where, 𝑥𝑖 , 𝑦𝑖 are observed values, 𝑥,̄ 𝑦 ̄ are sample means, and 𝑛 is the sample size.
2. Correlation
→ It is a measure of the relationship between the variability of two variables. However, it is calculated on
normalized variables, hence a better estimator of the relationship. Correlation is actually a standardized
covariance. Often it is called a Pearson Correlation Coefficient. The value of a correlation coefficient
ranges from -1 to +1, where -1 indicates a negative relationship meaning if one variable is increasing, the
other variable is decreasing while the +1 indicates positive relationship meaning increase in one variable is
causing an increase in another variable. ‘0’ is an indicator of independency between both variables.
𝐶𝑜𝑣 (𝑥, 𝑦)
𝐶𝑜𝑟(𝑥, 𝑦) = 𝜌𝑥𝑦 =
𝜎𝑥 𝜎𝑦
EXAMPLE: Let’s take an example where we are interested in knowing if there exists any relationship
between weight (x) and height (y) of people in a given sample.
28.9
𝐶𝑜𝑟(𝑟𝑥𝑦 ) = √ √ = 0.97
598 ∗ 1.49
Here, the positive correlation = 0.97 explains a strong linear relationship between weight and height of
people. It shows that as the height of the person increases, weight of the person also increases.
NOTE: The correlation does not imply causation. Remember the example discussed in the class
– Strong correlation may exists between ‘rate of potato’ in vegetable market and ‘number of accidents’
occurring on roads. However, physically this correlation does not make any sense.