0% found this document useful (0 votes)
8 views

Statistics Unit1 Notes.docx

Msc ds stats

Uploaded by

Shubham Wagh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Statistics Unit1 Notes.docx

Msc ds stats

Uploaded by

Shubham Wagh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit 1 : Descriptive Statistics

Measures of Central Tendency


A measure of central tendency is a single value that attempts to describe a set of data by identifying
the central position within that set of data.

The mean (often called the average) is most likely the measure of central tendency that you are most
familiar with, but there are others, such as the median and the mode.

The mean, median and mode are all measures of central tendency, but under different conditions,
some measures of central tendency become more appropriate to use than others.

MEAN

The mean is the average of a data set.

The mean is equal to the sum of all the values in the data set divided by the number of values in the
data set.

If x1 , x2, …xn are n values in the data, the sample mean denoted by 𝑥
𝑥1+𝑥2+⋯𝑥𝑛 Σ𝑥𝑖
is given by 𝑥 = 𝑛
or 𝑥 = 𝑛

Mean is not the best measure of central tendency when

i. The dataset contains outliers

ii. The distribution is skewed.

Properties of mean:

1. If a fixed value d is added to each of the observations in the data, then

Mean of the new data = d + mean of the old data.

e.g. Suppose the mean monthly salary of the employees working for a company is Rs. 15000/-. If
each employee gets a monthly raise of Rs.1500/-, then the mean monthly salary after the raise will
be 15000+1500= Rs. 16500

2. If each observation in the data is multiplied by a fixed constant c, then

Mean of new data = c.mean of the old data.

e.g. mean height of students in a class is 4.8 feet. If the height is expressed in inches, the mean
height would be 12(4.8) = 57.6 inches.

Median

The median is the middle score for a set of data that has been

arranged in order of magnitude.

If the number of observations in the data set are n, then

𝑛+1 𝑡ℎ
Median = Value of ( )
2
observation in the arranged dataset.
e.g. Consider the data set 65,55,89,56,35,14,56,55,87,44,92. Median=56

For the dataset 65,55,89,56,35,14,56,55,87,44. Median=55.5

Median is less affected by the outliers and is a preferred measure of central tendency for a skewed
data.

MODE

The mode is the most frequent score in the data set.

e.g. If the data set is 45,42,56,42,56,67,43,42,56,45,42,40,42. Mode= 42.

Normally, the mode is used for categorical data where we wish to know which is the most common
category

The main problem with mode is that it may not be unique.

It will not be an appropriate measure of central tendency if the most common value is away from the
rest of the data set.

We can have datasets with more than one mode or datasets with no

mode.

Mean , Median and Mode using R

Mean

Syntax

The basic syntax for calculating mean in R is −

mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used −

● x is the input vector.

● trim is used to drop some observations from both end of the sorted vector.

● na.rm is used to remove the missing values from the input vector.

● x <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

mean(x);

Applying Trim Option

When trim parameter is supplied, the values in the vector get sorted and then the required numbers
of observations are dropped from calculating the mean.

When trim = 0.3, .3n values from each end will be dropped from the calculations to find mean,
where n is the total no of observations.

x<-c(−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54)


In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values removed from the
vector for calculating mean are (−21,−5,2) from left and (12,18,54) from right.

mean(x,trim=0.3)

Applying NA Option

If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA
values.

x<-c(−21, −5, 2, 3, NA,4.2, 7, 8, 12, NA, 18, 54)

mean(x,na.rm=TRUE)

Median using R

Syntax

The basic syntax for calculating median in R is −

median(x, na.rm = FALSE)

Following is the description of the parameters used −

● x is the input vector.

● na.rm is used to remove the missing values from the input vector. x <-
c(2,1,2,3,1,2,3,4,1,5,5,3,2)

median<-median(x); median;

Mode using R

R does not have a standard in-built function to calculate mode. So we create a user function to
calculate mode of a data set in R. This function takes the vector as input and gives the mode value as
output.

xt<- table(x)

mode<-(which(xt==max(xt))); mode;

Example:

1. x <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3) 2. x<- c(“o”,”i”,”a”,”i","i","u","e")

[1] 2 [1] “i"

Partition Values

Partition Values divide the set of observations into several equal parts.

• Quartiles

• Deciles

• Percentiles
Quartiles

Quartiles are the values that divide a given data set into four equal parts. There are three quartiles
Q1,Q2 and Q3.

Q1: Lower or the First quartile: Is the value for which 1/4th of the obseravations are less than or
equal to it.

Q2: Middle or the Second quartile: Is the value for which 1/2 of the obseravations are less than or
equal to it. It is the median of the data set.

Q3: Upper or the Third quartile: Is the value for which 3/4th of the obseravations are less than or
equal to it.

We need to arrange all the observations in the data set in ascending order.
ⅈ(𝑁+1)
Qi = Value of 4
𝑡ℎ observation. i= 1, 2, 3

Deciles

Deciles are those values that divide any dataset into ten equal parts. Therefore, there are a total of
nine deciles.

D1, D2, D3, D4, ……… D9.

Di is the value for which i/10th of the total observations are less than or equal to Di,

i=1,2,…..,9 after all the observations are arranged in ascending order.


ⅈ(𝑁+1)
Di = Value of 10
𝑡ℎ observation. i= 1, 2,…9

Percentiles

Percentile divide any given dataset into 100 equal parts. There are a total of 99 percentiles.

P1, P2, P3, P4, ……… P99.

Pi is the value for which i/100th of the total observations are less than or equal to Pi,

i=1,2,…..,99 after all the observations are arranged in ascending order.


ⅈ(𝑁+1)
Pi = Value of 100
𝑡ℎ observation. i= 1, 2,… 99

Example:

Calculate Q1, D3 and P55 for the dataset given below: 42,45,41,48,50,52,43,44,42,50,56,52,55,43.

Arranged data: 41,42,42,43,43,44,45,48,50,50,52,52,55,56.

Q1= Value of 3.75th observation = 42.75

D3= Value of 4.5thobservation= 43

P55 = Value of 8.25th observation = 48.5


Using R

There is a single command for the calculation of all partition values.

If the array of observations is x and we need to compute kth percentile then use the command:

>quantile(x, k)

Note:

Manual calculations and values using R may not match exactly since R uses 9 different algorithms for
the calculation of percentiles.(type=6) (SPSS and Minitab)

Box Plot (Box and Whisker Plot)

The box and whiskers chart shows you how your data is spread out. Five pieces of information are
generally included in the chart:

• The minimum (the smallest number in the data set). The minimum is

shown at the far left of the chart, at the end of the left “whisker”.

• First quartile, Q1, is the far left of the box (or the far right of the left whisker).

• The median is shown as a line in the center of the box.

• Third quartile, Q3, shown at the far right of the box (at the far left of the right
whisker).

The maximum (the largest number in the data set), shown at the far right of the chart, at the end of
the right “whisker”.

Outliers

Box plot can be used to identify outliers in the data.

• Q3-Q1 is the interquartile range

• Maximum = Q3+1.5 IQR

• Minimum = Q1-1.5 IQR

• Points outside min and max are outliers.


Example

Draw a box-and-whisker plot for the data set {3, 7, 8, 5, 12, 14, 21, 13,18,10,11}.

Arranged data: 3,5,7,8,10,11,12,13,14,18,21

Min =3 ,Max =21 , Q1= 7 , Q2= 11 , Q3=14

Box Plot using R

• Box plot using R shows the outliers.

• Command : boxplot(x)

boxplot(x,main=“Box Plot”) ———-Title

boxplot(x,y,z)——-for comparison

Uses of a box plot

It is used to know

• the outliers and its values

• symmetry of data

• tight grouping of data

• data skewness -if, in which direction and how

Measures of Dispersion
Statistical dispersion means the extent to which a numerical data is likely to vary about an
average value.It helps to understand the distribution of the data.

Variance

Variance measures variability from the average or mean.

It is the average of the squares of the differences between each number in

the data set and its mean.


(
∑ 𝑥𝑖−µ )2
Variance = 𝑛
where µ is the population mean. If it is unknown then

It is replaced with sample mean.


2
( )
∑ 𝑥𝑖−𝑥
Variance = 𝑛

It is expressed in the squares of the units in which the original data is given.

For calculations we use the simplified version


Σ𝑥2𝑖−𝑛𝑥2
Variance = 𝑛

It is denoted by σ2

Standard Deviation

2
∑𝑥𝑖 −𝑛𝑥2
It is defined as square root of the variance and is defined by 𝑛

It is expressed in the same units in which the original data is given.

Coefficient of Variation

The coefficient of variation represents the ratio of the standard deviation to

the mean multiplied by 100.


σ
C.V = ×100
𝑥

It is expressed as a percentage.

It is a useful statistic for comparing the degree of variation from one data set to another, even if the
means are drastically different from one another or units of measurement are different.

Variance,SD and CV using R

Let the array of observations be x.

Variance <- var(x);

Standard deviation <- sd(x);

Coefficient of Variation <- sd(x)/mean(x)*100;

Measures of Skewness and Kurtosis


Skewness
Skewness refers to asymmetry in a curve of a dataset.

If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified to
define the extent to which a distribution differs from a normal distribution.

A distribution can be

Symmetric

Positively skewed

Negatively skewed

Symmetric distribution

For a symmetric distribution mean=mode=median.

The box plot of a symmetrical distribution. Here Q3-Q2 = Q2-Q1

And also the lengths of the two whiskers are equal.

Positively Skewed distribution

A positively skewed distribution is the distribution with the tail on its right side.For these
distributions Mean > Median > Mode

The box plot of a positive skewed distribution Here Q3-Q2 > Q2-Q1
Negatively Skewed distribution

A negatively skewed distribution is the distribution with the tail on its left side.For these distributions
mean < median < mode.

The box plot of a positive skewed distribution .Here Q3-Q2 < Q2-Q1

Measures of Skewness

1. Karl Pearsons Coefficient of Skewness

It is denoted by SKp and is defined by,

Skp = Mean – Mode


Standard Deviation

If mode is indeterminate use the empirical relationship between mean,

median and mode

Skp = 3(Mean – Median)


Standard Deviation

SKp = 0 Distribution is symmetric

>0 Distribution is positively skewed

<0 Distribution is negatively skewed

2. Bowley’s Coefficient of Skewness


It is denoted by SKb and is defined by,

Skb = (Q3 – Q2) – (Q2 – Q1)


Q3 – Q1
= Q3 + Q1 – 2Q2
Q3 – Q1

SKb = 0 Distribution is symmetric

> 0 Distribution is positively skewed

< 0 Distribution is negatively skewed

-1 ≤ SKb ≤ +1

3. Pearson’s Measure of Skewness (Based on moments)


2
µ3
It is denoted by ϒ1 = β1 , β1 = 3
µ2

𝑟
( )
∑ 𝑥𝑖−𝑥
Where µr = rth central moment given by µ𝑟 = 𝑛
r =1,2,3…

Interpretation

ϒ1 is positive root of β1 if μ3 is positive and ϒ1 is negative root of β1 if μ3 is

negative.

The distribution is symmetric if ϒ1 = 0

It is positively skewed if ϒ1 > 0

It is negatively skewed if ϒ1 < 0

Kurtosis

Kurtosis provides information about peakedness of a distribution. Peakedness in a distribution is the


degree to which data values are concentrated around the mean.

Datasets with high kurtosis tend to have a distinct peak near the mean and

tend to decline rapidly, and have heavy tails.

Datasets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

Types of Kurtosis

There are 3 types of kurtosis:

• Mesokurtic: Distributions that are moderate in breadth and curves with a medium peaked
height.

• Leptokurtic: Distributions whose curve is sharply peaked with heavy tails.


• Platykurtic: Distributions whose curve has a flat peak and has more dispersed scores with
lighter tails.

Measure of Kurtosis

Pearson’s measure of Kurtosis(Based on moments)


µ4
It is denoted by ϒ2 = β2 - 3 , β2 = 2
µ2

Interpretation

The distribution is mesokurtic if ϒ2 = 0 β2 = 3

It is leptokurtic if ϒ2 > 0 β2 > 3

It is platykurtic if ϒ2 < 0 β2 < 3

Skewness and Kurtosis using R

Draw a histogram and box plot to get an idea about skewness. There is no function calculating the
measures directly.
Use summary function to get mean, Q1,Q2 and Q3. Find sd(x).

Using formula compute SKp and SKb

Else use library(moments)

skewness(x) gives the value of ϒ1

kurtosis(x) gives the value of β2

You might also like