Statistics Unit1 Notes.docx
Statistics Unit1 Notes.docx
The mean (often called the average) is most likely the measure of central tendency that you are most
familiar with, but there are others, such as the median and the mode.
The mean, median and mode are all measures of central tendency, but under different conditions,
some measures of central tendency become more appropriate to use than others.
MEAN
The mean is equal to the sum of all the values in the data set divided by the number of values in the
data set.
If x1 , x2, …xn are n values in the data, the sample mean denoted by 𝑥
𝑥1+𝑥2+⋯𝑥𝑛 Σ𝑥𝑖
is given by 𝑥 = 𝑛
or 𝑥 = 𝑛
Properties of mean:
e.g. Suppose the mean monthly salary of the employees working for a company is Rs. 15000/-. If
each employee gets a monthly raise of Rs.1500/-, then the mean monthly salary after the raise will
be 15000+1500= Rs. 16500
e.g. mean height of students in a class is 4.8 feet. If the height is expressed in inches, the mean
height would be 12(4.8) = 57.6 inches.
Median
The median is the middle score for a set of data that has been
𝑛+1 𝑡ℎ
Median = Value of ( )
2
observation in the arranged dataset.
e.g. Consider the data set 65,55,89,56,35,14,56,55,87,44,92. Median=56
Median is less affected by the outliers and is a preferred measure of central tendency for a skewed
data.
MODE
Normally, the mode is used for categorical data where we wish to know which is the most common
category
It will not be an appropriate measure of central tendency if the most common value is away from the
rest of the data set.
We can have datasets with more than one mode or datasets with no
mode.
Mean
Syntax
● trim is used to drop some observations from both end of the sorted vector.
● na.rm is used to remove the missing values from the input vector.
● x <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
mean(x);
When trim parameter is supplied, the values in the vector get sorted and then the required numbers
of observations are dropped from calculating the mean.
When trim = 0.3, .3n values from each end will be dropped from the calculations to find mean,
where n is the total no of observations.
mean(x,trim=0.3)
Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA
values.
mean(x,na.rm=TRUE)
Median using R
Syntax
● na.rm is used to remove the missing values from the input vector. x <-
c(2,1,2,3,1,2,3,4,1,5,5,3,2)
median<-median(x); median;
Mode using R
R does not have a standard in-built function to calculate mode. So we create a user function to
calculate mode of a data set in R. This function takes the vector as input and gives the mode value as
output.
xt<- table(x)
mode<-(which(xt==max(xt))); mode;
Example:
Partition Values
Partition Values divide the set of observations into several equal parts.
• Quartiles
• Deciles
• Percentiles
Quartiles
Quartiles are the values that divide a given data set into four equal parts. There are three quartiles
Q1,Q2 and Q3.
Q1: Lower or the First quartile: Is the value for which 1/4th of the obseravations are less than or
equal to it.
Q2: Middle or the Second quartile: Is the value for which 1/2 of the obseravations are less than or
equal to it. It is the median of the data set.
Q3: Upper or the Third quartile: Is the value for which 3/4th of the obseravations are less than or
equal to it.
We need to arrange all the observations in the data set in ascending order.
ⅈ(𝑁+1)
Qi = Value of 4
𝑡ℎ observation. i= 1, 2, 3
Deciles
Deciles are those values that divide any dataset into ten equal parts. Therefore, there are a total of
nine deciles.
Di is the value for which i/10th of the total observations are less than or equal to Di,
Percentiles
Percentile divide any given dataset into 100 equal parts. There are a total of 99 percentiles.
Pi is the value for which i/100th of the total observations are less than or equal to Pi,
Example:
Calculate Q1, D3 and P55 for the dataset given below: 42,45,41,48,50,52,43,44,42,50,56,52,55,43.
If the array of observations is x and we need to compute kth percentile then use the command:
>quantile(x, k)
Note:
Manual calculations and values using R may not match exactly since R uses 9 different algorithms for
the calculation of percentiles.(type=6) (SPSS and Minitab)
The box and whiskers chart shows you how your data is spread out. Five pieces of information are
generally included in the chart:
• The minimum (the smallest number in the data set). The minimum is
shown at the far left of the chart, at the end of the left “whisker”.
• First quartile, Q1, is the far left of the box (or the far right of the left whisker).
• Third quartile, Q3, shown at the far right of the box (at the far left of the right
whisker).
The maximum (the largest number in the data set), shown at the far right of the chart, at the end of
the right “whisker”.
Outliers
Draw a box-and-whisker plot for the data set {3, 7, 8, 5, 12, 14, 21, 13,18,10,11}.
• Command : boxplot(x)
boxplot(x,y,z)——-for comparison
It is used to know
• symmetry of data
Measures of Dispersion
Statistical dispersion means the extent to which a numerical data is likely to vary about an
average value.It helps to understand the distribution of the data.
Variance
It is expressed in the squares of the units in which the original data is given.
It is denoted by σ2
Standard Deviation
2
∑𝑥𝑖 −𝑛𝑥2
It is defined as square root of the variance and is defined by 𝑛
Coefficient of Variation
It is expressed as a percentage.
It is a useful statistic for comparing the degree of variation from one data set to another, even if the
means are drastically different from one another or units of measurement are different.
If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified to
define the extent to which a distribution differs from a normal distribution.
A distribution can be
Symmetric
Positively skewed
Negatively skewed
Symmetric distribution
A positively skewed distribution is the distribution with the tail on its right side.For these
distributions Mean > Median > Mode
The box plot of a positive skewed distribution Here Q3-Q2 > Q2-Q1
Negatively Skewed distribution
A negatively skewed distribution is the distribution with the tail on its left side.For these distributions
mean < median < mode.
The box plot of a positive skewed distribution .Here Q3-Q2 < Q2-Q1
Measures of Skewness
-1 ≤ SKb ≤ +1
𝑟
( )
∑ 𝑥𝑖−𝑥
Where µr = rth central moment given by µ𝑟 = 𝑛
r =1,2,3…
Interpretation
negative.
Kurtosis
Datasets with high kurtosis tend to have a distinct peak near the mean and
Datasets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.
Types of Kurtosis
• Mesokurtic: Distributions that are moderate in breadth and curves with a medium peaked
height.
Measure of Kurtosis
Interpretation
Draw a histogram and box plot to get an idea about skewness. There is no function calculating the
measures directly.
Use summary function to get mean, Q1,Q2 and Q3. Find sd(x).