0% found this document useful (0 votes)
14 views

Lec_4 (Summary Data)

The document provides an overview of numerical summary measures in statistics, focusing on methods for data summarization, including measures of central tendency (mean, median, mode) and measures of dispersion (range, interquartile range, variance, standard deviation). It outlines the properties and limitations of these measures, emphasizing their appropriate applications based on data types and distributions. The document aims to equip students with the skills to compute and interpret these summary values effectively.

Uploaded by

Begidu Yilma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lec_4 (Summary Data)

The document provides an overview of numerical summary measures in statistics, focusing on methods for data summarization, including measures of central tendency (mean, median, mode) and measures of dispersion (range, interquartile range, variance, standard deviation). It outlines the properties and limitations of these measures, emphasizing their appropriate applications based on data types and distributions. The document aims to equip students with the skills to compute and interpret these summary values effectively.

Uploaded by

Begidu Yilma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

Arba-Minch University

College of Medicine and Health sciences


School of Public Health

Numerical Summary Measures

By: Etenesh K. (BSc, MPH( Epidemiology & Biostatistics))

02/09/2025 1
Learning objectives
At the end of this chapter, the student will be able to:
– Identify the different methods of data
summarization
– Compute appropriate summary values for a set of
data
– Identify the properties & limitations of summary
values

02/09/2025 2
Introduction
• Although frequency distributions serve useful purposes,

– They don’t summarize data by indicating the average value


or (the middle) and the spread of the values.

• A descriptive measure which summarize the data by means of


a single number is also required.

02/09/2025 3
Cont..
• Descriptive methods can be done by using a data of a
samples or a data from the population, to distinguish
between them we have the following definitions.
Statistic -descriptive measure computed from sample data
Parameter -descriptive measure computed from population
data.

02/09/2025 4
Numerical summary measures

• Single numbers which quantify the characteristics of


a distribution of values is called summary measures.

– Measures of central tendency (location)

– Measures of dispersion (variability)

02/09/2025 5
Measures of Central Tendency

• Measures used to summarize the point at which the


data tend to cluster in a single number or statistic are
called measures of location or measures of central
tendency.

• The objective of calculating MCT is to determine a


single value which may be used to represent the whole
data set.
02/09/2025 6
Cont..
• In that sense it is an even more compact and concise
description of the statistical data than the frequency
distribution

• Since a MCT represents the entire data, it facilitates


comparison within one group or between groups of
data.

 The most commonly used measures of central tendency

are: mean, median and mode.

02/09/2025 7
1. Arithmetic mean

• The arithmetic mean is the "average" of the dataset

• Arithmetic mean is the most familiar measure of


central tendency.

• Is the sum of all the observations in a data set


divided by the total number of observations.

02/09/2025 8
If the variable x assumes n values x1, x2 … xn then the mean, x, is given by:

02/09/2025 9
Cont..

02/09/2025 10
B. Grouped Data
• In calculating the mean from grouped data, we assume that all

values falling into a particular class interval are located at the

midpoint of the interval.

• It is calculated as follows:

• where,
– k = the number of class intervals
– mi = the mid-point of the ith class interval
– fi = the frequency of the ith class interval
02/09/2025 11
Example. Compute the mean age of 169 subjects from the grouped data.

02/09/2025 12
The mean can be thought of as a “balancing point”, “center of
gravity”

13
Properties of the Arithmetic Mean

1. Can be used for both discrete and continuous


data. However, it is not appropriate for either
nominal or ordinal data.
2. For given set of data there is one and only one
arithmetic mean.
3. It is easily understood and easy to compute.

4. Algebraic sum of the deviations of the given


values from their arithmetic mean is always zero.
02/09/2025 14
5. It is greatly affected by the extreme values.
Cont..
Advantages
– It is based on all values given in the distribution.

– Does not ignore any information.

– It is most early understood.

– It is most amenable to algebraic treatment.

02/09/2025 15
2. Median

• In addition to measures of central tendency median is the


measures of position or location.

• The median is the value which divides the data set into two
equal parts.

• If the number of values is odd, the median will be the middle


value when all values are arranged in order of magnitude.

02/09/2025 16
Cont..

• When the number of observations is


even, there is no single middle value
but two middle observations.

• In this case the median is the mean


of these two middle observations,
when all observations have been
02/09/2025 17
Cont..

02/09/2025 18
Cont..
E.g. 19 20 20 21 22 23 24 27 27 27 34 n=11

• Median = [(n+1)/2]th = [(11+1)/2]th = [6]th= 23

E.g. 19 2 0 20 21 22 24 27 27 27 34 n= 10

• Median = (n/2)th + [(n+2/2]th = (10/2)th +

[(10+2/2]th /2= (5)th + [6]th /2= (22 + 24)/2 = 23

02/09/2025 19
cont..

• The median is a better description (than the mean) of the


majority when the distribution is skewed
Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93

20
Cont..

21
Exercise

1. The number of rooms in the seven hotels in X town is


713, 300, 618, 595, 311, 401, and 292. Find the median.

02/09/2025 22
Cont..
Solution
• Step 1: Arrange the data in order.

292, 300, 311, 401, 595, 618, 713


• Step 2: Select the middle value.

292, 300, 311, 401, 595, 618, 713



Median
Hence, the median is 401 rooms
02/09/2025 23
Median for Grouped data

• The first step is to locate the class interval in which

the median is located, using the following procedure.

• Find n/2 and see a class interval with a minimum

cum. Freq. which contains n/2.

• Then, use the following formula.

02/09/2025 24
Cont..

02/09/2025 25
E.g. Compute the median age of 169 subjects from the grouped data.

02/09/2025 26
Cont..
n/2 = 169/2 = 84.5
84.5 = in the 3rd class interval
Lower class boundary= 29.5, Upper class
boundary = 39.5
Frequency of the class = 47
Fc = 70
Median = 29.5 + (84.5-70 /47)10 = 32.58 ≈ 33
02/09/2025 27
Properties of median

1. Can be used for ordinal, discrete and continuous


data. However, it is not appropriate for nominal data.
2. There is only one median for a given set of data

3. The median is easy to calculate

4. Median is a positional average and hence it is not


drastically affected by extreme values
5. It is not a good representative of data if the number
of items is small

02/09/2025 28
3. Mode

• It is the value/ observation which occurs most


frequently.

• Value that occurs most often

• It is possible to have more than one mode or no


mode.
– Unimodal: A distribution with one mode

– Bimodal: A distribution with two modes

– Trimodal: A distribution with 3 modes


02/09/2025 29
cont..
Mode

20
18
16
14
12
10
8
6
4
2
0
30
Cont..
Example-1
 Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
 Mode is 4 “Unimodal”

Example-2
 Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
 No mode, since all the values are different

02/09/2025 31
Cont..
Example-3
 Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
 There are two modes = 2 & 5
 This distribution is said to be “bi-modal”

02/09/2025 32
Mode of grouped data

• To find the mode of grouped data, we


usually refer to the modal class, where the
modal class is the class interval with the
highest frequency.

• If a single value for the mode of grouped


data must be specified, it is taken as the
mid-point of the modal class interval.
02/09/2025 33
Cont..

02/09/2025 34
E.g. Find the mode for the following data

02/09/2025 35
Cont..

Solution
Lmo=19.5,F=66,Fa=47,Fb=4,i=10
Mode=19.5+((66-47)/66-47+66-4))10
=21.8=22

02/09/2025 36
Properties of mode

1. Can be used for nominal, ordinal, discrete and


continuous data. However, it is more appropriate for
nominal and ordinal data.

2. It is not affected by extreme values

3. Often its value is not unique

4. The main drawback of mode is that often it does not


exist
02/09/2025 37
which measure of central tendency is best with a given set of data

 An investigator may naturally ask which measure of


central tendency is best to use with the data.
 Two factors are important in making this decisions:

38
cont..

39
Cont..

 The arithmetic mean is used for interval and ratio data


and for symmetric distribution
 The median and quartiles are used for ordinal, interval
and ratio data whose distribution is skewed.
 For nominal data mode is the appropriate MCT
 For discrete or continuous data, the “modal class” can
be used
40
cont..
Symmetric and unimodal distribution — Mean,
median, and mode should all be
approximately the same
Mean, Median & Mode

41
cont..
Bimodal — Mean and median should be about
the same, but may take a value that is unlikely
to occur; two modes might be best

42
cont..
Skewed to the right (positively skewed) —Mean
is sensitive to extreme values, so median
might be more appropriate

Mode

Median

Mean

43
cont..
Skewed to the left (negatively skewed) — The
same to previous
Mode

Median

Mean

44
cont..
When the data are skewed, the mean is “dragged” in the direction of the
skewness

 It is possible in extreme cases for all but one of the sample points to be on
one side of the arithmetic mean & in this case, the mean is a poor measure of
central location or does not reflect the center of the sample.

45
Measures of variation (dispersion)

 Measures that quantify the variation or dispersion of a set of data from its

central location.
 Dispersion of a set of observations is the variety exhibited by the

observations
1. If all the values are the same→ There is no dispersion
2. If all the values are different → There is a dispersion
3. If the values close to each other →The amount of dispersion is small
4. If the values are widely scattered/spread → The dispersion is greater

02/09/2025 46
Cont..
Common measures of dispersion
1. Range
2. Inter quartile range
3. Variance
4. Standard deviation
5. Coefficient of variation

02/09/2025 47
Range
• The difference between the largest and smallest
observations in a sample.

Range = Maximum value – Minimum value

Example:

• Data values: 5, 9, 12, 16, 23, 34, 37, 42

Range = 42-5 = 37

• Range concern only on two values

• Data set with higher range exhibit more


variability
02/09/2025 48
Con…
Properties of range

• It is the simplest crude measure and can


be easily understood

• It takes into account only two values which


causes it to be a poor measure of
dispersion

• Very sensitive to extreme outliers


02/09/2025 49
Inter-quartile range (IQR)
• Just as the median is the value above & below which lie half
of the data set, one can define measures (above or below)
which lie other fractional parts of the data.

• The median divides the data into two equal parts

• Quartile divide data in four equal parts

02/09/2025 50
Con…

a) The first quartile (Q1): 25% of the


observations are less than or equal to the first
quartile and 75% of the observations greater
than or equal to the first quartile.

b) The second quartile (Q2): 50% of all the


ranked observations are less than Q2.

– The second quartile is the median.

02/09/2025 51
Cont..
c) The third quartile (Q3): 75% of the
observations are less than or equal to the third
quartile and 25% of the observation are greater
than or equal to the third quartile.

d) The inter-quartile range is the difference


between the first and the third quartiles.

02/09/2025 52
Cont..
 IQR is used when the median is used as the measure of central

tendency.
 It gives the range in which the middle 50% of the distribution

lies.
 The inter-quartile range quantifies the difference between the

third and first quartiles.

IQR = Q3 - Q1

 A large IQR indicates a large amount of variability among the

middle 50% of the observations and a small IQR indicates a


small amount of variability
02/09/2025 53
Con…

To identify the position of quartiles

• 1st quartile = The {(n+1)/4}th observation

• 2nd quartile = the {(n+1)/2}th observation

• 3rd quartile = {3/4 (n+1)}th observation

• Interquartile range (IQR) = Q3-Q1

02/09/2025 54
Cont..
E.g1 :Given these data: 13, 7, 9, 15, 11, 5, 8, 4 find IQR?
a. Arrange the observations in increasing order.
4, 5, 7, 8, 9, 11, 13, 15.
b. Find the position of the 1st and 3rd quartiles.
n=8.
 Position of Q1 = ¼ (n+1) = ¼ (8+1) = 2.25th

 Q1 lies the 2nd and 3rd observations

 Position of Q3 = ¾(n+1) = ¾(8+1) = 6.75th

 Q3 lies the 6th and 7th observations.


02/09/2025 55
Cont..
C. Identify the value of the 1st and 3rd quartiles.
 the value of Q1 is equal to the value of the 2 nd

observation plus one-fourth the difference between


the values of the 3rd and 2nd observations:
 Value of the 3rd observation =7

 Value of the 2nd observation = 5

Q1 = 5 +1/4(7-5) = 5 +2/4 = 5.5

02/09/2025 56
Cont..
the value of Q3 is equal to the value of the 6th

observation plus three-fourths of the difference


between the value of the 7th and 6th
observations:
Value of the 7th observation =13

Value of the 6th observation=11

Q3 = 11 +3/4 (13-11) = 11 +3(2)/4 = 11+6/4 =


12.5
02/09/2025 57
Cont..

d. Calculate the inter-quartile range


Q3 = 12.5 ; Q1 = 5.5
IQR = Q3-Q1
= 12.5–5.5 = 7

02/09/2025 58
E.g 2: Suppose we have a small data set of twelve observations
15 18 19 20 20 20 21 23 23 24 24 25
• we want to divide the data into four equal sets
• First, we find the median
15 18 19 20 20 20 ↑ median 21 23 23 24 24 25
• median = 20.5 (half way between the 6th and 7th observations),
• divides the data into two equal sets with exactly 50% of the
observations in each: the 1st to the 6th observations in the first set
and the 7th to 12th observations in the other.

02/09/2025 59
Con…
• To find the first quartile we consider the
observations less than the median.
• 15 18 19 20 20 20
• The first quartile is the median of these
data.
• In this case, the first quartile is half way
between the 3rd and 4th observations and
is equal to 19.5.

02/09/2025 60
Cont..
 Now, we consider the observations which are

greater than the median.

21 23 23 24 24 25
 The third quartile is the median of these data and

is equal to 23.5.
15 18 19 ↑ 20 20 20 ↑ 21 23 23 ↑ 24 24 25

 Q1 Q2 Q3

IQR = Q3- Q1 = 23.5- 19.5.=4


02/09/2025 61
Quartiles for grouped data

Apply the same method with Lm = lower true class boundary of the
median interval containing the quartile
Q1= Q1L+((n/4-fc)/fQ1)i Fc = cumulative frequency of the

Q3= Q3L+((3n/4-fc)/fQ3)i interval just above the quartile class


interval
to find the class of each
FQ = frequency of the interval
Q1=n/4
containing the quartile
Q3=3n/4
i= class interval width
n = total number of observations
IQR= Q3-Q1
02/09/2025 62
Properties of IQR

 It is a simple and versatile measure


 It encloses the central 50% of the observations

 It is not based on all observations but only on two

specific values
 It is important in selecting cut-off points in the

formulation of clinical standards


 Since it excludes the lowest and highest 25% values, it

eliminates the outlier problem


 Less sensitive to the size of the sample
02/09/2025 63
Variance (2, s2)

 The variance is the average of the squares of

the deviations taken from the mean.


 A good measure of dispersion should make use

of all the data.


 The variance achieves this by averaging the

sum of the squares of the deviations from the


mean.
 The sample variance of the set x1, x2, ..., xn of n
02/09/2025 64
observations with mean ẍ is
cont..
 It is squared because the sum of the
deviations of the individual observations of a
sample about the sample mean is always 0

0 =å ( x i - x)

 The variance can be thought of as an average


of squared deviations
65
Cont..

• The denominator is not the sample size (n), because

when the population is large and the sample is small

(usually less than 30), the variance computed by this

formula usually underestimates the population

variance.

02/09/2025 66
Cont..

• Therefore, instead of dividing by n, find the variance


of the sample by dividing by n -1, giving a slightly
larger value and an unbiased estimate of the
population variance.

02/09/2025 67
Cont..
Degrees of freedom
In computing the variance there are (n-1) degrees

of freedom because only (n-1) of the deviations


are independent from each other
This is because the sum of the deviations from

their mean (Xi-Mean) must add to zero.


 The last one can always be calculated from the

others automatically (It is not free to vary).


02/09/2025 68
Cont..

Example
Data: 43,66,61,64,65,38,59,57,57,50.
Find Sample Variance of the data , mean =
56
S2= [(43-56) 2
+(66-56)2+…..+(50-56) 2
]/10-
1
= 810/9 = 90
02/09/2025 69
Cont..
Variance for grouped data

where

mi = the mid-point of the ith class interval

fi = the frequency of the ith class interval

x bar = the sample mean

k = the number of class intervals


02/09/2025 70
Ex. Compute the variance of the age of 169 subjects from the grouped
data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23

02/09/2025 71
Properties of variances
· The main disadvantage of variance is that its unit is the

square of the unite of the original measurement values

· The variance gives more weight to the extreme values as

compared to those which are near to mean value,


because the difference is squared in variance.

 The drawbacks of variance are overcome by the standard

deviation.
02/09/2025 72
Standard Deviation

 Shows variation about the mean

 It is the square root of the variance.

• This produces a measure having the same scale


as that of the individual values.

• Most commonly used

02/09/2025 73
Cont..

02/09/2025 74
Cont..

02/09/2025 75
Cont..

02/09/2025 76
Properties of SD

 Has the advantage of being expressed in the same units

of measurement as the mean.

 The best measure of dispersion and is used widely

because of the properties of the theoretical normal

curve.

 However, if the units of measurements of variables of

two data sets is not the same, then there variability

can’t be compared by comparing the values of SD.


02/09/2025 77
Cont..

02/09/2025 78
Coefficient of variation (CV)

 When two data sets have different units of

measurements, or their means differ


sufficiently in size, the CV should be used as a
measure of dispersion.
 It is the best measure to compare the variability of

two series of sets of observations.


 Data with less coefficient of variation is
considered more consistent.
02/09/2025 79
Cont..

• CV is the ratio of the SD to the mean

multiplied by 100.

• Measures relative variation

• Always in percentage (%)

• Shows variation relative to mean


02/09/2025 80
Cont..

02/09/2025 81
Cont..
• CV also used to compare two or more sets of

data measured in different units.

02/09/2025 82
Distributions

Normal distribution
 It is symmetric about its mean/one half of
the curve is the mirror image of the other
half
 The highest point is at its mean
 The height of the curve decreases as one
moves away from the mean in either
direction, approaching, but never reaching
zero
02/09/2025 83
Cont..

02/09/2025 84
Cont..
Skewed distributions
 The data are not distributed symmetrically in

skewed distributions
 the mean, median, and mode are not equal and

are in different positions


 Scores are clustered at one end of the distribution

 A small number of extreme values are located in

the limits of the opposite end


02/09/2025 85
Cont..
 Skew is always toward the direction of the longer tail

Positively skewed distribution


 occurs when majority of scores are at the right end of the

curve and a few small scores are scattered at the left


end.
Negatively skewed distribution
 Occurs when the majority of scores are at the left end of

the curve and a few extreme large scores are scattered


at the right end.
02/09/2025 86
Cont..

02/09/2025 87
Which measures to use?

 When the distribution is symmetric and uni-modal,

summarize the data using means and standard


deviations.
 When the data are skewed, it is preferable to use the

median and quartiles as summary statistics.


 Median and quartiles are not easily influenced by

extreme values in a skewed distribution unlike


means and standard deviations.

02/09/2025 88
Thank you!!!

02/09/2025 89

You might also like