0% found this document useful (0 votes)
52 views

Summarizing Data-Measures of Dispersion

The document discusses various measures of dispersion used to numerically summarize data, including range, interquartile range, variance, standard deviation, and coefficient of variation. It also covers measures of distribution shape such as skewness, which indicates the symmetry of a distribution, and z-scores for determining outliers. The measures of dispersion and distribution shape help analyze the variability and properties of a data set beyond just measures of central tendency.

Uploaded by

Rakesh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Summarizing Data-Measures of Dispersion

The document discusses various measures of dispersion used to numerically summarize data, including range, interquartile range, variance, standard deviation, and coefficient of variation. It also covers measures of distribution shape such as skewness, which indicates the symmetry of a distribution, and z-scores for determining outliers. The measures of dispersion and distribution shape help analyze the variability and properties of a data set beyond just measures of central tendency.

Uploaded by

Rakesh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Summarizing Data: Measures of

Dispersion

Prepared by:
Dr Vijendra Singh
Department of Informatics
School of Computer Science, UPES
Numerically Summarizing Data

Dispersion
Measures of Variability

 It is often desirable to consider measures of variability


(dispersion), as well as measures of location.
 For example, in choosing supplier A or supplier B we
might consider not only the average delivery time for
each, but also the variability in delivery time for each.
Measures of Variability
 Range
 Interquartile Range
 Variance
 Standard Deviation
 Coefficient of Variation
Range

 The range of a data set is the difference between the


largest and smallest data values.
 It is the simplest measure of variability.
 It is very sensitive to the smallest and largest data
values.
Range

Range = largest value - smallest value


Range = 615 - 425 = 190

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Interquartile Range

 The interquartile range of a data set is the difference


between the third quartile and the first quartile.
 It is the range for the middle 50% of the data.
 It overcomes the sensitivity to extreme data values.
Interquartile Range
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Variance

The variance is a measure of variability that utilizes


all the data.

It is based on the difference between the value of


each observation (xi) and the mean ( x for a sample,
m for a population).
Variance

The variance is the average of the squared


differences between each data value and the mean.

The variance is computed as follows:


2
2  ( xi  x ) 2  ( xi  m ) 2
s   
n 1 N
for a for a
sample population
Standard Deviation

The standard deviation of a data set is the positive


square root of the variance.

It is measured in the same units as the data, making


it more easily interpreted than the variance.
Standard Deviation

The standard deviation is computed as follows:

s  s2   2

for a for a
sample population
Coefficient of Variation

The coefficient of variation indicates how large the


standard deviation is in relation to the mean.

The coefficient of variation is computed as follows:


s   
  100 %   100  %
x  m 
for a for a
sample population
Descriptive Statistics:
Numerical Measures
 Measures of Distribution Shape, Relative Location,
and Detecting Outliers
Measures of Distribution Shape, and
Relative Location

 Distribution Shape
 z-Scores
 Detecting Outliers
Distribution Shape: Skewness
 An important measure of the shape of a distribution
is called skewness.
 The formula for computing skewness for a data set is
somewhat complex.
 Skewness can be easily computed using statistical
software.
Skewness a bit of history
___________________________________________________________________________________

Relationship between location measures:

mean – mode = 3(mean – median)

Coefficient of skewness: xx M


sk 
independent of measurment units

Combining both:
3 x  m 
sk 
 We will be using it
Karl Pearson (1857-1938)
xM – mode, a value that occurs most frequently in the sample or population
Skewness
formulas
____________________________________________________________________________________________

Skweness:
T

 xi  x  sum of deviation from


3

mean value devided by


sk  i 1
the cubed standard
 3
deviation

where x¯ is the mean, is the standard deviation, and


T is the number of data points
Skewness
formulas
____________________________________________________________________________________________

x  x 
3 adjusted Fisher-Pearson
i
T standardised moment
sk  i 1
T 1T  2  3 coefficient

where x¯ is the mean, is the standard deviation, and


T is the number of data points
Galton skewness (also known as Bowley's skewness) is defined as

Galton skewness=(Q1 + Q3 −2Q2 )/(Q3 − Q1)

where Q1 is the lower quartile, Q3 is the upper quartile, and Q2 is the median.
Skewness Example
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
2rd Quartile (Q2) = 475

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Skewness Example
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
2rd Quartile (Q2) = 475
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Galton skewness=(Q1 + Q3 −2Q2 )/(Q3 − Q1)


=(445+525-2*475)/(525-445)
=20/80
=0.25
Distribution Shape: Skewness

 Symmetric (not skewed)


• Skewness is zero.
• Mean and median are equal.
Skewness = 0
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness

 Moderately Skewed Left


Skewness is negative.
Mean will usually be less than the median.

Skewness =  .31
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness
Moderately Skewed Right
Skewness is positive.
Mean will usually be more than the median.
Skewness = .31
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness

 Highly Skewed Right


• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.

.35
Skewness = 1.25
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness

 Example: Apartment Rents


Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next slide.
Distribution Shape: Skewness

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Distribution Shape: Skewness

.35 Skewness = .92


.30
Relative Frequency

.25

.20
.15

.10
.05
0
59
z-Scores

The z-score is often called the standardized value.

It denotes the number of standard deviations a data


value xi is from the mean.

xi  x
zi 
s
Population Z - score

Sample Z - score
z-Scores

 An observation’s z-score is a measure of the relative


location of the observation in a data set.
 A data value less than the sample mean will have a
z-score less than zero.
 A data value greater than the sample mean will have
a z-score greater than zero.
 A data value equal to the sample mean will have a
z-score of zero.
z-Scores
 z-Score of Smallest Value (425)

xi  x 425  490.80
z    1.20
s 54.74

Standardized Values for Apartment Rents


-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
EXAMPLE Using Z-Scores

The mean height of males 20 years or older is


69.1 inches with a standard deviation of 2.8
inches. The mean height of females 20 years or
older is 63.7 inches with a standard deviation of
2.7 inches. Data based on information obtained
from National Health and Examination Survey.
Who is relatively taller:
Shaquille O’Neal whose height is 85 inches
or
Lisa Leslie whose height is 77 inches.
Answer:

 Shaquille O’Neal Z-Score:


(85-69.1)/2.8 =5.67857143

 Lisa Leslie
(77-63.7)/2.7 =4.92592593
Because O’Neal Z-Score > Lisa ‘s Z-Score,
 We say O’Neal is in a higher position than Lisa in their Goups.
Empirical Rule
For data having a bell-shaped distribution:

68.26% of the values of a normal random variable


are within +/- 1 standard deviation of its mean.

95.44% of the values of a normal random variable


are within +/- 2 standard deviations of its mean.

99.72% of the values of a normal random variable


are within +/- 3 standard deviations of its mean.
70
EXAMPLE Using the Empirical Rule

The following data represent the serum HDL


cholesterol of the 54 female patients of a family
doctor.
41 48 43 38 35 37 44 44 44
62 75 77 58 82 39 85 55 54
67 69 69 70 65 72 74 74 74
60 60 60 61 62 63 64 64 64
54 54 55 56 56 56 57 58 59
45 47 47 48 48 50 52 52 53

71
(a) Compute the population mean and standard
deviation.
(b) Draw a histogram to verify the data is bell-
shaped.
(c) Determine the percentage of patients that have
serum HDL within 3 standard deviations of the
mean according to the Empirical Rule.
(d) Determine the percentage of patients that have
serum HDL between 34 and 80.8 according to the
Empirical Rule.
(e) Determine the actual percentage of patients
that have serum HDL between 34 and 80.8.
72
(a) Using a TI83 plus graphing calculator, we find

m  57.4 and   11.7


(b)

73
m  57.4 and   11.7

(c) According to the Empirical Rule, approximately


99.7% of the patients will have serum HDL
cholesterol levels within 3 standard deviations of the
mean. That is, approximately 99.7% of the patients
will have serum HDL cholesterol levels greater than
or equal to 57.4 - 3(11.7) = 22.3 and less than or
equal to 57.4 + 3(11.7) = 92.5.

74
m  57.4 and   11.7
(d) Because 33.8 is 2 standard deviations below the
mean (57.4 - 2(11.7) = 34) and 81 is 2 standard
deviations above the mean (57.4 + 2(11.7) = 80.8),
the Empirical Rule states that approximately 95% of
the data will lie between 34 and 80.8.
(e) There are no observations below 34. There are
2 observations greater than 80.8. Therefore, 52/54
= 96.3% of the data lie between 34 and 80.8.
75
Detecting Outliers

 An outlier is an unusually small or unusually large


value in a data set.
 A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
 It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the
data set
• a correctly recorded data value that belongs in
the data set
Detecting Outliers

 The most extreme z-scores are -1.20 and 2.27


 Using |z| > 3 as the criterion for an outlier, there are
no outliers in this data set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27

You might also like