0% found this document useful (0 votes)
4 views

Lecture 3 Statistics

Uploaded by

phen zanuth
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 3 Statistics

Uploaded by

phen zanuth
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Chapter 3

Data Description

Danet Hak, PhD December 3, 2024


Course objective
After completing this chapter, you should be
able to:
Summarize data, using measures of central
tendency, such as the mean, median, mode,
and midrange.
Describe data, using measures of variation,
such as the range, variance, and standard
deviation.
Identify the position of a data value in a data
set, using percentiles, deciles, and quartiles.
Exploratory data analysis, including boxplots
and five-number summaries, to discover
various aspects of data.
Outline

1. Measurement of
“Central Tendency”
2. Measurement of
“Variation”
3. Measurement of
“Position”
4. Distribution shape
5. Exploratory Data
Measurement of “Central
Tendency”
Þ Mean or arithmetic average
Þ Median
Þ Mid-range
Þ Mode
Þ Weighted Mean
Þ Other mean: harmonic mean, the
geometric mean, and the quadratic
mean
The Mean (arithmetic average)

Þ The mean & arithmetic mean is


defined to be the sum of the data
values divided by the total number of
values.
o Mean of the samples is called sample
mean
o Mean of the population is called
population mean
The Mean (arithmetic average)

The symbol X represents the sample mean.


X is read as " X - bar ". The Greek symbol
 is read as " sigma " and it means " to sum".

X  X  ... + X
X= 1 2 n

n
X
 .
n
The Mean (arithmetic average)

Ex. Below are the ages of students who attend


statistics class.
18 19 30 23 25 22 18 20 21 23 22 21 20 18 23 24 24 25 19 24
19 20 30 23 25 22 18 20 21 23 22 21 20 18 23 24 24 25 19 23
The Mean (arithmetic average)
For ungrouped Frequency Distribution

The mean for an ungrouped frequency


distributuion is given by

 ( f X )
X= .
n
Here f is the frequency for the
corresponding value of X , and n =  f .
The Mean (arithmetic average)
For ungrouped Frequency Distribution

T h e s c o r e s fo r 2 5 s tu d e n ts o n a 4  p o in t
q u iz a r e g iv e n in th e ta b le . F in d th e m e a n s c o r e .
Score (X) Frequency (f) f.x
0 2
1 4
2 12
3 4
4 3

=
The Mean (arithmetic average)
For a grouped Frequency Distribution

The mean for a grouped frequency


distributu
tion is givenby

å( f ×X )
X = m
.
n
Here X is the
m
ing
correspond
class midpoint.
The Mean (arithmetic average)
For grouped Frequency Distribution

T a b le w ith c la s s m id p o in ts , X .
m

Score (X) Frequency (f) Xm f.Xm


15.5 - 20.5 3
20.5 - 25.5 5
25.5 - 30.5 4
30.5 - 35.5 3
35.5 - 40.5 2

=
The Mean (arithmetic average)
For a grouped Frequency Distribution
The Median
 The median is the midpoint of the data array. The
symbol for the median is MD.
 When there is an even number of values in the data
set, the median is obtained by taking the average of
the two middle numbers.
 Steps in computing the median of a data array:
o Step 1 Arrange the data in order.
o Step 2 Select the middle point.
Ex 1: The number of children with asthma during a specific year in seven
local districts is shown. Find the median. 253, 125, 328, 417, 201, 70, 90
Ex 2: Six customers purchased these numbers of magazines: 1, 7, 3, 2, 3, 4.
Find the median.
Ex 3: Find the median for the daily vehicle pass charge for five U.S. National
Parks. The costs are $25, $15, $15, $20, and $15.
The Median
For an ungrouped Frequency Distribution

 For an ungrouped frequency distribution, find the median


by examining the cumulative frequencies to locate the
middle value.
 If n is the sample size, compute n/2. Locate the data point
where n/2 values fall below and n/2 values fall above.
Ex 1: LRJ Appliance recorded the number of VCRs sold per week over a
one-year period. The data is given below.
Solution
No.
No.Sets
SetsSold
Sold Frequency
Frequency Cumulative
Cumulative
Frequency

Find cumulative frequency
Frequency
11 44 44

To locate the middle point, divide n by 2;
24/2 = 12.
22 99 13
13 
Locate the point where 12 values would fall
33 66 19
19 below and 12 values will fall above.
44 22 21
21 
Consider the cumulative distribution.
55 33 24
24

The 12th and 13th values fall in class 2.
Hence MD = 2.
The Median
For an ungrouped Frequency Distribution

• Find the median for below frequency distribution:


Score (X) Frequency (f) Cumulative frequency
0 2 2
1 4 6
2 12 18
3 4 22
4 3 25
The Median
For a grouped Frequency Distribution
The Median
For a grouped Frequency Distribution
Example: Given the group frequency distribution
table below. Identify the median of this group
frequency. Computation rule:
Score (X) Frequency
(f)

15.5 - 20.5 3

20.5 - 25.5 5

25.5 - 30.5 4

30.5 - 35.5 3

35.5 - 40.5 2

Step 1: Identify Median Class


Step 2: calculate median class width
Step 3: Find the median
The Median
For a grouped Frequency Distribution
Solution:
1. Identify median class
- Find cumulative frequency
- To locate the halfway point, divide n by 2; = 17/2 = 8.5 ≈ 9
- Find the class that contains the n/2th value. This will be the median class.
- The median class will then be 25.5 – 30.5.
2. Calculate median class width
W = 30.5 - 25.5 = 5
3. Identify media of the distribution
Score (X) Frequency Cumulative
(f) frequency

15.5 - 20.5 3 3

20.5 - 25.5 5 8

25.5 - 30.5 4 12

30.5 - 35.5 3 15

35.5 - 40.5 2 17
The Mode
 The value that occurs most often in a data set is called the mode.
 A data set that has only one value that occurs with the greatest frequency
is said to be unimodal.
 If a data set has two values that occur with the same greatest frequency,
both values are considered to be the mode and the data set is said to be
bimodal.
 If a data set has more than two values that occur with the same greatest
frequency, each value is used as the mode, and the data set is said to be
multimodal.
 When no data value occurs more than once, the data set is said to have
no mode.

Ex 1: 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10


Ex 2: 401, 344, 209, 201, 227, 353
Ex 3: 104 104 104 104 104 107 109 109 109 110 109 111 112 111 109
The Mode
For an ungrouped Frequency Distribution

 The mode for ungrouped data is the data value with


highest frequency.
Example: find the mode for the ungroup frequency
below.
Score (X) Frequency (f)
The score of 2
0 2 has the highest
1 4
frequency.
Therefore, it is
2 12
the mode of the
3 4 data set.
4 3
The Mode
For a grouped Frequency Distribution

 The mode for grouped data is the modal class.


 The modal class is the class with the largest
frequency.
 Sometimes the midpoint of the class is used
rather than the boundaries.
Example
The mode could
also be given as
the mid point of the
modal class:
(25.5+20.5)/2 = 23
The Midrange
 The midrange, symbolized MR, is a rough estimate of the middle. It is
found by adding the lowest and highest values in the data set and
dividing by 2.

 If the data set contains one extremely large value or one extremely
small value, a higher or lower midrange value will result and may not be
a typical description of the middle.

Ex 1. In the last two winter seasons, the city of Brownsville, Minnesota,


reported these numbers of water-line breaks per month. Find the midrange.
2, 3, 6, 8, 4, 1
Ex. 2 Find the midrange of data for the NFL signing bonuses. The bonuses
in millions of dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
The Weighted Mean

 The weighted mean is used when the values in a data set


are not all equally represented.
The Weighted Mean
 The weighted mean of a variable X is found by multiplying each value
by its corresponding weight and dividing the sum of the products by the
sum of the weights.

The weighted mean


w X  w X ... w X  wX
X = 1 1
 2 2 n n

w  w ... w 1 w 2 n

where w , w , ..., w are the weights


1 2 n

for the values X , X , ..., X . 1 2 n

Student A Student B Student C

Ex: Average score for selecting a GPA 3.0 4.0 3.5


scholarship candidate:
o GPA: 50% English 4 2.5 3.0
o English proficiency: 25% Proficiency
o Economic background : 25% Economic 3 3.0 3.0
The Weighted Mean
Ex 1. Find a mean scores of the student form an English class.
Scores Frequency (number Weight
of student who get
each score)
20 2
22 4
25 12
28 4
30 3
The Weighted Mean
Ex 1. Evaluation criteria for selecting new staff at company A is given below:
- CV preparation: 30%
- Ability to answer interview question: 50%
- Overall presentation: 20%
Which candidates will be hiring? (candidate with top score will be
selected)

Candidates Score for CV Score for Score for


preparation ability to overall
answer presentation
interview
A 45 30 18
B 50 20 20
C 30 30 20
D 25 30 15
F 50 15 17
The Weighted Mean
Ex 1. Evaluation criteria for statistics class is given below:
- Attendant: 10%
- Homework: 15%
- In class activities: 15%
- Mid-term: 30%
- Final: 30%
Which student get the highest grade?
Students Attendant Homework Class Mid-term Final
activity

A 100 100 75 80 80
B 100 80 75 90 80
C 80 100 100 90 100
D 50 80 100 75 85
F 0 100 0 75 80
Distribution Shapes
Frequency distributions can appear
in many shapes.

 The three most important shapes


are:
o positively skewed,
o symmetric, and
o negatively skewed.
Positively skewed or right-skewed distribution
The majority of the data values fall to the left of the
mean and cluster at the lower end of the
distribution;
The “tail” is to the right.
The mean is to the right of the median, and the
mode is to the left of the median.
Symmetric distribution
The data values are evenly distributed on both
sides of the mean.
In addition, when the distribution is unimodal, the
mean, median, and mode are the same and are at
the center of the distribution.
Negatively skewed or left-skewed distribution
When the majority of the data values fall to the right
of the mean and cluster at the upper end of the
distribution.
 The tail to the left, the distribution is said to be
negatively skewed or left-skewed.
Also, the mean is to the left of the median, and the
mode is to the right of the median.
Measures of Variation
Range
Variance
Standard deviation
What is range?
The range is defined to be the
highest value minus the lowest value.
The symbol R is used for the range.
R = highest value – lowest value.
Extremely large or extremely small
data values can drastically affect the
range.
What is variance? Standard variation?
 The Standard Deviation is a measure of how spread out numbers
are.
 Variance: the average of the squared differences from the mean.

=> Using the Standard Deviation we have a "standard" way of knowing


what is normal, and what is extra large or extra small.
Population Variance & standard deviation
Population Variance & standard deviation
Example: Find the variance and standard deviation for
the amount of European auto sales in 6 years (value in
millions of dollars shown below). Assuming the data
represents population.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
Sample Variance and standard deviation

 In computing sample variance, the sum of distance between individual


and sample mean is divided by sample number minus 1 instead of
dividing with sample number, to get a better approximation of
population.
Sample Variance and Standard deviation
Example: Find the variance and standard deviation for
the amount of European auto sales in 6 years (value in
millions of dollars shown below). Assuming the data
represents sample.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
Shortcut or Sample Variance and Standard Deviation

Example: Find the sample variance and standard deviation for the amount of
European auto sales for a sample of 6 years shown. The data are in millions of
dollars.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
Sample Variance for Group Data
• For grouped data, use the class
midpoints for the observed value in the
different classes.

– f : class frequency
– Xm : class mid-point
– n : total frequency
Sample Variance for Group Data
Find the variance and standard deviation of group data below:

S2 =
S =
Sample Variance for Ungrouped Data
 For ungrouped data, use the same
formula with the class midpoints, Xm,
replaced with the actual observed X value.
Sample Variance for Ungroupped Data
Find the variation and standard deviation of below ungroup data.
Class (X) Frequency (f) f.X (f.X2)
5 2
7 2
9 5
11 8
13 2
15 4
17 1

S2 =
S =
* Replace Xm with X
Summary
Coefficient of Variation
ÞThe coefficient of variation, denoted by CVar, is
the standard deviation divided by the mean.
ÞThe result is expressed as a percentage.

Þ It is used when we want to compare the spread of two


variables or variables that have different measurement
unit/scale.
Ex., The spread of student age Vs spread of their English
level (score)
Coefficient of Variation
• Temperature (°C): 17, 18, 20, 22, 22, 25, 27,
25, 23, 18, 17, 17
• Precipitation (mm): 500, 510, 520, 505, 500,
520, 530, 540, 550, 560, 500,
Chebyshev’s Theorem
Chebyshev’s theorem, (Russian mathematician, 1821–1894), specifies
the proportions of the spread in terms of the standard deviation .
Þ The proportion of values from a data set that will fall
within k standard deviations of the mean will be at least =
1 – 1/k2, where k is any number greater than 1.
e.g., k=2 => 75% of the values will lie within 2 standard deviations of
the mean.
K= 3 => approximately 89% will lie within 3 standard deviations.

Þ Chebyshev’s theorem can be used to find the minimum


percentage of data values that will fall between any two
given values.
Chebyshev’s Theorem
Ex., Prices of Homes The mean price of houses in Toul Kork
neighborhood is $50,000, and the standard deviation is $5,000. Find the
percentage of houses that costs within 2 standard deviation of the mean
price.
Chebyshev’s Theorem
Ex., Score of English class:
38 45 50 48 50 49 39 40 45 50 38 45 48 50 49 39 40 45
35 44 50 46 50 45 39 40 47 50 35 45 48 50 48 37 42 44
Assuming the data set above are the English score of 36 student in an
English class.
1. Find the standard deviation.
2. How many student got the score between 37.26 and 51.79? Using
Chebyshev’s theorem to solve it.
The Empirical (Normal) Rule
Chebyshev’s theorem applies to any distribution regardless of its shape. However,
when a distribution is bell-shaped (or what is called normal), the following statements,
which make up the empirical rule, are true.
Þ Approximately 68% of the data values will fall within 1 standard
deviation of the mean.
Þ Approximately 95% of the data values will fall within 2 standard
deviations of the mean.
Þ Approximately 99.7% of the data values will fall within 3 standard
deviations of the mean.
Measures of Position
• In addition to measures of central tendency and
measures of variation, there are measures of
position or location. These measures include:
• standard scores (z score)
• percentiles,
• deciles, and
• quartiles.
 They are used to locate the relative position of a data
value in the data set.
Example
o What is your position in class based on your score?
Standard Scores or z score
Þ A standard score or z score tells how many standard deviations a data
value is above or below the mean for a specific distribution of values.
Þ If the z score is positive, the score is above the mean.
Þ If the z score is 0, the score is the same as the mean.
Þ And if the z score is negative, the score is below the mean.
Standard Scores or z score - Example
Ex: A student scored 65 on a statistics exam that had a mean of 50 and a
standard deviation of 10. Compute the z-score. She scored 30 on a
history test with a mean of 25 and a standard deviation of 2.
a. Compute Z score for each test score.
b. Does the student do better in statistic class or in history class?
Standard Scores or z score - Example
Ex: The average teacher’s salary in a particular state is $54,166.
If the standard deviation is $10,200, find the salaries
corresponding to the
following z scores.
a. 2
b. -1
c. 0
d. 2.5
e. -1.6
Percentiles
Þ Percentiles divide the range of data set into 100
equal groups.

Þ The Pk percentile is defined to be the numerical value


that at most k% of the values are smaller than Pk and
at most (100 – k)% are larger than Pk in an ordered
data set.
Percentiles - Example
A teacher gives a 20-point test to 10 students. The score of
each students are: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10. Find the
percentile rank of a score of 12 and 15.
Finding the value Corresponding to a Given
Percentile
Finding the value Corresponding to a Given Percentile-
Example
A teacher gives a 20-point test to 10 students. The
score of each students are: 18, 15, 12, 6, 8, 2, 3, 5, 20,
10. Find the corresponding score of percentile 10th,
25th, 50th and 80th.
Example
Below are the distribution of number of monthly over time
in hours of the 12 garment staff. 10, 11, 12, 6, 8, 2, 3, 5,
20, 10, 11, 12.
a. Find the percentile rank of the staff who’s over time was 8 hours.
b. Identify the staff whose overtime is more than the 90th percentile.
Quartiles
Þ Quartiles divide the distribution into four
groups, separated by Q1, Q2, Q3.
- Q1 = the 25th percentile;
- Q2 = the 50th percentile, or the median;
- Q3 = the 75th percentile,
Quartiles-Example
Find Q1, Q2, and Q3 for the data set 15, 13, 6, 5,
12, 50, 22, 18.
Þ Prepare data array:
Þ Find the median of the data array MD= Q2 =
Þ Find the median of the first half of the data
array Q1 =
Þ Find the median of the second half of the data
array Q3 =
Deciles
Þ Deciles divide the distribution into 10 groups,
denoted by D1, D2, etc.

Þ Note that D1 corresponds to P10; D2 corresponds to P20; etc. Deciles can


be found by using the formulas given for percentiles.
Relationships among percentiles, deciles, and quartiles:
o Deciles are denoted by D1, D2, D3, . . . , D9, and they correspond to P10,
P20, P30, . . . , P90.
o Quartiles are denoted by Q1, Q2, Q3 and they correspond to P25, P50, P75.
o The median is the same as P50 or Q2 or D5.
Outliers
=> An outlier is an extremely high or an extremely low data
value when compared with the rest of the data values.
Þ An outlier can strongly affect the mean and standard
deviation of a variable.
One ways to check a data set for outliers (there are also other ways):

 When a distribution is normal or bell-shaped, data values that are beyond


3 standard deviations of the mean can be considered suspected outliers.
Outliers
There are several reasons why outliers may occur.
o First, the data value may have resulted from a
measurement or observational error. (Perhaps the researcher
measured the variable incorrectly)
o Second, the data value may have resulted from a recording
error. (That is, it may have been written or typed incorrectly)
o Third, the data value may have been obtained from a
subject that is not in the defined population. (For example,
suppose test scores were obtained from a seventh-grade class, but a student
in that class was actually in the sixth grade and had special permission to
attend the class. This student might have scored extremely low on that
particular exam on that day.)
o Fourth, the data value might be a legitimate value that
occurred by chance (although the probability is extremely
small).
Outliers
Example: Check the following data set for outliers: 5, 6,
12, 13, 15, 18, 22, 50
Solution:
1. Find Q1 and Q3 :
2. Find the interquartile range (IQR), which is Q 3- Q1 =
3. Multiply IQR value by 1.5; d = IQR x 1.5=
4. Subtract the value obtained in step 3 from Q1, => Q1- d =

and add the value obtained in step 3 to Q3 => Q3 - d =


5. Check the data set for any data values that fall outside
the interval from [Q1- d to Q3 + d ].
Stem and Leaf Plot
Stem and Leaf plot is a data plot that uses part of a data value
as the stem and part of the data value as the leaf to form groups
or classes.
Ex: At an outpatient testing center, a sample of 20 days showed the following
number of cardiograms done each day: 25, 31, 20, 32, 13, 14, 43, 02, 57, 23, 36, 32,
33, 32, 44, 32, 52, 44, 51, 45. The stem and leaf plot for the above data is shown
below.
Stem and Leaf Plot - Example

A teacher gives a 20-point test to 15 students. The


score of each students are: 12, 15, 9, 10, 18, 18, 15, 12,
6, 8, 2, 3, 5, 20, 10. Draw a stem and leaf plot to
represent this data set.
Boxplot
A boxplot is a graph of a data set obtained by drawing a horizontal line
from the minimum data value to Q1, drawing a horizontal line from Q3 to the
maximum data value, and drawing a box whose vertical sides pass through Q1
and Q3 with a vertical line inside the box passing through the median or Q2.
Boxplot
Boxplot

Ex: The number of meteorites found in 10 states of the United


States is 89, 47, 164, 296, 30, 215, 138, 78, 48, 39. Construct a boxplot
for the data. (Data array: 30, 39, 47, 48, 78, 89, 138, 164, 215, 296 )
Box Plot - Example

A teacher gives a 20-point test to 15 students. The


score of each students are: 12, 15, 9, 10, 18, 18, 15, 12,
6, 8, 2, 3, 5, 20, 10. Draw a box plot to represent this
data set.
Additional Exercise
1. Below are the response of 30 female citizen in a downtown area
regarding the general age that a woman should get married.
25 28 30 22 20 24 20 25 23 24 26 28 30 25 18
27 23 26 25 25 23 24 25 25 26 27 28 30 20 20
a. Does this data set represent sample or population?
b. Find the mean, median, mode draw a box plot, and stem
and leave plot to represent the data.
c. Does this data set contain outlier?
d. What is the percentile range for 22.
e. What is the corresponding value for the 28th percentile and
50th percentile?
End of Lecture

You might also like