Module 1
Module 1
Learning Objective
• To convert raw data to useful information
Descriptive Inferential
Statistics statistics
Drawing conclusions
Collecting, and/or making
summarizing and decisions concerning
describing the data a population based
only on sample data
Example: A recent study examined the math and verbal SAT scores of high school seniors
across the country. Which of the following statements are descriptive in nature and
which are inferential.
• 80% of all students taking the exam were headed for college.
• The math SAT scores are higher than they were 10 years ago.
Introduction to Basic Terms
Population: A collection, or set, of individuals or objects or events
whose properties are to be analyzed.
Two kinds of populations: finite or infinite.
Data (singular): The value of the variable associated with one element of a population
or sample. This value may be a number, a word, or a symbol.
Data (plural): The set of values collected for the variable from each of the elements
belonging to the sample.
▪ We will study samples in order to be able to describe populations. Our hospital may study a small,
representative group of X-ray records rather than examining each record for the last 50 years. The Gallup
Poll may interview a sample of only 2,500 adult Americans in order to predict the opinion of all adults
living in the United States.
▪ Studying samples is easier than studying the whole population; it costs less and takes less time. Often,
testing an airplane part for strength destroys the part; thus, testing fewer parts is desirable. Sometimes
testing involves human risk; thus, use of sampling reduces that risk to an acceptable level.
Presenting Data in Tables
and Charts
Organizing Numerical Data
Numerical Data
Frequency Distributions
Ordered
and
Array
Cumulative Distributions
Day Shift
22 17 25 42 18 32
Weight of sample 38 19 20 27 21 22
of daily 16 17 20 18 19 18
production in kg
Night Shift
18 23 19 32 33 41
18 28 19 20 21 45
Organizing Numerical Data:
Ordered Array
▪ An ordered array is a sequence of data, in rank order, from the smallest value to the largest
value.
▪ Shows range (minimum value to maximum value)
▪ May help identify outliers (unusual observations)
Day Shift
16 17 17 18 18 18
Weight of sample 19 19 20 20 21 22
of daily production 22 25 27 32 38 42
in kg
Night Shift
18 18 19 19 20 21
23 28 32 33 41 45
Organizing Numerical Data:
Ordered Array
• Advantages:
• We can quickly notice the lowest and highest values in the data.
• We can easily divide the data into sections.
• We can see whether any values appear more than once in the array
• We can observe the distance between succeeding values in the data.
Disadvantage:
▪You must give attention to selecting the appropriate number of class groupings for the table, determining a
suitable width of a class grouping, and establishing the boundaries of each class grouping to avoid
overlapping.
Organizing Numerical Data:
Frequency Distribution
Relative Frequency Distribution
• We can also express the frequency of each value as a fraction or a percentage of the total number of
observations.
• A relative frequency distribution presents frequencies in terms of fractions or percentages.
Guidelines for Selecting Width of Classes
17, 13, 12, 24, 24, 21, 27, 26, 27, 35, 30, 32, 43, 41, 38, 44, 43, 58, 46, 53
Organizing Numerical Data:
Frequency Distribution Example
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Graphing Numerical Data:
The Histogram
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Histogram
7 6
Frequency 6 5 No Gaps
5 4 Between
4 3
3
Bars
2
2
1 0 0
0
5 15 25 35 45 55 More
Class Boundaries
Class Midpoints
Graphing Numerical Data:
The Frequency Polygon
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Frequency
7
6
5
4
3
2
1
0
5 15 25 35 45 55 More
Class Midpoints
Tabulating Numerical Data:
Cumulative Frequency
▪ Shows the number of items with values less than or equal to the
upper limit of each class
▪ Shows the proportion of items with values less than or equal to the
upper limit of each class.
▪ Shows the percentage of items with values less than or equal to the
upper limit of each class.
Tabulating Numerical Data:
Cumulative Frequency
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Graphing Numerical Data:
The Ogive (Cumulative % Polygon)
• A cumulative frequency distribution enables us to see how many
observations lie above or below certain values, rather than merely
recording the number of items within intervals.
Graphing Numerical Data:
The Ogive (Cumulative % Polygon)
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Ogive
100
80
60
40
20
0
10 20 30 40 50 60
Mean: 6.97
Central Tendency
In following figure , the central location of curve B lies to the right of
those of curve A and curve C. Notice that the central location of curve
A is equal to that of curve C.
• Kurtosis When we measure the kurtosis of a distribution, we are measuring its peakedness.
• For example, curves A and B differ only in that one is more peaked than the other. They have the same central location
and dispersion, and both are symmetrical. Statisticians say that the two curves have different degrees of kurtosis
TWO CURVES WITH THE SAME CENTRAL LOCATION BUT DIFFERENT KURTOSIS
Ungrouped vs Grouped Data
• Ungrouped data
• Ungrouped data which is also known as raw data is data that has not been placed in any
group or category after collection.
• Data is categorized in numbers or characteristics therefore, the data which has not been
put in any of the categories is ungrouped.
• The number of individuals residing in that area is ungrouped data or raw information
because nothing has been categorized.
• We can therefore conclude that ungrouped data is data used to show information on an
individual member of a sample or population.
Ungrouped vs Grouped Data
• Grouped data
• Grouped data is the type of data which is classified into groups after collection.
• The number of individuals residing in that area is ungrouped data or raw information
because nothing has been categorized. But if it is categorized as no. of man, no. if
• The raw data is categorized into various groups and a table is created.
• The primary purpose of the table is to show the data points occurring in each group.
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
Calculating the Mean from Ungrouped Data
A MEASURE OF CENTRAL TENDENCY
The Arithmetic Mean
• Suppose the monthly income (in Rs) of six families is given as:
1600, 1500, 1400, 1525, 1625, 1630.
• Remember that the measures we compute for a sample are called statistics.
• The notation is different when we are computing measures for the entire
population, that is, for the group containing every element we are describing.
9+7+7+6+4+4+2
• 𝑥ҧ = = 5.6 ← 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
7
Calculating the Mean from Grouped Data
• A frequency distribution consists of data that are grouped by classes.
• Each value of an observation falls somewhere in one of the classes.
• Unlike the last example, we do not know the separate values of every observation.
• Steps to be followed:
• Calculate the midpoint of each class. To make midpoints come out in whole cents, we round
up. For example, mid point for first class becomes 25.00, rather than 24.995.
• Then we multiply each midpoint by the frequency of observations in that class,
• Sum all these results, and divide the sum by the total number of observations in the sample.
Calculating the Mean from Grouped Data
• Formula
σ 𝑓×𝑥
• 𝑥ҧ =
𝑛
• Where
• 𝑥ҧ = sample mean
• σ = symbol meaning “the sum of”
• 𝑓 = frequency (number of observations) in each class
• 𝑥 = midpoint for each class in the sample
• n = number of observations in the sample
Example
Calculating the Mean from Grouped Data-
Coding
• Using a technique called coding, we eliminate the problem of large or inconvenient midpoints.
• Instead of using the actual midpoints to perform our calculations, we can assign small-value consecutive
integers (whole numbers) called codes to each of the midpoints.
• The integer zero can be assigned anywhere, but to keep the integers small, we will assign zero to the midpoint
in the middle (or the one nearest to the middle) of the frequency distribution.
• Then we can assign negative integers to values smaller than that midpoint and positive integers to those
larger, as follows:
Class 1-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45
Code (u) -4 -3 -2 -1 0 1 2 3 4
𝑥0
Calculating the Mean from Grouped Data-
Coding
• Formula
Symbolically, statisticians use 𝑥0 to
σ 𝑢∗𝑓
• 𝑥ҧ = 𝑥0 + 𝑤 represent the midpoint that is assigned
𝑛 the code 0, and u for the coded
midpoint.
• Where
• x = mean of sample
• 𝑥0 = value of the midpoint assigned the code 0
• w = numerical width of the class interval
• u = code assigned to each class
• f = frequency or number of observations in each class
• n = total number of observations in the sample
Calculating the Mean from Grouped Data-
Coding Example
• Following table represents how to code the midpoints and find the sample mean of annual rainfall (in inches)
over 20 years in Kochi, Kerala
*W =8
• Y = mx + c
• 𝑌ത = 𝑚𝑥ҧ + 𝑐
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
The Weighted Mean
o The arithmetic mean, as discussed earlier, gives equal important (or weight) to each observation
in the data set.
o However, there are situations in which value of individual observations in the data set is not of
equal importance.
o If values occur with different frequencies, then computing A.M. of values (as opposed to the
A.M. of observations) may not be truly representative of the data set characteristic and thus may
be misleading.
o Under these circumstances, we may attach to each observation value a ‘weight’ 𝑤1 , 𝑤2 … … 𝑤𝑁 as
an indicator of their importance perhap because of size or importance and compute a weighted
mean or average denoted by 𝑥ҧ𝑤 as
σ 𝑤×𝑥
𝑥ҧ𝑤 = σ𝑤
𝑥ҧ𝑤 = symbol for the weighted mean
w = weight assigned to each observation
When to use weighted arithmetic mean
• (i) when the importance of all the numerical values in the given data
set is not equal.
• It is called a middle value in an ordered sequence of data in the sense that half of
the observations are smaller and half are larger than this value.
• The median can be calculated for both ungrouped and grouped data sets.
The Median – for ungrouped data
In this case the data is arranged in either ascending or descending order of
magnitude
Median Value
If the number of observations (n) is an odd number (𝑛+1)
𝑀𝑒𝑑 = 𝑡ℎ obesrvation
2
1. Calculate the median of the following data that relates to the service time
(in minutes) per customer for 7 customers at a railway reservation
counter:
2. Calculate the median of the following data that relates to the number of
patients examined per hour in the outpatient word (OPD) in a hospital:
𝑛
−𝑐𝑓
• 𝑀𝑒𝑑 = 𝑙 + 2
×ℎ
𝑓
The Median – for grouped data
𝒏
−𝒄𝒇
• 𝑴𝒆𝒅 = 𝒍 + 𝟐
×𝒉
𝒇
• where
• l = lower class limit (or boundary) of the median class interval.
• c.f. = cumulative frequency of the class prior to the median class
interval, that is, the sum of all the class frequencies upto, but not
including, the median class interval
• f = frequency of the median class
• h = width of the median class interval
• n = total number of observations in the distribution.
The Median (grouped data) – Example
• A survey was conducted to determine the age (in years) of 120 automobiles. The
result of such a survey is as follows:
• Range
• Interquartile Range
• Variance and Standard deviation
Range
• The range is the most simple measure of dispersion and is based on the
location of the largest and the smallest values in the data.
• Thus the range is defined to be the difference between the largest and lowest
observed values in a data set.
Months 1 2 3 4 5 6 7 8 9 10 11 12
Sales (Rs ’000) 80 82 82 84 84 86 86 88 88 90 90 92
• Given by (H – L) / (H + L)
Example
Advantage Disadvantage
• (i) It is independent of the measure of • (i) The calculation of range is based on only
central tendency and easy to calculate two values—largest and smallest in the data set
and fail to take account of any other
and understand. observations.
• (ii) It is quite useful in cases where the • (ii) It is largely influenced by two extreme
purpose is only to find out the extent values and completely independent of the
of extreme variation, such as industrial other values. For example, range of two data
quality control, temperature, rainfall, sets {1, 2, 3, 7, 12} and {1, 1, 1, 12, 12} is 11,
but the two data sets differ in terms of overall
and so on. dispersion of values.
• (iii) It does not describe the variation among
values in the data between two extremes.
Quartiles
It is often desirable to divide data into four parts, with each part
containing approximately one-fourth, or 25% of the observations.
Step 3.
(a) If i is not an integer, round up. The next integer greater than i denotes the
position of the pth percentile.
(b) If i is an integer, the pth percentile is the average of the values in positions i and
i + 1.
Example: Determine 85 th percentile
3310, 3355, 3450, 3480, 3480, 3490, 3520, 3540, 3550, 3650, 3730, 3925
Example
3310, 3355, 3450, 3480, 3480, 3490, 3520, 3540, 3550, 3650, 3730, 3925
Because i is not an integer, round up. The position of the 85th percentile is the
next integer greater than 10.2, the 11th position.
we see that the 85th percentile is the data value in the 11th position or 3730.
Interquartile Range
• The limitations or disadvantages of the range can partially be overcome by using another measure
of variation which measures the spread over the middle half of the values in the data set so as to
minimize the influence of outliers (extreme values) in the calculation of range.
• Since a large number of values in the data set lie in the central part of the frequency distribution,
therefore it is necessary to study the Interquartile Range (also called midspread).
• To compute this value, the entire data set is divided into four parts each of which contains 25 per
cent of the observed values. The quartiles are the highest values in each of these four parts.
• The interquartile range is a measure of dispersion or spread of values in the data set between the
third quartile, Q3 and the first quartile, Q1 .
• In other words, the interquartile range or deviation (IQR) is the range for the middle 50 per cent
of the data.
Interquartile range (IQR) = Q3 – Q1
Interquartile Range
• The concept of IQR is shown in Fig.
Example
(i) Find the interquartile range of the given data
5, 8, 15, 26, 10, 18, 3, 12, 6, 14, 11
3𝑁
−𝑓𝑐
4
Upper quartile 𝑄3 = 𝑙 + ∗𝑤
𝑓𝑞
σ𝑥
𝜇= Population mean
𝑁
Example
Example
Example
Example
(Sigma2/N) – (mu)2
Standard Score
𝒙−𝝁
Population standard score =
𝝈
Suppose we observe a vial of compound that is 0.108 percent impure. Because our population has a mean
of 0.166 and a standard deviation of 0.058, an observation of 0.108 would have a standard score of – 1:
0.108 − 0.166
= −1
0.058
The standard score indicates that an impurity of 0.282 percent deviates from the mean by 2(0.058) = 0.116 unit,
which is equal to +2 in terms of the number of standard deviations away from the mean.
Example
• The wholesale prices of a commodity for seven consecutive days in a month is as follows:
• Days : 1 2 3 4 5 6 7
• Commodity price/quintal : 240 260 270 245 255 286 264
• Calculate the variance and standard deviation.
Average Deviation Measures - Sample
Variance and Standard deviation
σ𝑥
𝑥ҧ = Sample mean
𝑛
Example
• Calculate the sample variance and standard deviation for the
following data
Observation
863
903
957
1,041
1,138
1,204
1,354
Average Deviation Measures - Population
Variance and Standard deviation
σ 𝑓×𝑥
𝜇=
𝑁
Average Deviation Measures - Population
Variance and Standard deviation (Coding)
Standard Deviation
• For
• 𝜎 = 𝜎2 = 𝑤 × group
data
2 σ 𝑓×𝑢
σ 𝑓×𝑢2 σ 𝑓×𝑢 𝜇 = 𝑥0 + 𝑤
𝑁 − 𝑁
𝑁
Where
𝜇 = mean of sample
𝑥0 = value of the midpoint assigned the code 0
w = numerical width of the class interval
u = code assigned to each class
f = frequency or number of observations in each class
n = total number of observations in the sample
Example
• Calculate the mean, variance and standard deviation for the given
data
Value Frequency
30-39 2
40-49 12
50-59 22
60-69 20
70-79 14
80-89 4
90-99 1
Summary
Type of Data Mean Standard Deviation
σ𝑥 σ 𝑥−𝜇 2
Population 𝜇= 𝜎= 𝜎2 =
𝑁 𝑁
Ungrouped
σ𝑥 σ 𝑥 − 𝑥ҧ 2
Sample 𝑥ҧ = 𝑠= 𝑠2 =
𝑛 𝑛−1
σ 𝑓×𝑥 σ𝑓 𝑥 − 𝜇 2
𝜇= 𝜎= 𝜎2 =
𝑁 𝑁
Grouped Population
2
σ 𝑓×𝑢 σ 𝑓 × 𝑢2 σ 𝑓×𝑢
𝜇 = 𝑥0 + 𝑤 𝜎= 2
𝜎 =𝑤× −
𝑁 𝑁 𝑁
Coefficient of Variation
• Standard deviation is an absolute measure of variation and expresses variation in the same unit of measurement
as the arithmetic mean or the original data.
• A relative measure called the coefficient of variation (CV), developed by Karl Pearson is very useful measure
• for (i) comparing two or more data sets expressed in different units of measurement
• (ii) comparing data sets that are in same unit of measurement but the mean values of data sets in a comparable
field are widely dissimilar
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎
𝐶𝑜𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = × 100 = × 100
𝑀𝑒𝑎𝑛 𝜇
Example
Each day, laboratory technician A completes on average 40 analyses with a standard
deviation of 5. Technician B completes on average 160 analyses per day with a
standard deviation 15. Which employee shows the less variability?
𝜎
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = × 100
𝜇
5
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = 40 × 100 = 12.5 % for Technician A
15
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = 160 × 100 = 9.4 % for Technician B
So, we find that Technician B, who has more absolute variation in output than Technician A, has less relative
variation because the mean output for B is much greater than A