0% found this document useful (0 votes)
19 views

Module 1

The document discusses organizing and presenting numerical data through various statistical methods. It covers ordered arrays, frequency distributions, histograms, polygons and cumulative distributions as ways to summarize and visualize numerical data.

Uploaded by

Anoop Thomas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Module 1

The document discusses organizing and presenting numerical data through various statistical methods. It covers ordered arrays, frequency distributions, histograms, polygons and cumulative distributions as ways to summarize and visualize numerical data.

Uploaded by

Anoop Thomas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Module 1

Learning Objective
• To convert raw data to useful information

• To construct and use data arrays

• To construct and use frequency distributions

• To graph frequency distributions with histograms, polygons, and ogives

• To use frequency distributions to make decisions


What is Statistics?
What is Statistics?
Statistics: The science of collecting, describing, and interpreting data.

Two areas of statistics:

Descriptive Statistics: collection, presentation, and description of


sample data.
Inferential Statistics: making decisions and drawing conclusions about
populations.
Introduction
• The field of statistics is the science of learning from data.
• Statistical knowledge helps you use the proper methods to collect
the data, employ the correct analyses, and effectively present the
results.
• Statistics is a crucial process behind how we make discoveries in
science, make decisions based on data, and make predictions.
• Statistics allows you to understand a subject much more deeply.

Statistics uses Numerical Evidence to draw valid Conclusions


Type of Statistics

Descriptive Inferential
Statistics statistics
Drawing conclusions
Collecting, and/or making
summarizing and decisions concerning
describing the data a population based
only on sample data
Example: A recent study examined the math and verbal SAT scores of high school seniors
across the country. Which of the following statements are descriptive in nature and
which are inferential.

• The mean math SAT score was 492.

• The mean verbal SAT score was 475.

• Students in the Northeast scored higher in math but lower in verbal.

• 80% of all students taking the exam were headed for college.

• 32% of the students scored above 610 on the verbal SAT.

• The math SAT scores are higher than they were 10 years ago.
Introduction to Basic Terms
Population: A collection, or set, of individuals or objects or events
whose properties are to be analyzed.
Two kinds of populations: finite or infinite.

Sample: A subset of the population.


Variable: A characteristic about each individual element of a population or sample.

Data (singular): The value of the variable associated with one element of a population
or sample. This value may be a number, a word, or a symbol.

Data (plural): The set of values collected for the variable from each of the elements
belonging to the sample.

Experiment: A planned activity whose results yield a set of data.

Parameter: A numerical value summarizing all the data of an entire population.

Statistic: A numerical value summarizing the sample data.


Example: A college dean is interested in learning about the average age of faculty.
Identify the basic terms in this situation.

The population is the age of all faculty members at the college.


A sample is any subset of that population. For example, we might select 10 faculty
members and determine their age.
The variable is the “age” of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the ages forming the sample and
determining the actual age of each faculty member in the sample.
The parameter of interest is the “average” age of all faculty at the college.
The statistic is the “average” age for all faculty in the sample.
Difference between Samples and Populations
▪ Statisticians gather data from a sample. They use this information to make inferences about the
population that the sample represents. Thus, a population is a whole, and a sample is a fraction or
segment of that whole.

▪ We will study samples in order to be able to describe populations. Our hospital may study a small,
representative group of X-ray records rather than examining each record for the last 50 years. The Gallup
Poll may interview a sample of only 2,500 adult Americans in order to predict the opinion of all adults
living in the United States.

▪ Studying samples is easier than studying the whole population; it costs less and takes less time. Often,
testing an airplane part for strength destroys the part; thus, testing fewer parts is desirable. Sometimes
testing involves human risk; thus, use of sampling reduces that risk to an acceptable level.
Presenting Data in Tables
and Charts
Organizing Numerical Data

Numerical Data

Frequency Distributions
Ordered
and
Array
Cumulative Distributions

Tables Histograms Polygons Ogive


Organizing Numerical Data:
Ordered Array
▪ An ordered array is a sequence of data, in rank order, from the smallest value to the largest
value.
▪ Shows range (minimum value to maximum value)
▪ May help identify outliers (unusual observations)

Day Shift
22 17 25 42 18 32

Weight of sample 38 19 20 27 21 22
of daily 16 17 20 18 19 18
production in kg
Night Shift
18 23 19 32 33 41
18 28 19 20 21 45
Organizing Numerical Data:
Ordered Array
▪ An ordered array is a sequence of data, in rank order, from the smallest value to the largest
value.
▪ Shows range (minimum value to maximum value)
▪ May help identify outliers (unusual observations)

Day Shift
16 17 17 18 18 18

Weight of sample 19 19 20 20 21 22
of daily production 22 25 27 32 38 42
in kg
Night Shift
18 18 19 19 20 21
23 28 32 33 41 45
Organizing Numerical Data:
Ordered Array
• Advantages:

• We can quickly notice the lowest and highest values in the data.
• We can easily divide the data into sections.
• We can see whether any values appear more than once in the array
• We can observe the distance between succeeding values in the data.

Disadvantage:

• It is a cumbersome form for displaying large quantities of data.


• We need to compress the information and still be able to use it for
interpretation and decision making
Organizing Numerical Data:
Frequency Distribution
▪The frequency distribution is a summary table in which the data are arranged into numerically ordered
classes.

▪You must give attention to selecting the appropriate number of class groupings for the table, determining a
suitable width of a class grouping, and establishing the boundaries of each class grouping to avoid
overlapping.
Organizing Numerical Data:
Frequency Distribution
Relative Frequency Distribution
• We can also express the frequency of each value as a fraction or a percentage of the total number of
observations.
• A relative frequency distribution presents frequencies in terms of fractions or percentages.
Guidelines for Selecting Width of Classes

Largest Data Value − Smallest Data Value


• Approximate Class Width =
Number of Classes
Organizing Numerical Data:
Frequency Distribution Example

Example: A manufacturer of insulation randomly selects 20 sample of iron rod and


records the temperature.

17, 13, 12, 24, 24, 21, 27, 26, 27, 35, 30, 32, 43, 41, 38, 44, 43, 58, 46, 53
Organizing Numerical Data:
Frequency Distribution Example
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Graphing Numerical Data:
The Histogram
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Histogram

7 6
Frequency 6 5 No Gaps
5 4 Between
4 3
3
Bars
2
2
1 0 0
0
5 15 25 35 45 55 More

Class Boundaries
Class Midpoints
Graphing Numerical Data:
The Frequency Polygon
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Frequency

7
6
5
4
3
2
1
0
5 15 25 35 45 55 More

Class Midpoints
Tabulating Numerical Data:
Cumulative Frequency
▪ Shows the number of items with values less than or equal to the
upper limit of each class

▪ Shows the proportion of items with values less than or equal to the
upper limit of each class.

▪ Shows the percentage of items with values less than or equal to the
upper limit of each class.
Tabulating Numerical Data:
Cumulative Frequency
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Graphing Numerical Data:
The Ogive (Cumulative % Polygon)
• A cumulative frequency distribution enables us to see how many
observations lie above or below certain values, rather than merely
recording the number of items within intervals.
Graphing Numerical Data:
The Ogive (Cumulative % Polygon)
• Data in Ordered Array:
• 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Ogive

100

80
60
40
20

0
10 20 30 40 50 60

Class Boundaries (Not Midpoints)


Stem and Leaf Plot
A Stem and Leaf Plot is a special table where each data value is split into a "stem" (the first digit
or digits) and a "leaf" (usually the last digit).
Measures of Central Tendency and
Dispersion in Frequency Distributions
Central Tendency
• Central tendency is the middle point of a distribution. Measures of
central tendency are also called measures of location.

Mean: 6.97
Central Tendency
In following figure , the central location of curve B lies to the right of
those of curve A and curve C. Notice that the central location of curve
A is equal to that of curve C.

COMPARISON OF CENTRAL LOCATION OF THREE CURVES


Excel Module 1 Test results
Dispersion
• Dispersion is the spread of the data in a distribution, that is, the
extent to which the observations are scattered. Notice that curve A in
Figure has a wider spread, or dispersion, than curve B.

COMPARISON OF DISPERSION OF TWO CURVES


• Skewness Curves representing the data points in the data set may be either symmetrical or skewed. Symmetrical curves,
like the one in following Figure, are such that a vertical line drawn from the center of the curve to the horizontal axis
divides the area of the curve into two equal parts. Each part is the mirror image of the other.

COMPARISON OF CENTRAL LOCATION OF THREE CURVES

• Kurtosis When we measure the kurtosis of a distribution, we are measuring its peakedness.
• For example, curves A and B differ only in that one is more peaked than the other. They have the same central location
and dispersion, and both are symmetrical. Statisticians say that the two curves have different degrees of kurtosis

TWO CURVES WITH THE SAME CENTRAL LOCATION BUT DIFFERENT KURTOSIS
Ungrouped vs Grouped Data
• Ungrouped data
• Ungrouped data which is also known as raw data is data that has not been placed in any
group or category after collection.
• Data is categorized in numbers or characteristics therefore, the data which has not been
put in any of the categories is ungrouped.
• The number of individuals residing in that area is ungrouped data or raw information
because nothing has been categorized.
• We can therefore conclude that ungrouped data is data used to show information on an
individual member of a sample or population.
Ungrouped vs Grouped Data
• Grouped data

• Grouped data is the type of data which is classified into groups after collection.

• The number of individuals residing in that area is ungrouped data or raw information

because nothing has been categorized. But if it is categorized as no. of man, no. if

women and no. of children, then it is considered as grouped data

• The raw data is categorized into various groups and a table is created.

• The primary purpose of the table is to show the data points occurring in each group.
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
Calculating the Mean from Ungrouped Data
A MEASURE OF CENTRAL TENDENCY
The Arithmetic Mean
• Suppose the monthly income (in Rs) of six families is given as:
1600, 1500, 1400, 1525, 1625, 1630.

• The mean family income is obtained by adding up the incomes and


dividing by the number of families.
1600 + 1500 + 1400 + 1525 + 1625 + 1630
𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡ℎ𝑖𝑐 𝑀𝑒𝑎𝑛 =
6

𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡ℎ𝑖𝑐 𝑀𝑒𝑎𝑛 = 1546


A MEASURE OF CENTRAL TENDENCY The
Arithmetic Mean
• A sample of a population consists of n observations (a lowercase 𝑛) with a mean
of 𝑥ҧ .

• Remember that the measures we compute for a sample are called statistics.

• The notation is different when we are computing measures for the entire
population, that is, for the group containing every element we are describing.

• The mean of a population is symbolized by 𝜇, which is the Greek letter 𝑚𝑢.

• The number of elements in a population is denoted by the capital italic letter 𝑁


Calculating the Mean from Ungrouped Data

𝜇 = 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑀𝑒𝑎𝑛

𝑥ҧ = 𝑆𝑎𝑚𝑝𝑙𝑒 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑀𝑒𝑎𝑛


Example
• Following Table shows the score obtained by seven different students
taking an online preparatory quiz.
Table: Quiz Marks
Students 1 2 3 4 5 6 7
Marks Obtained 9 7 7 6 4 4 2

• We Calculate the Mean of this sample of seven students as follows:


σ𝑥
• 𝑥ҧ =
𝑛

9+7+7+6+4+4+2
• 𝑥ҧ = = 5.6 ← 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
7
Calculating the Mean from Grouped Data
• A frequency distribution consists of data that are grouped by classes.
• Each value of an observation falls somewhere in one of the classes.
• Unlike the last example, we do not know the separate values of every observation.

• Steps to be followed:
• Calculate the midpoint of each class. To make midpoints come out in whole cents, we round
up. For example, mid point for first class becomes 25.00, rather than 24.995.
• Then we multiply each midpoint by the frequency of observations in that class,
• Sum all these results, and divide the sum by the total number of observations in the sample.
Calculating the Mean from Grouped Data
• Formula
σ 𝑓×𝑥
• 𝑥ҧ =
𝑛
• Where
• 𝑥ҧ = sample mean
• σ = symbol meaning “the sum of”
• 𝑓 = frequency (number of observations) in each class
• 𝑥 = midpoint for each class in the sample
• n = number of observations in the sample
Example
Calculating the Mean from Grouped Data-
Coding
• Using a technique called coding, we eliminate the problem of large or inconvenient midpoints.
• Instead of using the actual midpoints to perform our calculations, we can assign small-value consecutive
integers (whole numbers) called codes to each of the midpoints.
• The integer zero can be assigned anywhere, but to keep the integers small, we will assign zero to the midpoint
in the middle (or the one nearest to the middle) of the frequency distribution.
• Then we can assign negative integers to values smaller than that midpoint and positive integers to those
larger, as follows:

Class 1-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45
Code (u) -4 -3 -2 -1 0 1 2 3 4

𝑥0
Calculating the Mean from Grouped Data-
Coding
• Formula
Symbolically, statisticians use 𝑥0 to
σ 𝑢∗𝑓
• 𝑥ҧ = 𝑥0 + 𝑤 represent the midpoint that is assigned
𝑛 the code 0, and u for the coded
midpoint.
• Where
• x = mean of sample
• 𝑥0 = value of the midpoint assigned the code 0
• w = numerical width of the class interval
• u = code assigned to each class
• f = frequency or number of observations in each class
• n = total number of observations in the sample
Calculating the Mean from Grouped Data-
Coding Example
• Following table represents how to code the midpoints and find the sample mean of annual rainfall (in inches)
over 20 years in Kochi, Kerala

Annual Rainfall (Class) Frequency Lower Class Boundary


0-7 2
8-15 6
16-23 3
24-31 5
32-39 2
40-47 2

*W =8
• Y = mx + c

• 𝑌ത = 𝑚𝑥ҧ + 𝑐
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
The Weighted Mean
o The arithmetic mean, as discussed earlier, gives equal important (or weight) to each observation
in the data set.
o However, there are situations in which value of individual observations in the data set is not of
equal importance.
o If values occur with different frequencies, then computing A.M. of values (as opposed to the
A.M. of observations) may not be truly representative of the data set characteristic and thus may
be misleading.
o Under these circumstances, we may attach to each observation value a ‘weight’ 𝑤1 , 𝑤2 … … 𝑤𝑁 as
an indicator of their importance perhap because of size or importance and compute a weighted
mean or average denoted by 𝑥ҧ𝑤 as

σ 𝑤×𝑥
𝑥ҧ𝑤 = σ𝑤
𝑥ҧ𝑤 = symbol for the weighted mean
w = weight assigned to each observation
When to use weighted arithmetic mean

• (i) when the importance of all the numerical values in the given data
set is not equal.

• (ii) when the frequencies of various classes are widely varying.

• (iii) where there is a change either in the proportion of numerical


values or in the proportion of their frequencies.
Example
• A quiz was held to decide the award of a scholarship. The weights of various subjects were different. The
marks obtained by 3 candidates (out of 100 in each subject) are given below:

Subjects Weights Students


Ron Harry Hermione
Microeconomics 4 60 57 62
Financial Accounting 3 62 61 67
Business Statistics 2 55 53 60
Business Ethics 1 67 77 49

• Calculate the weighted A.M. to award the scholarship


Example
• The owner of a general store was interested in knowing the mean contribution (sales price minus variable cost)
of his stock of 5 items. The data is given below:

Product Contribution Quantity Sold


1 6 160
2 11 60
3 8 260
4 4 460
5 14 110
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
The Median
• Median may be defined as the middle value in the data set when its elements are
arranged in a sequential order, that is, in either ascending or decending order of
magnitude.

• It is called a middle value in an ordered sequence of data in the sense that half of
the observations are smaller and half are larger than this value.

• The median is thus a measure of the location or centrality of the observations.

• The median can be calculated for both ungrouped and grouped data sets.
The Median – for ungrouped data
In this case the data is arranged in either ascending or descending order of
magnitude
Median Value
If the number of observations (n) is an odd number (𝑛+1)
𝑀𝑒𝑑 = 𝑡ℎ obesrvation
2

If the number of observations (n) is an even number 𝑛 𝑛


𝑡ℎ + + 1 𝑡ℎ
𝑀𝑒𝑑 = 2 2
2
The Median (ungrouped data) – Examples

1. Calculate the median of the following data that relates to the service time
(in minutes) per customer for 7 customers at a railway reservation
counter:

3.5, 4.5, 3, 3.8, 5.0, 5.5, 4

2. Calculate the median of the following data that relates to the number of
patients examined per hour in the outpatient word (OPD) in a hospital:

10, 12, 15, 20, 13, 24, 17, 18


The Median – for grouped data
• To find the median value for grouped data, first identify the class interval which
contains the median value or (n/2)th observation of the data set.
• To identify such class interval, find the cumulative frequency of each class until the
class for which the cumulative frequency is equal to or greater than the value of
(n/2)th observation.
• The value of the median within that class is found by using interpolation. That is, it
is assumed that the observation values are evenly spaced over the entire class
interval.
• The following formula is used to determine the median of grouped data:

𝑛
−𝑐𝑓
• 𝑀𝑒𝑑 = 𝑙 + 2
×ℎ
𝑓
The Median – for grouped data
𝒏
−𝒄𝒇
• 𝑴𝒆𝒅 = 𝒍 + 𝟐
×𝒉
𝒇
• where
• l = lower class limit (or boundary) of the median class interval.
• c.f. = cumulative frequency of the class prior to the median class
interval, that is, the sum of all the class frequencies upto, but not
including, the median class interval
• f = frequency of the median class
• h = width of the median class interval
• n = total number of observations in the distribution.
The Median (grouped data) – Example
• A survey was conducted to determine the age (in years) of 120 automobiles. The
result of such a survey is as follows:

Age of Auto No of Auto Lcb cf


0-4 13
4-8 29
8-12 48
12-16 22
16-20 8
The Median (grouped data) – Example

Age of Auto No of Auto Lcb cf


0-4 13 0 13
4-8 29 4 42 Median=8+(((120/2)-42)/48)*4
8-12 48 8 90
12-16 22 12 112
16-20 8 16 120
A MEASURE OF CENTRAL TENDENCY
1. The Arithmetic Mean
2. The Weighted Mean
3. The Median
4. The Mode
The Mode
• The mode is the value that is repeated most often in the data set.
Mode for the ungrouped data
Example
DISPERSION
Suppose over the six-year period the net profits (in percentage) of two firms is as follows:
Firm 1 : 5.8, 5.5, 5.0, 5.7, 5.1, 5.4 5.41
Firm 2 : 5.6, 3.2, 3.3, 4.3, 4.0, 12.1 5.41
USEFUL MEASURES OF DISPERSION

• Range
• Interquartile Range
• Variance and Standard deviation
Range
• The range is the most simple measure of dispersion and is based on the
location of the largest and the smallest values in the data.

• Thus the range is defined to be the difference between the largest and lowest
observed values in a data set.

• Range (R) = Highest value of an observation – Lowest value of an observation


• =H–L
Range – Example (Ungrouped Data)
• The following are the sales figures of a firm for the last 12 months

Months 1 2 3 4 5 6 7 8 9 10 11 12
Sales (Rs ’000) 80 82 82 84 84 86 86 88 88 90 90 92

• Calculate the range of the given data.


Range – Example (Grouped Data)
• The following data show the waiting time of telephone calls to be matured:

Waiting Time Frequency


(Second)
10-25 6
26-50 10
51-75 8
76-100 4
101-125 4
Coefficient of Range
• This is a relative measure of dispersion and is based on the
value of the range. It is also called range coefficient of
dispersion.

• Given by (H – L) / (H + L)
Example

• The following are the wages of 8 workers in a factory. Find


the range and coefficient of range. Wages are in dollars:
1400, 1450, 1520, 1380, 1485, 1495, 1575, 1440.
Range

Advantage Disadvantage
• (i) It is independent of the measure of • (i) The calculation of range is based on only
central tendency and easy to calculate two values—largest and smallest in the data set
and fail to take account of any other
and understand. observations.
• (ii) It is quite useful in cases where the • (ii) It is largely influenced by two extreme
purpose is only to find out the extent values and completely independent of the
of extreme variation, such as industrial other values. For example, range of two data
quality control, temperature, rainfall, sets {1, 2, 3, 7, 12} and {1, 1, 1, 12, 12} is 11,
but the two data sets differ in terms of overall
and so on. dispersion of values.
• (iii) It does not describe the variation among
values in the data between two extremes.
Quartiles
It is often desirable to divide data into four parts, with each part
containing approximately one-fourth, or 25% of the observations.

Q1 = First quartile, or 25th percentile (also the median)


Q2 = second quartile, or 50th percentile (also the median)
Q3 = Third quartile, or 75th percentile
PERCENTILE

• The pth percentile is a value such that at least p percent of the


observations are less than or equal to this value and at least (100 - p)
percent of the observations are greater than or equal to this value.
PERCENTILE calculation steps
Step 1. Arrange the data in ascending order (smallest value to largest value).

Step 2. Compute an index i = (p/100) * n (n = sample size)

Step 3.

(a) If i is not an integer, round up. The next integer greater than i denotes the
position of the pth percentile.

(b) If i is an integer, the pth percentile is the average of the values in positions i and
i + 1.
Example: Determine 85 th percentile
3310, 3355, 3450, 3480, 3480, 3490, 3520, 3540, 3550, 3650, 3730, 3925
Example
3310, 3355, 3450, 3480, 3480, 3490, 3520, 3540, 3550, 3650, 3730, 3925

85 th percentile = (85/ 100) * 12 = 10.2

Because i is not an integer, round up. The position of the 85th percentile is the
next integer greater than 10.2, the 11th position.

we see that the 85th percentile is the data value in the 11th position or 3730.
Interquartile Range
• The limitations or disadvantages of the range can partially be overcome by using another measure
of variation which measures the spread over the middle half of the values in the data set so as to
minimize the influence of outliers (extreme values) in the calculation of range.

• Since a large number of values in the data set lie in the central part of the frequency distribution,
therefore it is necessary to study the Interquartile Range (also called midspread).

• To compute this value, the entire data set is divided into four parts each of which contains 25 per
cent of the observed values. The quartiles are the highest values in each of these four parts.

• The interquartile range is a measure of dispersion or spread of values in the data set between the
third quartile, Q3 and the first quartile, Q1 .

• In other words, the interquartile range or deviation (IQR) is the range for the middle 50 per cent
of the data.
Interquartile range (IQR) = Q3 – Q1
Interquartile Range
• The concept of IQR is shown in Fig.
Example
(i) Find the interquartile range of the given data
5, 8, 15, 26, 10, 18, 3, 12, 6, 14, 11

(ii) Find the interquartile range of the given data


11, 31, 21, 19, 8, 54, 35, 26, 29, 31, 35, 54
Quartiles of grouped data
𝑁
4
−𝑓𝑐
Lower quartile 𝑄1 = 𝑙 + ∗𝑤
𝑓𝑞

3𝑁
−𝑓𝑐
4
Upper quartile 𝑄3 = 𝑙 + ∗𝑤
𝑓𝑞

𝑙 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑜𝑤𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠


N is the total freq of distribution,
Fc is cumulative distribution before the quartile class
Fq is the frequeny of the quartile class
W is the class width
Example: Step 1
Marks Freq. C.F LCB
21-30 12
31-40 21
41-50 34
51-60 20
61-70 6
71-80 4
81-90 3
Example: Step 1
Marks Freq. C.F LCB
21-30 12 12 20.5
31-40 21 33 30.5
41-50 34 67 40.5
51-60 20 87 50.5
61-70 6 93 60.5
71-80 4 97 70.5
81-90 3 100 80.5
Example: Step 1
Value Fq Fc L
3𝑁
− 𝑓𝑐
𝑄3 = 𝑙 + 4 ∗𝑤
21-30 12 12 20.5 𝑓𝑞
31-40 21 33 30.5
41-50 34 67 40.5 3𝑁
𝑁𝑜𝑤, = 75, hence l = 50.5
4
51-60 20 87 50.5
𝑓𝑐 = 67 ; 𝑓𝑞 = 20 ; 𝑤 = 10
61-70 6 93 60.5
71-80 4 97 70.5
81-90 3 100 80.5 𝑄3 = 54.5
Example: Step 1
Value Fq Fc L
1𝑁
− 𝑓𝑐
𝑄1 = 𝑙 + 4 ∗𝑤
21-30 12 12 20.5 𝑓𝑞
31-40 21 33 30.5
41-50 34 67 40.5 𝑁
𝑁𝑜𝑤, = 25, hence l = 30.5
4
51-60 20 87 50.5
𝑓𝑐 = 12 ; 𝑓𝑞 = 21 ; 𝑤 = 10
61-70 6 93 60.5
71-80 4 97 70.5
81-90 3 100 80.5 𝑄1 = 36.69
Average Deviation Measures
• Two of these measures are important to our study of statistics: the variance
and the standard deviation. Both of these tell us an average distance of
any observation in the data set from the mean of the distribution.

• In statistics, the standard deviation is the usual way of measuring


distance from the mean or median (technically it measures dispersion or
variance, which is a complicated way of saying distance).
Average Deviation Measures - Population
Variance and Standard deviation

Variance Standard Deviation


• Every population has a variance, which is • The population standard deviation, or 𝜎,
symbolized by 𝜎 2 (sigma squared). is simply the square root of the
For • The formula for calculating the variance population variance.
ungroup is • Because the variance is the average of the
data
σ 𝑥−𝜇 2 squared distances of the observations
• 𝜎2 = from the mean, the standard deviation
𝑁
is the square root of the average of the
squared distances of the observations
from the mean.
For
σ 𝑥−𝜇 2
• 𝜎 = 𝜎2 = ungroup
𝑁
data

σ𝑥
𝜇= Population mean
𝑁
Example
Example
Example
Example
(Sigma2/N) – (mu)2
Standard Score
𝒙−𝝁
Population standard score =
𝝈

Suppose we observe a vial of compound that is 0.108 percent impure. Because our population has a mean
of 0.166 and a standard deviation of 0.058, an observation of 0.108 would have a standard score of – 1:

0.108 − 0.166
= −1
0.058

An observed impurity of 0.282 percent would have a standard score of +2

The standard score indicates that an impurity of 0.282 percent deviates from the mean by 2(0.058) = 0.116 unit,
which is equal to +2 in terms of the number of standard deviations away from the mean.
Example
• The wholesale prices of a commodity for seven consecutive days in a month is as follows:
• Days : 1 2 3 4 5 6 7
• Commodity price/quintal : 240 260 270 245 255 286 264
• Calculate the variance and standard deviation.
Average Deviation Measures - Sample
Variance and Standard deviation

Variance Standard Deviation


For For
2 σ(x−𝑥)ҧ 2 ungroup σ 𝑥−𝑥ҧ 2 ungroup
•𝑠 = data • 𝑠= 𝑠2 =
𝑛−1 data
𝑛−1

σ𝑥
𝑥ҧ = Sample mean
𝑛
Example
• Calculate the sample variance and standard deviation for the
following data
Observation
863
903
957
1,041
1,138
1,204
1,354
Average Deviation Measures - Population
Variance and Standard deviation

Variance Standard Deviation



For
σ 𝑓 𝑥−𝜇 2 σ 𝑓 𝑥−𝜇 2
For
• 𝜎2 = group
data • 𝜎 = 𝜎2 = group
𝑁 𝑁 data

σ 𝑓×𝑥
𝜇=
𝑁
Average Deviation Measures - Population
Variance and Standard deviation (Coding)

Standard Deviation
• For
• 𝜎 = 𝜎2 = 𝑤 × group
data
2 σ 𝑓×𝑢
σ 𝑓×𝑢2 σ 𝑓×𝑢 𝜇 = 𝑥0 + 𝑤
𝑁 − 𝑁
𝑁
Where
𝜇 = mean of sample
𝑥0 = value of the midpoint assigned the code 0
w = numerical width of the class interval
u = code assigned to each class
f = frequency or number of observations in each class
n = total number of observations in the sample
Example
• Calculate the mean, variance and standard deviation for the given
data
Value Frequency
30-39 2
40-49 12
50-59 22
60-69 20
70-79 14
80-89 4
90-99 1
Summary
Type of Data Mean Standard Deviation

σ𝑥 σ 𝑥−𝜇 2
Population 𝜇= 𝜎= 𝜎2 =
𝑁 𝑁
Ungrouped
σ𝑥 σ 𝑥 − 𝑥ҧ 2
Sample 𝑥ҧ = 𝑠= 𝑠2 =
𝑛 𝑛−1

σ 𝑓×𝑥 σ𝑓 𝑥 − 𝜇 2
𝜇= 𝜎= 𝜎2 =
𝑁 𝑁
Grouped Population
2
σ 𝑓×𝑢 σ 𝑓 × 𝑢2 σ 𝑓×𝑢
𝜇 = 𝑥0 + 𝑤 𝜎= 2
𝜎 =𝑤× −
𝑁 𝑁 𝑁
Coefficient of Variation
• Standard deviation is an absolute measure of variation and expresses variation in the same unit of measurement
as the arithmetic mean or the original data.
• A relative measure called the coefficient of variation (CV), developed by Karl Pearson is very useful measure
• for (i) comparing two or more data sets expressed in different units of measurement
• (ii) comparing data sets that are in same unit of measurement but the mean values of data sets in a comparable
field are widely dissimilar

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎
𝐶𝑜𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = × 100 = × 100
𝑀𝑒𝑎𝑛 𝜇
Example
Each day, laboratory technician A completes on average 40 analyses with a standard
deviation of 5. Technician B completes on average 160 analyses per day with a
standard deviation 15. Which employee shows the less variability?

𝜎
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = × 100
𝜇

5
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = 40 × 100 = 12.5 % for Technician A

15
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑉 = 160 × 100 = 9.4 % for Technician B

So, we find that Technician B, who has more absolute variation in output than Technician A, has less relative
variation because the mean output for B is much greater than A

You might also like