0% found this document useful (0 votes)
7 views

Prob. & stat note (Chs 1 & 2) (1)

This document provides an overview of basic statistical concepts, methods of data collection, and presentation. It distinguishes between descriptive and inferential statistics, outlines the stages of statistical investigation, and defines key terms such as population, sample, and variable. Additionally, it discusses measurement scales, applications, uses, limitations of statistics, and various data collection methods including primary and secondary sources.

Uploaded by

2abelj8383ni3i3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Prob. & stat note (Chs 1 & 2) (1)

This document provides an overview of basic statistical concepts, methods of data collection, and presentation. It distinguishes between descriptive and inferential statistics, outlines the stages of statistical investigation, and defines key terms such as population, sample, and variable. Additionally, it discusses measurement scales, applications, uses, limitations of statistics, and various data collection methods including primary and secondary sources.

Uploaded by

2abelj8383ni3i3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CHAPTER 1

Basic Concepts, Methods of data collection and presentation

1.1. Basic Concepts


We can define statistics in two ways.
1. Plural sense: Statistics refer to an aggregate numerical fact, or figures or quantitative information that
describes every aspect of social and economic phenomenon. Statistics are the raw data themselves,
like statistics of births, statistics of deaths, statistics of students, statistics of imports and exports,
etc.
2. Singular sense: Statistics is defined as the science of collecting, organizing, presenting, analyzing
and interpreting numerical data for the purpose of assisting in making a more effective decision.

Classifications of statistics:
Depending on how data can be used statistics is sometimes divided in to two main areas or branches.
1. Descriptive Statistics: It is concerned with summarizing or describing important features of
the available data without going beyond the data themselves. It is an area of statistical which is
mainly concerned with the methods and techniques used in collection, organization, presentation
and analysis of a set of data without making any conclusions or inferences. It is concerned with
summary calculations, graphs, charts and tables.
2. Inferential Statistics: It is a subject which deals with the method of inferring or drawing
conclusion about the characteristics of the population based upon the results of a sample. It is a
method used to generalize from a sample to a population. It induces the use of data from
samples to make inferences about a population from which samples are drawn. For example, the
average income of all families (the population) in Ethiopia can be estimated from figures
obtained from a few hundred (the sample) families.
 It is important because statistical data usually arises from sample.
 Statistical techniques based on probability theory are required.

Stages in Statistical Investigation


There are five stages or steps in any statistical investigation.
1. Collection of data: The process of measuring, gathering, assembling the raw data up on which
The statistical investigation is to be based. Data can be collected in a variety of ways.
2. Organization of data: Summarization of data in some meaningful way, e.g. table form
3. Presentation of the data: The process of re-organization, classification, compilation, and
summarization of data to present it in a meaningful form.
4. Analysis of data: The process of extracting relevant information from the summarized data,
mainly through the use of elementary mathematical operation.
5. Interpretation of data: The interpretation and further observation of the various statistical
measures through the analysis of the data by implementing those methods by which conclusions
are formed and inferences made.
 Statistical techniques based on probability theory are required.

1|Page
Definitions of some terms
a. Population: It is the collection of all possible measurements or observations of a specified
characteristic of interest (possessing certain common property) and being under study.
Examples 
 Population of trees under specified climatic conditions 
 Population of animals fed a certain type of diet 
 Population of farms having a certain type of natural fertility 
 Population of households, etc
b. Sample: It is a subset of the population, selected using some sampling technique in such a way
that they represent the population.
c. Sampling: The process or method of sample selection from the population.
d. Sample size: The number of elements or observation to be included in the sample.
e. Census: Complete enumeration or observation of the elements of the population. Or it is the
collection of data from every element in a population
f. Parameter: Characteristic or measure obtained from a population.
g. Statistic: Characteristic or measure obtained from a sample.
h. Variable: It is an item of interest that can take on many different numerical values.

Types of Variables and scale of measurements


Types of variables
1. Qualitative Variables are nonnumeric variables and can't be measured. Examples include
gender, religious affiliation, and state of birth.
2. Quantitative Variables are numerical variables and can be measured. Examples include balance
in checking account, number of children in family.
Note that quantitative variables are either discrete (which can assume only certain values, and there
are usually "gaps" between the values, such as the number of bedrooms in your house) or
continuous (which can assume any value within a specific range, such as the air pressure in a tire.)
Scales of measurement
Proper knowledge about the nature and type of data to be dealt with is essential in order to specify
and apply the proper statistical method for their analysis and inferences. Measurement scale refers to
the property of value assigned to the data based on the properties of order, distance and fixed zero.

In mathematical terms measurement is a functional mapping from the set of objects {Oi} to the set of
real numbers {M(Oi)}.

2|Page
The goal of measurement systems is to structure the rule for assigning numbers to objects in such a
way that the relationship between the objects is preserved in the numbers assigned to the objects.
The different kinds of relationships preserved are called properties of the measurement system.

Order

The property of order exists when an object that has more of the attribute than another object, is
given a bigger number by the rule system. This relationship must hold for all objects in the "real
world".

The property of ORDER exists

When for all i, j if Oi > Oj, then M(Oi) > M(Oj).

Distance

The property of distance is concerned with the relationship of differences between objects. If a
measurement system possesses the property of distance it means that the unit of measurement means
the same thing throughout the scale of numbers. That is, an inch is an inch, no matters were it falls -
immediately ahead or a mile downs the road.

More precisely, an equal difference between two numbers reflects an equal difference in the "real
world" between the objects that were assigned the numbers. In order to define the property of
distance in the mathematical notation, four objects are required: Oi, Oj, Ok, and Ol . The difference
between objects is represented by the "-" sign; Oi - Oj refers to the actual "real world" difference
between object i and object j, while M(Oi) - M(Oj) refers to differences between numbers.

The property of DISTANCE exists, for all i, j, k, l

If Oi-Oj ≥ Ok- Ol then M(Oi)-M(Oj) ≥ M(Ok)-M( Ol ).

Fixed Zero

A measurement system possesses a rational zero (fixed zero) if an object that has none of the attribute
in question is assigned the number zero by the system of rules. The object does not need to really
exist in the "real world", as it is somewhat difficult to visualize a "man with no height". The
requirement for a rational zero is this: if objects with none of the attribute did exist would they be
given the value zero. Defining O0 as the object with none of the attribute in question, the definition
of a rational zero becomes:

The property of FIXED ZERO exists if M(O0) = 0.

The property of fixed zero is necessary for ratios between numbers to be meaningful.

3|Page
MEASUREMENT SCALE TYPES

Measurement is the assignment of numbers to objects or events in a systematic fashion. Four levels of
measurement scales are commonly distinguished: nominal, ordinal, interval, and ratio and each
possessed different properties of measurement systems.

1) Nominal Scales

Nominal scales are measurement systems that possess none of the three properties stated above.

 Level of measurement which classifies data into mutually exclusive, all inclusive categories in
which no order or ranking can be imposed on the data.
 No arithmetic and relational operation can be applied.

Examples:

o Political party preference (Republican, Democrat, or Other,)


o Sex (Male or Female.)
o Marital status(married, single, widow, divorce)
o Country code
o Regional differentiation of Ethiopia.

2) Ordinal Scales

Ordinal Scales are measurement systems that possess the property of order, but not the property of
distance. The property of fixed zero is not important if the property of distance is not satisfied.

 Level of measurement which classifies data into categories that can be ranked. Differences
between the ranks do not exist.
 Arithmetic operations are not applicable but relational operations are applicable.
 Ordering is the sole property of ordinal scale.

Examples:

o Letter grades (A, B, C, D, F).


o Rating scales (Excellent, Very good, Good, Fair, poor).
o Military status.

3) Interval Scales

Interval scales are measurement systems that possess the properties of Order and distance, but not
the property of fixed zero.

4|Page
 Level of measurement which classifies data that can be ranked and differences are
meaningful. However, there is no meaningful zero, so ratios are meaningless.
 All arithmetic operations except division are applicable.
 Relational operations are also possible.
Examples:

o IQ
o Temperature in oF.

4) Ratio Scales

Ratio scales are measurement systems that possess all three properties: order, distance, and fixed
zero. The added power of a fixed zero allows ratios of numbers to be meaningfully interpreted; i.e.
the ratio of Bekele's height to Martha's height is 1.32, whereas this is not possible with interval scales.

 Level of measurement which classifies data that can be ranked, differences are meaningful, and
there is a true zero. True ratios exist between the different units of measure.
 All arithmetic and relational operations are applicable.

Examples:

o Weight
o Height
o Number of students
o Age

Exercise:

The following present a list of different attributes and rules for assigning numbers to objects. Try to
classify the different measurement systems into one of the four types of scales. (Exercise)

1. Your checking account number as a name for your account.


2. Your checking account balance as a measure of the amount of money you have in that account.
3. Your score on the first statistics test as a measure of your knowledge of statistics.
4. A response to the statement "Abortion is a woman's right" where "Strongly Disagree" = 1,
"Disagree" = 2, "No Opinion" = 3, "Agree" = 4, and "Strongly Agree" = 5, as a measure of
attitude toward abortion.
5. Times for swimmers to complete a 50-meter race
6. Months of the year Meskerm, Tikimit…
7. Socioeconomic status of a family when classified as low, middle and upper classes.
8. Blood type of individuals, A, B, AB and O.
9. Pollen counts provided as numbers between 1 and 10 where 1 implies there is almost no
pollen and 10 that it is rampant, but for which the values do not represent an actual counts of
grains of pollen.

5|Page
10. Regions numbers of Ethiopia (1, 2, 3 etc.)
11. The number of students in a college;
12. the net wages of a group of workers;
13. the height of the men in the same town;

Applications, Uses and Limitations of statistics


Applications of statistics:
 In almost all fields of human endeavor.
 Almost all human beings in their daily life are subjected to obtaining numerical facts e.g. abut
price.
 Applicable in some process e.g. invention of certain drugs, extent of environmental pollution.
 In industries especially in quality control area.
Uses of statistics:
The main function of statistics is to enlarge our knowledge of complex phenomena. The following
are some uses of statistics:
1. It presents facts in a definite and precise form.
2. Data reduction.
3. Measuring the magnitude of variations in data.
4. Furnishes a technique of comparison
5. Estimating unknown population characteristics.
6. Testing and formulating of hypothesis.
7. Studying the relationship between two or more variable.
8. Forecasting future events.
Limitations of statistics
As a science, statistics has its own limitations. The following are some of the limitations:
 Deals with only aggregate of facts and not with individual data items.
 Statistical data are only approximately and not mathematical correct.
 Statistics can be easily misused and therefore should be used be experts.

1.2. Methods of data collection and presentation


1.2.1. Methods of data collection
Source of data: There are two sources of data (Primary and Secondary source).
1. Primary Data
 Data are measured or collected by the investigator or the user directly from the source.
 If secondary data are outdated, then primary data have an advantage over secondary data.
 Two activities involved: planning and measuring (collecting).
a) Planning:
 Identify source and elements of the data.
 Decide whether to consider sample or census.
 If sampling is preferred, decide on sample size, selection method,… etc
 Decide measurement procedure.
 Set up the necessary organizational structure.

6|Page
b) Measuring (collecting) data: There are different options.
i. Observation:
 It includes all methods from simple visual observations to the use of high-level
machines and measurements, sophisticated equipment or facilities.
 An observation guide should be prepared prior to data collection.
Advantages: Gives relatively more detailed, accurate and context related information.
Disadvantages: Investigators or observer’s own biases, prejudice, desires, and etc. and
needs more resources and skilled human power during the use of high-level machines.
ii. Interview
 Could be face to face /telephone interview
Advantage:
- suitable for use with illiterates.
- permits clarifications of questions.
- higher response rate than self-administered questionnaire.
Disadvantage:
- presence of interviewer can influence the response
- more costly than self-administered questionnaire
iii. Questionnaire (Self-administered and Mailed questionnaires)
 Questionnaire is list of questions arranged in a predetermined sequence for a
predetermined purpose.
 Self-administered questionnaires: Under this method, the questionnaire is
distributed by hand to the respondents. The use of self-administered questionnaires
is simpler and cheaper; such questionnaires can be administered to many persons
simultaneously (e.g. to a class of students).
 Mailed Questionnaire: The questionnaires are sent by post, e-mail e.tc. to the
informants.
 Limitations of questionnaire:
- The method can be used only if the respondents are educated.
- The response rates tend to be relatively low.
- Informants may not return the completed questionnaire back and even if they
did, they may have filled them incorrectly.
- It may not give the investigator a chance to explain the questions or ask
supplementary and follow up questions.
iv. Focus Group discussions
v. Other data collection techniques – life histories, case studies, etc. are some of the sources for
collecting the primary data.

Types of questions:
Depending on how questions are asked and recorded we can distinguish two major possibilities - Open –
ended questions, and closed ended questions.
a) Open-ended questions: Open-ended questions permit free responses that should be recorded in the
respondent’s own words. The respondent is not given any possible answers to choose from. Such
questions are useful to obtain information on:
 Facts with which the researcher is not very familiar

7|Page
 Opinions, attitudes, suggestions of informants, or Sensitive issues
b) Closed- ended questions: Closed questions offer a list of possible options or answers from which the
respondents must choose. When designing closed questions, one should try to:
 Offer a list of options that are exhaustive and mutually exclusive
 Keep the number of options as few as possible.

2. Secondary Data
 Data gathered or compiled from published and unpublished sources or files.
 Secondary data are less expensive than primary data.
 When our source is secondary data check that:
 The type and objective of the situations.
 The purpose for which the data are collected and compatible with the present
problem.
 The nature and classification of data is appropriate to our problem.
 There are no biases and misreporting in the published data.
Note: Data which are primary for one may be secondary for the other.

According to the role of time, data are classified in to cross-sectional and time series data.
Cross-sectional data is a set of observations taken at one point in time, while, time series data is a set of
observations collected for a sequence of times, usually at equal interval which may be on weekly,
monthly, quarterly, yearly, etc basis.

Before any statistical work can be done data must be collected. Depending on the type of variable and the
objective of the study different data collection methods can be employed. In the collection of data we
have to be systematic. If data are collected haphazardly, it will be difficult to answer our research
questions in a conclusive way.

1.2.2. Methods of data Presentation

Having collected and edited the data, the next important step is to organize it. That is to present it in a
readily comprehensible condensed form that aids in order to draw inferences from it. It is also necessary
that the like be separated from the unlike ones.

The presentation of data is broadly classified in to the following two categories:

 Tabular presentation
 Diagrammatic and Graphic presentation.

Tabular presentation:

The process of arranging data in to classes or categories according to similarities technically is called
classification.

Classification is a preliminary and it prepares the ground for proper presentation of data.

Definitions:

8|Page
 Raw data: recorded information in its original collected form, whether it be counts or
measurements, is referred to as raw data.
 Frequency: is the number of values in a specific class of the distribution.
 Frequency distribution: is the organization of raw data in table form using classes and frequencies.

There are three basic types of frequency distributions

 Categorical frequency distribution


 Ungrouped frequency distribution
 Grouped frequency distribution

There are specific procedures for constructing each type.

1) Categorical frequency Distribution:

Used for data that can be place in specific categories such as nominal, or ordinal. e.g. marital status.
Example: a social worker collected the following data on marital status for 25 persons.
(M=married, S=single, W=widowed, D=divorced)
M S D W D
S S M M M
W D S M M
W D D S S
S W W D D
Solution:

Since the data are categorical, discrete classes can be used. There are four types of marital status M, S, D, and
W. These types will be used as class for the distribution. We follow procedure to construct the frequency
distribution. First tally the data; then count the tally and find and the percentage of values in each class
f
( %  * 100 , where f= frequency of the class, n=total number of value).
n

Percentages are not normally a part of frequency distribution but they can be added since they are used in
certain types diagrammatic such as pie charts.

Combing all the steps one can construct the following frequency distribution.

Class Tally Frequency Percent

M 5 20
////
S //// // 7 28
D //// // 7 28
W //// 6 24

9|Page
2) Ungrouped frequency Distribution:

 It is a table of all the potential raw score values that could possibly occur in the data along with the
number of times each actually occurred.
 Is often constructed for small set or data on discrete variable.

Constructing ungrouped frequency distribution:

 First find the smallest and largest raw score in the collected data.
 Arrange the data in order of magnitude and count the frequency.
 To facilitate counting one may include a column of tallies.

Example:

The following data represent the mark of 20 students.

80 76 90 85 80
70 60 62 70 85
65 60 63 74 75
76 70 70 80 85

Construct a frequency distribution, which is ungrouped.


Solution:
Step 1: Find the range, Range=Max-Min=90-60=30.
Step 2: Make a table as shown
Step 3: Tally the data.
Step 4: Compute the frequency.
Mark Tally Frequency
60 // 2
62 / 1
63 / 1
65 / 1
70 //// 4
74 / 1
75 // 2
76 / 1
80 /// 3
85 /// 3
90 / 1

Each individual value is presented separately, that is why it is named ungrouped frequency distribution.

3) Grouped frequency Distribution:


When the range of the data is large, the data must be grouped in to classes that are more than one unit in
width.

10 | P a g e
Definitions:
 Grouped Frequency Distribution: a frequency distribution when several numbers are grouped in one
class.
 Class limits: Separates one class in a grouped frequency distribution from another. The limits could
actually appear in the data and have gaps between the upper limits of one class and lower limit of the
next.
 Units of measurement (U): the distance between two possible consecutive measures. It is usually
taken as 1, 0.1, 0.01, 0.001, -----.
 Class boundaries: Separates one class in a grouped frequency distribution from another. The
boundaries have one more decimal places than the row data and therefore do not appear in the data.
There is no gap between the upper boundary of one class and lower boundary of the next class. The
lower class boundary is found by subtracting U/2 from the corresponding lower class limit and the
upper class boundary is found by adding U/2 to the corresponding upper class limit.
 Class width: the difference between the upper and lower class boundaries of any class. It is also the
difference between the lower limits of any two consecutive classes or the difference between any two
consecutive class marks.
 Class mark (Mid points): it is the average of the lower and upper class limits or the average of upper
and lower class boundary.
 Cumulative frequency: is the number of observations less than/more than or equal to a specific value.
 Cumulative frequency above: it is the total frequency of all values greater than or equal to the lower
class boundary of a given class.
 Cumulative frequency blow: it is the total frequency of all values less than or equal to the upper class
boundary of a given class.
 Cumulative Frequency Distribution (CFD): it is the tabular arrangement of class interval together
with their corresponding cumulative frequencies. It can be more than or less than type, depending on
the type of cumulative frequency used.
 Relative frequency (rf): it is the frequency divided by the total frequency.
 Relative cumulative frequency (rcf): it is the cumulative frequency divided by the total frequency.

Guidelines for classes

1. There should be between 5 and 20 classes.


2. The classes must be mutually exclusive. This means that no data value can fall into two different
classes
3. The classes must be all inclusive or exhaustive. This means that all data values must be included.
4. The classes must be continuous. There are no gaps in a frequency distribution.
5. The classes must be equal in width. The exception here is the first or last class. It is possible to
have an "below ..." or "... and above" class. This is often used with ages.

Steps for constructing Grouped frequency Distribution:


 Fix the number of classes (K) to use, (We may approximate k by : k  1  3.32 log n ).
 Determine the class size (class width) as (if we want to have equal class width for classes):
W = (Maximum value – Minimum value)/K = Range/K.
 Pick a suitable starting point less than or equal to the minimum value. The starting point is called
the lower limit of the first class. Continue to add the class width to this lower limit to get the rest of
the lower limits.
 To find the upper limit of the first class, subtract U from the lower limit of the second class. Then
continue to add the class width to this upper limit to find the rest of the upper limits.

11 | P a g e
 Find the boundaries by subtracting U/2 units from the lower limits and adding U/2 units from the
upper limits.
 Find the frequency and relative frequency of each class.

Example: Construct a frequency distribution for the following data.


11 29 6 33 14 31 22 27 19 20
18 17 22 38 23 21 26 34 39 27

Solutions:

Step 1: Find the highest and the lowest value H=39, L=6

Step 2: Find the range; R=H-L=39-6=33

Step 3: Select the number of classes desired using Sturges formula;

k  1  3.32 log n =1+3.32log (20) =5.32=6(rounding up)

Step 4: Find the class width; w=R/k=33/6=5.5=6 (rounding up)

Step 5: Select the starting point, let it be the minimum observation.

 6, 12, 18, 24, 30, 36 are the lower class limits.

Step 6: Find the upper class limit; e.g. the first upper class=12-U=12-1=11

 11, 17, 23, 29, 35, 41 are the upper class limits.

So combining step 5 and step 6, one can construct the following classes.

Class limits
6 – 11
12 – 17
18 – 23
24 – 29
30 – 35
36 – 41

Step 7: Find the class boundaries;

E.g. for class 1 Lower class boundary=6-U/2=5.5

Upper class boundary =11+U/2=11.5

 Then continue adding w on both boundaries to obtain the rest boundaries. By doing so one can
obtain the following classes.

12 | P a g e
Class boundary
5.5 – 11.5
11.5 – 17.5
17.5 – 23.5
23.5 – 29.5
29.5 – 35.5
35.5 – 41.5

Step 8: tally the data.

Step 9: Write the numeric values for the tallies in the frequency column.

Step 10: Find cumulative frequency.

Step 11: Find relative frequency or/and relative cumulative frequency.

The complete frequency distribution follows:

Class Class boundary Class Tally Freq. Cf (less Cf (more rf. rcf (less
limit Mark than than type) than type
type)
6 – 11 5.5 – 11.5 8.5 // 2 2 20 0.10 0.10
12 – 17 11.5 – 17.5 14.5 // 2 4 18 0.10 0.20
18 – 23 17.5 – 23.5 20.5 7 11 16 0.35 0.55
//////
24 – 29 23.5 – 29.5 26.5 //// 4 15 9 0.20 0.75
30 – 35 29.5 – 35.5 32.5 /// 3 18 5 0.15 0.90
36 – 41 35.5 – 41.5 38.5 // 2 20 2 0.10 1.00

Diagrammatic and graphical presentation of data

These are techniques for presenting data in visual displays using geometric and pictures.
They have greater attraction, facilitate comparison and are easily understandable.

Diagrammatic presentation of data

Diagrams are appropriate for presenting data for qualitative variable. The most commonly used
diagrammatic presentation for qualitative data are:

 Pie charts
 Bar charts

Pie chart

A pie chart is a circle that is divided in to sections or wedges according to the percentage of frequencies in
each category of the distribution. The angle of the sector is obtained using:

13 | P a g e
Valueofthepart
Angleof sec tor  * 360
thewholequantity

Example: Draw a suitable diagram to represent the following population in a town.

Men Women Girls Boys

2500 2000 4000 1500

Solutions:

Step 1: Find the percentage.

Step 2: Find the number of degrees for each class.

Step 3: Using a protractor and compass, graph each section and write its name corresponding percentage.

Class Frequency Percent Degree


Men 2500 25 90
Women 2000 20 72
Girls 4000 40 144
Boys 1500 15 54

CLASS

Boy s Men

Girls Women

Fig. 1
Bar Chart:
There are different types of bar charts, the most important ones are simple bar chart, component bar chart
and multiple bar chat.

14 | P a g e
a) Simple bar chart: It is a one-dimensional chart in which the bar represents the whole of the
magnitude. The height or length of each bar indicates the size (frequency) of the figure
represented.
Example: Draw a bar-chart to represent the following data related to students’ enrolment in a university.
Year 1990 1991 1992 1993 1994 1995
No. of students 2005 2338 3412 3900 4967 5788

Fig. 2
b) Multiple Bar-Chart: In this type of chart the component figures are shown as separate bars adjoining
each other. The height of each bar represents the actual value of the component figure. It depicts
distributional pattern of more than one variable and comparisons of each component are desired.
Example: Represent the following data relate to faculty wise enrolment of students in a college by using
multiple bar chart.
Faculty Years 1990 1991 1992
No. of Art students 120 115 132
No. of Science students 160 165 190
No. of Health students 80 90 94
Solution:
Since year-wise data is compared in three aspects (Art, Science and Health), the appropriate diagram is a
multiple bar chart.

15 | P a g e
Fig. 3
c) Component Bar chart: Bars are sub-divided into component parts of the figure. These sorts of diagrams
are constructed when each total is built up from two or more component figures. This is done by dividing
the bars into parts representing the components and shading them accordingly.
Example: Consider the above example and give the Component bar chart.

Fig. 4

Graphical Presentation of data


The histogram, frequency polygon and cumulative frequency graph or ogive are most commonly applied
graphical representation for continuous data.
Procedures for constructing statistical graphs:
 Draw and label the X and Y axes.
 Choose a suitable scale for the frequencies or cumulative frequencies and label it on the Y axes.
 Represent the class boundaries for the histogram or ogive or the mid points for the frequency polygon on
the X axes.
 Plot the points.
 Draw the bars or lines to connect the points.

16 | P a g e
Histogram:
A graph which displays the data by using vertical bars of various heights to represent frequencies. Class
boundaries are placed along the horizontal axes. Class marks and class limits are sometimes used as quantity
on the X axes.

Example:
The histogram of the following grouped frequency distribution is given below.
Class Boundary 14.5 - 24.5 24.5 - 34.5 34.5 - 44.5 44.5 - 54.5 54.5 - 64.5
Frequency 3 4 8 6 7

Fig. 5
Frequency Polygon:
If we join the midpoints of the tops of the adjacent rectangles of the histogram with line segments a
frequency polygon is obtained. It is a line graph. The frequency is placed along the vertical axis and classes
mid points are placed along the horizontal axis. It is customer to the next higher and lower class interval
with corresponding frequency of zero, this is to make it a complete polygon.

Example: Draw the frequency polygon for the following grouped data

Class Class boundary Class Freq.


limit Mark
6 – 11 5.5 – 11.5 8.5 2
12 – 17 11.5 – 17.5 14.5 2
18 – 23 17.5 – 23.5 20.5 7
24 – 29 23.5 – 29.5 26.5 4
30 – 35 29.5 – 35.5 32.5 3
36 – 41 35.5 – 41.5 38.5 2

17 | P a g e
8

4
Value Frequency

0
2. 5 8. 5 14 .5 20 .5 26 .5 32 .5 38 .5 44 .5

Class Mid points

Fig. 6

18 | P a g e
Chapter 2
Descriptive Statistics (Summarizing data)
2.1. Measures of central tendency
When we want to make comparison between groups of numbers it is good to have a single value that is
considered to be a good representative of each group. This single value is called the average of the group.
Averages are also called measures of central tendency.
Objectives
Since the number of sample points is frequently large and it is easy to lose track of the overall picture by
looking at all the data at once, the data must be summarized as briefly as possible.
Some objectives of measuring central tendency:
 To comprehend (understand) the data easily.
 To facilitate comparison.
 To make further statistical analysis.
The Summation Notation
Let X1, X2, X3, …,Xn be a number of measurements where n is the total number of observation and Xi is
,
th
i observation.
n
The symbol X
i 1
i (read as “the sum of Xi where i goes from 1 to n”) is mathematical shorthand for

n
X1+X2+X3+...+Xn . That is X
i 1
i = X1+X2+…+Xn

Example: Suppose the following were scores made on the first homework assignment for five students in
the class: 5, 7, 7, 6, and 8.
5

X
i 1
i = X1+X2+ X3 + X4+ X5 = 5 + 7+7+6+8=33

Properties of Summation
n n n

 k  nk , where k is any constant


i 1
 kX  k  X ,
i 1 i 1
where k is any constant

n n n n n

 (a  bX )  na  b X
i 1
i
i 1
i , a and b are constants. (X
i 1
i  Yi )   X i   Yi
i 1 i 1

Example: Consider the following data and determine


Xi 5 7 7 6 8
Yi 6 7 8 7 8

19 | P a g e
5 5
a)  X i =5+7+7+6+8=33
i 1
e) (X
i 1
i  Yi )   3

5 5
b)  Yi  36
i 1
f) X Y
i 1
i i =241

5 5
c) 10  10 * 5  50
i 1
g) X
i 1
i
2
 223

5 5 5 5 5
d)  ( X i  Yi ) 
i 1
 X i +  Yi =69
i 1 i 1
h) (  X i )(  Yi ) = 1188
i 1 i 1

Types of measures of central tendency


The different measures of central tendency are the Mean (Arithmetic, Geometric and Harmonic), the
Mode, the Median.

The Arithmetic Mean:


It is defined as the sum of the magnitude of the items divided by the number of items.
Suppose X1, X2, X3, …,Xn are n observed values in a sample of size n, then the arithmetic mean of the
sample, denoted by is given as:

If we take an entire population Mean is denoted by 𝜇 and is given by:


𝜇= , where N stands for the total number of observations in the population.

Example: Suppose the sample consists of birth weights (in grams) of live born infants at a private hospital
in a certain city during a 1-week period. These sample birth weights are:
3265, 3323, 2581, 2759, 3260, 3649, 2841, 3248, 3245, 3200, 3609, 3314,
3484, 3031, 2838, 3101, 4146, 2069, 3541, 2834.
Then find arithmetic mean for the sample birth weights.

Solution: Arithmetic mean = = = (3265 + 3260 + ….+ 2834) = 3166.9 gram.

If X is a variable having values x1, x2,…,xk occurring with frequencies of f1, f2,…, fk respectively, then its
arithmetic mean is given by:
k

x
i 1
i fi
k

f
i 1
i

20 | P a g e
Example: Suppose the X values are 3, 5, 4, 2, 7 and 6 with corresponding frequencies of 2, 1, 3, 2, 1 and 1
respectively. Then fine the mean for data.
Xi 3 5 4 2 7 6
frequency, fi 2 1 3 2 1 1

Solution: = = = 4.

Mean for Grouped Data


This method is applicable where the entire range of observations has been grouped into a continuous
frequency distribution. In such cases the mean of the distribution is computed as:
k

m i fi
,where
X  i 1
k

 i 1
fi

 k is number of classes, mi is the midpoint of the ith class and fi is the ith class frequency.
Example: Calculate the mean for grouped data on the amount of time (in hours) that 80 college students
devoted to leisure activities during a typical school week given below:
Time spent (hours) Frequency
10 – 14 8
15 – 19 28
20 – 24 27
25 – 29 12
30 – 34 3
35 – 39 1
40 - 44 1

Solution:
The class marks of the distribution are: 12, 17, 22, 27, 32, 37, 42.
Then the mean of the data is computed as:
k

m i fi
= = = 20.7 hours.
X  i 1
k

f i 1
i

Special Properties of the Arithmetic Mean


1) The sum of the deviations about the mean is zero. i. e.  ( X i  X )  0

21 | P a g e
2) If we have means , , X 3 , …, X k of k groups having the same unit of measurements of a
variable, based on n1, n2, n3, …, nk observations respectively. Then the mean of all the observation
in all groups often called the combined mean is given by
n1 X 1  n2 X 2  ...  nk X k
=
n1  n2  ...  nk
Example: If the mean final exam mark of one class of 50 students is 30 and the mean of marks of another
class of 100 students in the same final exam is 40. What is the mean mark of all 150 students?
50 * 30  100 * 40
Solution: X c   36.7 (50*30 + 100*40)/(50 + 100) =36.7.
50  100

3) If a wrong figure has been used when calculating the mean, then the correct mean can be obtained
without repeating the whole process using:
correct value wrong value
Correct mean = wrong mean +
n
Where n= number of observations
Example: An average weight of 10 students was calculated to be 65. Later it was discovered that one
weight was misread as 40 kg instead of 80 kg.
Calculate the correct average weight.
80  40
Correct mean = 65+ = 65+4 = 69
10
4) The effect of transforming original series on the mean.
a) If a constant k is added to / subtracted from/ every observation then the new mean will be
the old mean ± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be k*old mean.
Example: The mean of a set of numbers is 500.
a. If 10 is added to each of the numbers in the set, then what will be the mean of the new set?
New mean = 500+10 =510
b. If each of the numbers in the set are multiplied by 5, then what will be the mean of the new set?
New mean = 5*500= 2500
Example: The mean of n observations X1, X2, …,Xn are known to be 12 . New set of another
observations are obtained by the linear transformation Y = 2X – 0.5 ( i = 1, 2, …, n ) then what will be
i i

the mean of the new set of observations


Solutions: New Mean = 2* Old Mean – 0.5 = 2*12 – 0.5 = 23.5.

22 | P a g e
Advantages of arithmetic mean
 It is based on all values
 It is easy to calculate and simple to understand
 It is suitable for further mathematical treatment.
 It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.
Disadvantages of arithmetic mean
 It is affected by extreme observations.
 It cannot be used in the case of open end classes.
 It cannot be determined by the method of inspection.
 It cannot be used when dealing with qualitative characteristics, such as intelligence, honesty, beauty.
 Sometimes it leads to wrong conclusion if the details of the data from which it is obtained are not
available.

Weighted Mean
In computation of arithmetic mean we had given equal importance to each observation. While, when
averaging quantities, it is often necessary to account for the fact that not all of them are equally important
in the phenomenon being described. In order to give quantities being averaged their proper degree of
importance, it is necessary to assign them relative importance called weights, and then calculate a
weighted mean.
In general, the weighted mean w of a set of values x1, x2, …,xn, whose relative importance is expressed
numerically by a corresponding set of weights W1, W2, … Wn, is given by:

Example: A student obtained results 60, 75, 63, 59, and 55 in English, Biology, Mathematics, Physics and
Chemistry examinations respectively. Find the students weighted arithmetic mean if weights 1, 2, 1, 3, 3
respectively are allotted to the subjects.
Solution: weight mean = = (60*1 +75*2 + 63*1 + 59*3 + 55*3)/ (1+2+1+3+3) = 615/10 =61.5
Geometric mean
If the observed values are measured as ratios, proportions or percentages and the series of observations
contains one or more unusually large values geometric mean gives a better measure of central tendency
than other means. It is obtained by taking the nth root of the product of “n” values, i.e, if the values of the
observation are demoted by X1,X2,…,Xn, then
GM =

23 | P a g e
Example: A person has invested Rs 5,000 in the stock market. At the end of the first year the amount has
grown to Rs 6,250; he has had a 25 percent profit. If at the end of the second year his principal has grown
to Rs 8,750, the rate of increase is 40 percent for the year. What is the average rate of increase of his
investment during the two years?
Solution:
GM = = 1.323
The average rate of increase in the value of investment is therefore 1.323 - 1 = 0.323, which if multiplied
by 100, gives the rate of increase as 32.3 percent.

Harmonic Mean
Harmonic mean is a suitable measure of central tendency when the data pertains to speed, rates and time.
The harmonic mean is defined as the reciprocal of the mean of the reciprocals of a series of observations.

That is let X1, X2, …, Xn be the values of a set of observations, then the harmonic mean is given by: HM =

= .

Example: In a small company, two typists are employed. Typist A types one page in ten minutes while
typist B takes twenty minutes for the same. Both are asked to type for one hour. What is the average time
taken by them for typing one page?
Solution:

minutes per page

The mode
The mode is the value of the observation that occurs with the greatest frequency. A particular
disadvantage is that, with a small number of observations, there may be no mode. In addition, sometimes,
there may be more than one mode such as when dealing with a bimodal (two-peak)
distribution.
Example: Find the modal values for the following data:
(a) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg).
(b) 10, 10, 9, 9, 8, 12, 15, 5 (modal value = 9 and 10). Hence, it is possible for a frequency distribution to
have more than one mode.

Note: Distributions with one mode are called unimodal, those with two modes are called bimodal, and
those with more than two modes are called multimodal.

24 | P a g e
For grouped (continuous) frequency distribution, we can have a modal class that is a class with the highest
frequency
The Median
An alternative measure of location, perhaps second in popularity to the arithmetic mean, is the median. In
a distribution, median is the value of the variable which divides it in to two equal halves. In an ordered
series of data median is an observation lying exactly in the middle of the series. It is the middle most value
in the sense that the number of values less than the median is equal to the number of values greater than it.
Suppose there are n observations in a sample and if these observations are ordered from smallest to
largest, then the sample median foe ungrouped data is defined as:

(1) The observations if n is odd

(2) The average of the and observations if n is even.

Example: Find the median of the following numbers.


(a) 6, 2, 8, 9, 4 (b) 5, 2, 1, 8, 3,7, 8, 9.
Solution: a) ascending ordered data: 2, 4, 6, 8, 9 (n=5)

 5  1
th

Median =   value  3 value  6


rd

 2 
b) Ascending order: 1, 2, 3, 5, 7, 8, 8, 9 (n=8)
4 rd  5th 5  7
Median =  =6
2 2

Median for Grouped Data


For a grouped (continuous) frequency distribution, median is calculated as:

Median = , where L = lower class boundary of the median class


w = length of the interval,
n = total frequency of the sample,
cf = Cumulative frequency preceding the median class,
f = Frequency of that interval containing the median.
The median class is the class with the smallest cumulative frequency (less than type) greater than or equal
n
to
2

25 | P a g e
Example: Find the median for the following distribution

Class limit Frequency Cumulative freq.(less than type)

40 – 44 7 7

45 – 49 10 17

50 – 54 22 39

55 – 59 15 54

60 – 64 12 66

65 – 69 6 72

70 – 74 3 75
n 75
  37.5
2 2
39 is the first cumulative frequency to be greater than or equal to 37.5.
Therefore, 50 – 54 is the median class. L = 49.5, n=75, w = 5, cf =17, f = 22

Hence, Median =

(37.5  17)5
= 49.5+ = 54.16
22
Note:
 Median is a positional average and hence not influenced by extreme observations.
 Median can be calculated in the case of open end intervals.
 Median can be located even if the data are incomplete.

2.2. Other measures of locations (Quantiles: quartiles, deciles, percentiles)

When a distribution is arranged in order of magnitude of items, the median is the value of the middle term.
Their measures that depend up on their positions in distribution quartiles, deciles, and percentiles are
collectively called quantiles.
Quartiles: Quartiles are measures that divide the frequency distribution in to four equal parts. The value
of the variables corresponding to these divisions are denoted Q , Q , and Q often called the first, the
1 2 3

second and the third quartile respectively.


Q is a value in which 25% items are less than or equal to it. Q has 50% items with value less than or
1 2

equal to it and Q has 75% items whose values are less than or equal to it.
3

The kth quartile Qk for ungrouped data is the value of the item which is the position,

26 | P a g e
where k =1, 2, 3 and n is the total number of observations.
The computation of three quartiles for a grouped data can be done as follows:
kn
 Calculate and search for the minimum cumulative frequency which is greater than or equal
4
kn
to , k=1, 2, 3.
4
 The class corresponding to this cumulative frequency is the kth quartile class. This is the class
where Qk lies.

 Thus, Qk = L + , k =1, 2, 3, where


L = lower class boundary of the kth quartile class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding the k th
quartile class
w= the class width of the quartile class and
f= frequency of the kth quartile class
Deciles: Deciles are measures that divide the frequency distribution in to ten equal parts. The values of
the variables corresponding to these divisions are denoted D , D ,.. D often called the first, the
1 2 9

second,…, the ninth decile respectively.


kn
To find Dk(i=1, 2,..9) we count of the classes beginning from the lowest class.
10

For grouped data we have the following formula:

Dk = L + , k =1, 2, 3…9, where


L = lower class boundary of the kth deciles class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding the k th
deciles class
w= the class width of the deciles class
f = frequency of the kth deciles class
Percentiles: Percentiles are measures that divide the frequency distribution in to hundred
equal parts. The values of the variables corresponding to these divisions are denoted
P , P ,.. P often called the first, the second,…, the ninety-ninth percentile respectively.
1 2 99

27 | P a g e
kn
To find P (i=1, 2,..99) we count of the classes beginning from the lowest class.
i 100
For grouped data we have the following formula:

Pk = L + , k =1, 2, 3…99, where


L = lower class boundary of the kth percentiles class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding the k th
percentiles class
w= the class width of the percentiles class
f = frequency of the kth percentiles class
Note: To compute quantiles, we first sort the data in ascending order.
Q2 = D5 = P50 = median, P25 = Q1, P75 = Q3, and Di = Pi*10,i=1, 2, 3,…9.

Example: Considering the following distribution


Calculate: a) All quartiles b) The 7thdecile c) The 90th percentile.

Class limit Frequency Cumulative freq.(less than type)

141 – 150 17 17

151 – 160 29 46

161 – 170 42 88

171 – 180 72 160

181 – 190 84 244

191 – 200 107 351

201 – 210 49 400

211 – 220 34 434

221 – 230 31 465

231 – 240 16 481

241 – 250 12 493


Solution a) quartiles
Q1: Determine the class containing the first quartile.
n
 123.25 . Hence, 171- 180 is the class containing the first quartile.
4
L =170.5, n =493, w= 10, cf = 88, f= 72

28 | P a g e
10(123.25  88)
Q1 = L + = 170.5+ = 174.43
72

Q2: Determine the class containing the second quartile.


2n
 246.5 . Hence, 191- 200 is the class containing the second quartile.
4
L =190.5, n =493, w= 10, cf =244 , f= 107
10(246.5  244)
Q2 = L + = 190.5+ = 190.73
107

Q3: Determine the class containing the third quartile.


3n
 369.75 . Hence, 201- 210 is the class containing the third quartile.
4
L =200.5, n =493, w= 10, cf = 351 , f= 49
10(369.75  351)
Q3 = L + = 200.5+ = 204.33
49

b) D7: Determine the class containing the 7thdecile.

7n
 345.1 . Hence, 191- 200 is the class containing the seventh decile.
10
L =190.5, n =493, w= 10, cf = 244 , f= 107
10(345.1  244)
D7= L + = 190.5+ = 199.95
107

c) P90: Determine the class containing the 90th percentile.


90n
 443.7 . Hence, 221- 230 is the class containing 90thpercentile.
100

L =220.5, n =493, w= 10, cf = 434 , f= 31


10(443.7  434)
P90= L + = 220.5+ = 223.63
31

29 | P a g e
2.3. Measures of variation (dispersion)
Introduction
The measure of central tendency helps us in describing a set of data by a single number or typical value.
However, they do not provide us any information about the extent to which the values differ from one
another or from the average value. Hence, to increase our understanding of the pattern of a data, we must
also measure its dispersion- indicates the degree to which the numerical data tend to spread or variability
about an average value. T he scatter or spread of items of a distribution is known as dispersion or
variation. The measures of dispersion also enable us to compare several samples with similar averages.
Consider the following data sets:
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
Set 3: 50 50 50 50 50 50 50 50
The three data sets have a mean of 50, but obviously data set 1 is more “spread out” than set 2 and set 3
has no variability.
Objectives
The general object of measuring dispersion is to obtain a single summary figure which adequately exhibits
whether the distribution is compact or spread out.
• To judge the reliability of measures of central tendency
• To control variability itself.
• To compare two or more groups of numbers in terms of their variability.
• To make further statistical analysis.

Absolute and Relative Measures of Dispersion


The measures of dispersion which are expressed in terms of the original unit of a series are termed as
absolute measures. Such measures are not suitable for comparing the variability of two distributions which
are expressed in different units of measurement and different average size. Relative measures of
dispersions are a ratio or percentage of a measure of absolute dispersion to an appropriate measure of
central tendency and are thus pure numbers independent of the units of measurement. For comparing the
variability of two distributions (even if they are not measured in the same unit), we compute the relative
measure of dispersion instead of absolute measures of dispersion.

Types of Measures of Dispersion


It is useful for comparing variation in two or more distributions where units of measurements are the
same. Various measures of dispersions are in use. The most commonly used measures of dispersions are:

30 | P a g e
1) Range and Relative Range
2) Quartile Deviation and Coefficient of Quartile Deviation
3) Mean Deviation
4) Standard Deviation and Coefficient of Variation.

The Range (R)


The range is the largest value minus the smallest value in a data set. The range is greatly affected by
extreme values. Range = largest value – smallest value.
The following two distributions have the same range, 13, yet appear to differ greatly in the amount of
variability.

Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.
Merits and Demerits of range
Merits:
• It is rigidly defined.
• It is easy to calculate and simple to understand.
Demerits:
• It is not based on all observation.
• It is highly affected by extreme observations.
• It is affected by fluctuation in sampling.
• It cannot be computed in the case of open end distribution.
• It is very sensitive to the size of the sample.
Relative Range (RR)
It is also sometimes called coefficient of range and given by:
Highest value  lowest value
RR =
Highest value  lowest value

Exercise:
1. Find the relative range of the above two distribution.
2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value of:
a) Smallest observation (Ans. 6)
b) Largest observation (Ans. 10)

The Quartile Deviation (Semi-inter quartile range), Q.D


The inter quartile range is the difference between the third and the first quartiles of a set of items.

31 | P a g e
IQR = Q3 – Q1, and semi-inter quartile range is half of the inter quartile range.

Q.D =

Coefficient of Quartile Deviation (C.Q.D)

C.Q.D = =

Remark: Q.D or C.Q.D includes only the middle 50% of the observation.

The Mean Deviation (M.D):


The mean deviation of a set of items is defined as the arithmetic mean of the values of the absolute
deviations from a given average.
Mean Deviation about the mean for a data set x1, x2, …,xn
n

x i X
MD  i 1
,
n
For the case of a frequency distribution data where the values X1, X2, X3, …,Xk occur f1, f2, f3, …, fk times
k

respectively, then mean deviation is obtained by: MD =


f
i 1
i Xi  X
k

f i 1
i

If the data is given in the form of frequency distribution of k-classes in which mi and fi are the class marks
and frequency of the ith class respectively then the mean deviation is given by:
k

f
i 1
i mi  X
MD = k

f
i 1
i

Steps to calculate M.D:


1. Find the arithmetic mean,
2. Find the deviations of each reading from X and
3. Find the arithmetic mean of the deviations, ignoring sign.
Example: Calculate the mean deviation for the following data:
Xi 10 8 9 7 6
fi 8 9 13 6 3

Solution: first find the mean as = = (10*8 + 8*9 +…+6*3)/(8+9+…+3) = 8.4, then

Xi 10 8 9 7 6
fi 8 9 13 6 3

32 | P a g e
Xi  X 1.6 0.6 0.4 1.4 2.4

fi X i  X 12.8 7.8 3.6 8.4 7.2

Thus, MD = 12.8  7.8  3.6  8.4  7.2  1.02


8  9  13  6  3
Interpretation: Each value deviates on average 1.02 from the arithmetic mean, 8.4.

The Variance and Standard Deviation


The variance
The variance is the "average squared deviation from the mean" and it measures the average of the square
of the deviations from the mean for each observations.
Suppose we have population of N observations, say X1, X2, X3, …, XN, then we define the population
variance as:
N N

  X i     X i  N 2
2 2

2  i 1
 i 1

N N
But most of the time we have sample of n observations, say X1, X2, X3, …, Xn from the population of N,
then we define the sample variance as:
2
 n 
 X  X
n n n

X n X i    X i 
2
 nX
2 2 2
i i
S 
2 i 1
,or S  2 i 1
,or S2 
i 1  i 1 
n 1 n 1 n(n  1)
This measure of variation is universally used to show the scatter of the individual measurements around
the mean of all the measurements in a given distribution. But the disadvantage is that the units of variance
are the square of the units of the original observations. The easiest way for this difficulty is to use the
square root of the variance as a measure of variability called the standard deviation.
Standard deviation
The population and the sample standard deviations denoted by σ and S respectively are defined as:

N 2

 x i  
 i 1
, where  is the popuplation mean
N
n

 (x i  X )2
S i 1
where X is the samplemean
n 1

33 | P a g e
For the case of frequency distribution data the population and sample variance are given

 f (x i i  )2
as:  2 
N
,where N = f i

 f (x i i  X )2
S2 
n 1
,where n = f i

Variance and Standard Deviation for Grouped Data


The sample variance for a grouped frequency distribution is given by

 f (m i i  X )2
S2 
n 1
, where n = f i , mi = midpoint of ith class

Example: Areas of spray able surfaces with DDT from a sample of 15 houses are as follows (m 2): 101,
105, 110, 114, 115, 124, 125, 125, 130, 133, 135, 136, 137, 140, 145.
Find the variance and standard deviation..
Solution: The mean of the sample is 125 ( X  125) , then

 X  X
n
2
i
(101  125) 2  (105  125) 2  ...  (145  125) 2
S2  i 1
=  178.71
n 1 14
Hence, the standard deviation = S = 178.71 = 13.37.
Examples: Find the variance and standard deviation of the following grouped sample data
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Sample mean, = 55, n=75

mi(midpoint) 42 47 52 57 62 67 72 Total
fi(mi- 2
) 1183 640 198 60 588 864 867 4400

 f (m i i  X )2
4400
Then S 2  = = 59.46
n 1 74

and S = 59.46 = 7.71

34 | P a g e
Note:
1. If the standard deviation of X1, X2, …..,Xn is S, then the standard deviation of
a) X1+ k, X2+k, …, Xn+k will also be S (where k =constant)
b) kX1, kX2, …,kXn will be |k|S.
c) c+kX1, c+kX2, …,a+ kXnwill be |k|S ( c and k are constants)
2. Chebyshev's Theorem
For any data set with mean X and standard deviation S, no matter what the pattern of variation,
the proportion of the values that fall within k standard deviations of the mean or
1
( X  kS, X  kS ) will be at least 1  i.e. the proportion of items falling beyond k standard
k2
1
deviations of the mean is at most
k2
Example: Suppose a distribution has mean 50 and standard deviation 6.What percent of the numbers are:
a) Between 38 and 62
b) Less than 38 or more than 62.
Solutions: a) 38 and 62 are at equal distance from the mean, 50 and this distance is 12
 KS = 12.  6k =12.  k= 2
1
Applying the above theorem at least ( 1  ) 100% = 75% of the numbers lie between 38 and 62.
k2
1
b) It is just the complement of a) i.e. at most *100% =25% of the numbers lie less than 32 or more than
k2
62.

Example: The standard deviation of n observations X1, X2, ....,Xn is known to be 3. New set of
observations are obtained by the linear transformation Yi = 2Xi– 0.5 ( i = 1, 2, …, n ), then what will be
the standard deviation of the new set of observations.
Solution: new standard deviation = |k|S= 2*3 =6
Example: The mean and the standard deviation of a set of numbers are respectively 500 and 10.
a) If 10 is added to each of the numbers in the set, then what will be the variance and standard
deviation of the new set?
b) If each of the numbers in the set are multiplied by -5, then what will be the variance and standard
deviation of the new set?

Solutions: a) The variance and standard deviation will remain the same.
b) New standard deviation= |k|S =5*10 =50

35 | P a g e
Coefficient of Variation (CV)
The coefficient of variation (CV) is defined by
s tan darddeviation S
CV= * 100% = *100%.
mean X
The coefficient of variation is most useful in comparing the variability of several different samples, each
with different means. This is because a higher variability is usually expected when the mean increases,
and the CV is a measure that accounts for this variability.
CV is a relative measure free from unit of measurement.
Examples: An analysis of the weekly wages paid (in Birr) to workers in two firms A and B belonging to
the same industry gives the following results.
In which firmthe wages is more variable?
Value Firm A Firm B
Mean wage 56 72
Variance 100 121

S 10
Solution: C.VA = *100% = *100% = 17.86%
X 56
S 11
C.VB = *100% = *100%= 15.28%.
X 72
Since C.VA>C.VBin A there is greater variability in individual wages.

The standard Score (Z-score):


It is the number of standard deviations that a given value X is below or above the mean.
The standard score of any value Xi is defined as
X i  mean
Zi  (for the sample data sets)
s tan darddeviation
Values above the mean have positive z-scores and values below the mean have negative Z-scores. Z-
scores are generally meaningless by themselves unless they are compared to the distribution or scores
from some reference group.
Note: A Z-score value less than -2 and greater than 2 considers as unusually low or high value.
Example1: Two sections were given introduction to statistics examinations. The following information
was given.
Value Section 1 Section 2
Mean 78 90
Standard deviation 6 5

36 | P a g e
Student A from section 1 scored 90 and student B from section 2 scored 95. Relatively speaking who
performed better?
XA  X 90  78
Solution: Z A  =  2 and
S 6
X B  X 95  90
ZB  = 1
S 5
Student A performed better relative to his section because the score of student A is two standard deviation
above the mean score of his section while, the score of student B is only one standard deviation above the
mean score of his section.
Example 2: Two groups of people were trained to perform a certain task and tested to find out which
group is faster to learn the task. For the two groups the following information was given:
Value Group one Group two
Mean 10.4 min 11.9 min
Standard deviation 1.2 min 1.3 min
Relatively speaking:
a) Which group is more consistent(less variable) in its performance?
b) Suppose a person A from group one takes 9.2 minutes while person B from
Group two takes 9.3 minutes, who was faster in performing the task? Why?
Solutions:
a) Use coefficient of variation.
S 1.2
CV1 = 1 *100%  * 100%  11.54%
X1 10.4
S2 1.3
CV2 = *100%  * 100%  10.92%
X2 11.9
Since C.V2 < C.V1, group 2 is more consistent (less variable)
b) Calculate the standard scores of A and B
X A  X1 9.2  10.4 X  X 2 9.3  11.9
ZA  =  1 and Z B  B =  2
S1 1.2 S2 1.3
Person B is faster because the time taken by person B is two standard deviation shorter than the average
time taken by group 2 while, the time taken by person A is only one standard deviation shorter than the
average time taken by group 1

2.4. Measures shape of distribution


a) Measures of skewness
Skewness is the degree of asymmetry or departure from symmetry of a distribution. A skewed frequency
distribution is one that is not symmetrical. Skewness is concerned with the shape of the curve not size.

37 | P a g e
If the frequency curve (smoothed frequency polygon) of a distribution has a longer tail to the right of the
central maximum than to the left, the distribution is said to be skewed to the right or said to have positive
skewness. If it has a longer tail to the left of the central maximum than to the right, it is said to be skewed
to the left or said to have negative skewness.
The Pearsonian coefficient of skewness.
3(mean  median)
3 
s tan dard deviation

If α3>0, then the distribution is positively skewed.


If α3=0, then the distribution is symmetric.
If α3<0, then the distribution is negatively skewed.

Moments
If X is a variable that assume the values X1, X2,…..,Xn then the rth moment about the mean is
defined as

 X  X
r
i

Mr 
n

For the case of frequency distribution this is expressed as:

 f X  X
r
i i

Mr 
n
If r = 2, it is population variance, this is called the second central moment. If we assume n -1≈ n ,
it is also the sample variance.

The moment coefficient of skewness is defined as

 X X  X X
3 3
i i
M3 M3 M
3  = = 33 , where M 3  , M3 
M2
3/ 2
  
2 3/ 2 n n
and  is the population standard deviation.
If α3>0, then the distribution is positively skewed.
If α3=0, then the distribution is symmetric.
If α3<0, then the distribution is negatively skewed

b) Measures of Kurtosis
Kurtosis is the degree of peakdness of a distribution, usually taken relative to a normal distribution. A
distribution having relatively high peak is called leptokurtic. If a curve representing a distribution is
flat topped, it is called platykurtic. The normal distribution which is not very high peaked or flat
topped is called mesokurtic.
Measures of kurtosis

38 | P a g e
The moment coefficient of kurtosis:
 Denoted by  4 and given by
M4 M4
4   4
M2
2

Where : M 4 is the fourth moment about the mean.
M 2 is the sec ond moment about the mean.
 is the population s tan dard deviation.
The peakdness depends on the value of  4 .

Examples: If the first four central moments of a distribution are:


M1  0, M 2  16, M 3  60, M 4  162
a) Compute a measure of skewness
b) Compute a measure of kurtosis and give your interpretation.

Solutions:
M3  60
3  32
  0.94  0
a) M2 16 3 2
 The distribution is negatively skewed .

M 4 162
4  2
 2  0.6  3
M2 16
 The curve is platykurti c.

39 | P a g e

You might also like