0% found this document useful (0 votes)
12 views32 pages

Sta 111 Lecture Note

The document is a lecture note on Descriptive Statistics prepared by Dr. C. J. Nweke for the Department of Mathematics and Statistics at Alex Ekwueme Federal University Ndufu-Alike. It covers topics such as types of statistical data, methods of data collection, presentation of data, and measures of location and dispersion. The note emphasizes the importance of descriptive statistics in summarizing and interpreting data effectively.

Uploaded by

kelechiokoro959
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Sta 111 Lecture Note

The document is a lecture note on Descriptive Statistics prepared by Dr. C. J. Nweke for the Department of Mathematics and Statistics at Alex Ekwueme Federal University Ndufu-Alike. It covers topics such as types of statistical data, methods of data collection, presentation of data, and measures of location and dispersion. The note emphasizes the importance of descriptive statistics in summarizing and interpreting data effectively.

Uploaded by

kelechiokoro959
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

DEPARTMENT OF MATHEMATICS AND STATISTICS

ALEX EKWUEME FEDERAL UNIVERSITY NDUFU-ALIKE


LECTURE NOTE ON DESCRIPTIVE STATISTICS (STA 111)
Prepared by
Dr C. J. Nweke

Course Content
1. Statistical Data: Types, Sources and methods of collection.
2. Presentation of Data: Tables, charts and graph
3. Error and approximations
4. Frequency and cumulative distributions
5. Measures of location, partition, dispersion, skewness and kurtosis
6. Rates, ratios and index numbers

Recommended Text:

1. Collaborative Statistics (2008). by Barbara Illowsky and Susan Dean.


©2008 Maxfield Foundation, Rice University, Houston, Texas.

1
A. INTRODUCTION
1. NATURE OF STATISTICS
To several people, statistics means different things and several definitions have been
given to statistics. Two common but broad definitions of statistics are when:
1. Used with a plural verb (facts and data): Here statistics is defined as facts or data,
either numerical or non-numerical, organized and summarized so as to provide
useful and accessible information about a particular subject. Such as employment
data, number of marriages, school enrolment, etc.
2. Used with a singular Verb (mathematical or scientific techniques): Here statistics
involve:
 the scientific method for collecting, organizing, summarizing, presenting, and
analyzing data as well as drawing valid conclusion and making reasonable
decision on the basis of such analysis.

3. DIVISIONS OF STATISTICS
Statistics is divided into two broad areas: (a) Descriptive Statistics and (b) Inferential
Statistics Descriptive Statistics: Descriptive Statistics also known as Deductive
Statistics deal with collection and summarization of data. Descriptive statistics is a
branch of statistics that is concerned with describing the characteristics of the known
data either by providing summaries about the population data or the sample data.
Descriptive Statistics gives either numerical and graphic procedures to summarize a
collection of data in a clear and understandable way. Descriptive statistics help us to
simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots
of data into a simpler summary. The purpose of descriptive Statistics is to display and
pass information from which conclusion can easily be drawn and decision made.
Inferential Statistics: This deals with the techniques or methods that are used to select
a sample from a population, analyze data from such sample, make estimate and/or
draw conclusions about a population based on the information in a sample from the
population.
Remark: The interest in this course is the DESCRIPTIVE STATISTICS

2
STATISTICAL DATA: CLASSIFICATION AND COLLECTION
Data refers to series of observation from a population collected at regular intervals or
instances. They are values of variables under consideration. When they are arranged in
a particular order (either ascending or descending), they are called an ARRAY
CLASSIFICATION OF DATA
Data are classified on several bases:
1. Based on time dimension
 Cross-sectional data: data collected on several related subjects at a point in time.
e.g. price of pms in 20 petrol stations in Abakaliki yesterday, the number of student
in each department in AEFUNAI in 2021/2022 academic session.
 Time series data: data on one subject collected sequentially over time. e.g. the
annual profit of UBA from 2000 -2022, the annual GDP for Nigeria from 1980-2022.
 Panel or longitudinal or pooled data: data collected on related subjects and
sequentially overtime. It involves pooling of cross-sectional and time series data.
E.g. The GDP of 10 African countries from 1980-2022, the profit of 15 deposit banks
in Nigeria from 2000-2022.
2. Based on degree of measurability
 Quantitative data: Data from quantitative variables. This are data that are
measurable or could be quantified. The can be either Continuous data (data from
continuous variables e.g. Height in cm, weight in kg, age in months) or Discrete data
(data from discrete variables, e. g. Number of student in STA 203 CLASS for
2022/2023, number of lecturers in ACC department.
 Qualitative Data: These are data from qualitative (categorical) variables. They are
nun-numerical but classifiable data. Examples are sex, colour, race etc
3. Base on Sources
Basically, there are two sources of Statistical Data: (a) Primary Data and (b)
Secondary Data
 Primary Data: This is the name given to the data that are collected by the
researcher from the original source and are used for the purpose for which they
were originally collected. Sources of such data include: surveys, experiment,
census etc.

3
 Secondary Data: Secondary data refers to data that are extracted from other
sources for some other purposes rather than that for which they were originally
collected for. The sources of this kind of data include journals, newspapers,
annual reports etc from which we extract them.
4. Base of Degree of Aggregation
 Micro Data: this is data in its unmerged form. Example. Amount your GPA in
first semester.
 Macro data: This is data in its aggregated form: CGPA, average income, sum
of the total temperature etc.

SCALES OF MEASUREMENT OF DATA


The four common scales for data measure are:
Nominal Scale: This is scale used to describe categorical data that has no natural
ordering. Here, numbers are used simply as labels for groups or classes. If our data set
consists of blue, green, and red items, we may designate blue as 1, green as 2, and red
as 3. In this case, the numbers 1, 2, and 3 stand only for the category to which a data
point belongs. Examples are sex, ethnicity, religion, marital status. Etc.
Ordinal Scale: this involves categorical data with natural ordering. Here data elements
are ordered according to their relative size or quality. Four products ranked by a
consumer
may be ranked as 1, 2, 3, and 4, where 4 is the best and 1 is the worst. In this scale of
measurement we do not know how much better one product is than others, only that it is
better. Other examples are level of education, academic performance based on class of
CGPA. etc
Interval Scale: This scale is adopted for quantitative variables whose difference has
meaning but quotient does not make any sense. In the interval scale of measurement
the value of zero is assigned arbitrarily and therefore we cannot take ratios of two
measurements. But we can take ratios of intervals. A good example is how we measure
time of day, which is in an interval scale. We cannot say 10:00 A.M. is twice as long as
5:00 A.M. But we can say that the interval between 0:00 A.M. (midnight) and 10:00
A.M., which is a duration of 10 hours, is twice as long as the interval between 0:00 A.M.

4
and 5:00 A.M., which is a duration of 5 hours. This is because 0:00 A.M. does not mean
absence of any time. Another example is temperature. When we say 0°F, we do not
mean zero heat. A temperature of 100°F is not twice as hot as 50°F.
Ratio Scale: If two measurements are in ratio scale, then we can take ratios of those
measurements which will be meaningful. The zero in this scale is an absolute zero.
Money, for example, is measured in a ratio scale. A sum of N100 is twice as large as
N50. A sum of N0 means absence of any money and is thus an absolute zero. We have
already seen that measurement of duration (but not time of day) is in a ratio scale. In
general, the interval between two interval scale measurements will be in ratio scale.
Other examples of the ratio scale are measurements of weight, volume, area, or length.

METHODS OF DATA COLLECTION


Primary Methods
There are several ways of collecting data. These include:
(1) Direct Observation, (2) Personal Interview (3) Mail Interview (4) Experimental Survey

Direct Observation: This involves a close and direct monitoring of the process and
recording the result of the observation at an appropriate time as they occur. Example is
the observing and recording of number of cars that enters a particular place for a given
period or interval of time.
Advantages: (i) It may yield an accurate and reliable result. (ii) The chances of incorrect
data being recorded are reduced.
Disadvantages: (i) It is very expensive to carry out in terms of cost, time and energy. (ii)
Sometimes, the information sought for cannot be directly observed.

Personal Interview: In this method, a respondent is usually asked questions and


his/her responses recorded/noted. It can also make use of questionnaires that are filled
by enumerators or field workers. This is the method used for public inquiry.
Advantages: (i) Personal judgement of the investigator comes to play. (ii) One gets
information on something that happened when he or she was not there.
Disadvantages: (i) Inaccurate or false data may be given by the respondent. This is due
to the fact that he might have forgotten or he misunderstands the question or further

5
still, with intent to mislead the interviewer. (ii) Whereas the interviewer engages many
respondents, they may not record the information in the same way, as he would have
himself.

Mail Questionnaire: This involves sending questionnaires by post or any other means
to respondents. The respondent in turn fills and returns the questionnaire by post or any
other means. Its major drawback is that of low rate of response as the respondents may
not be interested in the subject matter. This method is also prone to delay and to some
extent, questionnaires may miss on transit.
Advantage: (i) It is by far the cheapest method. (ii) A large field of inquiry may be
studied by this method.
Disadvantage: (i) Respondents may not send back the questionnaires and when they
do, it may not be accurately filled.
Experimental Survey/Experimentation: This is mostly used in Science and
technology where experiments are designed, conducted and observations are recorded.
There is often a control (a control experiment is that which the experimenter is trying to
improve upon) experiment with which the result of the main experiment will be
compared.
Advantage: (i) It appears to be most reliable of all the methods. (ii) It yields a high
degree of accuracy.
Disadvantage: (i) It involves a lot of time, energy and cost. (ii) It requires a high level of
meticulousness and care in order to be effective.

Secondary Methods of Data Collection


As earlier stated, users of secondary data may not have thorough understanding of the
background as the original investigator and so, may be unaware of its limitations. Data
often obtained from secondary sources are majorly from government, government
ministries and parastatals’ records, Trade associations, journals, market survey,
newspapers, magazines etc.

6
PRESENTATION OF STATISTICAL DATA
Data presentation is a process by which raw data are arranged in tabular or
diagrammatical forms. Collected data may be complex in nature. One way of simplifying
and making them more intelligible is to represent them by means of tables, diagrams,
charts and graphs.

FREQUENCY DISTRIBUTION
Data collected using any of the methods may be cumbersome and too large to handle,
hence a need to summarize them in an easy to comprehend form. One way of doing
this is to summarize the data in a frequency distribution table in such a way that, the
data, though compressed but looks more meaningful.
Frequency Table: A frequency table is a table that is best suitable in handling the data
collected. It shows the number of times the values of a variable occurs in a data set.
The distribution of values of variables in a frequency table is called frequency
distribution. Frequency table can be constructed as either ungrouped frequency table
resulting to an ungrouped frequency distribution or grouped frequency table resulting to
grouped frequency distribution.

Construction of Frequency Table: In the construction of a frequency table, the


following must be considered
 Array: Arrange the data in specific rows and columns for each identification.
 Order of Size or Magnitude: This array can be arranged in order of size or magnitude.
 Counting the Data: Counting the data on the spot is usually done using a tally counter
(a button which records each observation as it occurs). An individual may also do the
counting. We can also count using the old tally system wherein we dross every fourth
row. Whichever method we use does have its own uniqueness and monotonousness
hence the need for concentration and care.
 Tabulation process and Large Surveys: This involves the use of sophisticated
materials and equipment. For instance, sensitized paper of questionnaire enables
answers to be checked automatically while the use of punch cards enables the cards
to be electronically sorted into the required groups.

7
Much more information becomes apparent when the raw data is checked and arranged
in order of size or magnitude.
Constructing a Grouped Frequency Table: Generally, the quality of the frequency
distribution table is determined by a wise choice of number of classes and the width of
the class. Hence, in constructing a grouped frequency table, two things that are very
necessary are (i) deciding the number and width of classes and (ii) determination of the
class limits. In determining the class limits, it should be noted that class limit should be
definitely and clearly stated and the lower limit of the first class should be well defined
so as to cover all the data. This is done in case one needs to use the table to construct
graphs such as Ogive or frequency distribution curve. Other things to be noted include:
(i) The number of classes should not be less than six or more than twenty.
(ii) The range should be found, that is, the numerical difference between the largest and
the least figure in the data.
(iii) Class intervals are usually in multiples of five.
(iv) A class interval with an odd number of unit is easier to work with than with even
number because the midpoint of the interval of odd unit leads to an integer.
Uses of Frequency Table
(i) It allows required figures to be located easily and quickly.
(ii) It allows comparisons to be made easily between different classes of the
group.
(iii) It reveals patterns within the figures which cannot be seen in the ordinary
form.
(iv) It allows for easier data analysis and interpretations.
General rules for forming a Frequency Distribution Table
(i) Determine the range, that is, the difference between the smalls and the largest
number.
(ii) Determine the number of classes, that is, Range divided by the class size.
(iii) Find the Upper class limit using U1 = L1 + C – 1 (for discrete data) or U1 = L1 + C –
0.1 (for data with one decimal place) or U1 = L1 + C – 0.01 (for data with two
decimal place) and generally U1 = L1 + C – 10-k (for data with k decimal place)
Example: Given the following data on fifty electric lamps

8
16 13 7 21 23 28 23 1 20 30 29 18 18 33 20 11 23 8 29 20 12 27 25 18 22 16 21 20
24 13 34 23 17 26 2 20 17 15 39 22 34 5 21 35 23 28 17 16 26 24.

Using the minimum value in the data as the lower limit of the first class and 6 as the
class size, form a frequency distribution table for the data above.
Solution:
Class size (C) =6, Range, R=39 – 1 = 38, Number of classes = R /C = 38/6 = 6.2 ~ 7
The Frequency table is formed thus
Table I: Frequency distribution table of fifty Electric Lamps

Class Interval Tally Frequency


1–6 ||| 3
7 – 12 |||| 4
13 – 18 |||| |||| ||| 13
19 – 24 |||| |||| |||| | 16
25 – 30 |||| |||| 10
31 – 36 ||| 3
Diagrammatic
37 - 42representation
| of Data 1
Alternative means of data presentation is pictorial. This technique sometimes is quite
simple and easy to understand more than the frequency distribution table as it gives a
pictorial representation of the subject under consideration. This is quite necessary to
enable a comparative study.
To be able to draw a meaningful diagram, it must be noted that a diagram must have a
title, the proportion of length and breadth must be specified, scale must be decided after
taking into account the magnitude of the data and the diagram must be attractive and
have footnotes. Some of the usefulness of diagrams includes: (i) diagrams have greater
attraction than mere figures (ii) they help in delivering the required information in less
time and without any mental stress (iii) they have greater memorising value than mere
figures (iv) they facilitate comparison.

Types of Diagrams: Bar chart, Pie chart, Graphs, Histogram, Polygon, Stem and leaf,
Box plot and Dot plot. Etc.

Remark: While Bar chart, Pie chart are commonly used for categorical data, others are
mostly adopted for quantitative data.

9
Bar Chart
This is a name for a rectangular bars with each bar representing the frequency
(proportion, ratio, percentage, etc) with which the different values of a variable occur.
Bar chart is divided into (i) Simple bar chart (ii) Component bar chart (iii) Multiple bar
chart
We shall use the data below to show each of these charts.

Case Rates of some reported notifiable diseases by Sex in Nigeria (2010 - 2013)
2010 2011 2012 2013
Disease Male Female Male Female Male Female Male Female
HIV/AIDS 13.7 24.4 12.3 23.4 5.9 6.3 10.2 13.0
Pneumonia 4.9 3.5 5.1 4.3 3.9 2.9 2.9 2.6
Malaria 66.9 59.9 70.3 62.1 68.6 73.3 56.8 56.8
Diarrhoea 12.0 9.0 9.3 7.2 6.1 4.0 22.3 19.5
Source: National Bureau of Statistics (NBS) Demographic Statistics Bulletin 2013.

Simple Bar Chart


150

100
Rates

50

0
2010 2011 2012 2013
Years (2010 - 2013)

Fig 1: A Simple Bar Chart showing case rate of Malaria in Nigeria (2010 - 2013)
Component Bar Chart: This shows the component parts which make up the total. It has
a limitation that the actual components represented cannot be read from the chart
directly. Thus, the difference between the top and the base has to be worked out before
any meaningful comparison can be made.

10
120
100
80
Rates

60
40 Diarrhoea
20 Malaria
0
Pneumonia
M F M F M F M F
HIV/AIDS
2010 2011 2012 2013
Sex (2010 - 2013)

Fig 2: Component Bar Chart showing case rates of some notifiable disease by Sex in
Nigeria (2010 - 2013)

Multiple Bar Chart: This is used to compare changes in more than one variable. Here
the group of bars conveying information pertaining to a particular variable are joined
together.

80
70
60
50
Rates

40
30 HIV/AIDS
20
10 Pneumonia
0
Malaria
M F M F M F M F
Diarrhoea
2010 2011 2012 2013
Sex (2010 - 2013)

Fig 3: A Multiple Bar Chart showing case rates of some notifiable diseases by Sex in
Nigeria (2010 - 2013)

Pie Chart: This also is used to present data that has been collected. It shows
component parts of the angles of each slice or sector at the centre of the pie by working
out the proportion it bears to the whole.

11
Table 1: New Jobs created in Q2, 2015 (All Sectors)
Sector Frequency Angular Sector
Formal 51, 070 51070 * 360
 130
141368
Informal 83,903 83903* 360
 214
141368
Public Institution 6,395 6395 * 360
 16
141368
Total 141, 368 360

Public
Institution, 1
Formal, 130.
6.28515647
0520627

Informal, 213
.6627808

Fig 4: Pie chart of new job created in different sectors

Histogram: This is used to represent a frequency distribution in the form of a diagram.


Here, the horizontal axis shows values of the class boundaries while the vertical axis
records the class frequency on each interval. It is the graph of class boundaries versus
the frequencies. We must note that the interval for each column should be constant and
the bars adjoin each other. That is histogram displays the classes of the quantitative
data on a horizontal axis and the frequencies (relative frequencies, percents) of those
classes on a vertical axis. The frequency (relative frequency, percent) of each class is
represented by a vertical bar whose height is equal to the frequency (relative frequency,
percent) of that class. The bars should be positioned so that they touch each other. For
single-value grouping, we use the distinct values of the observations to label the bars,
with each such value centered under its bar. For limit grouping or cutpoint grouping, we
use the lower class limits (or, equivalently, lower class cutpoints) to label the bars. Note:
Some statisticians and technologies use class marks or class midpoints centered under
the bars.

12
Steps
Step 1: Obtain a frequency (relative-frequency, percent) distribution of the data.
Step 2: Draw a horizontal axis on which to place the bars and a vertical axis on which to
display the frequencies (relative frequencies, percents).
Step 3: For each class, construct a vertical bar whose height equals the frequency
(relative frequency, percent) of that class.
Step 4: Label the bars with the classes, as explained in Definition 2.9, the horizontal
axis with the name of the variable, and the vertical axis with “Frequency” (“Relative
frequency,” “Percent”).

Frequency Polygon: This is constructed by drawing ruled lines connecting the


midpoints of the top of each column of a histogram. This line is continued down to cut
the horizontal axis at what would be the midpoint of an extra class having zero
frequency.

Dot-plots: Another type of graphical display for quantitative data is the dot-plot. Dot-plots
are particularly useful for showing the relative positions of the data in a data set or for
comparing two or more data sets. Procedure 2.6 presents a method for constructing a
dot-plot.
Steps
Step 1: Draw a horizontal axis that displays the possible values of the quantitative data.
Step 2: Record each observation by placing a dot over the appropriate value on the
horizontal axis.
Step 3: Label the horizontal axis with the name of the variable

Stem-and-Leaf Diagrams
With a stem-and-leaf diagram, we think of each observation as a stem-consisting of all
but the rightmost digit and a leaf, the rightmost digit. In general, stems may use as
many digits as required, but each leaf must contain only one digit.
To Construct a Stem-and-Leaf Diagram

13
Step 1: Think of each observation as a stem-consisting of all but the rightmost Digit and
a leaf, the rightmost digit.
Step 2: Write the stems from smallest to largest in a vertical column to the left of a
vertical rule.
Step 3: Write each leaf to the right of the vertical rule in the row that contains the
appropriate stem.
Step 4: Arrange the leaves in each row in ascending order.

Cumulative Frequency Curve (Ogive): For a given frequency distribution, an ogive is


a curve drawn by finding the cumulative frequency of the data. If the frequencies of all
succeeding classes are added to the frequency of a class, we have the “less than
ogive”. If however, the frequencies of preceding classes are added to the frequency of a
class, we have the “greater than ogive”. For the less than ogive, the curve is plotted
against the upper boundaries. It can be used to find characteristics of a distribution such
as the mean, median, and quartiles.

ERRORS AND APPROXIMATION IN DATA COLLECTION AND ANALYSIS


Whichever method we use in collecting data, there is obviously some circumstance that
arises giving room to possible error, errors of different kinds. Statistically, we can
identify two kinds of error; Sampling and Non-Sapling error.
Sampling Error: This occurs as a result of making estimates of the population
parameters from samples. That is, the fact that samples, whether properly chosen with
adequate sample size, are used to estimate population parameters, this type of error is
often inevitable. This is often as a result of fluctuations in samples. However, this kind of
error is not commonly found in general Census wherein samples are not used. The
error can be obtained by measuring the difference between the expected and the actual
values. Other causes of sampling error include:

 Improper selection of the samples (especially when samples are selected with
some bias).

14
 Substitution (where a sample unit is absent thus substituted with a mean value or
another value).
 Faulty demarcation of the statistical unit (leading to overlapping, thereby causing
double counting).
 Errors due to variability and wrong method of estimation.

Non-Sampling Error: This occurs when data are not properly observed, measured,
approximated and processed. Such errors are present in both census as well as sample
survey. This error can be avoided if adequate care and attention is observed. Other
error arise due to

 Incomplete questionnaire.
 Defective method of sampling.
 Personal bias of the investigator.
 Lack of trained/qualified enumerators.
 Failure to respond by the respondents.
 Errors in compilation and tabulation.

Approximations in Statistical Analysis: Because exactness is not possible in data


collection and anlaysis some time, we approximate. Among such approximation are”

a. Model Assumptions: Most statistical models rely on assumptions (e.g., normality,


independence, linearity) that may not fully align with real-world data.

b. Rounding and Truncation :Rounding data to fewer decimal places introduces minor
inaccuracies. Truncating data can lead to a loss of significant information.

c. Estimation Techniques: Using approximations (e.g., point estimates, interpolation)


introduces a margin of error. Confidence intervals quantify uncertainty in estimates.

d. Computational Limitations: Complex calculations may involve numerical


approximations. Rounding errors in software can accumulate in large datasets.

15
Implications of Errors and Approximations

 Reduced Accuracy: Errors and approximations can distort results.


 Misleading Conclusions: Systematic errors or poor approximations may lead to
incorrect interpretations.
 Increased Uncertainty: Random errors and approximations contribute to
variability in outcomes.

Mitigating Errors and Improving Approximations

Improve Data Collection Methods, Increase Sample Size, Use Proper Sampling
Techniques, Address Non-Response Bias, Validate Assumptions, Quantify and
Communicate Uncertainty, Perform Sensitivity Analyses:

STATISTICAL MEASURES
The three common measures used to describe data features are: measures of central
tendency or location, measures of partition and measures of dispersion or variations.
Other measures are measure of shapes: skewness and peakness.

MEASURES OF CENTRAL TENDENCY


Measures of location also called measures of central tendency summarize a distribution
or data by its typical value. Most batches of data show a distinct tendency to cluster
about a certain central point, thus, for any particular batch of data, it usually becomes
possible to select some typical value or average to describe the entire batch. Such a
descriptive typical value is a measure of central tendency or location. Some measures
of central tendency are the Mean (Arithmetic mean, Harmonic mean and Geometric
mean), Median and Mode.

Arithmetic Mean

16
The arithmetic mean or briefly the mean is the most commonly used measure of
location and is obtained by adding all the values and dividing by the total number of
values or items.
Definition
Let x1, x2 ,.........., xn be the numerical measurements on variable X from a sample of n

elements, then the sample mean of X denoted by x is


n

x i
x i 1

n
Example 1
The mean of the measurements: 2, 6, 3, 2, 1, 2, 7, 5, 4, 1, 4, 0, 5, 2, 4
is
n

x i
48
x i 1
= = 3.2
n 15

If each of the measurements xi is observed fi times, i = 1, 2, 3, …, n, then the


distribution of xi’s is called a frequency distribution. The sample mean, x , for a frequency
distributions given by
n

f x i i
x i 1
n

f
i 1
i

A frequency distribution can be arranged in a frequency distribution table as follows


xi x1 x2 x3 . . . xn
fi f1 f2 f3 . . . fn

Example 2
Find the mean of the following data:
14, 17, 15, 13, 18, 15, 16, 16, 15, 17, 17, 15, 17, 13, 17

17
Solution
The data can first be arranged in a frequency distribution table before computing the
formula as follows
xi 13 14 15 16 17 18 Total

fi 2 1 4 2 5 1 15

f i xi 26 14 60 32 85 18 235
6

f x i i
235
x i 1
6
  15.667
15
f
i 1
i

Note: If the data are arranged in a grouped frequency distribution table the formula for
the mean remains the same as that of frequency distribution table except that now xi
becomes the class mark or class average for the ith group or class interval.

HARMONIC MEAN (HM)


Harmonic mean of a set of observations is defined as the reciprocal of the arithmetic
mean of the reciprocal of the given values. If X1, X2…. Xn are observations,

n
H .M 
  1 X 

i 

For a frequency distribution.


N
H .M 
 fi  1  where, N   fi
 Xi 

Example 2.2: From the given data calculate H.M 5, 10, 17, 24, 30
Solution:
X 5 10 17 24 30 Total
1/X 0.2 0.1 0.0588 0.0417 0.0333 0.433

18
n 5
H .M    11.526
1  x 0.4338

Example 2.3: The marks scored by some students of a class are given below. Calculate
the H.M
Marks 20 21 22 23 24 25
No. of Students 4 2 7 1 3 1
Soln:

Marks (x) No. of Students (f) 1


x  x
f 1

20 4 0.0500 0.2000
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
Total 18 0.8216

N
H .M 
 fi  1 
 xi 
18
  21.91
0.1968
Harmonic mean is most suitable average when it is desired to give greater weight to
smaller observations and less weight to the larger ones. Some of its demerits are:
i. It is not easily understood
ii. It is difficult to compare
Geometric Mean
The Geometric mean of a series containing observations is the nth root of the product of
the values. If X1 , X2….. Xnare observations, then:

19
G.M  n x1 , x2 , x3........ xn
1
  x1 , x2 , x3 ....xn  n

taking log of both sides


Log G.M  1 log  x1 , x2 , x3 ....xn 
n
 1  log x1  log x2  log xn 
n


 log xi
n

G.M  Anti log 


 log xi
n
for grouped data
  f log xi 
G.M  Anti log  
 N 
Example 2.4: Calculate the Geometric mean of the following series of monthly income
of a batch of families 180, 250, 490, 1400, 1050.
Soln:

X 180 250 490 1400 1050 Total

LogX 2.2553 2.3979 2.6902 3.1461 3.0212 13.5107

  log x 
G.M. =Antilog  
 n 
13.5107
Antilog ( ) = Antilog (2.7021) =503.6
5
Merits of Geometric Mean
1. It is suitable for averaging ratios, rates and percentages.
2. Unlike Arithmetic mean (A.M), it is not affected by the presence of extreme
values.

Demerits of Geometric Mean


1. It cannot be used when the values are negative or if any of the observations is
zero.

20
2. It is difficult to calculate particularly when the items are very large or when there
is frequency distribution.

WEIGHTED AND COMBINED MEAN


Sometimes we associate with the numbers X1,X2 …..Xk certain weighting factors (or
weights) W 1, W 2 ……Wk, depending on the significance or importance attached to the
numbers. In this case

X
W1 X 1  W2 X 2  ...  Wk X k

W X i i
is called the weighted arithmetic mean.
W1  W2  ...  Wk W i

Example 2:1: If a final exam in a course is weighted 3 times as much as a quiz and a
student has a final exam grade of 85 and quiz grades of 70 and 90; calculate the
weighted mean grade.
(1 70)  (1 90)  (3  85)
X  83
11 3
Weighted mean is mainly used in
a. Construction of index numbers
b. Computation of standardized death and birth rates in demographic studies.
If the arithmetic averages and the number of items in two or more related groups
are known, the combined or the composite mean of the entire group can be obtained by
n x  n x 
combined mean X   1 1 2 2 
 n1  n2 
One major advantage of combined arithmetic mean is that we can determine the overall
mean of a combined data without going back to the original data.
Example 2.5: Find the combined mean for the data given below.
n1  20, x  4, n2  30, x2  3
n1 x1  n2 x2
X
n1  n2

Combined mean = 
 20  4    30  3
20  30
80  90 170
   3.4
50 50
THE MEDIAN

21
The median of a set of measurements or items is usually the middle value or item after
the measurements or items have been arranged in their order of magnitude. If there is
even number of measurements or values, two values will be in the middle and the
median in this case will be obtained as an average of the two middle values.
Definition
The median is the observation that occupies the middle position when the observations
are arranged in ascending or descending order of magnitude.

X 1  X  X 3  ..  X n

Median  X  X  n 1 if n is odd
 
 2 

X  X n
 n 2 
 1
Median  X  2 
if n is even
2 Example 3
Find the median value of the set of measurements: 0, 2, 1, 4, 2, 2, 4, 1, 3, 5,
4, 2, 7, 6, 5.

Solution
Arranging in order of magnitude gives:
Observations:0 1 1 2 2 2 2 3 4, 4 4 ,5,5 , 6, 7
Rank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
X  X  n 1   X  91   X 8  3
 th  th
 2   2 

 Median is the 8th observation = 3.


Example 4
Find the median for the following data. 5, 8, 12, 30, 18, 10, 2, 22
Soln:
Arranging in ascending order: 2, 5, 8, 10, 12, 18, 22, 30
Ranks:1, 2, 3, 4, 5, 6, 7, 8
For n even
X  n   X  n  
 th
2
 2  1 th
   X 4th  X 5th 10  12
Median =    11
2 2 2

22
Median for grouped data
For grouped data, the median is obtained by:

N 
 2    f 1 
Median  L1  C  
 f median 
 
Where ,
L1= lower class boundary of the median class
N = number of items in the data (i.e. the total frequency)

  f  = sum of frequency of all classes preceding the median class


1

fmedian= frequency of the median class


C = size of the median class
Example 5
Find the median of the grouped data on the life span (in days) of 101 organisms of the
same specie exposed to the same treatment as presented in the following table.
Life span 10-19 20-29 30-39 40-49 50-59 60-69 Total

Frequency 5 12 24 35 18 7 101

Steps:
Step 1: Find Cumulative Frequencies

 
Step 2: Find N 2

 
Step 3: See in the Cumulative Frequency the value first greater than N 2 . Then the

corresponding class interval is called the median class.

Then apply the formula:


 N   f  
Median = L1  C  2 1

 fm 
 

23
Class Interval Class F CF
Boundaries
10 – 19 9.5 – 19.5 5 5
20 – 29 19.5 – 29.5 12 17
30 – 39 29.5 – 39.5 24 41
40 – 49 39.5 – 49.5 35 76 * Median class
50 – 59 49.5 – 59.5 18 94
60 – 69 59.5 – 69.5 7 101
101

N  101, N  2   1012  50.5,   f   41


1
L1  39.5, C  49.5  39.5  10
f m  35
Substituting we have

2  41 
 101
Median = 39.5  10  
 35 
9.5
 39.5   10  39.5  2.71 42.21.
35

MODE
A sample mode is a measurement which occurs most frequently in the sample. (It is
possible for two or more modes to exist in one sample if two or more different
measurements tie as most frequent). For ungrouped data or series of individual
observations, mode is often found by mere inspection.

Example 6.0
Find the mode of the set of observations below: : 2, 7, 10, 17, 8, 10, 2

Mode = X = 10
In some cases the mode may be absent while in some cases there may be more than
one mode.

24
Example: 6.1:
1. 12, 10, 15, 24, 30 (no mode)
2. 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10. Here the mode are 7 and 10. Hence The
set of data is said to be bimodal

Example 7
Consider the following data and obtain the mode:
Number of children 0 1 2 3 4 5 6 7
Number of families 1 2 4 1 3 2 1 1
(frequency)

Mode = 2 since more families have two children than any other specified number of
children.

Mode for Grouped Data


For continuous distribution, see the highest frequency, then corresponding class interval
is called the modal class. Then apply the formula.

~  1 
Mode  X  L1  C   , where,

 1   2 

L1= lower class boundary of modal class,


 1 = excess of modal frequency over the frequency of the next lower class.
 2 = excess of modal frequency over the frequency of the next higher class
C = size of modal class
1 = f1– f0

 2 = f1 – f2
f1 = freq. of the modal class
f0 = freq. of the class preceding the modal class
f2= freq. of the class just above the modal class
Hence the expression for mode for a grouped data can be written as

25
  f1  f 0 
X  L1  C  
 2 f1  f1  f 0 

Example 8
Consider the grouped data in example 5. Obtain the mode.
Solution
Class Interval Class F
Boundaries
10 – 19 9.5 – 19.5 5
20 – 29 19.5 – 29.5 12
30 – 39 29.5 – 39.5 24
40 – 49 39.5 – 49.5 35* Modal class
50 – 59 49.5 – 59.5 18
60 – 69 59.5 – 69.5 7

Modal class = class with highest frequency, which is 40-49.


L1 = 39.5, 1 = 35-24 = 11, 2 = 35-18 = 17, C = 10.
  11 
 Mode =X= 39.5 + 10  
 11+17 
 110
X  39.5 
28

X  39.5  3.93

X  43.43

MEASURES OF PARTITION
Quartiles, Deciles and Percentiles
If a set of data is arranged in order of magnitude, the middle value (or arithmetic mean
of the two middle values) which divides the set into two equal parts is the median. By
extending this idea, we can think of those values which divide the set into four equal

26
parts. These values denoted by Q1, Q2, Q3 are called the first, second and third quartiles
respectively, the value Q2 being equal to the median. Similarly, the value which divides
the data into ten equal parts are called Deciles and are denoted by D 1, D2,..,D9 while the
value dividing the data into one hundred equal parts are called percentiles and are
denoted by P1, P2, P3,..,P99. The 5th deciles and the 50th percentile correspond to the
median. The 25th and 75th percentile correspond to the 1st and the 3rd quartiles
respectively. Collectively, quartiles, deciles and percentiles and other values obtained
by equal subdivision of the data are called Quantile.

The ith (i =1,2,3) quartile is given as


𝑖𝑁
− 𝑓𝑏𝑞 𝑖
𝑄𝑖 = 𝐿𝐶𝐵𝑞 𝑖 + 4 𝐶
𝑓𝑞 𝑖

The ith (i =1,2,3,..,9) Decile is given as


𝑖𝑁
− 𝑓𝑏𝑑 𝑖
𝐷𝑖 = 𝐿𝐶𝐵𝑑 𝑖 + 10 𝐶
𝑓𝑑 𝑖

The ith (i =1,2,3,..,99) percentile is given as


𝑖𝑁
− 𝑓𝑏𝑝 𝑖
𝑃𝑖 = 𝐿𝐶𝐵𝑝 𝑖 + 100 𝐶
𝑓𝑝 𝑖

Interquartile Range: 𝑄3 − 𝑄1

𝑄3 −𝑄1
Semi Interquartile Range: 2

Illustration
Using our data on fifty electric lamps, we compute as follows
Third Quartile:
3𝑁
− 𝑓𝑏𝑞 3 37.5 − 36
𝑄3 = 𝐿𝐶𝐵𝑞 3 + 4 𝐶 = 24.5 + 6 = 33.5
𝑓𝑞 3 16

Third Decile:

27
3𝑁
− 𝑓𝑏𝑑 3 15 − 7
𝐷3 = 𝐿𝐶𝐵𝑑 3 + 10 𝐶 = 12.5 + 6 = 15.0385
𝑓𝑑 3 13

Twenty third percentile:


23𝑁
− 𝑓𝑏𝑝 23 11.5 − 7
𝑃23 = 𝐿𝐶𝐵𝑝 23 + 100 𝐶 = 12.5 + 6 = 14.5769
𝑓𝑝 23 13

MEASURES OF DISPERSION
Dispersion is the second property which describes a set of data. Although the measures
of central tendency, i.e the mean, median, mode and geometric mean are useful clues
to the value of central items, they do not tell us how items are spread or dispersed
throughout the distribution. To get a clue of this spread, we need a measure of
dispersion because two or more distribution may have the same value of central
tendency but differ greatly in dispersion. Thus, a measure of dispersion is a measure of
the degree to which a numerical data tends to spread about an average value.
Measures of dispersion falls into two main categories namely, measures of absolute
dispersion and measures of relative dispersion.

Measures of Absolute Dispersion: These are measures computed to serve as cues to


the amount of spread in the given distribution. They include Range, Variance, Standard
Deviation, Mean Deviation etc.
Range: The Range of a set of data is the difference between the largest and the
smallest value in the given distribution.
Variance and Standard Deviation
These are two commonly used measures of dispersion which does not take into
account how all the values are distributed. They evaluate how the values fluctuate about
the mean.
Standard Deviation is simply the square root of the variance. The sample variance is
denoted by S2 while the sample standard deviation is denoted by S. Given a set of raw
data, these measures can be computed as follows

28
 X  X 
n
2

S2  i 1

n 1

 X  X 
n
2

S i 1

n 1
Computation of Variance and Standard Deviation for a grouped distribution: For
computing S2 and S from a grouped data, we can use either the definitional or the
computational method.
The Definitional formulae is given as

 f X  X 
n
2

S2  i 1
n 1

 f X  X 
n
2

S i 1
n 1

The Computational formulae is given as

 fu 2   fu  C 2
 2

  n 
S2   
n 1


  fu   2
2


fu 
2
C
n 
S  
n 1
C is that same transformed variable applicable when computing mean using the short
cut method and it is the usual class size.
Mean Deviation is given as

MD 
F x  x
N
Example: Using the data on fifty electric lamps above and taking the fourth class as
origin, we would compute the following as follows:
29
Class Frequen Clas 𝒇𝒙 U U2 𝒇𝑼 𝒇𝒙𝟐 𝒇𝑼𝟐 𝒇 𝒙−𝒙 𝟐 𝒇 𝒙−𝒙
Interva cy (f) s
l mark
(x)
1–6 3 3.5 10.5 -3 9 -9 36.75 27 834.6672 50.04
7 – 12 4 9.5 38 -2 4 -8 361 16 456.2496 42.72
13 – 13 15.5 210. -1 1 -13 3123. 13 284.7312 60.84
18 16 21.5 5 0 0 0 25 0 27.8784 21.12
19 – 10 27.5 34.4 1 1 10 7396 10 535.824 73.2
24 3 33.5 275 2 4 6 7562. 12 532.2672 39.96
25 – 1 39.5 100. 3 9 3 5 9 373.2624 19.32
30 5 3366.
31 – 39.5 75
36 1560.
37 - 42 25
50 100 -11 23406 87 3044.88 307.2
9 .5

 f X  X 
n
2

3044.88
S2  i 1
  62.1404
n 1 49

 f X  X 
n
2

S i 1
 62.1404  7.8829
n 1
Variance can also be computed using another form of definitional formula as

 fx  2
10092
 fx 2

n
23406.5 
50 23406.5  20361.62 3044.88
S2      62.1404
n 1 49 49 49
Variance using the computational formula is

30

  fu   2
2
  112  * 62

fu 
2
C  87 
n   50 
S2       62.1404
n 1 49
The standard deviation is:


  fu   2
2


fu 
2
C
n 
S    62.1404  7.8829
n 1
The Mean Deviation is

MD 
F xx 
307.2
 6.144
N 50
Moment, Skewness and Kurtosis
Moment
The rth moment of a variable X about a constant say C is defined as
(𝑋 − 𝐶)𝑟
𝐸(𝑋 − 𝐶)𝑟 = 𝑓𝑜𝑟 𝑎𝑛 𝑢𝑛𝑔𝑟𝑜𝑢𝑝𝑒𝑑 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
𝑛
𝑓(𝑋 − 𝐶)𝑟
𝐸(𝑋 − 𝐶)𝑟 = 𝑓𝑜𝑟 𝑎 𝑔𝑟𝑜𝑢𝑝𝑒𝑑 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
𝑛
Skewness
Any set of data has a shape describing it. A sketch of the data on the graph would
usually give an insight about the distribution usually described as being symmetrical or
skewed. A distribution is symmetrical if there are no extreme values in a particular
direction so that low and high values balance each other. On the other hand, if the peak
lies to one or other side of the centre of the histogram, then the distribution is said to be
skewed, positively or negatively depending on the direction of the skewness. Skewness
can be measured and expressed as degree of skewness and also direction of
skewness. The degree of skewness as given by Karl Pearson are

𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝑘 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
or
3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝑘 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

31
Where Sk is the Pearson coefficient of skewness. Sk is a dimensionless quantity that
ranges as  1  Sk  1. If Sk is negative (Sk<0), it implies negative skewness (mean <
median < mode); a positive Sk (Sk>0) implies positive skewness (mean > median >
mode) while a zero value of Sk (Sk=0) implies Symmetry (mean = median = mode).
Other measures of skewness are
(1) Quartile Coefficient of Skewness
𝑄3 − 2𝑄2 + 𝑄1
𝑄𝐶𝑆 =
𝑄3 − 𝑄1
(2) (10 - 90) percentile coefficient of skewness
𝑃90 − 2𝑃50 + 𝑃10
𝑃𝐶𝑆 =
𝑃90 − 𝑃10
Example: with Mean=55, median=5.06 and standard deviation=17.53,
3(55 − 55.06)
𝑆𝑘 = = −0.01
17.53
Sk= -0.01 implies that the distribution is skewed to the left though with a small value
close to zero, there is a possibility of symmetry in the data set.
Kurtosis
This is the degree of peakedness of a distribution and it is usually discussed and
measured relative to the degree of the normal distribution. A distribution that is peaked
as the normal distribution is said to be Mesokurtic. If a distribution is more peaked than
the normal distribution then it is said to be Leptokurtic. If the distribution is less peaked
than the normal distribution then it is said to be Platykurtic.
The value of coefficient of kurtosis for every normal distribution is K=3, for platykurtic
distribution, K<3 and for the Leptokurtic distribution, K>3.
The coefficient of Kurtosis is
1
𝑄3 − 𝑄1
𝐾=2
(𝑃90 − 𝑃10 )

32

You might also like