Sta 111 Lecture Note
Sta 111 Lecture Note
Course Content
1. Statistical Data: Types, Sources and methods of collection.
2. Presentation of Data: Tables, charts and graph
3. Error and approximations
4. Frequency and cumulative distributions
5. Measures of location, partition, dispersion, skewness and kurtosis
6. Rates, ratios and index numbers
Recommended Text:
1
A. INTRODUCTION
1. NATURE OF STATISTICS
To several people, statistics means different things and several definitions have been
given to statistics. Two common but broad definitions of statistics are when:
1. Used with a plural verb (facts and data): Here statistics is defined as facts or data,
either numerical or non-numerical, organized and summarized so as to provide
useful and accessible information about a particular subject. Such as employment
data, number of marriages, school enrolment, etc.
2. Used with a singular Verb (mathematical or scientific techniques): Here statistics
involve:
the scientific method for collecting, organizing, summarizing, presenting, and
analyzing data as well as drawing valid conclusion and making reasonable
decision on the basis of such analysis.
3. DIVISIONS OF STATISTICS
Statistics is divided into two broad areas: (a) Descriptive Statistics and (b) Inferential
Statistics Descriptive Statistics: Descriptive Statistics also known as Deductive
Statistics deal with collection and summarization of data. Descriptive statistics is a
branch of statistics that is concerned with describing the characteristics of the known
data either by providing summaries about the population data or the sample data.
Descriptive Statistics gives either numerical and graphic procedures to summarize a
collection of data in a clear and understandable way. Descriptive statistics help us to
simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots
of data into a simpler summary. The purpose of descriptive Statistics is to display and
pass information from which conclusion can easily be drawn and decision made.
Inferential Statistics: This deals with the techniques or methods that are used to select
a sample from a population, analyze data from such sample, make estimate and/or
draw conclusions about a population based on the information in a sample from the
population.
Remark: The interest in this course is the DESCRIPTIVE STATISTICS
2
STATISTICAL DATA: CLASSIFICATION AND COLLECTION
Data refers to series of observation from a population collected at regular intervals or
instances. They are values of variables under consideration. When they are arranged in
a particular order (either ascending or descending), they are called an ARRAY
CLASSIFICATION OF DATA
Data are classified on several bases:
1. Based on time dimension
Cross-sectional data: data collected on several related subjects at a point in time.
e.g. price of pms in 20 petrol stations in Abakaliki yesterday, the number of student
in each department in AEFUNAI in 2021/2022 academic session.
Time series data: data on one subject collected sequentially over time. e.g. the
annual profit of UBA from 2000 -2022, the annual GDP for Nigeria from 1980-2022.
Panel or longitudinal or pooled data: data collected on related subjects and
sequentially overtime. It involves pooling of cross-sectional and time series data.
E.g. The GDP of 10 African countries from 1980-2022, the profit of 15 deposit banks
in Nigeria from 2000-2022.
2. Based on degree of measurability
Quantitative data: Data from quantitative variables. This are data that are
measurable or could be quantified. The can be either Continuous data (data from
continuous variables e.g. Height in cm, weight in kg, age in months) or Discrete data
(data from discrete variables, e. g. Number of student in STA 203 CLASS for
2022/2023, number of lecturers in ACC department.
Qualitative Data: These are data from qualitative (categorical) variables. They are
nun-numerical but classifiable data. Examples are sex, colour, race etc
3. Base on Sources
Basically, there are two sources of Statistical Data: (a) Primary Data and (b)
Secondary Data
Primary Data: This is the name given to the data that are collected by the
researcher from the original source and are used for the purpose for which they
were originally collected. Sources of such data include: surveys, experiment,
census etc.
3
Secondary Data: Secondary data refers to data that are extracted from other
sources for some other purposes rather than that for which they were originally
collected for. The sources of this kind of data include journals, newspapers,
annual reports etc from which we extract them.
4. Base of Degree of Aggregation
Micro Data: this is data in its unmerged form. Example. Amount your GPA in
first semester.
Macro data: This is data in its aggregated form: CGPA, average income, sum
of the total temperature etc.
4
and 5:00 A.M., which is a duration of 5 hours. This is because 0:00 A.M. does not mean
absence of any time. Another example is temperature. When we say 0°F, we do not
mean zero heat. A temperature of 100°F is not twice as hot as 50°F.
Ratio Scale: If two measurements are in ratio scale, then we can take ratios of those
measurements which will be meaningful. The zero in this scale is an absolute zero.
Money, for example, is measured in a ratio scale. A sum of N100 is twice as large as
N50. A sum of N0 means absence of any money and is thus an absolute zero. We have
already seen that measurement of duration (but not time of day) is in a ratio scale. In
general, the interval between two interval scale measurements will be in ratio scale.
Other examples of the ratio scale are measurements of weight, volume, area, or length.
Direct Observation: This involves a close and direct monitoring of the process and
recording the result of the observation at an appropriate time as they occur. Example is
the observing and recording of number of cars that enters a particular place for a given
period or interval of time.
Advantages: (i) It may yield an accurate and reliable result. (ii) The chances of incorrect
data being recorded are reduced.
Disadvantages: (i) It is very expensive to carry out in terms of cost, time and energy. (ii)
Sometimes, the information sought for cannot be directly observed.
5
still, with intent to mislead the interviewer. (ii) Whereas the interviewer engages many
respondents, they may not record the information in the same way, as he would have
himself.
Mail Questionnaire: This involves sending questionnaires by post or any other means
to respondents. The respondent in turn fills and returns the questionnaire by post or any
other means. Its major drawback is that of low rate of response as the respondents may
not be interested in the subject matter. This method is also prone to delay and to some
extent, questionnaires may miss on transit.
Advantage: (i) It is by far the cheapest method. (ii) A large field of inquiry may be
studied by this method.
Disadvantage: (i) Respondents may not send back the questionnaires and when they
do, it may not be accurately filled.
Experimental Survey/Experimentation: This is mostly used in Science and
technology where experiments are designed, conducted and observations are recorded.
There is often a control (a control experiment is that which the experimenter is trying to
improve upon) experiment with which the result of the main experiment will be
compared.
Advantage: (i) It appears to be most reliable of all the methods. (ii) It yields a high
degree of accuracy.
Disadvantage: (i) It involves a lot of time, energy and cost. (ii) It requires a high level of
meticulousness and care in order to be effective.
6
PRESENTATION OF STATISTICAL DATA
Data presentation is a process by which raw data are arranged in tabular or
diagrammatical forms. Collected data may be complex in nature. One way of simplifying
and making them more intelligible is to represent them by means of tables, diagrams,
charts and graphs.
FREQUENCY DISTRIBUTION
Data collected using any of the methods may be cumbersome and too large to handle,
hence a need to summarize them in an easy to comprehend form. One way of doing
this is to summarize the data in a frequency distribution table in such a way that, the
data, though compressed but looks more meaningful.
Frequency Table: A frequency table is a table that is best suitable in handling the data
collected. It shows the number of times the values of a variable occurs in a data set.
The distribution of values of variables in a frequency table is called frequency
distribution. Frequency table can be constructed as either ungrouped frequency table
resulting to an ungrouped frequency distribution or grouped frequency table resulting to
grouped frequency distribution.
7
Much more information becomes apparent when the raw data is checked and arranged
in order of size or magnitude.
Constructing a Grouped Frequency Table: Generally, the quality of the frequency
distribution table is determined by a wise choice of number of classes and the width of
the class. Hence, in constructing a grouped frequency table, two things that are very
necessary are (i) deciding the number and width of classes and (ii) determination of the
class limits. In determining the class limits, it should be noted that class limit should be
definitely and clearly stated and the lower limit of the first class should be well defined
so as to cover all the data. This is done in case one needs to use the table to construct
graphs such as Ogive or frequency distribution curve. Other things to be noted include:
(i) The number of classes should not be less than six or more than twenty.
(ii) The range should be found, that is, the numerical difference between the largest and
the least figure in the data.
(iii) Class intervals are usually in multiples of five.
(iv) A class interval with an odd number of unit is easier to work with than with even
number because the midpoint of the interval of odd unit leads to an integer.
Uses of Frequency Table
(i) It allows required figures to be located easily and quickly.
(ii) It allows comparisons to be made easily between different classes of the
group.
(iii) It reveals patterns within the figures which cannot be seen in the ordinary
form.
(iv) It allows for easier data analysis and interpretations.
General rules for forming a Frequency Distribution Table
(i) Determine the range, that is, the difference between the smalls and the largest
number.
(ii) Determine the number of classes, that is, Range divided by the class size.
(iii) Find the Upper class limit using U1 = L1 + C – 1 (for discrete data) or U1 = L1 + C –
0.1 (for data with one decimal place) or U1 = L1 + C – 0.01 (for data with two
decimal place) and generally U1 = L1 + C – 10-k (for data with k decimal place)
Example: Given the following data on fifty electric lamps
8
16 13 7 21 23 28 23 1 20 30 29 18 18 33 20 11 23 8 29 20 12 27 25 18 22 16 21 20
24 13 34 23 17 26 2 20 17 15 39 22 34 5 21 35 23 28 17 16 26 24.
Using the minimum value in the data as the lower limit of the first class and 6 as the
class size, form a frequency distribution table for the data above.
Solution:
Class size (C) =6, Range, R=39 – 1 = 38, Number of classes = R /C = 38/6 = 6.2 ~ 7
The Frequency table is formed thus
Table I: Frequency distribution table of fifty Electric Lamps
Types of Diagrams: Bar chart, Pie chart, Graphs, Histogram, Polygon, Stem and leaf,
Box plot and Dot plot. Etc.
Remark: While Bar chart, Pie chart are commonly used for categorical data, others are
mostly adopted for quantitative data.
9
Bar Chart
This is a name for a rectangular bars with each bar representing the frequency
(proportion, ratio, percentage, etc) with which the different values of a variable occur.
Bar chart is divided into (i) Simple bar chart (ii) Component bar chart (iii) Multiple bar
chart
We shall use the data below to show each of these charts.
Case Rates of some reported notifiable diseases by Sex in Nigeria (2010 - 2013)
2010 2011 2012 2013
Disease Male Female Male Female Male Female Male Female
HIV/AIDS 13.7 24.4 12.3 23.4 5.9 6.3 10.2 13.0
Pneumonia 4.9 3.5 5.1 4.3 3.9 2.9 2.9 2.6
Malaria 66.9 59.9 70.3 62.1 68.6 73.3 56.8 56.8
Diarrhoea 12.0 9.0 9.3 7.2 6.1 4.0 22.3 19.5
Source: National Bureau of Statistics (NBS) Demographic Statistics Bulletin 2013.
100
Rates
50
0
2010 2011 2012 2013
Years (2010 - 2013)
Fig 1: A Simple Bar Chart showing case rate of Malaria in Nigeria (2010 - 2013)
Component Bar Chart: This shows the component parts which make up the total. It has
a limitation that the actual components represented cannot be read from the chart
directly. Thus, the difference between the top and the base has to be worked out before
any meaningful comparison can be made.
10
120
100
80
Rates
60
40 Diarrhoea
20 Malaria
0
Pneumonia
M F M F M F M F
HIV/AIDS
2010 2011 2012 2013
Sex (2010 - 2013)
Fig 2: Component Bar Chart showing case rates of some notifiable disease by Sex in
Nigeria (2010 - 2013)
Multiple Bar Chart: This is used to compare changes in more than one variable. Here
the group of bars conveying information pertaining to a particular variable are joined
together.
80
70
60
50
Rates
40
30 HIV/AIDS
20
10 Pneumonia
0
Malaria
M F M F M F M F
Diarrhoea
2010 2011 2012 2013
Sex (2010 - 2013)
Fig 3: A Multiple Bar Chart showing case rates of some notifiable diseases by Sex in
Nigeria (2010 - 2013)
Pie Chart: This also is used to present data that has been collected. It shows
component parts of the angles of each slice or sector at the centre of the pie by working
out the proportion it bears to the whole.
11
Table 1: New Jobs created in Q2, 2015 (All Sectors)
Sector Frequency Angular Sector
Formal 51, 070 51070 * 360
130
141368
Informal 83,903 83903* 360
214
141368
Public Institution 6,395 6395 * 360
16
141368
Total 141, 368 360
Public
Institution, 1
Formal, 130.
6.28515647
0520627
Informal, 213
.6627808
12
Steps
Step 1: Obtain a frequency (relative-frequency, percent) distribution of the data.
Step 2: Draw a horizontal axis on which to place the bars and a vertical axis on which to
display the frequencies (relative frequencies, percents).
Step 3: For each class, construct a vertical bar whose height equals the frequency
(relative frequency, percent) of that class.
Step 4: Label the bars with the classes, as explained in Definition 2.9, the horizontal
axis with the name of the variable, and the vertical axis with “Frequency” (“Relative
frequency,” “Percent”).
Dot-plots: Another type of graphical display for quantitative data is the dot-plot. Dot-plots
are particularly useful for showing the relative positions of the data in a data set or for
comparing two or more data sets. Procedure 2.6 presents a method for constructing a
dot-plot.
Steps
Step 1: Draw a horizontal axis that displays the possible values of the quantitative data.
Step 2: Record each observation by placing a dot over the appropriate value on the
horizontal axis.
Step 3: Label the horizontal axis with the name of the variable
Stem-and-Leaf Diagrams
With a stem-and-leaf diagram, we think of each observation as a stem-consisting of all
but the rightmost digit and a leaf, the rightmost digit. In general, stems may use as
many digits as required, but each leaf must contain only one digit.
To Construct a Stem-and-Leaf Diagram
13
Step 1: Think of each observation as a stem-consisting of all but the rightmost Digit and
a leaf, the rightmost digit.
Step 2: Write the stems from smallest to largest in a vertical column to the left of a
vertical rule.
Step 3: Write each leaf to the right of the vertical rule in the row that contains the
appropriate stem.
Step 4: Arrange the leaves in each row in ascending order.
Improper selection of the samples (especially when samples are selected with
some bias).
14
Substitution (where a sample unit is absent thus substituted with a mean value or
another value).
Faulty demarcation of the statistical unit (leading to overlapping, thereby causing
double counting).
Errors due to variability and wrong method of estimation.
Non-Sampling Error: This occurs when data are not properly observed, measured,
approximated and processed. Such errors are present in both census as well as sample
survey. This error can be avoided if adequate care and attention is observed. Other
error arise due to
Incomplete questionnaire.
Defective method of sampling.
Personal bias of the investigator.
Lack of trained/qualified enumerators.
Failure to respond by the respondents.
Errors in compilation and tabulation.
b. Rounding and Truncation :Rounding data to fewer decimal places introduces minor
inaccuracies. Truncating data can lead to a loss of significant information.
15
Implications of Errors and Approximations
Improve Data Collection Methods, Increase Sample Size, Use Proper Sampling
Techniques, Address Non-Response Bias, Validate Assumptions, Quantify and
Communicate Uncertainty, Perform Sensitivity Analyses:
STATISTICAL MEASURES
The three common measures used to describe data features are: measures of central
tendency or location, measures of partition and measures of dispersion or variations.
Other measures are measure of shapes: skewness and peakness.
Arithmetic Mean
16
The arithmetic mean or briefly the mean is the most commonly used measure of
location and is obtained by adding all the values and dividing by the total number of
values or items.
Definition
Let x1, x2 ,.........., xn be the numerical measurements on variable X from a sample of n
x i
x i 1
n
Example 1
The mean of the measurements: 2, 6, 3, 2, 1, 2, 7, 5, 4, 1, 4, 0, 5, 2, 4
is
n
x i
48
x i 1
= = 3.2
n 15
f x i i
x i 1
n
f
i 1
i
Example 2
Find the mean of the following data:
14, 17, 15, 13, 18, 15, 16, 16, 15, 17, 17, 15, 17, 13, 17
17
Solution
The data can first be arranged in a frequency distribution table before computing the
formula as follows
xi 13 14 15 16 17 18 Total
fi 2 1 4 2 5 1 15
f i xi 26 14 60 32 85 18 235
6
f x i i
235
x i 1
6
15.667
15
f
i 1
i
Note: If the data are arranged in a grouped frequency distribution table the formula for
the mean remains the same as that of frequency distribution table except that now xi
becomes the class mark or class average for the ith group or class interval.
n
H .M
1 X
i
Example 2.2: From the given data calculate H.M 5, 10, 17, 24, 30
Solution:
X 5 10 17 24 30 Total
1/X 0.2 0.1 0.0588 0.0417 0.0333 0.433
18
n 5
H .M 11.526
1 x 0.4338
Example 2.3: The marks scored by some students of a class are given below. Calculate
the H.M
Marks 20 21 22 23 24 25
No. of Students 4 2 7 1 3 1
Soln:
20 4 0.0500 0.2000
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
Total 18 0.8216
N
H .M
fi 1
xi
18
21.91
0.1968
Harmonic mean is most suitable average when it is desired to give greater weight to
smaller observations and less weight to the larger ones. Some of its demerits are:
i. It is not easily understood
ii. It is difficult to compare
Geometric Mean
The Geometric mean of a series containing observations is the nth root of the product of
the values. If X1 , X2….. Xnare observations, then:
19
G.M n x1 , x2 , x3........ xn
1
x1 , x2 , x3 ....xn n
log xi
n
log x
G.M. =Antilog
n
13.5107
Antilog ( ) = Antilog (2.7021) =503.6
5
Merits of Geometric Mean
1. It is suitable for averaging ratios, rates and percentages.
2. Unlike Arithmetic mean (A.M), it is not affected by the presence of extreme
values.
20
2. It is difficult to calculate particularly when the items are very large or when there
is frequency distribution.
X
W1 X 1 W2 X 2 ... Wk X k
W X i i
is called the weighted arithmetic mean.
W1 W2 ... Wk W i
Example 2:1: If a final exam in a course is weighted 3 times as much as a quiz and a
student has a final exam grade of 85 and quiz grades of 70 and 90; calculate the
weighted mean grade.
(1 70) (1 90) (3 85)
X 83
11 3
Weighted mean is mainly used in
a. Construction of index numbers
b. Computation of standardized death and birth rates in demographic studies.
If the arithmetic averages and the number of items in two or more related groups
are known, the combined or the composite mean of the entire group can be obtained by
n x n x
combined mean X 1 1 2 2
n1 n2
One major advantage of combined arithmetic mean is that we can determine the overall
mean of a combined data without going back to the original data.
Example 2.5: Find the combined mean for the data given below.
n1 20, x 4, n2 30, x2 3
n1 x1 n2 x2
X
n1 n2
Combined mean =
20 4 30 3
20 30
80 90 170
3.4
50 50
THE MEDIAN
21
The median of a set of measurements or items is usually the middle value or item after
the measurements or items have been arranged in their order of magnitude. If there is
even number of measurements or values, two values will be in the middle and the
median in this case will be obtained as an average of the two middle values.
Definition
The median is the observation that occupies the middle position when the observations
are arranged in ascending or descending order of magnitude.
X 1 X X 3 .. X n
Median X X n 1 if n is odd
2
X X n
n 2
1
Median X 2
if n is even
2 Example 3
Find the median value of the set of measurements: 0, 2, 1, 4, 2, 2, 4, 1, 3, 5,
4, 2, 7, 6, 5.
Solution
Arranging in order of magnitude gives:
Observations:0 1 1 2 2 2 2 3 4, 4 4 ,5,5 , 6, 7
Rank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
X X n 1 X 91 X 8 3
th th
2 2
22
Median for grouped data
For grouped data, the median is obtained by:
N
2 f 1
Median L1 C
f median
Where ,
L1= lower class boundary of the median class
N = number of items in the data (i.e. the total frequency)
Frequency 5 12 24 35 18 7 101
Steps:
Step 1: Find Cumulative Frequencies
Step 2: Find N 2
Step 3: See in the Cumulative Frequency the value first greater than N 2 . Then the
fm
23
Class Interval Class F CF
Boundaries
10 – 19 9.5 – 19.5 5 5
20 – 29 19.5 – 29.5 12 17
30 – 39 29.5 – 39.5 24 41
40 – 49 39.5 – 49.5 35 76 * Median class
50 – 59 49.5 – 59.5 18 94
60 – 69 59.5 – 69.5 7 101
101
2 41
101
Median = 39.5 10
35
9.5
39.5 10 39.5 2.71 42.21.
35
MODE
A sample mode is a measurement which occurs most frequently in the sample. (It is
possible for two or more modes to exist in one sample if two or more different
measurements tie as most frequent). For ungrouped data or series of individual
observations, mode is often found by mere inspection.
Example 6.0
Find the mode of the set of observations below: : 2, 7, 10, 17, 8, 10, 2
Mode = X = 10
In some cases the mode may be absent while in some cases there may be more than
one mode.
24
Example: 6.1:
1. 12, 10, 15, 24, 30 (no mode)
2. 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10. Here the mode are 7 and 10. Hence The
set of data is said to be bimodal
Example 7
Consider the following data and obtain the mode:
Number of children 0 1 2 3 4 5 6 7
Number of families 1 2 4 1 3 2 1 1
(frequency)
Mode = 2 since more families have two children than any other specified number of
children.
~ 1
Mode X L1 C , where,
1 2
2 = f1 – f2
f1 = freq. of the modal class
f0 = freq. of the class preceding the modal class
f2= freq. of the class just above the modal class
Hence the expression for mode for a grouped data can be written as
25
f1 f 0
X L1 C
2 f1 f1 f 0
Example 8
Consider the grouped data in example 5. Obtain the mode.
Solution
Class Interval Class F
Boundaries
10 – 19 9.5 – 19.5 5
20 – 29 19.5 – 29.5 12
30 – 39 29.5 – 39.5 24
40 – 49 39.5 – 49.5 35* Modal class
50 – 59 49.5 – 59.5 18
60 – 69 59.5 – 69.5 7
MEASURES OF PARTITION
Quartiles, Deciles and Percentiles
If a set of data is arranged in order of magnitude, the middle value (or arithmetic mean
of the two middle values) which divides the set into two equal parts is the median. By
extending this idea, we can think of those values which divide the set into four equal
26
parts. These values denoted by Q1, Q2, Q3 are called the first, second and third quartiles
respectively, the value Q2 being equal to the median. Similarly, the value which divides
the data into ten equal parts are called Deciles and are denoted by D 1, D2,..,D9 while the
value dividing the data into one hundred equal parts are called percentiles and are
denoted by P1, P2, P3,..,P99. The 5th deciles and the 50th percentile correspond to the
median. The 25th and 75th percentile correspond to the 1st and the 3rd quartiles
respectively. Collectively, quartiles, deciles and percentiles and other values obtained
by equal subdivision of the data are called Quantile.
Interquartile Range: 𝑄3 − 𝑄1
𝑄3 −𝑄1
Semi Interquartile Range: 2
Illustration
Using our data on fifty electric lamps, we compute as follows
Third Quartile:
3𝑁
− 𝑓𝑏𝑞 3 37.5 − 36
𝑄3 = 𝐿𝐶𝐵𝑞 3 + 4 𝐶 = 24.5 + 6 = 33.5
𝑓𝑞 3 16
Third Decile:
27
3𝑁
− 𝑓𝑏𝑑 3 15 − 7
𝐷3 = 𝐿𝐶𝐵𝑑 3 + 10 𝐶 = 12.5 + 6 = 15.0385
𝑓𝑑 3 13
MEASURES OF DISPERSION
Dispersion is the second property which describes a set of data. Although the measures
of central tendency, i.e the mean, median, mode and geometric mean are useful clues
to the value of central items, they do not tell us how items are spread or dispersed
throughout the distribution. To get a clue of this spread, we need a measure of
dispersion because two or more distribution may have the same value of central
tendency but differ greatly in dispersion. Thus, a measure of dispersion is a measure of
the degree to which a numerical data tends to spread about an average value.
Measures of dispersion falls into two main categories namely, measures of absolute
dispersion and measures of relative dispersion.
28
X X
n
2
S2 i 1
n 1
X X
n
2
S i 1
n 1
Computation of Variance and Standard Deviation for a grouped distribution: For
computing S2 and S from a grouped data, we can use either the definitional or the
computational method.
The Definitional formulae is given as
f X X
n
2
S2 i 1
n 1
f X X
n
2
S i 1
n 1
fu 2 fu C 2
2
n
S2
n 1
fu 2
2
fu
2
C
n
S
n 1
C is that same transformed variable applicable when computing mean using the short
cut method and it is the usual class size.
Mean Deviation is given as
MD
F x x
N
Example: Using the data on fifty electric lamps above and taking the fourth class as
origin, we would compute the following as follows:
29
Class Frequen Clas 𝒇𝒙 U U2 𝒇𝑼 𝒇𝒙𝟐 𝒇𝑼𝟐 𝒇 𝒙−𝒙 𝟐 𝒇 𝒙−𝒙
Interva cy (f) s
l mark
(x)
1–6 3 3.5 10.5 -3 9 -9 36.75 27 834.6672 50.04
7 – 12 4 9.5 38 -2 4 -8 361 16 456.2496 42.72
13 – 13 15.5 210. -1 1 -13 3123. 13 284.7312 60.84
18 16 21.5 5 0 0 0 25 0 27.8784 21.12
19 – 10 27.5 34.4 1 1 10 7396 10 535.824 73.2
24 3 33.5 275 2 4 6 7562. 12 532.2672 39.96
25 – 1 39.5 100. 3 9 3 5 9 373.2624 19.32
30 5 3366.
31 – 39.5 75
36 1560.
37 - 42 25
50 100 -11 23406 87 3044.88 307.2
9 .5
f X X
n
2
3044.88
S2 i 1
62.1404
n 1 49
f X X
n
2
S i 1
62.1404 7.8829
n 1
Variance can also be computed using another form of definitional formula as
fx 2
10092
fx 2
n
23406.5
50 23406.5 20361.62 3044.88
S2 62.1404
n 1 49 49 49
Variance using the computational formula is
30
fu 2
2
112 * 62
fu
2
C 87
n 50
S2 62.1404
n 1 49
The standard deviation is:
fu 2
2
fu
2
C
n
S 62.1404 7.8829
n 1
The Mean Deviation is
MD
F xx
307.2
6.144
N 50
Moment, Skewness and Kurtosis
Moment
The rth moment of a variable X about a constant say C is defined as
(𝑋 − 𝐶)𝑟
𝐸(𝑋 − 𝐶)𝑟 = 𝑓𝑜𝑟 𝑎𝑛 𝑢𝑛𝑔𝑟𝑜𝑢𝑝𝑒𝑑 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
𝑛
𝑓(𝑋 − 𝐶)𝑟
𝐸(𝑋 − 𝐶)𝑟 = 𝑓𝑜𝑟 𝑎 𝑔𝑟𝑜𝑢𝑝𝑒𝑑 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
𝑛
Skewness
Any set of data has a shape describing it. A sketch of the data on the graph would
usually give an insight about the distribution usually described as being symmetrical or
skewed. A distribution is symmetrical if there are no extreme values in a particular
direction so that low and high values balance each other. On the other hand, if the peak
lies to one or other side of the centre of the histogram, then the distribution is said to be
skewed, positively or negatively depending on the direction of the skewness. Skewness
can be measured and expressed as degree of skewness and also direction of
skewness. The degree of skewness as given by Karl Pearson are
𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝑘 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
or
3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝑘 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
31
Where Sk is the Pearson coefficient of skewness. Sk is a dimensionless quantity that
ranges as 1 Sk 1. If Sk is negative (Sk<0), it implies negative skewness (mean <
median < mode); a positive Sk (Sk>0) implies positive skewness (mean > median >
mode) while a zero value of Sk (Sk=0) implies Symmetry (mean = median = mode).
Other measures of skewness are
(1) Quartile Coefficient of Skewness
𝑄3 − 2𝑄2 + 𝑄1
𝑄𝐶𝑆 =
𝑄3 − 𝑄1
(2) (10 - 90) percentile coefficient of skewness
𝑃90 − 2𝑃50 + 𝑃10
𝑃𝐶𝑆 =
𝑃90 − 𝑃10
Example: with Mean=55, median=5.06 and standard deviation=17.53,
3(55 − 55.06)
𝑆𝑘 = = −0.01
17.53
Sk= -0.01 implies that the distribution is skewed to the left though with a small value
close to zero, there is a possibility of symmetry in the data set.
Kurtosis
This is the degree of peakedness of a distribution and it is usually discussed and
measured relative to the degree of the normal distribution. A distribution that is peaked
as the normal distribution is said to be Mesokurtic. If a distribution is more peaked than
the normal distribution then it is said to be Leptokurtic. If the distribution is less peaked
than the normal distribution then it is said to be Platykurtic.
The value of coefficient of kurtosis for every normal distribution is K=3, for platykurtic
distribution, K<3 and for the Leptokurtic distribution, K>3.
The coefficient of Kurtosis is
1
𝑄3 − 𝑄1
𝐾=2
(𝑃90 − 𝑃10 )
32