Sta 111 (Introduction of Statistics)
Sta 111 (Introduction of Statistics)
By
Department of mathematics
1.0 Introduction
Statistics is the branch of science that making effective use of numerical data relating to
groups of individuals or experiments. It deals with all aspects of these including not only
the collection, observation, recording, analysis and interpretation of such data, but also
the planning of the collection of data, in terms of designs of surveys and experiments.
Statistics is a word derived from 'status' meaning 'state', 'condition'. The study of statistics
was popularized during the Roman Empire. In the days of the Roman Empire, statistics
was regarded as the science that studied the state population, economic resources, social
problems, politics etc. The theory of probability in statistics began in the 17th century and
provided the mathematical foundation for statistics.
1.1.1 Population: The word population is used to mean the entire set of individuals or
items of interest.
1.1.2 Sample: A subset of the population is called sample. It may mean a selected part
of the population.
In practice, the population of interest is too large or too scattered to allow measurements
to be made on all the individuals. Since a complete census is impossible, the next option
is to select part of the population called a sample. The investigation is thoroughly carried
out on the sample of individuals, recording all relevant measurements and information.
Finally the information gained from the sample after all necessary calculations are used
to make inferences concerning the population. Due to the fact that the sample information
is not perfect, there is the risk of making incorrect inferences about the population. Part
of the method of statistical inference is to minimize such risks.
1.1.3 Example: Suppose that a medical researcher has developed a new vaccine for
HIV/AIDS. The population of interest may include not only those individuals who are
infected but those who are at risk in contracting the disease in future. Thus it is clearly
impossible for the researcher to investigate the effect of treatment on every individual in
the population. The statistical approach to this problem is to try the treatment on a
randomly selected sample of individuals and use the results to make inferences regarding
the overall effectiveness of the treatment.
Numerical measures are used to describe both populations and samples. The measures
used to describe populations are called parameters, while those that describe samples are
called statistics. Parameter values are usually unknown however values of statistics can
be calculated from the sample measurements. Using the calculated statistical values,
parameter values can be estimated. The use of calculated statistical values to estimate
unknown parameter values is an important part of statistical inference.
Note that: the calculated value of the statistic serves as estimates of the parameter, and
may not be equal to the parameter.
1.1.4 Example: A sample mean ( x ) a statistic estimates the population mean ( µ ), and
the sample proportion ( p̂ ) is a statistic that estimates the population proportion (P), a
parameter.
1.1.5 Example: The campaign manager for a candidate say 'D' in a student union election
is interested in unknown parameter P, the proportion of students familiar with candidate
'D'. The school is large and the campaign manager does not have available the necessary
time or resources for a complete polling of the students' population. What might he do?
Solution:
The campaign manager might conduct a sample randomly selected students and
determine how many are familiar with candidate D. If for instance, a sample of 200
students includes exactly 80 who are familiar with the candidate. Then the sample
80
proportion, denoted by p̂ = is a statistic that can serve as an estimate of the
200
unknown population parameter P.
The objective of inferential statistics is to make inferences (i.e. draw conclusions, make
predictions, make decisions about the characteristics of a population from information
contained in a sample.
In this case, inferences about data are taking after the data is summarized and analysed.
These inferences may take the form of answering Yes/ No questions about the data
(Hypothesis Testing), numerical characteristics of the data (Estimation), describing
relationship within the data (Correlation), modelling relationship within the data
(Regression) and extrapolation, interpolation or other modelling techniques (like
ANOVA and Time Series ).
1.2.3 Data: It is a fact that comes up as a result of statistical inquiries or survey.
Information like age, weight, price of commodities, number of people and houses in a
commodities etc are infact statistical data. Data is important information needed in
statistics.
1.3 Sources of Statistical Data:
The data we collect for any statistical study may be classified into qualitative or
quantitative data. Qualitative data are those used for describing characteristics which
cannot be defined in numerical terms. For example colour of hair, colour of eyes,
defective or non-defective, performance graded as excellent, good, average, poor. These
characteristics are called attributes. Qualitative data may be divided into nominal or
ordinal data. Ordinal data have an inherent ordering while nominal data do not have any
ordering.
Quantitative data are data that are capable of numerical description. Examples include
data on height of students in meters, wages of workers in naira, scores of students in
percentages etc. Quantitative data is divided into discrete and continuous data. Discrete
data take on only a finite or countable (equivalent to set of integers) number of values.
For example children in a household, number of visits to a doctor in a year, etc. Discrete
data are integers or fractions. Continuous data are data that can take on any real value in
an interval or over the whole real number line. For example age, height, heart rate, blood
pressure, etc.
Data collected by the investigator himself for the purpose of investigation is called
primary data, while data collected from already published sources is called secondary
data
Disadvantages of this method include the expensive cost of conducting the survey, the
time consuming nature of the survey, the informants may refuse to respond to questions
and the absence of respondents in their homes.
1.6.1 Example: The data below shows the distribution of students in 11 Departments at
Federal University Lafia in the 2011/2012 academic session.
1.6.2 Definition: A pie chart consists of a circle divided into sectors each angle is
proportional to the size of the data.
1.6.3 Example: To draw the pie chart of Example 1.6.1 the following computations are
conducted
38 15
Biology = × 360 o =55₀, Chemistry = × 360 o = 22o
247 247
1.6.3 Definition: A bar chart consists of rectangular bars which may be vertical or
horizontal and are proportional to the magnitude under consideration.
1.6.4 Example: To draw the bar chart of Example 1.6.1 each bar is proportional to the
magnitude considered.
45
40
35
30
25
20
15
10
5
0
Fig. 1.2 Bar Chart of the enrolment of student in Departments at Federal University Lafia
in 2011/2012 academic session.
Sometimes a pie chart can be drawn using angles as shown in the example below.
1.6.5 Example: A family shared out available money of N400 for the month as follows:
Food N180
School fees N100
Health care N50
Rent N60
Incidentals N10
(i) Represent the above information in angles on a pie chart.
(ii) What percentage of the income is given to school fees?
Rent(54 o)
o
Food(162 )
Health (45o)
Incidentals(9o)
School fees(90 o)
1.6.6 Raw Data : Data is collected that is not organized numerically is called raw data.
1.6.7 Example : The scores of 20 students in statistics examination is shown below: 52,
14, 23, 53, 21, 13, 27, 17, 44, 74, 91, 92, 19, 48, 63, 80, 70, 64, 50, 57.
Solution : 13, 14, 17, 19, 21, 23, 27, 44, 48, 50, 52, 53, 57, 63, 64, 70, 74, 80,
91, 92.
1 - 20 1111 4
21 - 40 111 3
41 - 60 1111 1 6
61 - 80 1111 5
81 - 100 11 2
Data organized and summarized as in the above table is called grouped data.
1.6.9 Class Mark: The arithmetic mean of the upper and lower class limits or the upper
and lower class boundaries is called class mark.
1.6.11 Histogram
A histogram is made up of rectangles or bars of equal width with no space or gaps
between bars. The heights of rectangles or bars correspond to the class frequencies.
The class boundaries are marked on the horizontal axis while the frequencies are
marked on the vertical axis.
1.6.12 Example: Construct a histogram using the example on the scores of 20 students
in an examination.
frequency
2
0 0.5 20.5 40.5 60.5 80.5 100.5
Class Boundary
1.6.13 Frequency Polygon: A line graph of frequency against class mark. It is obtained
by joining the midpoints of the tops of the rectangles of a histogram.
20
15
Freq. 10
5
0
20.5 40.5 60.5 80.5 100.5
Upper class boundaries
The graph of cumulative frequency against upper class boundaries is called Ogive.
EXERCISES
1. Consider the budget allocations of a Local Government Council
In what follows, notations of Arithmetic Mean for unordered (raw) and frequency
distribution data.
∑
̅ =
⋯…
= for samples from raw data and
∑
̅ = ∑
⋯…
= ∑
for samples from discrete (ungrouped) and continuous
(grouped) distributions. Where is the ith observation and is the corresponding ith
frequency and n and ∑ are total number of observation for raw data and frequency
distribution data respectively.
Example 2.1:
Find the arithmetic mean of the following age (in month) of children for an
immunization;
6, 12, 4, 16 and 2
Solution:
̅ = = 40/5 =8.
Sometimes when a set of data is large there will be need to form a frequency
table. The format of frequency tables for grouped and ungrouped frequency has
discussed in chapter four. The method of obtaining their arithmetic mean is shown
insection 6.1.1 and 6.1.2 of this chapter.
Table 2.1: The Frequency Distribution of Ages of Children Died on a Motor Accident In
a Luxurious Bus.
Freq 2 7 5 4 9 7 6
Solution
1 2 2
2 7 14
3 5 15
4 4 16
5 9 45
6 7 42
7 6 42
Total 40 175
∑
= ∑
= 75/40 = 4.375.
Frequency 6 6 12 11 10 5
Solution
1-10 6 5.5 33
11-20 6 15.5 93
Total 50 1555
=
The assumed mean method is adopted when data under consideration consists
of large items. The number chosen arbitrarily from the list of information being
considered for the purpose of calculating the arithmetic mean is the Assumed Mean. It’s
generally suggested that the assumed mean should take a number very close to the
middle score. If A denotes our Assumed Mean then the Arithmetic Mean using an
Assumed Mean is calculated by
∑( !)
̅ = A +
Example 2.3:
Using an assumed mean of 6 for data in example 6.1, find the mean of the distribution.
Solution
6 0
12 6
4 -2
16 10
2 -4
Total 10
π=6+ =8
Example 2.3(b)
Using the data in table 2.2, obtain the arithmetic mean by taking the assumed mean to
be 5 years
Solution
2 7 -3 -12
3 5 -2 -10
4 4 -1 -4
5 9 0 0
6 7 1 7
7 6 2 12
Total 40 -24
( )
̅ = 5+ = 4.4
Example 2.4:
Calculate the arithmetic mean of data in table 6.2 using an assume mean of 44.5
Solution
45.5 10 0 0 0
55.5 5 10 1 5
TOTAL 50 -72
C = 15.5 – 5.5 = 10
( %)
π = 45.5 + 10
= 45.5 – 14.4
= 31.1
2.2 MEDIAN
Example 2.5:
Consider the following height of trees to the nearest metre; 2.0, 3.5, 4.2, 3.7, 2.6, and
5.1. Obtain the median of the distribution
Solution
Arranging the data in an ascending order, we have 2.0, 2.6, 3.4, 3.5, 4.2 and 5.1
There are five elements all together the median is therefore 3.5 (the value in 4th
position)
When the items are large, it may be necessary to use the method other than counting
from left and right of a set of data in an array.
( &) th
The median may be obtain by taking the observation that falls into
if N is odd or
& &
average of two numbers that fall into ( )'( and ( + 1)'( Position of observation where
N is the terminal cumulative frequency.
Example 2.6:
Number of trees 4 3 5 1 3 2
Cumulative frequency 4 7 12 13 16 18
Median = (18/2)th = 9th and (18/2+1)th = 10th positions. Hence from the cumulative
frequency column the Median is 3.0m
Where;
Example 2.7:
Using the data in table 2.4 below to obtain the median of the distribution
Table 2.4: Matches per Box and Corresponding Frequency of Randomly Selected
Number of Matches
Matches per 39 – 41 42 – 44 45 – 47 48 – 50 51 – 53 54 - 56
Box
Frequency 3 13 26 38 15 5
Solution
38.5 – 41.5 3 3
41.5 – 44.5 13 16
44.5 – 47.5 26 42
47.5 – 50.5 38 80
50.5 – 53.5 15 95
120
100
Cumulative Frequency
80
60
40
20
0
41.5 44.5 47.5 50.5 53.5 56.5
Class Boundary
Figure 2.1: Demonstration of Median from Cumulative Curve
• Unlike Arithmetic Mean, extreme values (outliers) do not affect the median.
• It is useful when comparing two or more sets of data especially in the non
parametric test.
• Computation in median is very easy and easy to understand as it does not
involve serious calculation.
• It can be obtained from graphs as discussed in section 6.2.3
• It is unique as it gives only one figure.
2.2.3.2 DISADVANTAGES
2.3 MODE
The mode is the most easily computed and simplest to interpret among others. The
mode of a given data is the item which occurs most often in the distribution. In case of
data that are frequently distributed, the mode is a member of numbers that has highest
frequency. The information with one mode is referred to as unimodal, two modes is
bimodal and more than two modes are known to be multimodal. However, if all items
are different there is no mode.
Example 2.8:
From the following data 50, 45, 50, 25 45, 45 and 30, the mode are 45 and 50 (Bimodal)
The mode from grouped data may be simply obtained by taking the average of the
class interval or class boundary or picking the class mark of the modal class. On the
other hand, an exact value of mode is obtained by interpolation or graphical method.
The mode by interpolation method provides a single value data that represents the
whole data set. This method is carried out by using the formula;
∆
Mode = Lmo+ (∆∆)c where;
Example 2.9:
∆1 = 38 – 26 = 12
∆2 = 38 -15 = 23
c=3
Mode = 47.5 + (.)3 = 48.5
The mode can be read off from the histogram by the following steps
The geometric mean of a set of non negative N observations is the Nth root of
their product. Suppose x1, x2 ,..............., xN be a set of positive numbers then,
Example 2.10:
A sample of five batteries is tested for the following hour, 2, 4, 3, 8 and 6 hours.
Find the geometric mean for the distribution?
Solution
N N P N NQ C/P
:DEFDGHIJ ;DKL (:;) = M(CC O== O … … OPP ) = ( ∏P
R Q )
Score 1 2 3 4
Frequency 10 2 3 2
Solution
• The geometric mean is useful when data contains only positive integers
• It is useful in calculating relative values such as index number
• It is suitable for skewed distributions since taking the logarithm of the
observation makes it more symmetrical and the mean then becomes a good
measure of centre.
The harmonic mean (HM) of a set of N positive observations is the reciprocal of the
arithmetic mean of the reciprocals of observations
Example 2.12:
Find the harmonic mean of the following scores of students in a test, 5, 3, 10, 12 and 2
Solution
B
HM = =
/ /. / / /
The formula for Harmonic Mean (HM) of frequency distribution is given as follows
1
∑ f i xi ∑fi
i
Harmonic Mean (HM) = i =
∑f i
i ∑fx i i
Example 2.13:
MEASURES OF PARTITION
3.0 INTRODUCTION
Other values likely to be considering in this chapter are deciles and percentile which
split the observation into ten and hundred parts respectively.
3.1 QUARTILE
In descriptive statistics, a quartile is any of the three values which the divide the sorted
data into four equal parts, so that each represents one-fourth of the sample data. The
middle one is also called the median.
First quartile: being designated ‘[’; cuts off lowest 25% of data, it is the middle value
of lower half. Second quartile designated ‘[ ’; a median of distribution cuts off data sets
into equal half and it’s 50'( percentile.
Third quartile : being designated ‘[. ’; cut off 25% of the data set or lowest 75% which is
equivalent to75'( percentile. It is the middle value of upper half. Note that the first and
third quartile are also referred to as lower and upper quartile respectively. The
difference between the two is known to be inter-quartile range. Therefore our major
concern in this section shall be on [ and [. . The formula for calculating the quartile for
ungrouped data is similar to that of the median given in section 2.12
The first and third quartile from ungrouped frequency distribution is given as
follows;
(&)'(
First quartile ([) = the value of position data and
& th
Third quartile([. ) = the value of 3(
) position from ungrouped data.
If there is any even number of data items, then we need to get the average of data
(&)'( 6(&)9'( .(&)'( &
fall into
and
for [ and
and 3[(
) +1]th for [. .
Example 3.1 : The illustration given in example 7.1 below find the lower and upper
quartile in the following set of data. 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25, (65)
Solution
Arrangement of data: 5, 7, 12, 14, 15, 22, 25, 30, 36, 42, 53
[ [ [.
Median [ =22
Upper quartile [. = 36
If there are even numbers of data as included in the parenthesis of data in example 7.1
Arrangement: 5, 7, 12, 14, 15, 22, 25, 30, 36, 42, 53, 65
.
[ = = 13, [ = = 23.5, [. = =39.
The lower and upper quartile from grouped frequency distribution are given as follows
&/ ∑ _
Lower Quartile [ = L[ + ( `
)c
.&/ ∑ _
Upper Quartile [. = = L[. + ( )c
`a
Where b[ and b[. are the respective lower class boundaries of the lower and upper
quartile classes. [ and [. are the respective frequencies of lower and upper quartile
classes.
∑ c is the cumulative frequency before the quartile class, and c is class size ( lower
boundaries – upper boundaries)
Inter-quartile range = [. - [ .
Example 3.2:
Find the first and second quartiles of the data in table 3.1, hence, obtain the Inter-
quartile range for the distribution
AGE 10 – 19 20 - 29 30 - 39 40 – 49 50 – 59
FREQUENCY 8 12 13 32 35
Solution
Boundary Frequency
9.5 – 19.5 8 8
19.5 – 29.5 12 20
29.5 – 39.5 13 33
39.5 – 49.5 32 65
100
& th
For first quartile = (
) = 25th , LQ = 29.5, ∑ ` = 20, ` = 13, c =39.5 – 29.5 = 10
[= 29.5 +( ) 10
.
= 29.5 + 3.85
= 33.35
(% )
[. = 49.5 + 10
.
= 49.5 + 2.86
= 52.36.
3.5 DECILES
The measure which divides the data in a distribution into ten equal parts is called
Decile. Deciles are the percentiles that are multiple of 10. First decile is the point with
10% of the data below it and 90% of the data above it while the nineth decile is the
point with 90% below it and 10% above it.
First decile (d ) is the (1/10)th, d is (1/5)th d. is (3/10)th, ----------, de is the (9/10)th
of the distribution .
These are the main points which divide a distribution into ten equal parts. The formula
for the deciles is generally given by
QP
( ∑ NhQ )
fI = ghQ + Ci
J, (I = C, =, − − −, l). The symbol is as explained as
NhQ
for Quartiles.
3.3 PERCENTILES
Percentiles are measures of partitions that divide the whole distribution into100
equal parts. Using the formula to determine percentiles the procedure is similar to that
of quartiles and Deciles.
QP
∑ NqQ
mDHJDLGInDo (mI ) = gpQ = ( Cii
)J
NqQ
Example 3.3:
Solution
5
∑ s
Di = br + (
)c
s
& th
=( ) = 10th , br = 19.5, ∑ r = 8, r = 12
( )
First decile D1 = 19.5 +
X10
e& e ()
= = 90Th , bre = 49.5, ∑ re = 65, re = 35
(e )
de = 49.5 + .
10 = 49.5 + 7.4 = 56.54
5
∑ s
(b) 10th percent v = bw + ( )c
s
&
= = 1, bw =9.5, ∑ x = 0, x = 8
( )
v = 9.5 + X10
/
%& %
= = 75th bw% = 49.5 ∑ v% = 65, v% = 35
(% )
v% 49.5+ X 10
.
MEASURES OF DISPRRESION
4.0 INTRODUCTION
Another feature of a set of date is its spread about an average. While measure of
central tendency are used to estimate normal value of a data set, measures of
dispersion are important for describing the spread of the data, or its variation around a
central value. For instance two distinct samples may have the same mean or median but
completely different levels of variability or vice versa.
A proper description of a set of data should include both of these characteristics.
There are various methods that can be used to measure dispersion of a data set, each
with its own set of advantages and disadvantages.
4.1 RANGE
Range is the difference between the largest and smallest sample values. It is seen as
the distance from the highest to the lowest value in the set of numbers. Range is the
difference between the upper boundary of highest class and the lower boundary of the
lowest class of a grouped frequency distribution.
Examples 4.1:
Find the range of price of bag of rice in a selected market given below
This measure gives the length of the interval containing the middle (50%) of the
data. It is the difference between the upper and lower quartiles as discussed in chapter
seven
Example 4.2:
The mean deviation of a set of observation is the arithmetic mean of all absolute
deviations from the mean. It is a measure of dispersion that spread about the mean. The
process is by finding the sum of all values of each deviation from the mean (changing all
negative values to positive) and then dividing it by number of the values.
Given the arithmetic mean of a set of data , , ----, to be ̅ The mean deviation
for ungrouped data is
∑| ̅ |
M.D = { , Where | | is an absolute value (assuming all signs to be positive) of each
&
deviation from the mean.
Example 4.3:
Find the mean deviation of the following heights of plants in an agricultural science
laboratory. 2, 5, 6, 7, 7, and 9cm
Solution
%%e .
̅ = = =6
e
( ) ( ) ( % ) ( % ) ( e )
Mean deviation =
= = 1.7
Suppose we are given observations , , ----, with their respective frequency as
, , ----, , The mean deviation for such data will be given as:
∑ | ̅ |
Mean Deviation = ∑
Example 4.4.3:
No of spoiled sticks 6 3 4 8 2 7
Solution
3 6 18 13 78
8 3 24 8 24
13 4 52 3 12
18 8 144 2 16
23 2 46 7 14
28 7 196 12 84
30 480 228
∑
S~ = ∑
/
= .
= 16
The variance may be viewed as an average of the distance of all observed values
from the mean (but not quite, since we divide by n-1 rather than n).
If the variance is small, the most of sample values lies quite close to the sample
mean. However, if the variance is large then the sample value lies rather far from
sample mean.
Standard deviation measures the degree to which a set of values has been
spread about their mean. If the value is large, then the values in the distribution are well
spread out about their mean, clustered or otherwise.
The variance is the arithmetic mean of the squares of the deviation of the
observation from the true mean, the standard deviation is indeed square root of the
variance.
Given a set of observation , , ----, with the mean ̅ , the variance and standard
deviation of ungrouped (raw) data is given as follows:
( ̅ ) ∑
Variance = ∑ = (∑ – ) this formular is expand to
& & &
√(∑( )
S .D. = √ =
&
Example 4.4:
Solution
. /
= = = 5.6
Alternatively using the second formula,( it may be use without obtaining the mean).
.
Variance = (2 + 5 + 7 + 7 + 9 − )
=1/5(28) = 5.6
Meanwhile, the procedure used in computing the variance and standard deviation in
an ungrouped and grouped frequency distribution data is the same.
∑ 2
By expansion variance = ∑ (∑ – ( ∑
))
and standard deviation being square root of the variance is then given as
S. D = √
∑ ( )
=M ∑
or
S.D. = M6∑ (∑ - (∑ ∑ )2)
Example 4.5:
Solution
3 6 18 169 1014
8 3 24 64 192
13 4 52 9 36
18 8 144 4 32
23 2 46 49 98
30 480 2380
∑ z
S~ = ∑z
/
= .
= 16
• Since variance measures the square of the units of the observations it is difficult
to use it to compare the variation (spread) of two sets of data.
• It is relatively difficult to interpret the variance
• The variance of constant observations (same observation value) is zero
• If the variance is small, the simplest value lies quite close to the sample mean.
Therefore if the variance is large then most of the sample values lie rather far
from the sample means.
• The standard deviation being square root of the variance provides solution to
the problem of the squaring unit of data. Hence standard deviation may be used
to compare the spread of two set of data.
• It is used in further statistical text or analysis such as testing of difference of
location using t or z text.
• It makes use of all the observations
• As for the variance and mean deviation, a small standard deviation means the
sample value lie close around the mean and variance. The standard deviation is
however affected by the magnitude or change in the unit of the observation.
Coefficient OF Variation (C. V.) describes the magnitude of sample values and the
variation within them. It corrects difference of spread in magnitude of observations. For
example, consider the following sets of data on price of two commodities in four
markets
It can be observed from the two data sets that both mean and standard deviation are
different (3.9, 0.9) for commodity 1 and (39, and 9) for commodity 2. This means that
commodity 2 has a greater spread than commodity 1. However, coefficient of variation
is use in correcting the magnitude in the variability of the two commodities.
Coefficient of Variation (CV) is the ratio of standard deviation (SD) to the mean
(̅ ) i.e.
r
C.V =
Example 4.6:
For commodity 2:
e
CV = = 0.23
.e
This indicates that the two data sets have equal variability.
5.0 INTRODUCTION
5.1 MOMENT
Moment of a set value is the summation over the power of the set of values. In
other words, they are the expectation of the powers of the set of values. They can be
classify into two namely Raw and Central moments.
Given a set of observations , , ----, the rth moment about the origin (Zero) is
#
given by ̀ = ∑ &
for ungrouped/per data
̀ = ∑ ∑ for ungrouped/grouped frequency
Example 5.1:
Find the first, second and third moment about the origin for the followingvalues ; 2, 4,
5, 3, and 1
Solution
Example 5.2
Find the first second and third moment for data in table 9.2 below;
Time (min) 4 6 8 10 12
ATM 42 5 1 3 4
Solution
4 2 8 16 32 64 128
8 1 8 64 64 512 512
10 3 30 100 300 1000 3000
∑
First raw moment = (from equation)
∑
= 124 / 15 = 8.27
∑
Second raw moment = (from equation)
∑
= 1152 / 15 = 76.8
∑ a
Third raw moment . = (from equation)
∑
= 11632 / 15 = 775.47
)
5.2 CENDTRAL MOMENT (MOMENT ABOUT THE MEAN
Given a set of observation x1, x2, - - - - - - - - -, xn, the rth moment about the mean or
∑( ~~~
)
rth central moment is given by Mr = for ungrouped data
∑ ( ~~~
)
= ∑
for ungrouped and grouped frequency distribute on data.
Where ̅ is the sample mean an unbiased estimate of (ISt raw moment)
Example 5.3 : obtain Ist and 2nd and 3rd central moment of data in example 9.2
Solution
= 3 from the first raw moment, from the table blow
X (x - ̅ ) (x - ̅ )2 (x - ̅ )3
2 -1 1 -1
4 1 1 1
5 2 4 8
3 0 0 0
1 -1 4 -8
0 10 0
From the Ist , 2nd and 3rd columns totals of the table, we have,
∑( − ̅ ) = 0
∑(x − ̅ )2 =10
∑(4 – )
Second central moment M2 = = =2
∑(x − ̅ )3 = 8
∑(4 ̅ ).
Third central moment M3 =
==0
Example 5.4, obtain the first, second and third moments of data in table 9.2
Solution
S~ =
∑
= 124 = 8.27
∑
∑f(x − ̅ ) = -0.05
.
M1 = = -0.003 = ≈ 0
f(x − ̅ )2 = 126.93
e.e.
M2 = = 8.462
/./%
M3 = = 0.519
∑(x − ̅ )4 = 1,597.0
e%
∴ M4 =
= 106.53
5.3 SKEWNESS
• The skewness for a normal distribution is zero and any symmetric data
should have a skewness near zero.
• Negative values of the skewness indicate data that are skewed left and
positive values of skewness indicate data that are skewed right.
• By skewed left, we mean that the left tail is longer than the right. Similarly
skew right means vice versa. However some measurements have lower
bound and are skewed right.
5.4 kurtosis:
Kurtosis is a measure of whether the data are peaked or flat relative to a normal
distribution. That is, the data sets with high kurtosis trend to have a distinct peak
near the mean decline rather rapidly and have heavy tails. The data sets with low
kurtosis trends to have a flat top near the mean value rather than a sharp pick. A
uniform distribution would be extreme case.
Example 5.3
Find the Skewness and Kurtosis for data in example 9.4 hence and interpret your
answers.
Solution
∑(4 ̅ )
S=M
∑
.e.
S=M = 3.01
Therefore,
.e
Skewness = aa = ... = 0.02, since the value of the skewness is positive and close to
zero, the data skewed to right and it can also be taken as a symmetric data.
..
Kurtosis =
= ...
= 3.906 (the distribution is a peaked)
6.2 Regression
Regression is a measure of relationship between two or more variables, one
being dependent variable or response and others are independent variables or
predictors. The regression equation generally is given as follows
y = α + βx + e
Where,
y is the dependent variable or response
x is the independent variable or predictor
α is a constant value
β is a coefficient of independent variable
e is the error term which is normally and identically distributed with mean 0
and variance σ2. This type of model stated earlier is referred to as simple
linear regression model. For the purpose of this course we shall restrict
ourselves to the model of this form.
Note that the parameter of the model, α and β can be estimated using least
square method as follows;
From the model,
y = α + βx + e
e = y - α – βx
To minimise the error term we can take the square and sum over the error
terms
e2 = (y - α – βx)2
∑
= ∑(y − α – βx)
= ∑(y − α – βx)
= −2 ∑y − α – βx = 0
∑ = ∑ + ¡ ∑ … … … … … … . .1
= −2 ∑y − α – βx = 0
¢
∑ = + ¡∑………………………2
Multiply equation 1 by n and 2 by∑ , this gives
∑ = ∑ + ¡ ∑ … … … … … … . .3
∑ ∑ = ∑ + ¡(∑ ) … … … … … … .4
Subtract 3 from 4, we have,
∑ − ∑ ∑ = ¡ ∑ − ¡(∑ ) , therefore
∑ £ ∑ ∑ £
¡= and
∑ (∑ )
∑£ ∑
= = ~ − ¡̅
Example 6.3
Eight undergraduate students were surveyed in a study involving time spent
on the Internet and their grade point average (GPA). The results are shown
in Table below. x is the amount of time spent on the Internet weekly and y is
the GPA of the student.
Hours GPA
11 2.84
5 3.20
22 2.18
23 2.12
10 2.90
19 2.36
15 2.60
18 2.42
a. Fit a straight line regression to the data and give the values of α and β
b. What will it be the GPA of a student who spent 40 hours in the
internet weekly?
Solution
∑ − ∑ ∑
¡=
∑ − (∑ )
8300.36 − 12320.62
¡= = −0.06
682169 − (123) 9
∑ −¡∑
=
20.62 − (−0.06)123
= = 3.5
8
The model is therefore being given as;
¥ = 3.5 − 0.06
(1) It is used to show changes in the price of a single commodity over time
(2) It is used to know how the price level of a group of related commodities change
with time
(3) Index series serve as of comparison
(4) It aids decision making
Example 7.1
Consider the prices of a fish for three years: 1990, 2000, 2010
Price 20 30 80
Using 1990 as the base period, calculate relative price index for 1990, 2000 and 2010
This is to say that the prices of fish for 1990, 2000 and 2010 were 100, 150 and 400%
respectively of what it was in 1990. There was an increase of 50% and 300%respectively
Example 7.2
Compute arithmetic mean and simple aggregate price of relatives’ index of the following
items using 2011 as a base period
Item year
2011 2012
Maize 85 110
Beans 60 125
Solution
∑ ¦ /¦
The mean of relatives’ index =
e./...
I11 = .
= 162.58%
This shows that quantities have risen by 62.58%
x
b. simple aggregate price = ∑ x
¨
= 100 = 157.14%
/
Under the weighted aggregate method, the aggregate is found for each period by
multiplying the prices by their respective weights and a new figure for the total is
calculated as a percentage of the base total. Hence, in calculating the price index for a
group of commodity using the weighted aggregate method, the quantities for the
respected items serve as the weights. weighted aggregate price index can be
constructed in two ways namely; Laspeyres and Paasche’s
Where,
Il is price index
7.4 Paasche’s Method: the difference between this and Laspeyres formula is that
Paasche’s formula uses current year quantities qn as the base period instead of base
year quantity q0. The formula Paasche’s for weighted aggregate index is given as
follows;
∑ « ¬
©ª =
∑ « ¬
Solution
The Laspeyres’ index of 153.4% shows that there has been an increase of 53.4% in the
price of the group of items in 1982 relative to the 1980 price levels; while Paasche’s
index of 149.4%, this increase constitutes 49.4%
The Laspeyres’ Quantity index indicates there is a 21.9% increase in the quantity of
items purchase in 1982 over the 1980 level while The Paasche’s Quantity index indicates
there is a 18.8% increase in the quantity of items purchase in 1982 over the 1980 level