Copy of STATS NOTE Edited Upto Regression
Copy of STATS NOTE Edited Upto Regression
Sharma
**
Prepared by Prof.Balaram P. Sharma
MEANING OF STATISTICS
The word ‘Statistics’ is derived from the Latin word ‘status’ and the Italian word
‘statista’ (means political state). In earlier days statistics was considered as a
science which is used for the fulfillment of the needs of state or administration.
2. COMPREHENSIVE DEFINITION:
3. OTHER DEFINITIONS:
# NARROW DEFINITION :
Prepared by Prof.Balaram P. Sharma
Mr. Webster defined statistics as the classified facts respecting the conditions of
the people in the given state.
#COMPREHENSIVE DEFINITION :
FUNCTIONS OF STATISTICS :
IMPORTANCE OF STATISTICS :
Statistics is useful –
LIMITATIONS OF STATISTICS :
Variable
An attribute that describes a person, place, thing or idea under study is known as a variable or
characteristics. A variable is usually denoted by X, Y, Xi, Yi etc. Examples of variable are age of
the teachers in a college, marks obtained by students in a certain examination, height and weight
of people in a group, blood pressure, temperature (℃ ), rain fall (mm), education level, export
and import, cast and religion, hair color, eye color etc. The types of variables are: (a) qualitative
variables. (b) quantitative variables.
Qualitative variables (or categorical variables) take on values that are names or labels. Examples:
sex, place, country, nationality, religions, color of balls or marbles, breeds of a cow, major tree
species in Hilly region, major fruits in Gandaki province, corns produced in Narayani zone etc
are some examples of Qualitative or categorical variables.
If the data of the variable can be expressed numerically, such variables are said to be quantitative
variables. Since quantitative variables represent a measurable quantity, such types of data are
said to be numerical data or quantitative data.
Examples: height(cm. or m. or ft. or inch), weight, blood pressure, rain fall, room
temperature, cost, revenue, export, import, profit, age, income, death count etc. The quantitative
variables should have certain units. There are two types of quantitative variables.
[i]. Discrete variables: a variable which takes only countable values (or whole number) is known
as discrete variable. Number of students, number of telephone calls in a day, number of vehicles
running a day, number of planes grounded on an airport, family size etc are the examples of
discrete variables.
[ii]. Continuous variables: continuous variables are those, which takes all possible real values
within a certain range. Such types of real values may be whole number as well as fractional
number. Height, weight, blood pressure, rain fall, temperature, marks of a student, profit of
certain commodity are the examples of continuous variables.
Prepared by Prof.Balaram P. Sharma
For the further clearification between discrete and continuous variables, we take two
examples as: (i) if we count the number of telephone call received in a day or an hour, it may be
0 or 1 or 2 or 3… (but not 3.6 or 4.5 etc fractional number). It will be always integral or whole
number. Hence, number of telephone call is discrete variable. (ii) if we measure the height of the
students in a class, it may be 4.2 ft. or 5.4 ft. or 5 ft. …..etc. height may be integral or fractional
value. So the height of the student is continuous variable.
Population: Population or statistical population means all the totality of cases under
investigation. It is also the entire group of individuals or members or items of interest in a study.
A population is finite, if it is possible to count its individuals. It is also known as countable
population. Sometimes it is not possible to count the number or units contained by population.
Such a population is called infinite or uncountable population.
For example: if we want to know the quality of an umbrella produced by a factory, the
entire product of umbrella of the factory under study in a certain time period is called a
population. All the patients in a certain hospital, all the flower plants in the garden, all the
teacher in a college, all the goats in the goat form are all the examples of population in different
studies. All the above examples are also examples of countable or finite population. The number
of vehicles crossing the Rapti bridge (Hetada) in a day is another example of finite population.
The number of germs in the body of COVID patient, the number of chemical (liquid) drops in a
beaker are examples of uncountable or infinite population.
If all the items or individuals can be regarded as the same type, such population is known
as homogeneous population. The heterogeneous population means the population containing
sub-population of different types of characteristics under study.
Sample:
Sample is some of the units selected from population. The process of selecting such units from
the population to draw conclusion about the population is known as sampling. So the sampling is
a process of choosing a representative portion of the population.
The parameters are the several statistical measures,which describes or characterize a population.
In general, there is insufficiency of all the data for entire population. We have to use on samples
to arrive at conclusions about populations. Thus, it needs to calculate parameters. An estimation
of population parameter is called statistic. In other words, statistical measures, calculated rfom
the sample data which will be an estimate of population parameters, are called statistic
Prepared by Prof.Balaram P. Sharma
parameter Statistic
1 1
Mean, 𝜇 =𝑛 . ∑𝑛𝑖=1 𝑥𝑖 Mean,𝑋̅ = 𝑛 ∑𝑛𝑖=1 𝑥𝑖
1 1
Variance, 𝜎 2 =𝑛 ∑𝑛𝑖=1[𝑥𝑖 − 𝜇]2 Variance,𝑆 2 = 𝑁 ∑𝑛𝑖=1[𝑥𝑖 − 𝑋̅]2
Diagrams and Graphs are the presentation of statistical data in the form of geometrical figures
like points, lines, bars , rectangles, circles etc.
The most common and easiest method of presenting the data is bar diagram. Bar diagram
consists of a rectangle or a set of rectangles corresponding to data in which the magnitudes or
values are represented by the length or height of the bar.
The diagram for only one variable is a simple bar diagram. The magnitudes of the data are
presented through the height of the bars. It can be used for the comparative study of two or more
categories of a single variable. It consists of a set of equidistant rectangles having equal width.
Example [1]: The following is the production of some of the crops in a certain province of Nepal
in the fiscal year 2071/072:
S.N. Crop type Quantity (Tones) S.N. Crop type Quantity (Tones)
1 Rice 130,000 3 Wheat 25,000
2 Maize 40,000 4 Others 20,000
140000
120000
100000
80000
60000
40000
20000
0
Rice Maize Wheat Others
Prepared by Prof.Balaram P. Sharma
Sub-divided or component bar diagram is such type of bar diagram which is presented for two or
more component of a total. It can be drawn when it is necessary to make the comparative study
of different components with one another and to see the relation between the components with
their totals. In component bar diagram, a rectangle of total length is drawn. Then, divide the
rectangle into various parts each representing component of the total. Each of the component
can be colored or shaded by differently.
From example 1:
250000
200000 20000
25000
150000 40000
100000
130000
50000
0
Crops
Example [2]: The following table represents the number of passed out students in B.Sc. Forestry
from FoF, Agriculture and Forestry University, Nepal in the year 2074 and 2075.
Sub-divided bar diagram showing the passed out students in B.Sc. Forestry
80
70
60
40
25
50
40
30
20 40
35
10
0
2074 2075
Male Female
A sub-divided bar diagram presented data in terms of percentage is known as percentage bar
diagram. The height of each bar is same as 100. In order to show the relative changes in the data,
percentage bar diagram is more appropriate than simple and sub-divided bar diagram. The
percentage bar diagram of example [1] above is as below:
100
9.3
11.62
80
18.6
60
40
60.46
20
0
Crops
Percentage bar diagram showing the passed out students in B.Sc. Forestry
120
100
80 38.46
53.33
60 Female
Male
40
61.54
46.67
20
0
2074 2075
215000
200000
Rice
150000
130000 Maize
Wheat
100000 Others
Total
50000 40000
25000 20000
0
crops
Prepared by Prof.Balaram P. Sharma
Multiple bar diagram showing the passed out students in B.Sc. Forestry
80 75
70 65
60
50
40 40 Male
40 35 Female
30 25 Total
20
10
0
2074 2075
The lengths as well as width of the bars are considered and the area of the bars are compared in
rectangular diagram. The area of two or more circles of two variables can be compared in
circular diagram. The rectangular diagram & Pie chart (or circular diagram) are the examples of
two dimensional diagram.
It is the modified form of bar diagram. The only difference is that the areas of two
rectangles of different categories of the same variables are compared. The height or length of the
rectangle is generated by data and the breadth or width is same for each rectangle. The gaps of
two rectangles are taken same.
Pie chart or circular diagram is two dimensional diagram. This method is popular and widely
used method. it is also known as angular diagram. All the given values are converted in terms of
central angles of a circle. The sum of all the angles should be 360°. The conversion of each of
the items into angles are as follows:
Pie chart (or circular diagram) of example [1] above is as dradn below:
Crops
20000
25000
Rice
Maize
Wheat
40000 130000 Others
[i]. Histogram.
[i]. Histogram
If only mid values are given, the upper and lower limits of each of the intervals should be
obtained for drawing histograms.
Prepared by Prof.Balaram P. Sharma
Example[i].The followings are the number of forest enterprises having annual income in
Rs.’000’.
30
25
20
15 no. of forest enterprises
10
5
0
0-15 15-30 30-45 45-60 60-75 75-90
Income (Rs. '000')
Note that histogram frequency polygon, frequency curve and ogive are drawn for
continuous series.
35
30
25
20
15 no. of forest enterprises
10
5
0
7.5 22.5 37.5 52.5 67.5 82.5
Income mid point
A smooth free hand curve drawn through the vertices of a frequency polygon is known as
the frequency curve. The frequency curve smoothes out the corners and the peaks of frequency
polygon in such a way that area enclosed by frequency curve is same as that of frequency
polygon but its shape must be smooth but not with sharp edges. The frequency curve is a
limiting form of frequency polygon when the number of observations become very large and
class intervals are made smaller and smaller.
Note: [1] cumulative frequency curve or ogive means less than ogive (if otherwise is not stated.)
[2]. If both more than ogive and less than ogive are drawn on the same paper, they intersect at a
point. The foot of the perpendicular drawn from their point of intersection on the x-axis gives the
value of the median.
Examples of less than and more than ogive can be given from Example[i].
Prepared by Prof.Balaram P. Sharma
Income less than No. of enterprises Income more than No. of enterprises
15000 12 0 142
30000 32 15000 130
45000 57 30000 110
60000 97 45000 85
75000 127 60000 45
90000 142 75000 15
More than ogive
160
140
No. of enterprises
120
100
80
60 No. of enterprises
40
20
0
15 30 45 60 75 90
Income more than (Rs. '000')
120
100
80
60 No. of enterprises
40
20
0
15 30 45 60 75 90
Income less than (Rs. '000')
Prepared by Prof.Balaram P. Sharma
160
140
120
100
40
20
0
0 20 40 60 80 100
The behavior of unwidely large amount of data which moves towards the middle of the group is
known as central tendency. Its measure are used to describe the middle or centre of data set. The
resulting value obtained from the measures of central tendency can be considered as proxy
typical unique value. Measures of central tendency also enable us to compare two or more sets of
data to facilitate comparison. We may define the central tendency as follows:
Central tendency is a single value within the range of data, which represents a group of
individual values in a simple and concise manner and concentrates in the middle of the
distribution.
The following characteristics should be applied or satisfied to select an ideal or suitable average
Types of average:
The measure of central tendency is designed to measure central value around which most of the
data tends to concentrate. The measure of central location or central tendency are as follows:
Prepared by Prof.Balaram P. Sharma
Mean, Median and Mode are considered generally as major measures, while geometric and
harmonic means are considered as minor measures or means of transformed data. H.M. & G.M.
are used in typical cases only.
It is the most commonly used and popular mean. Arithmetic mean is also known as
arithmetic average. ’Arithmetic mean’ or simply a ‘mean’ of a set of observations is the sum of
all the observations divided by the number of observations
Individual series: The ungrouped data where each and every value of individual item is listed
down is known as individual series. Let x1, x2, x3, … … …, xn be the ‘n’ observations of the
variable X, then their arithmetic mean denoted by (𝑋̅) is defined as ;
𝑥 +𝑥 +𝑥 +⋯….+𝑥𝑛 ∑𝑥
𝑋̅ = 1 2 3 = ………………………[i]
𝑛 𝑛
When the number of observations are so large and the sizes of the items are so big,
we take their arithmetic mean by taking the deviations of the items from any
arbitrary number. This method is known as ‘short cut method.’ or ‘change of
origin method’ and this arbitrary number is known as ‘assumed mean’. The
addition or subtraction of a constant is sometimes called change of origin. In this
case,
𝑑 +𝑑 +𝑑 +⋯….+𝑑𝑛 ∑𝑑
𝑋̅ = 𝑎 + 1 2 3 = a+ ………………………[ii]
𝑛 𝑛
In discrete series:
If f1, f2, f3, …….. , fn be the corresponding frequencies of the variate values x1, x2,
x3, ……., xn respectively, then the arithmetic mean ̅̅̅̅
(𝑋 ) is defined as,
𝑓 𝑥 +𝑓 .𝑥 +⋯………+𝑓𝑛 .𝑥𝑛 ∑ 𝑓𝑥 ∑ 𝑓𝑥
𝑋̅ = 1 1 2 2 = = ………………………[iii]
𝑓1 +𝑓2 +𝑓3 +⋯……+𝑓𝑛 𝛴𝑓 𝑁
Prepared by Prof.Balaram P. Sharma
In discrete series, the A. M. can be obtained by taking assumed mean ‘a’ which is
formulated as follows
∑ 𝑓𝑑 ∑ 𝑓𝑑
𝑋̅ = a + =a+ ………………………[iv]
𝛴𝑓 𝑁
𝑥−𝑎
Where, a= assumed mean, N=𝛴f = total frequency, d’ = , h = class size. In
ℎ
other symbol, the formula for short cut method and step deviation methods are
summarized as:
𝑥 +𝑥 +𝑥 +⋯….+𝑥𝑛 ∑𝑥
Direct method [individual series]: 𝑋̅ = 1 2 3 =
𝑛 𝑛
𝑓 𝑥 +𝑓 .𝑥 +⋯………+𝑓𝑛 .𝑥𝑛 ∑ 𝑓𝑥
Direct method [discrete & continuous series]: 𝑋̅ = 1 1 2 2 =
𝑓1 +𝑓2 +𝑓3 +⋯……+𝑓𝑛 𝑁
𝑈 +𝑈 +𝑈 +⋯….+𝑈𝑛 ∑𝑈
Short cut method [Individual series]: 𝑋̅ = 𝑎 + 1 2 3 = a+
𝑛 𝑛
∑ 𝑓𝑈 ∑ 𝑓𝑈
Short cut method [discrete & continuous series]: 𝑋̅ = a + =a+
𝛴𝑓 𝑁
∑ 𝑓𝑈′ ∑ 𝑓𝑈′
Step deviation method [disc. & continuous series]: 𝑋̅= a+ = a+ xh
𝛴𝑓 𝑁
If w1, w2, w3, …. .., wn be the weight given to the variate values x 1, x2, x3,
……,.xn respectively. then their weighted arithmetic mean is given by,
𝑥 𝑤 +𝑥 𝑤 +⋯……..+𝑥𝑛 𝑤𝑛 ∑ 𝑤𝑥
𝑋̅𝑤 = 1 1 2 2 =∑ ……….[vi]
𝑤1 +𝑤2 +⋯..+𝑤𝑛 𝑤
1. The algebraic sum of the deviations of the item taken from arithmetic mean is
zero. So, 𝛴(X-𝑋̅) = 0 [In individual series], 𝛴f(X-𝑋̅) = 0 [in discrete series],
1
And 𝛴f(X-𝑋̅) =0 [In continuous series], where h is the class size and X is the set
ℎ
of mid –value of each class.
2. The sum of the squares of the deviations of the items is minimum, when the
deviations are taken from arithmetic mean. Mathematically, ∑(𝑋 − 𝑎)2 is
minimum when, a=𝑋̅.
Merits:
• It is rigidly defined.
• It is based on all the observation.
• It is simple to understand and easy to compute.
• It is suitable for further mathematical analysis.
• It is least affected by fluctuation of sampling.
Demerits:
The geometric mean of the ‘n’ non zero and non negative variate values is the nth
root of their product.
In individual series, Let x1, x2, x3, ………., xn be ‘n’ non zero and non negative
variate values, then their geometric mean (denoted by ‘G’) is defined as ;
∑ log 𝑥
G = 𝑛√𝑥1 . 𝑥2 . 𝑥3 … . . 𝑥𝑛 = 𝑎𝑛𝑡𝑖𝑙𝑜𝑔[ ] ……………………[ix]
𝑛
In discrete series, If f1, f2, f3, ….. fn be the corresponding frequencies of ‘n’ non
zero and non negative variate values x1, x2, x3, ….., xn, their geometric mean G is
defined as ;
∑ 𝑓.log 𝑥
G= 𝑎𝑛𝑡𝑖𝑙𝑜𝑔[ ]
𝑁
……………………………………………………[x]
In continuous series The formula for the G.M. (G) in the discrete series [x]can be
used by considering mid value of each of the class interval as the variate value x.
Merits:
➢ It is rigidly defined.
➢ It is based on all the observations.
➢ It is suitable for the further mathematical analysis.
➢ It is not affected very much by the fluctuation of sampling.
➢ It gives more weights comparatively to small items.
Demerits:
In individual series: Let x1, x2, x3, …., xn be non-zero variate values of the variable
X, then their harmonic mean (H.M) which is denoted by H is defined as ;
𝑛
H= 1………………………………….[xi]
∑
𝑥
In discrete & continuous series: If f1, f2, f3, ………., fn be the corresponding
frequencies of n number of variate values x1, x2, x3, ……., xn, then their harmonic
mean H is defined as,
𝑁
H = 1 ………………. ……………[xii]
∑ 𝑓.
𝑥
(iv). MEDIAN:
Median is the variate value which divides total number of observations into two
equal parts. So, the number of observations in the first part is equal to the number
of observations in the second part in the condition when the data is arranged either
in ascending order in magnitude or in descending order in magnitude.
Individual series: First of all, arrange the data in either ascending or descending
order. If total number of observations ‘n’ is odd, there is only one middle value
which divides the whole items in two equal parts. If n is even, there are two central
or middle items. The single central value (median ‘Md’) can be obtained by taking
the A.M. of these two central items. In both cases,
𝑛+1 𝑡ℎ
Md = value of [ ] item …………………………………………[xiii]
2
Discrete series: The median can be determined in discrete series. For this, we
follow the following steps:
(iii). Use the formula for Median ‘Md’ (where ‘N’ =total frequency)
𝑁+1 𝑡ℎ
Md = value of [ ] item ……………………………………...[xiv]
2
(iv). Observe the cumulative frequency column, and note the value corresponding
𝑁+1
to the c.f. either equal to or greater than the value given by [xiv] or . this gives
2
the Median value Md.
(iv). Observe the cumulative frequency column, and note the value corresponding
to the c.f. either equal to or greater than N/2. This gives the Median class or the
class in which the median belong.
Merits:
Demerits:
(v). MODE:
The value (or variate value) which repeats maximum number of times, is known as
Mode. In other words, Mode is the value having maximum frequency. Mode is
denoted by Mo.
f1= frequency following the modal class, h=size of the modal class.
NOTE: Above formula and definition of the Mode suffer from the following
limitations:
Merits:
Demerits:
This last relationship is also known as ‘’Empirical relation ‘’ among A.M., G.M
and H.M.
PARTITION VALUES:
Those variate values which divide the total number of items into equal
number of parts, are said to be partition values. The important partition values are
Quartiles, Deciles and percentiles. There are 3 Quartiles which divide total number
of items into 4 equal parts. Similarly, there are 9 Deciles which divide total number
of observations into 10 equal parts. And the number of percentiles are 99 which
divide whole observations into 100 equal parts.
Three Quartiles are denoted by Q1, Q2 and Q3, which are known as first,
second and third Quartiles respectively. Second Quartile Q2 is Median itself, which
divides whole observations into 2 equal parts. Nine deciles are denoted by D 1, D2,
… ..,D9. Similarly P1, P2,…….,P99 are ninety nine number of percentiles. From
these ideas we can easily conclude that, Md=Q2=D5=P50, Q1=P25, Q3=P75, D1=P10,
D7=P70 and so on.
The process of computing partition values (Quartiles, Deciles and percentiles) are
similar to the process of computing Median in individual, discrete and continuous
series. We discuss the process as follows:
Individual series
First of all given ‘n’ number of items should be arranged in ascending order. Then
the Quartiles, Deciles and Percentiles can be determined by the following formula:
𝑖(𝑛+1) 𝑡ℎ
Qi = value of [ ] item. where, i=1, 2, 3.
4
𝑗(𝑛+1) 𝑡ℎ
Dj = value of [ ] item. where, j=1, 2, 3,……………, 9.
10
𝑘(𝑛+1) 𝑡ℎ
Pk = value of [ ] item. Where, k=1, 2, 3, ……………., 99.
100
Discrete series: For computing the partition values in discrete series, we may
procedure as the following steps:
(iii). Use the formula for partition values Qi, Dj & Pk (where ‘N’ =total frequency)
𝑖(𝑁+1) 𝑡ℎ
Qi = value of [ ] item. Where, i=1, 2, 3.
4
𝑗(𝑁+1) 𝑡ℎ
Dj = value of [ ] item. Where, i=1, 2, 3,…….., 9.
10
𝑘(𝑁+1) 𝑡ℎ
Pk = value of [ ] item. Where, i=1, 2, 3,………, 99.
100
(iv). Observe the cumulative frequency column, and note the value corresponding
𝑖(𝑛+1)
to the c.f. either equal to or greater than the value given by step (iii) [i.e. or
4
𝑗(𝑛+1) 𝑘(𝑛+1)
or ] this gives the required partition value (Qi or Dj or Pk ).
10 100
(iv). Observe the cumulative frequency column, and note the value corresponding
to the c.f. either equal to or greater than jN/10. This gives the D j class or the class
in which Dj belong.
EX.[1]. Find the arithmetic mean of (i) n natural numbers. (ii) a, ar, ar 2, ……., arn-
1
.
Solution: (i). we know, the first ‘n’ natural numbers are 1, 2, 3, …..,n. Hence, A.
M. of ‘n’ natural numbers is, [since, the sum of the first ‘n’ natural numbers is
𝑛(𝑛+1)
]
2
𝑛
∑ 𝑥 1+2+3+⋯…..+𝑛 1 𝑛(𝑛+1) 𝑛+1
𝑋̅ = 𝑖=1 𝑖 = = . = .
𝑛 𝑛 𝑛 2 2
𝑛 2 +⋯.+𝑎.𝑟 𝑛−1
∑ 𝑥 𝑎+𝑎𝑟+𝑎.𝑟 𝑎
(ii). We have, 𝑋̅ = 𝑖=1 𝑖 = = . [1 + 𝑟 + 𝑟 2 + ⋯ + 𝑟 𝑛−1 ]
𝑛 𝑛 𝑛
𝑎 1−𝑟 𝑛 𝑎(1−𝑟 𝑛 )
= [ ] = .
𝑛 1−𝑟 𝑛(1−𝑟)
Ex.[2]. There are two series X and Y having same number of items. The
relationship of each member of X and Y is, 2x-y=5. If the arithmetic mean of Y is
15, what is the A.M. of series X?
12 15 28 25 20
Solution:
Ex.[4]. Find the arithmetic mean of daily expenditure of 150 students of Everest
Hostel.
Computation of Mean
500-600 550 12 2 24
600-700 650 0 3 0
𝛴f =N =150 𝛴fU’ = -128
𝛴𝑓𝑈′ −128
We know, ̅𝑋 = a+ 𝑥 ℎ = 350 + x 100
𝑁 150
=350-83.83 = 264.67
Ex.[5]. Find the missing frequencies from the incomplete distribution given below.
(given that average wage is 30.2).
Hourly wages (in ‘Rs.’): 0-10 10-20 20-30 30-40 40-50 Total
Number of workers (f) 4 - 10 20 - 50
Solution: Let f1 and f2 be the number of workers having class interval of hourly
∑ 𝑓.𝑥
wages (Rs.) 10-20 and 40-50 respectively. We have, A.M. 𝑋̅ =
𝑁
Ex.[6]. Mean of 100 observations was calculated as 50. Later on it was found that
two items were misread as 92 and 8 instead of 192 and 88. Find the correct mean.
Prepared by Prof.Balaram P. Sharma
Correct mean =?
∑𝑥
We have, 𝑋̅ = or, 50 = 𝛴x/100 𝛴x ( from misread item) = 5000.
𝑛
Ex.[7]. Find the Geometric mean of the growth factor of population in a capital
city within 5 years from the following data:
Year 1 2 3 4 5
Growth Factor 1.07 1.08 1.10 1.12 1.18
Solution: we know the formula for G.M. is,
1
G = Anti log [ . ∑ log 𝑥]
𝑛
Frequency: 20 40 30 10
Solution:
Computation of G. Mean
Ex.[9]. An enquiry into the budget of the certain family, provided the following
information. Calculate the weighted arithmetic mean.
∑ 𝑥𝑤 𝑥 .𝑤 +𝑥 .𝑤 +𝑥 .𝑤 43.5+38.3+36.2
Weighted mean of B, ̅̅̅̅
𝑥′𝑤 = ∑ = 1 1 2 2 3 3 = =401/10 = 40.1
𝑤 𝑤1 +𝑤2 +𝑤3 5+3+2
Ex.[11]. There are 3 types of staffs in a research centre. The average salary of 8
field workers is Rs.50,000, 3 managers is Rs.75,000 and 9 supporting staff is
Rs.25,000 per month respectively. Find the average salary per month of all the
staffs of that research centre.
Solution: Here,
Numbers: n1 = 3 n2 = 8 n3 = 9 n = 20.
A. Mean: 𝑥1 =75,000
̅̅̅ 𝑥2 =50,000
̅̅̅ 𝑥3 =25,000
̅̅̅ 𝑋̅ =?
𝑛 .𝑋 +𝑛 .𝑋 .+𝑛 .𝑋 ̅ ̅̅̅̅ ̅̅̅̅
We have, Combined Arithmetic Mean, 𝑋̅ = 1 1 2 2 3 3
𝑛1 +𝑛2 +𝑛3
3.75000+8.50000+9.25000
= = 850,000/20 = 42,500
20
Hence, the average salary per month of all the staffs = Rs. 42,500.
Prepared by Prof.Balaram P. Sharma
Ex.[12]. In a class of 100 students, 85 passed and their average marks is 58. The
total marks secured by the entire class is 5260. Find the average marks of the failed
students
5260 85.58+15.𝑋̅2
Or, =
100 100
X: 3 6 7 10 11
f: 15 25 30 20 18
Solution:
Computation of H.M
X F 1/x f . 1/x
3 15 0.3333 4.9999
6 25 0.1666 4.1666
7 30 0.1428 4.2840
10 20 0.1000 2.0000
11 18 0.0909 1.6362
𝛴f=N=108 𝛴f.1/x=17.0867
𝑁 108
Now, Harmonic Mean H.M. = ∑ = = 6.32.
𝑓.1/𝑥 17.0867
(a). 27, 124, 54, 35, 61, 87, 78, 40. (b). 6, 3, 21, 31, 12, 24, 9, 17, 22, 11, 21.
(c)
Prepared by Prof.Balaram P. Sharma
(b). The given data in ascending order is, 3, 6, 9, 11, 12, 17, 21, 21, 22, 24, 31.
Here, N/2 = 75/2 =37.5. the c.f. just greater than 37.5 is 40. So, the Median lies on
the corresponding class interval 20-30. From Md class, L=20, h=10, f=25,
c.f.=15.
𝑁
−𝑐.𝑓. 37.5−15
2
Now, Md = L+ x h = 20 + x 10 = 20+9 =29.
𝑓 25
Ex.[16]. Calculate the appropriate measure of central tendency from the following
data of monthly income (Rs.) generated by a certain Community Forestry of 125
user groups
Mothly Below 500 500-599 600-699 700-799 800-899 900 & more
income (Rs.)
No. of family. 5 35 50 15 12 8
Solution, it is seen that, the first and last intervals are in the form of income below
and above. The data is of monthly income (Rs.). So, suitable measure of central
tendency is Median. The classes are in inclusive form. So, the calculation table is
changed into exclusive.
Ex.[17]. Find the missing frequency from the following distribution. The median
of the distribution is given to be180.
No. of persons: 15 25 40 - 10
Solution, It is given that, Md =180, so, median class is 150-200. The missing
frequency is assumed as f’.
Ex.[18]. Find the modal income from the following frequency distribution of daily
income.
Then, first class in exclusive form is 115-2.5 to 115+2.5, i.e., 112.5-117.5 & so on.
Prepared by Prof.Balaram P. Sharma
Calculation of Mode
=122.5 + 3 =125.5
Ex.[19]. Modal income (hourly) for a group of 100 workers is 140. The number of
workers earning Rs. between 0 to 50 is 10. Thirty workers earn Rs. Between 100-
150. And 15 workers earn Rs.200-250. If the maximum income is Rs.250, Find the
no. of workers earning Rs. Between 50-100 and 150-200.
Solution: Given that Mo =140, N= 100. Suppose that f0 and f2 be the no. of
workers earning Rs. Between 50-100 and 150-200 respectively.
30−𝑓0
Or, 140 =100 + x50 or, 40[60-f0-f2]= 50[30-f0]
60−𝑓0 −𝑓2
.f2+f0 = 45 ………………………………………………..[ii]
Hence, required no, of workers earning between Rs. 50-100 is 18 and that of
between Rs.150-200 is 27.
Ex.[20]. Find first quartile (Q1), Median (Md), fourth Decile (D4) and 85th
Percentile (P85) from the data, : 21, 17, 8, 12, 23, 16, 4,10
EX.[21]. Find the median size of trousers which are bought from a supplier for 200
students in a hostel as shown below:
Prepared by Prof.Balaram P. Sharma
Size of trousers: 18 20 22 24 26 28 30 32
No. of trousers: 5 20 25 40 50 35 15 10
th th
Also determine third Quartile, 7 Decile & 65 Percentile.
Solution:
Size of trousers: 18 20 22 24 26 28 30 32
No. of trousers:f 5 20 25 40 50 35 15 10
c.f. 5 25 50 90 140 175 190 200
𝑁+1 𝑡ℎ 200+1 𝑡ℎ
Now, Md = Value of [ ] item. = Value of [ ] item.
2 2
EX.[22] The daily sale of potato by 100 retailers in Hetauda sub-metropolitan city
are as follows:
For Median, N/2 =100/2 = 50, So, Median lies on the class 60-80.
= 60 + 16.66 = 76.66
For Mode, Since maximum frequency is 30, the corresponding class 60-80 is the
modal class.
= 60 + 15 = 75.
Hence, mean sale= 76 kg., median =76.66 kg. & modal sale = 75 kg.
= 40 + 20 = 60.
Prepared by Prof.Balaram P. Sharma
For D4: 4.N/10 = 4 x 100/10 = 40. The c.f. just greater than 40 is 55 and
corresponding class is 60-80.
= 60+10 = 70.
For P70: 70.N/100 = 70 x 100/100 = 70. The c.f. just greater than 70 is 80 and
corresponding class is 80-100.
= 80+12 = 92.
Ex.[23]. The following are the distribution of marks obtained by 250 students of
Faculty of Forestry in the subject ‘Fire Ecology’. (a). find the minimum pass marks
if only 30% of the students had failed. (b) If top 20 % students were awarded for
scholarship, what is the minimum marks obtained by students who were awarded.
And (c) find out the range of the marks of the middle 60% students.
For P30: 30.N/100 = 30x250/100 =75. The c.f. column contains 75. So, the class
in which P30 belong is 10-20.
(b) The minimum marks obtained by top 20% students is P80 or D8.
For P80: We have, 80 x N/100 = 80 x 250/100 = 200. The c.f. just greater than
200 is 205. The corresponding class in which P80 belong is 25-30.
Hence, the minimum marks obtained by students who were awarded is 29.5
(c) Hint: The middle 60% students occupy 30% above Median [or positional
average] and 30% below Median. In other words, we have to find P20 &P80. Range
of the marks of the middle 60% students = P80 – P20. [Solve it!]
MEASURE OF DISPERSION
If two distributions may have same Mean, Median and Mode, They may not be
identical. They may differ in formations. Consider the following examples:
Mode, which doesn’t mean that the series are identical. But they differ in measure
of ‘Dispersion’ or ‘Variability’. It means they differ in entirely formations.
Actually, the meaning of dispersion is the scatterness of the items from the
centre value. Then, the dispersion is defined as ‘the measure of variation of the
items from the central value’.
The measures of dispersion having the same unit as that of the given series, is
known as ‘the absolute measure of dispersion’. It can be used to compare the
variability of two distributions having the same units. Two distributions having
different units can be compared with the help of relative measures of dispersion.
The relative measure of dispersion is a ratio defined by
𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑜𝑓 𝐷𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛
Relative measure of Dispersion =
𝑆𝑢𝑖𝑡𝑎𝑏𝑙𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒.
Measures of Dispersion
[i]. Range
of the lowest class or (ii) by taking the difference between middle point of the
highest class and the middle point of the lowest class.
The Range is the absolute measure of dispersion. It has the same unit of
measurement as that of the given data. Coefficient of Range is the relative measure
corresponding to Range.
𝐿−𝑆
Hence, Coefficient of Range =
𝐿+𝑆
If we compare two distributions having same units and almost nearly mean,
we can say that the distribution is more dispersed or more variable which has more
Coefficient of range.
The distribution has more variability (or less uniformity or less equality or
less consistency), if the distribution has more Q.D. or coefficient of Q.D.
The arithmetic mean of the deviations of the items from mean, median or mode is
known as the Mean Deviation (M.D.) when all the deviations are considered as
̅ Md, Mo denotes the arithmetic mean, median and mode respectively,
positive. If 𝑋,
then the Mean deviation from mean, median and mode are as follows:
𝑀.𝐷.𝑓𝑟𝑜𝑚 𝑀𝑒𝑑𝑖𝑎𝑛
Coefficient of M.D. from Median =
𝑀𝑒𝑑𝑖𝑎𝑛
𝑀.𝐷.𝑓𝑟𝑜𝑚 𝑀𝑜𝑑𝑒
Coefficient of M.D. from Mode =
𝑀𝑜𝑑𝑒
Standard Deviation:
The Standard Deviation (S.D.) is defined as the positive square root of the
mean of the square of the deviations taken from the arithmetic mean. It is denoted
by σ.
∑[𝑥−𝑋] ∑𝑥̅̅̅2∑𝑥 2
σ=√ =√ − [ ]2 ………………………….[i]
𝑛 𝑛 𝑛
2
̅̅̅
∑ 𝑓[𝑥−𝑋] ∑ 𝑓𝑥 2 ∑ 𝑓𝑥 2
σ=√ =√ −[ ] ……………………….[ii]
𝑁 𝑁 𝑁
The formula [i] and [ii] represent the S.D. in individual and discrete + continuous
series respectively. In continuous series x represents the mid value of each class.
When the deviations are taken from assumed mean, let, u=x-a. (change of
𝑥−𝑎
origin), u’= (change of origin and scale) were, a= assumed mean and h= class
ℎ
size. Then, S.D. (σ) is given by,
∑ 𝑢2 ∑𝑢 2
σ =√ −[ ] ………………………………………………[iii]
𝑛 𝑛
Prepared by Prof.Balaram P. Sharma
∑ 𝑓𝑢2 ∑ 𝑓𝑢 2
σ =√ −[ ] ………………………………………….[iv]
𝑁 𝑁
∑ 𝑓𝑢′2 ∑ 𝑓𝑢′ 2
σ =hx√ −[ ] ……………………………………..[v]
𝑁 𝑁
Formula [iii] represents the short cut method (individual series). [iv]
represents the short cut method (discrete & continuous series). And [v] represents
the step deviation method (continuous series).
Notes:
Merits of Range
▪ It is rigidly defined.
▪ It is easy to calculate and simple to understand.
▪ It takes minimum time to know about dispersion with the help of it.
Demerits of Range
Merits of Q.D.
▪ It rigidly defined.
▪ It is simple to understand and easy to calculate.
▪ It is not affected by extreme values.
▪ It can be calculated in the case of open end classes.
▪ It is more effective than range since it is based on 50% of central items.
Demerits of Q.D.
Merits of M.D.
Demerits of M.D.
Prepared by Prof.Balaram P. Sharma
Merits of S.D.
▪ It is rigidly defined.
▪ It is based on all observations.
▪ .it is least affected by fluctuation of sampling.
▪ It is suitable for further mathematical analysis.
Demerits of S.D.
Ex.[1]. Calculate the following measure of dispersion and their coefficients. (a)
Range. (b) Semi-inter quartile range or Q.D. (c) Mean Deviation (M.D.). (d)
Standard Deviation (S.D.) from the data: 9, 15, 7, 14, 11, 9, 12, 10, 14.
Solution: The given data in ascending order is, 7, 9, 9, 10, 11, 12, 14, 14, 15.
𝑛+1 𝑡ℎ 9+1 𝑡ℎ
(b). For Q1: Now, Q1 = Value [ ] item = value of[ ] item.
4 4
2𝑛𝑑 +3𝑟𝑑
= Value of (2.5)th item = Value of item
2
= (9+9)/2 = 9
3.(𝑛+1) 𝑡ℎ 3.(9+1) 𝑡ℎ
For Q3: Now, Q3 = Value [ ] item = value of[ ] item.
4 4
7𝑡ℎ +8𝑡ℎ
= Value of (7.5)th item = Value of item
2
Prepared by Prof.Balaram P. Sharma
= (14+14)/2 = 14.
∑𝑋 1
(c). Mean, 𝑋̅ = = .[7+9+9+10+11+12+14+14+15] = 101/9 = 11.22
𝑛 9
̅̅̅
∑ |𝑋−𝑋|
M.D. from Mean =
𝑛
X ̅| =|x-11.22|
|x-𝒙
7 |-4.22| =4.22
9 |-2.22| =2.22
9 |-2.22| =2.22
10 |-1.22| =1.22
11 |-0.22| =0.22
12 |0.78| = 0.78
14 |2.78| = 2.78
14 |2.78| = 2.78
15 |3.78| =3.78
∑ |𝒙 − 𝑿 ̅ |=20.22
̅̅̅
∑ |𝑥−𝑋|
Hence, M.D. from Mean = = 20.22/9 = 2.246.
𝑛
𝑀.𝐷.𝑓𝑟𝑜𝑚 𝑚𝑒𝑎𝑛
Further, coefficient of M.D. from mean = = 2.246/11.22 = 0.2002.
𝑀𝑒𝑎𝑛
∑ 𝑢2 ∑𝑢 2
(d). We have, S.D.(σ) = √ −[ ] where, u = x-a.
𝑛 𝑛
14 3 9
15 4 16
∑𝒖 = 2 ∑ 𝒖𝟐 = 60
∑ 𝑢2 ∑𝑢 2 60 2
Now, σ = √ −[ ] =√ − [ ]2 =√6.66 − 0.197
𝑛 𝑛 9 9
=√6.463 = 2.542.
𝜎
Coefficient of S.D. = = 2.542/11.22 = 0.226.
𝑋̅
𝜎
Coefficient of Variance (C.V.) = x 100% = 2.542/11.22 x 100% = 22.6%.
𝑋̅
Distribution A Distribution B
Class f c.f. Class F c.f
interval interval
Below 17.5 5 5 Below 12.5 8 8
17.5-22.5 20 25 12.5-17.5 12 20
22.5-27.5 15 40 17.5-22.5 13 33
27.5-32.5 10 50 22.5-27.5 22 55
32.5-37.5 5 55 27.5-32.5 5 60
Series A:
For Q1: we have, N/4 = 55/4 = 13.75. The c.f. just greater than 13.75 is 25. So, Q 1
lies on the class 17.5-22.5.
𝑁
−𝑐.𝑓. 13.75−5
4
But Q1 = L + x h = 17.5+ x 5 = 17.5 + 2.187 = 19.687.
𝑓 20
Prepared by Prof.Balaram P. Sharma
For Q3: we have, 3xN/4 =3x55/4 = 41.25. The c.f. just greater than 41.25 is 50. So,
Q3 lies on the class 27.5-32.5.
3.𝑁
−𝑐.𝑓. 41.25−40
4
But Q3 = L + x h = 27.5+ x 5 = 27.5 + 0.625 = 28.125
𝑓 10
𝑄3 −𝑄1 28.125−19.687
Now, Coefficient of Q.D. = = = 8.43/47.81 = 0.176.
𝑄3 +𝑄1 28.125+19.687
Series B
For Q1: we have, N/4 = 60/4 = 15. The c.f. just greater than 15 is 20. So, Q 1 lies on
the class 12.5-17.5.
𝑁
−𝑐.𝑓. 15−8
4
But Q1 = L + x h = 12.5+ x 5 = 12.5 + 2.916 = 15.483.
𝑓 12
For Q3: we have, 3xN/4 =3x60/4 = 45. The c.f. just greater than 45 is 55. So, Q3
lies on the class 22.5-27.5.
3.𝑁
−𝑐.𝑓. 45−33
4
But Q3 = L + x h = 22.5+ x 5 = 22.5 + 2.7272 = 25.227
𝑓 22
𝑄3 −𝑄1 25.227−15.483
Now, Coefficient of Q.D. = = =9.743/40.71 = 0.239.
𝑄3 +𝑄1 25.227+15.483
Dancer A: 38 39 39 36 38 40 39 41
Dancer B: 42 40 43 45 43 41 40 44
Dancer A Dancer B
2
x u=x-38 u X v=x-42 v2
38 0 0 42 0 0
Prepared by Prof.Balaram P. Sharma
39 1 1 40 -2 4
39 1 1 43 1 1
36 -2 4 45 3 9
38 0 0 43 1 1
40 2 4 41 -1 1
39 1 1 40 -2 4
41 3 9 44 2 4
∑ 𝒖 =6 ∑ 𝒖𝟐 =20 𝜮𝒗 =2 ∑ 𝒗𝟐 =24
Dancer A:
∑𝑥
Now, 𝑋̅ = a+ = 38 + 6/8 = 38.75
𝑛
2
∑𝑢 ∑𝑢 20 6
Also, σ = √ − [ ]2 = √ − [ ]2 = √2.5 − 0.562
𝑛 𝑛 8 8
= √1.937 = 1.391.
𝜎 1.391
C.V. of A = ̅ x100% = x 100% = 3.66%
𝑋 38.75
Dancer B:
∑𝑥
Now, 𝑋̅ = a+ = 42+ 2/8 = 42.25
𝑛
2
∑𝑣 ∑𝑣 24 2
Also, σ = √ − [ ]2 = √ − [ ]2 = √3 − 0.0625
𝑛 𝑛 8 8
= √2.937 = 1.713.
𝜎 1.713
C.V. of A = ̅ x100% = x 100% = 4.05%
𝑋 42.25
Since, C.V. of dancer A is less than that of dancer B, so A is the consistent dancer.
Number of logs : 50 44 28 12 5
If there is no any log having weight more than 50 kg, find out the mean, standard
deviation, coefficient of standard deviation and C.V.
Solution:
Prepared by Prof.Balaram P. Sharma
0-10 5 6 -2 -12 24
10-20 15 16 -1 -16 16
20-30 25 16 0 0 0
30-40 35 7 1 7 7
40-50 45 5 2 10 20
𝛴f =N =50 𝛴fu’ =-11 𝛴fu’2 =67
We have, mean
∑ 𝑓𝑢′ −11
𝑋̅ =a+ x h = 25+ x 10 =25-2.2 =22.8 kgs.
𝑁 50
𝜎 11.36
Finally, Coefficient of Variance C.V. = ̅ x100% = x100% =49.8%
𝑋 22.8
Ex.[5] The marks of (a) 8 student of a class as well as (b) all 80 students of that
class in two subjects are as follows. Recommend by suitable method that which
subjects in both cases have more uniform marks.
Marks in Cytology: 42 27 34 30 36 30 34 31
(b).
34
34 0 0 42 9 81
43 9 81 27 -6 36
28 -6 36 34 1 1
35 1 1 30 -3 9
30 -4 16 36 3 9
37 3 9 30 -3 9
29 -5 25 34 1 1
36 2 4 31 -2 4
𝛴x =272 ∑[𝑥 − 𝑋̅]2=172 𝛴y =264 𝛴[y-𝑌̅]2=150
∑𝑥
Now, Mean marks in Statistics = =272/8 = 34
𝑛
∑𝑦
Mean marks in Cytology = =264/8 = 33
𝑛
∑[𝑥−𝑋] ̅ 2
172
S.D. of Marks in Statistics, 𝜎𝑥 =√ =√ =√21.5 =4.63
𝑛 8
∑[𝑦−𝑌] 150 ̅ 2
S.D. of Marks in Cytology, 𝜎𝑦 =√ =√ =√18.75 =4.33
𝑛 8
𝜎 4.63
C.V. of marks in Statistics, (C.V.)x = ̅𝑥 x100% = x100% =13.61%.
𝑋 34
𝜎𝑦 4.33
C.V. of marks in Cytology, (C.V.)y = ̅ x100% = x100% =13.12%.
𝑌 33
Here, C.V. of marks in Statistics, (C.V.)x > C.V. of marks in Cytology, (C.V.)y ,
the marks of Cytology is more uniform than that of Statistics.
(b).
2
∑ 𝑓𝑢′ 𝑓𝑢′ 129 13
S.D. (σ) = hx√ − [ ]2 =10 x √ − [ ]2 = 10 x √1.612 − 0.0264
𝑁 𝑁 80 80
Since, C.V. of marks in Statistics < C.V. of marks in Cytology, the marks in
statistics is more uniform than that of Cytology.
SKEWNESS
The word skewness refers to lack of symmetry i.e. the distribution is not
symmetrical, it is called skewed distribution. In symmetrical distribution, the value
of mean, median and mode are alike or coincide.
Prepared by Prof.Balaram P. Sharma
c. the curve drawn from the frequency distribution is not a bell shape type.
TYPES OF SKEWNESS :
a. NO SKEWNESS or SYMMETRICAL
A distribution of data is said to be no skewed, if the curve drawn from the data is
neither elongated more to the left nor to the right side . In this case, the curve is
equally elongated to the right as well as left. Also the nature of curve on both end
sides appears as asymptotic behavior i.e. asymptotic tail appears on both sides. In
other word, the distribution has no skewness if, MEAN = MEDIAN = MODE.
Prepared by Prof.Balaram P. Sharma
b. POSITIVELY SKEWED
A distribution of data is said to have right skewed or positively skewed if, the
curve drawn from the data is more elongated to the right side. In this case, we can
observe that:MEAN > MEDIAN > MODE.
C. NEGATIVELY SKEWED
If the curve drawn from the data is more elongated to the left side, such
distribution of data is said to have left skewed or negatively skewed. In this case
we can observe that : MEAN < MEDIAN < MODE.
2. The data when plotted on the graph, the normal bell-shaped form is obtained.
The asymptotic behavior is seen on both side tail.
3. Sum of the positive deviations from the median is equal to the sum of the
negative deviations.
5. Frequencies are equally distributed at the point of equal deviations from the
mode.
MEASURES OF SKEWNESS :.
Prepared by Prof.Balaram P. Sharma
= 3 ( mean – median )
= Q3 + Q1 – 2. Md
The absolute measure of skewness are not widely used because they stand for
original units. The coefficient of skewness (which is the relative measure)
2. -1≤ SKb ≤ 1
EX.[2]. Compare the skewness of the wages distributed by two companies P & Q.
Also comment on the result. Some information determined with the data of wages
for the workers of P & Q are as follows:
P 60 65 10
Q 65 68 12
EX.[3]. Find the mean and mode of the frequency distribution, which gives the
following results : C.V.=7.5, S.D.=2 and Karl Pearson’s coefficient of skewness
(Sk)=0.5.
𝜎
Solution: We know, C.V. = ̅ x100%
𝑋
𝜎
Hence, 𝑋̅ = x100 =200/7.5 =26.67
𝐶.𝑉.
𝑋̅−𝑀𝑜 26.67−𝑀𝑜
Also we know, Sk = or, 0.5 =
𝜎 2
EX.[4]. In a certain distribution, the following results are obtained: Mean (𝑥̅ ) =45,
Median (Mo) =48, Coefficient of skewness (Sk) =-0.4. Find s.d. ( 𝜎) and the
coefficient of variation. [solve it !]
Prepared by Prof.Balaram P. Sharma
EX.[5]. In a moderately skewed frequency distribution, the mean is 10 kg. and its
median is 8.50 kg. Find the Pearson’s coefficient of skewness of the distribution if,
the coefficient of variance is 20%. [solve!]
∑𝑋
Here, Mean, 𝑋̅ = = 452/10 =45.2
𝑛
1 2 ∑𝑋 2 1 452 2
S.D. (σ) = √ . ∑ 𝑋 − [ ] =√ 𝑋 24270 − [ ] =√2427 − 2043.04
𝑛 𝑛 10 10
= √383.96 =19.59
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
Hence, Sk(P)= = [45.2-47.7]/19.59 = 0.08
𝑆.𝐷.
EX.[7]. Find out the Bowley’s coefficient of skewness [or skewness based on
quartiles] from the following data of the length of the log ‘inch’ on a chatta.
Length of log Less than 36-41 42-47 48-53 54-59 60-65 66 & above
‘inch’: 36
No. of logs (f): 17 23 39 38 27 19 12
Solution:
Hence, median lies in the class 48-53 with real limit 47.5-53.5
Prepared by Prof.Balaram P. Sharma
𝑁
−𝑐𝑓 87.5−79
2
Therefore, Md = 𝐿 + x h = 47.5+ x 6 =48.84
𝑓 15
In the same way, Q3 lies in the class 54-59 with real limit 53.5-59.5
3.𝑁/4−𝑐𝑓 131.25−117
Therefore, Q3 = 𝐿 x h = 53.5+ x 6 =56.67
𝑓 27
98.75−97.68
= = 1.07/14.59 =0.073
14.59
Hence, median lies in the class 28-32 with real limit 27.5-32.5
Prepared by Prof.Balaram P. Sharma
𝑁
−𝑐𝑓 27.5−22
2
Therefore, Md = 𝐿 + x h = 27.5+ x 5 =29.22
𝑓 16
In the same way, Q3 lies in the class 33-37 with real limit 32.5-37.5
3.𝑁/4−𝑐𝑓 41.25−38
Therefore, Q3 = 𝐿 x h = 32.5+ x 5 =34.31
𝑓 9
0.995
= = 0.11
9.185
EX.[9]. The coefficient of skewness for a certain distribution is -0.8. Calculate the
Median, if the upper and lower quartiles are respectively 56.6 & 44.1.
100.7−2.𝑚𝑑
Or, -0.8 = or, -o.8 x 12.5 = 100.7-2.md
12.5
∴ Median, Md =55.35
Solution: Here, Sk(B) =0.6, Q1+Q3 =100, Median, Md =38, Q1 =?, Q3=?
𝑄3 +𝑄1 −2.𝑀𝑑 100−2 𝑋 38
We know, Sk(B) = =
𝑄3 −𝑄1 𝑄3−𝑄1
Prepared by Prof.Balaram P. Sharma
𝑓1 −𝑓0 23−18
Now, Mode = L + x h = 30+ x 10 = 35.56
2.𝑓1 −𝑓2 −𝑓0 2.23−18−19
EX.[12]. The following data represents the height of the trees in a garden.
Calculate Karl Pearson’s coefficient of skewness from the given data:
Since, maximum frequency ‘f1’=82, the mode lies on the class interval 28-35.
𝑓1 −𝑓0 82−42
So, Mode, Mo = L + x h = 28 + x 7 = 33.49.
2.𝑓1 −𝑓2 −𝑓0 2.82−42−71
= 7 x √3.473 = 13.045
Hence, Sk(P) = -0.31, which shows that the distribution is negatively skewed.
EX.[13]. From the following distribution, find out the Karl Pearson’s coefficient
of skewness.
Moment
Let the symbol ‘x’ represents the deviation of any item in a distribution from
the arithmetic mean of that distribution. The arithmetic mean of the various
powers of these deviations in any distribution is called the moment of the
distribution.
The arithmetic mean of the 1st power of the deviations is known as the
first moment about the mean.
The arithmetic mean of the squares of the deviations is known as the second
moment about the mean.
The arithmetic mean of the cubes of the deviations is known as the third
moment about the mean.
The moment about the mean is known as the central moment. These are
denoted by Greek letter µ(mu).Thus the symbols µ 1, µ2 , µ3etc represents the first
moment, second moment, third moment respectively. Symbol
∑(𝑋−𝑋̅) ∑𝑥
µ1 = =
𝑛 𝑛
µ 1 = 0,since the sum of the deviations of items from their arithmetic mean is
always zero.
∑(𝑋−𝑋) 2 ∑𝑥 2
µ2 = = {µ2 = 𝜎 2, 𝑎𝑛𝑑 𝜎 = √𝜇2}
𝑛 𝑛
∑(𝑋−𝑋) 3 ∑𝑥 3
µ3 = =
𝑛 𝑛
∑ 𝑓(𝑋−𝑋̅) ∑ 𝑓𝑥
In discrete series, µ 1 = =
𝑁 𝑁
∑𝑓(𝑋−𝑋) 2 ∑𝑓𝑥 2
µ2 = =
𝑁 𝑁
∑𝑓(𝑋−𝑋) 3 ∑𝑓𝑥 3
µ3 = =
𝑁 𝑁
∑𝑓(𝑋−𝑋) 4 ∑𝑓𝑥 4
µ4 = =
𝑁 𝑁
𝜇32 µ4
TWO IMPORTANT CONSTANTS :𝛽 1 = and𝛽2=
𝜇23 𝜇22
Prepared by Prof.Balaram P. Sharma
#note: The odd moments are always zero in symmetrical distribution, however this
rule doesn’t hold in the case of asymmetric distribution.
𝜇32 𝜇3 𝜇3
𝛾1 =√𝛽1 =√ = =
𝜇23 √(𝜎 2 ) 3 𝜎3
KURTOSIS
#If a curve is more flat than the normal curve, it is called platykurtic.
MEASURE OF KURTOSIS:
a. By Moment:
Prepared by Prof.Balaram P. Sharma
𝜇4
The measures of kurtosis depends upon the coefficient 𝛽2 ,where𝛽2 = .
𝜇22
#Note:1. If 𝛽2 > 3, the curve is more peaked than the normal curve.Such a curve is
leptokurtic.
#Note:2. If 𝛽2 < 3, the curve is less peaked than the normal curve.Such a curve is
platykurtic
the measure of kurtosis based on quartiles and some deciles is defined as:
1 𝑄3 −𝑄1 1 𝑄3 −𝑄1
k= . = .
2 𝐷9 −𝐷1 2 𝑃90 −𝑃10.
EX.[1]. Calculate the percentile coefficient of kurtosis from the following data also
interpret the result.
60-70 14 24
70-80 18 42
80-90 24 66
90-100 16 82
100-110 12 94
110-120 6 100
N=100
Hence the curve or distribution is leptokurtic (which is nearly normal but not
normal, being the value of k =0.2685).
EX.[2]. Calculate the percentile coefficient of kurtosis from the following data of
weakly increment of 100 fodder plant of the same species in a certain plantation
area. Also interpret the result.
For Q1: 𝑁⁄4 =25, then Q1 lies on class interval (c.i.) 120-130. so
𝑁⁄ − 𝑐.𝑓. 25−24
4
Q1 = L+ x h =120+ x 10 =120.56
𝑓 18
Hence the curve or distribution is laptokurtic (which is nearly normal but not
normal, being the value of k =0.269).
CORRELATION ANALYSIS.
If two quantities vary in such a way that the movement in one are accompanied by
movement in the other, these quantities are correlated. Generally, correlation is
Prepared by Prof.Balaram P. Sharma
Examples of correlations: The amount of rain fall and the volume of production of
certain commodity, age and blood pressure of people in a certain community, the
increase(or decrease) of price accompanied by increase (or decrease) in the
quantities demanded, the amount of production of wheat under the amount of rain
fall as well as amount of chemical fertilizer used etc are the examples of
correlation.
Cause and effect relationship: In the study of correlation, the concept of cause and
effect relationship is more important. The variables which makes the other variable
to change is called cause variable and the resulting variable is known as the effect
variable. There may be more than one cause variable affecting a single variable.
Mathematically, the dependent variable(effect) Y and independent variable(cause)
X can be related as
Y = f(X)----------------(i)
The first case is denoted for simple correlation and the second case for multiple
correlation.
TYPES OF CORRELATION :
a. positive and negative correlation:If both the variables move in the same direction
i.e. increase(or decrease) in the value of one variable results the increase (or
decrease) in the value of other variable(s), is said to be positively correlation. On
the other hand, if both the variables move in opposite direction. Then the
Prepared by Prof.Balaram P. Sharma
b. linear and non-linear correlation: When a unit change in one variable results a
constant change in other variable over the entire range of the value is known as
linear correlation. In other words, if the amount of change in one variable tends to
bear constant ratio to the amount of change in the other variable.Such correlation is
said to be linear correlation. Otherwise, these are said to be non-linear correlation.
For example:
Fertilizer used(kg) x: 25 30 35 40 45 50
Production(kg) y : 100 125 150 175 200 225
Age(in year) 20 30 40 50 60 70
x:
Blood pressure(mm) y: 115 125 130 132 140 135
The first table shows the linear correlation and the second shows the non-linear
correlation.
The correlation between two variables is known as simple correlation. But the
partial or multiple collinear is the case when the correlation of more than two
variables is taken. In multiple correlation, the correlation among three or more
variables is studied together simultaneously. But in partial correlation, the
correlation between any two variables is taken considering other any one or
more variable as constant. For example, (i)yield of rice per acre against both
amount of rain fall and amount of fertilizer used is the case of multiple
correlation. (ii) If we want to study the effect of quality of seeds, chemical
fertilizer used, and soil fertility on the production of certain crop .
Prepared by Prof.Balaram P. Sharma
The another method of showing the relation between two variables is the scatter
diagram .The simplest device for ascertaining whether two variables is to prepare a
dot chart called scatter diagram. To draw scatter diagram, variables x and y are
plotted along x- and y-axis respectively of the graph paper and corresponding pair
(x,y) are plotted as dot on the graph. ##
GRAPHIC METHOD:
Individual values of two variables are plotted in the graph paper. Two curves X
and Y are drawn. By examine direction and closeness of the two curves so drawn
we can see whether or not the variables are related. Usually, in case of time series
this method can be applied. ##
where, d1= X-A, d2= Y-B, Also, A, B are assumed mean of series X and Y
respectively
𝑋−𝐴 𝑌−𝐵
Where, d1’ = , d 2’ = , A and B are assumed mean of series X and Y
ℎ1 ℎ2
respectively, h1 and h2 are class size of series X and Y respectively.
Where d1 = X – A1, d2 = Y –A2 (A1 and A2 are assumed mean of series X and Y
respectively.)
[Where, A1 and A2 are assumed mean of series X and Y respectively. h1, h2 are
common factors of series X and Y respectively.]
4. The value of r is symmetrical with respect to series X and Y. i.e., r12 = r21.
Probable error:
The probable error [P.E.] of the correlation coefficient is applicable for the
measurement of reliability of the computed value of the correlation coefficient ‘r’
which is defined as:
1−𝑟 2
P.E. = 0.6745 x = 0.6745 x S.E.
√𝑁
Note:[i]. If r < P.E., the value of r is not significant, no matter how high r value is.
Ex.[1]. The following data gives the marks of 10 students in mathematics and
statistics as follows:
Student 1 2 3 4 5 6 7 8 9 10
Prepared by Prof.Balaram P. Sharma
Marks in maths. 45 70 65 30 90 40 50 75 85 60
Marks in statistics 35 90 70 40 95 40 60 80 80 50
Find the correlation coefficient by
X Y X2 Y2 XY
45 35 2025 1225 1575
70 90 4900 8100 6300
65 70 4225 4900 4550
30 40 900 1600 1200
90 95 8100 9025 8550
40 40 1600 1600 1600
50 60 2500 3600 3000
75 80 5625 6400 6000
85 80 7225 6400 6800
60 50 3600 2500 3000
∑ 𝑋 = 610 ∑ 𝑋 = 640 ∑ 𝑋 2 = 40700 2
∑ 𝑋 = 45350 ∑ 𝑋𝑌 =
42575
425750−390400 35350
= = = 0.9031
√407000−372100√453500−409600 186.81 𝑋 209.52
It was later found, at the time of checking that he had copied down two pairs of
observation as (6,14) and (8,6) instead of correct values (8,12) and (6,8).obtain the
correct value of the correlation coefficient between X and Y.
∑ 𝑋=125-6-8+6+8=125, ∑ 𝑌=100-14-6+12+8=100
∑ 𝑋𝑌=508-6x14-8x6+8x12+6x8 = 520
Ex.[3]. Find the total number of pairs of observations from the following
information
∑ 𝑥𝑦 =60, r=o.8, standard deviation of Y=2.5 and ∑ 𝑥 2 =90, where x and y are the
deviation taken from their respective arithmetic means.
Solution:
∑ 𝑥𝑦 60
Or, r = or, 0.8 =
√𝑥 2 √𝑦 2 √90√∑ 𝑦 2
∑ 𝑦2 62.5
∴n= = = 10, Hence, total nos. of observation n = 10.
𝜎𝑦2 (2.5)2
Ex.[4]. Calculate the Karl Pearson’s coefficient of correlation and interpret the
result of studying hour per day and marks obtained (out of full marks 20) of the
following 5 students:
Student A B C D E
Studying hour per day (X) 2 3 4 5 6
Marks obtained (Y) 7 9 10 14 15
Solution:
Prepared by Prof.Balaram P. Sharma
X Y x = X –̅̅̅̅
𝑿= X - 𝒙𝟐 ̅= Y
y = Y -𝒀 𝒚𝟐 xy
4 -11
2 7 -2 4 -4 16 8
3 9 -1 1 -2 4 2
4 10 0 0 -1 1 0
5 14 1 1 3 9 3
6 15 2 4 4 16 8
𝚺X=𝟐𝟎 𝛴Y=55 𝛴x=0 𝛴𝒙𝟐 =10 𝛴y=0 𝛴𝒚𝟐 =46 𝛴xy=21
∑ 𝑋 20 ∑ 𝑌 55
We have, 𝑋̅= = =4 and 𝑌̅= = =11
𝑛 5 𝑛 5
∑ 𝑥𝑦 21
Now, correlation coefficient, r = = = 0.98
√𝑥 2 √𝑦 2 √10√46
∴ r=0.98. This shoes that there is almost perfect positive correlation between two
series.
X: 25 18 32 21 35 29
Y: 16 11 20 15 26 28
Determine the Karl Pearson’s coefficient of correlation and interpret the result.
Solution: using short cut method, take A=25(assumed mean of series X) and
B=20(assumed mean of series Y). Put U=X-25 and V=Y-20.
X Y U=X-25 V=Y-20 U2 V2 UV
25 16 0 -4 0 16 0
18 11 -7 -9 49 81 63
32 20 7 0 49 0 0
21 15 -4 -5 16 25 20
35 26 10 6 100 36 60
29 28 4 8 16 64 32
Prepared by Prof.Balaram P. Sharma
Hence, there is very high degree or positive correlation between radio time (X)
and the forest product sold(Y).
Where, D = R1 – R2
m = no. of times the item repeated. No. of times first item repeated is m 1,
no. of times second item repeated is m2and so on.
Prepared by Prof.Balaram P. Sharma
Ex.[1]. The ranking of 10 students in two subjects statistics and wild life
management are as follows:
Statistics: 3 5 8 4 7 10 2 1 6 9
Wild life 6 4 9 8 1 2 3 10 5 7
mgnt.
Solution: Let R1 be rank of marks in statistics and R2 that of in subject wild life
management.
R1 R2 𝐷2 = (𝑅1 − 𝑅2 ) 2
3 6 9
5 4 1
8 9 1
4 8 16
7 1 36
10 2 64
2 3 1
1 10 81
6 5 1
9 7 4
∑ 𝐷2 = 214.
6.∑𝐷 2 6 𝑋 214
Here, rs = 1- =1– = 1 – 1.297 = - 0.297
𝑁(𝑁2 −1) 1000−10
Ex.[2] .Two forest officers were requested to refer ranking of 7 different species of
medicinal plant for the plantation in a certain community based forest. The priority
ranking are as follows:
Species: A B C D E F G
Forester M : 2 1 4 3 5 7 6
Forester N : 1 3 2 4 5 6 7
Prepared by Prof.Balaram P. Sharma
Calculate the Spearman’s rank correlation coefficient. Also discuss the direction of
ranking are similar or opposite.
Solution: Let R1 and R2 denote the rank referred by two foresters M and N
respectively.
R1 R2 D = R1 – R2 𝐷2
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0
7 6 1 1
6 7 -1 1
∑ 𝐷2 =12
6.∑𝐷 2 6 𝑋 12
Here, Spearman’s rank coefficient rs = 1- =1– = 0.786
𝑁(𝑁2 −1) 343−7
Hence the direction of ranking pattern of two foresters are in the same direction
(being rs positive). Furthermore rs is strictly positive(since 0.75 ≤rs≤ +1)
Solution: Let R1 and R2 represent the rank prepared on the basis of marks assigned
to students in ascending serial. The ranks are not repeated in this case.
37 3 35 3 0 0
38 4 30 2 2 4
25 1 25 1 0 0
27 2 50 7 -5 25
∑ 𝐷2 = 76
6.∑𝐷 2 6 𝑋 76
Here, Spearman’s rank coefficient,rs =1- =1– =1-0.461 = 0.539
𝑁(𝑁2 −1) 1000−10
Hence, the ranking pattern of two judges are in the same direction with high level
of positive correlation.(Being rs positive and 0.5 ≤𝑟𝑠 ≤ 0.75).
Ex.[4]. Calculate the Spearman’s rank coefficient of correlation from the following
data (also interpret the result ):
X : 80 78 75 75 68 67 60 59
Y : 12 13 14 14 14 16 15 17
X Rank R1 Y Rank R2 D = R1 - R2 D2
80 8 12 1 7 49
78 7 13 2 5 25
75 5.5 14 4 1.5 2.25
75 5.5 14 4 1.5 2.25
68 4 14 4 0 0
67 3 16 7 -4 16
60 2 15 6 -4 16
59 1 17 8 -7 49
∑ 𝐷2 = 159.5
𝑚3 3
1−𝑚1 𝑚2−𝑚2 23 −2 33 −3
6.[∑𝐷 2+ + +⋯…………………… ] 159.5+ +
12 12 12 12
Now, rs = 1 – 2
= 1 – 6. 3
𝑁(𝑁 −1) 8 −8
6(159.5+0.5+2)
=1– = 1 – 1.929 = - 0.929 #
504
Ex.[5]. Two teaching methods A and B applied for 11 pairs of students, so that
student in a pair have approximately equal scores on an intelligence test. In each
pair of one student was taught by mrthod A and other by mothod B. the marks are
as follows.
REGRATION ANALYSIS
“ Regression analysis is a statistical tool (or device) with the help of which we can
estimate or predict the unknown value of one variable corresponding to the known
value of another variable.”
The variable whose values are known is known as independent variable and the
variable whose valueis estimated or predicted is known as dependent variable.
LINE OF REGRESSION:
Whenever there shows a relationship between the two variables the dots of the
scatter diagram will concentrate around a certain curve. If a curve is a straight line,
which is called the line of regression. If the scatter diagram doesn’t represent a
straight line, the line of best fit is the required line of regression.
The line of best fit is obtained by the method of least squares. It is a line from
which the sum of the deviations of the various points on either side is equal to zero
so that the sum of the squares of these deviations is minimum.##
There are two lines of regressions. The first is line of regression of y on x and the
other one is the line of regression of x on y. The line of regression of y on x gives
the most probable value of y corresponding to given values of x. Similarly, the
Prepared by Prof.Balaram P. Sharma
lines of regression of x on y gives the most probable value of x for given value of
y.The two lines of regression intersect at the point (𝑥̅ ,𝑦̅), where 𝑥̅ and 𝑦̅ being
the arithmetic mean or average of two series x and y respectively.
If there is a wide gap between two lines of regression , the correlation between two
variables is less. If two regression lines are near enough then the correlation
between two variables is high. Whenever, two lines of regression coincides,there is
a perfect correlation between two variables. If two lines of regression intersect at
right angles, there is no correlation between two variables. ##
REGRESSION EQUATION OF Y ON X:
Y =ax + b ……………………………………..(i)
𝛴xy = a 𝛴x +b 𝛴𝑥 2 ……………………………………[v].
𝛴𝑥𝑦 𝛴𝑥 𝛴𝑦
− . 𝑐𝑜𝑣(𝑥,𝑦) 𝑐𝑜𝑣(𝑥,𝑦)
𝑛 𝑛 𝑛
= 2 = [since, r = ]
𝛴𝑥 𝛴𝑥
−( )2 𝜎𝑥2 𝜎𝑥 .𝜎𝑦
𝑛 𝑛
𝜎𝑥 𝜎𝑦
= r.
𝜎𝑥2
𝜎𝑦
Hence, byx = 𝑟 . ……………………………………….[vii]
𝜎𝑥
2# the regression equation can easily be obtained when the deviations of the items
of X and Y series be taken from assumed mean A and B respectively.then,
𝛴𝑢 𝛴𝑣 𝑛.𝛴𝑢𝑣−𝛴𝑢.𝛴𝑣 𝜎𝑦
𝑋̅ = A + , 𝑌̅ = B + , byx = 2 2
= 𝑟.
𝑛 𝑛 𝑛𝛴𝑢 −(𝛴𝑢) 𝜎𝑥
𝑛.𝛴𝑢𝑣−𝛴𝑢.𝛴𝑣 𝜎𝑥
And, bxy = = 𝑟.
𝑛𝛴𝑣 2 −(𝛴𝑣)2 𝜎𝑦
3# since, 𝜎𝑥 and 𝜎𝑦 are not independent (dependent) of scale. so, if we take the
𝑋−𝐴 𝑌−𝐵 𝑘 𝑁.𝛴𝑢′ 𝑣 ′ −𝛴𝑢′ .𝛴𝑣 ′ ℎ
deviations u’= , v’= , then, byx = [ ′2
] and, bxy =
ℎ 𝐾 ℎ 𝑁.𝛴𝑢 −(𝛴𝑢′ )2 𝑘
𝑁.𝛴𝑢′𝑣′−𝛴𝑢′.𝛴𝑣′
[ ]
𝑁𝛴𝑣′2 −(𝛴𝑣′)2
[1]. The regression coefficients are independent of change of origin but not of
scale.
It means the correlation coefficient between two variables is equal to the geometric
mean of two regression coefficients.
Hence, the product of two regression coefficient is less than or equal to 1(one).
This implies that if one of the regression coefficient is greater than 1, the other is
obviously less than 1. In other words, both regression coefficient can not be greater
than 1 together.
[4]. Since, r2= bxy .byx , we conclude that either both the coefficient will be positive
or both negative. Otherwise, r2 will be negative (which is impossible). Further
more, the sign of r, bxy and byx are all positive or negative.[by property 2]
[5]. The arithmetic mean of two regression coefficients is greater than or equal to
r.(provided r > 0)
1
i.e. [𝑏𝑥𝑦 + 𝑏𝑦𝑥 ] ≥ 1.
2
[6]. Two regression lines intersects at (𝑋̅,𝑌̅), where 𝑋̅ and 𝑌̅denotes the arithmetic
means of two series respectively.
Ex.[1]. From the following data between the age of husbands and wives calculate
the regression equation and find the approximate age of husband when wife’s age
is 20 also predict the age of wife when age of husband is 30.
Prepared by Prof.Balaram P. Sharma
r, Y-29 = 0.915(X-24)
Y’ = 0.915x20+7.04 = 25.34 ≈ 25
So, X =1.02 Y – 5.58 is the required equation. Put Y=30, the projected age of wife
is,
Ex.[2]. Find two regression coefficients and regression equations from the
following information given below:
Regression coefficient of Y on X,
𝜎𝑦 3.5
byx = r = 0.8 x =1.12
𝜎𝑥 2.5
X – 𝑋̅ = bxy (Y-𝑌̅)
(iii) the regression coefficient bxy and byx. (iv) the coefficient of correlation
between X and Y.
Prepared by Prof.Balaram P. Sharma
Solving [i] and [ii], we get, x=5,y=1/3. Since the regression lines intersect at (𝑋̅,𝑌̅)
𝑠𝑜(𝑋̅,𝑌̅) =(5,1/3).
If possible, let 3x+12y =19 be the line of regression of y on x & 9x+3y =46 be that
of x on y. so,[i] and [ii] are
Testing whether the product 𝑏𝑥𝑦 . 𝑏𝑦𝑥 ≤ 1 or not. Now, 𝑏𝑥𝑦 . 𝑏𝑦𝑥 =(-4).(-3)=12 >1.
[impossible]. So our supposition is incorrect. Hence, line[i] is equation of
regression of y on x. Then, [i] gives, y= -1/4 x+19/12 ,∴ byx = -1/4.
1 1
Finally, r=√𝑏𝑥𝑦 . 𝑏𝑦𝑥 =√ = < 1, [correct since r < 1].
12 2√3
X: 6 2 10 4 8
Y: 9 11 5 8 7
∑𝑋 ∑𝑌
Here, 𝑋̅ = = 30/5 =6, 𝑌̅ = = 40/5 =8
𝑛 𝑛
Prepared by Prof.Balaram P. Sharma
Y-𝑌̅ = byx(X-𝑋̅)
X- 𝑋̅ = bxy (Y-𝑌̅)
Ex.[5]. The following table shows the age X and blood pressure Y of 8 persons.
X: 52 63 45 36 72 65 47 25
Y: 62 53 51 25 79 43 60 33
Obtain the regression equation of Y on X. also find the expected blood pressure of
a person who is 49 years old.Solution: Take assumed mean of series X and Y, as
A&B. Take A =50, B =50.
X Y X-50 Y-50 U2 V2 UV
(U) (V)
52 62 2 12 4 144 24
63 53 13 3 169 9 39
45 51 -5 1 25 1 -5
36 25 -14 -25 196 625 350
72 79 22 29 484 841 638
65 43 15 -7 225 49 -105
47 60 -3 10 9 100 -30
25 33 -25 -17 625 289 425
𝛴X=405 𝛴Y=406 𝛴U=5 𝛴V=6 𝛴𝑈 =1737 𝛴𝑉 2 =2058
2 𝛴UV=1336
∑𝑋 ∑𝑌
Here, 𝑋̅ = =405/8 =50.625, and,𝑌̅ = =406/8=50.75
𝑛 𝑛
𝑛.𝛴𝑢𝑣−𝛴𝑢.𝛴𝑣 8𝑋1336−5𝑋6
byx = = =0.768
𝑛𝛴𝑢2 −(𝛴𝑢)2 8𝑋1737−25
Prepared by Prof.Balaram P. Sharma
Or, Y- 50.75 = 0.768 (X- 50.625 )∴ Y= 11.87 +0.768 X, is the required regression
equation.
solving A and B, we get the mean of the series X and Y, so we get, the point of
intersection of two regression lines (X,Y)=(13,17). Hence, the mean value of series
X, (𝑋̅) =13. And mean value of series Y, (𝑌̅)=17.
(ii). To find the correlation coefficients, we have to obtain bxy and byx. We may
assume that equation A is regression of X on Y and B is regression of Y on X. so
by A, 8x =10y -66
Here, bxy .byx =[5/4].[20/9] =25/9 >1. Which is impossible. Then we can conclude
that equation A i.e. 8x-10y+66=0,is a regression equation of Y on X.
and, y=8/10 x + 6.6 ∴ byx =8/10 = 4/5
Ex.[7]. For 50 students of a class the regression equation of the marks in statistics
(X) on the marks on accountancy(Y) is 3y-5x+180=0. The mean marks in
accountancy is 44 and the variance of marks in statistics is [9/16]th of variance of
marks in accountancy. Find the mean marks in statistics and the coefficient of
correlation between the marks in two subjects.
𝜎𝑥 𝜎𝑦
Since, bxy = r ∴ r = bxy. = 0.6x4/3 = 0.8
𝜎𝑦 𝜎𝑥
Ex.[8]. Using the following data, obtain the regression equation X as independent
variable and find the most likely value of Y when X = 24.