0% found this document useful (0 votes)
9 views

Copy of STATS NOTE Edited Upto Regression

The document outlines a comprehensive course of study in statistics, covering topics such as definitions, measures of central tendency and dispersion, correlation, regression, probability, and hypothesis testing. It also discusses the meaning, functions, importance, and limitations of statistics, as well as the concepts of population, sample, and variables. Additionally, it includes methods for diagrammatic and graphical presentation of data, with examples of various types of diagrams and their applications.

Uploaded by

awasthirohitr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Copy of STATS NOTE Edited Upto Regression

The document outlines a comprehensive course of study in statistics, covering topics such as definitions, measures of central tendency and dispersion, correlation, regression, probability, and hypothesis testing. It also discusses the meaning, functions, importance, and limitations of statistics, as well as the concepts of population, sample, and variables. Additionally, it includes methods for diagrammatic and graphical presentation of data, with examples of various types of diagrams and their applications.

Uploaded by

awasthirohitr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Prepared by Prof.Balaram P.

Sharma

COURSE OF STUDY (THEORY)

1. Origin, definition, scope and limitation of statistics.

2. Statistical notation, population and sample, parameter and statistic,


variables.

3. Diagrammatic and graphical presentation of data.

4. Measure of central tendency: Introduction, types, Properties, merits,


demerits and uses of mean, median and mode.

5. Measure of dispersion: Introduction, types, properties, Merits, demerits


and uses of range, quartile deviation, Mean deviation, standard deviation,
coefficient of Variation and Lorenz curve.

6. Measure of skewness, kurtosis, moments and their uses.

7. Correlation: Introduction, simple linear correlation (scatter Diagram, Karl


Pearson’s and Spearman’s), properties of correlation coefficient.

8. Regression: Introduction, simple linear regression, properties of regression.

9. Probability: Introduction, addition and multiplication theorems.

10. Conditional probability and Baye’s theorem.

11. Random variables, mathematical expectations, discrete and continuous


probability distribution.

12. Discrete probability distributions: Binomial, Poisson, and multinomial.

13. Continuous probability distributions: Uniform and normal.

14. Point and interval estimation.

15. Testing of hypothesis: Z-test, T-test, F-test, Chi-square test.

**
Prepared by Prof.Balaram P. Sharma

MEANING OF STATISTICS
The word ‘Statistics’ is derived from the Latin word ‘status’ and the Italian word
‘statista’ (means political state). In earlier days statistics was considered as a
science which is used for the fulfillment of the needs of state or administration.

DEFINITION: (SINGULAR SENSE)


1. NARROW DEFINITION:

Statistics as the science of counting or science of average

-Mr. ‘A.L. BOWLEY’.

2. COMPREHENSIVE DEFINITION:

Statistics as the collection, presentation, analysis and interpretation of numerical


data

-‘Croxton and Cowden’.

3. OTHER DEFINITIONS:

➢ Statistics may rightly be regarded as a body of Methods for taking


wise decisions in the face of uncertainty.
-‘Roberts &Wallis’.
➢ Statistics is the method of decision making in the face of uncertainty
on the basis of numerical data and calculated risks.
- YA – LUN – CHOU.
➢ Statistics refers to the body of techniques and Methodology which
has been developed for the collection, presentation and analysis of
quantitative data and for the use of such data in decision making.
-‘Nether and Wasserman’

DEFINITION: (PLURAL SENSE)

Statistics means quantitative Information or numerical facts collected


systematically with a definite objective.

# NARROW DEFINITION :
Prepared by Prof.Balaram P. Sharma

Mr. Webster defined statistics as the classified facts respecting the conditions of
the people in the given state.

#COMPREHENSIVE DEFINITION :

‘By statistics, we mean aggregate of facts affected to a marked extent by


multiplicity of causes numerically expressed, enumerated or estimated according
to reasonable standards of accuracy collected in a systematic manner for a pre-
determined purpose and placed in relation to each other.’

-Prof. Horace Secrist.

FUNCTIONS OF STATISTICS :

Main functions of statistics are as follows :

➢ Statistics simplifies complexity.


➢ Statistics presents facts in a definite form.
➢ Statistics facilitates comparison.
➢ Statistics helps in formulation of policies.
➢ Statistics helps in forecasting.
➢ Statistics helps in formulating and testing hypothesis.

IMPORTANCE OF STATISTICS :

Statistics is useful –

➢ in the administration of various states.


➢ to the government in framing policies.
➢ to the business men.
➢ to the economists.
➢ to the research workers.
➢ in planning.
➢ in supervision
➢ to the bankers, insurance companies etc.
➢ to help the development of the other sciences.

LIMITATIONS OF STATISTICS :

The followings are limitations-


Prepared by Prof.Balaram P. Sharma

➢ Statistics doesn’t study individuals.


➢ Statistics doesn’t study qualitative phenomena.
➢ Statistics deals with averages.
➢ Statistics is only a means.
➢ Statistics is liable to be misused.
➢ Statistical results are only approximately correct.
POPULATION AND SAMPLE:

Variable

An attribute that describes a person, place, thing or idea under study is known as a variable or
characteristics. A variable is usually denoted by X, Y, Xi, Yi etc. Examples of variable are age of
the teachers in a college, marks obtained by students in a certain examination, height and weight
of people in a group, blood pressure, temperature (℃ ), rain fall (mm), education level, export
and import, cast and religion, hair color, eye color etc. The types of variables are: (a) qualitative
variables. (b) quantitative variables.

(a). Qualitative variables:

Qualitative variables (or categorical variables) take on values that are names or labels. Examples:
sex, place, country, nationality, religions, color of balls or marbles, breeds of a cow, major tree
species in Hilly region, major fruits in Gandaki province, corns produced in Narayani zone etc
are some examples of Qualitative or categorical variables.

(b). Quantitative variables:

If the data of the variable can be expressed numerically, such variables are said to be quantitative
variables. Since quantitative variables represent a measurable quantity, such types of data are
said to be numerical data or quantitative data.

Examples: height(cm. or m. or ft. or inch), weight, blood pressure, rain fall, room
temperature, cost, revenue, export, import, profit, age, income, death count etc. The quantitative
variables should have certain units. There are two types of quantitative variables.

[i]. Discrete variables: a variable which takes only countable values (or whole number) is known
as discrete variable. Number of students, number of telephone calls in a day, number of vehicles
running a day, number of planes grounded on an airport, family size etc are the examples of
discrete variables.

[ii]. Continuous variables: continuous variables are those, which takes all possible real values
within a certain range. Such types of real values may be whole number as well as fractional
number. Height, weight, blood pressure, rain fall, temperature, marks of a student, profit of
certain commodity are the examples of continuous variables.
Prepared by Prof.Balaram P. Sharma

For the further clearification between discrete and continuous variables, we take two
examples as: (i) if we count the number of telephone call received in a day or an hour, it may be
0 or 1 or 2 or 3… (but not 3.6 or 4.5 etc fractional number). It will be always integral or whole
number. Hence, number of telephone call is discrete variable. (ii) if we measure the height of the
students in a class, it may be 4.2 ft. or 5.4 ft. or 5 ft. …..etc. height may be integral or fractional
value. So the height of the student is continuous variable.

Population and sample:

Population: Population or statistical population means all the totality of cases under
investigation. It is also the entire group of individuals or members or items of interest in a study.
A population is finite, if it is possible to count its individuals. It is also known as countable
population. Sometimes it is not possible to count the number or units contained by population.
Such a population is called infinite or uncountable population.

For example: if we want to know the quality of an umbrella produced by a factory, the
entire product of umbrella of the factory under study in a certain time period is called a
population. All the patients in a certain hospital, all the flower plants in the garden, all the
teacher in a college, all the goats in the goat form are all the examples of population in different
studies. All the above examples are also examples of countable or finite population. The number
of vehicles crossing the Rapti bridge (Hetada) in a day is another example of finite population.
The number of germs in the body of COVID patient, the number of chemical (liquid) drops in a
beaker are examples of uncountable or infinite population.

If all the items or individuals can be regarded as the same type, such population is known
as homogeneous population. The heterogeneous population means the population containing
sub-population of different types of characteristics under study.

Target population: It is a population for which representative information is desired.

Sampling population: It is a population from which a sample be taken as determined by the


sample frame. The frame is merely a list of sampling units.

Sample:

Sample is some of the units selected from population. The process of selecting such units from
the population to draw conclusion about the population is known as sampling. So the sampling is
a process of choosing a representative portion of the population.

Parameter and Statistics:

The parameters are the several statistical measures,which describes or characterize a population.
In general, there is insufficiency of all the data for entire population. We have to use on samples
to arrive at conclusions about populations. Thus, it needs to calculate parameters. An estimation
of population parameter is called statistic. In other words, statistical measures, calculated rfom
the sample data which will be an estimate of population parameters, are called statistic
Prepared by Prof.Balaram P. Sharma

Some parameters and statistics

parameter Statistic
1 1
Mean, 𝜇 =𝑛 . ∑𝑛𝑖=1 𝑥𝑖 Mean,𝑋̅ = 𝑛 ∑𝑛𝑖=1 𝑥𝑖
1 1
Variance, 𝜎 2 =𝑛 ∑𝑛𝑖=1[𝑥𝑖 − 𝜇]2 Variance,𝑆 2 = 𝑁 ∑𝑛𝑖=1[𝑥𝑖 − 𝑋̅]2

DIAGRAMATIC & GRAPHIC REPRESENTATION OF DATA

Diagrams and Graphs are the presentation of statistical data in the form of geometrical figures
like points, lines, bars , rectangles, circles etc.

The difference between the diagrams and graphs are as follows:


S.N Diagrams Graphs
1. Diagrams are constructed on plane Graphs are constructed on graph paper. The
paper. principle of graph (x,y- axes) should be applied
on construction.
2. May be one, two or three dimensional. Will be generally of two dimensional.
3. Construction is not so easy. Construction is easier comparison to diagram.
4 The numerical data are presented by The numerical data are presented in term of
bars, rectangles, circles, cubes, cuboids, points and lines.
spheres etc.
5 Diagrams are used only for comparison. Graph helps in studying the mathematical
relationship (not necessarily functional)
between two variables.
6 Presentation of frequency distribution in Presentation of frequency distribution and
diagram is not used. time series in graph is more appropriate.
7 Diagrams are more attractive for seeing The researchers and Statistician use the graph
but they don’t add anything to the frequently and use it to analyze the data.
meaning of data. So, the researchers or
Statistician rarely use

Different types of diagram.

The following are the commonly used diagram types.

[i]. 1-dimensional diagram or Bar diagram.

[ii]. 2-dimensional diagram.

[iii]. 3-dimensional diagram.

[i]. One dimensional diagram or bar diagram.


Prepared by Prof.Balaram P. Sharma

The most common and easiest method of presenting the data is bar diagram. Bar diagram
consists of a rectangle or a set of rectangles corresponding to data in which the magnitudes or
values are represented by the length or height of the bar.

The followings are the types of bar diagram.

(a). simple bar diagram. (b). sub-divided or component bar diagram.

(c). percentage bar diagram. (d). multiple bar diagram.

(a). Simple bar diagram.

The diagram for only one variable is a simple bar diagram. The magnitudes of the data are
presented through the height of the bars. It can be used for the comparative study of two or more
categories of a single variable. It consists of a set of equidistant rectangles having equal width.

Example [1]: The following is the production of some of the crops in a certain province of Nepal
in the fiscal year 2071/072:

S.N. Crop type Quantity (Tones) S.N. Crop type Quantity (Tones)
1 Rice 130,000 3 Wheat 25,000
2 Maize 40,000 4 Others 20,000

Simple bar diagram showing the production of certain crops

140000

120000

100000

80000

60000

40000

20000

0
Rice Maize Wheat Others
Prepared by Prof.Balaram P. Sharma

(b). Component (or sub-divided) bar diagram

Sub-divided or component bar diagram is such type of bar diagram which is presented for two or
more component of a total. It can be drawn when it is necessary to make the comparative study
of different components with one another and to see the relation between the components with
their totals. In component bar diagram, a rectangle of total length is drawn. Then, divide the
rectangle into various parts each representing component of the total. Each of the component
can be colored or shaded by differently.

From example 1:

Sub-divided bar diagram showing the production of certain crops

250000

200000 20000
25000

150000 40000

100000

130000
50000

0
Crops

Rice Maize Wheat Others

Example [2]: The following table represents the number of passed out students in B.Sc. Forestry
from FoF, Agriculture and Forestry University, Nepal in the year 2074 and 2075.

Year. Male. Female. Total.


2074 40 25 65
2075 35 40 75
Prepared by Prof.Balaram P. Sharma

Sub-divided bar diagram showing the passed out students in B.Sc. Forestry
80

70

60
40
25
50

40

30

20 40
35
10

0
2074 2075

Male Female

(c). Percentage bar diagram

A sub-divided bar diagram presented data in terms of percentage is known as percentage bar
diagram. The height of each bar is same as 100. In order to show the relative changes in the data,
percentage bar diagram is more appropriate than simple and sub-divided bar diagram. The
percentage bar diagram of example [1] above is as below:

Percentage bar diagram showing the production of certain crops


120

100
9.3
11.62
80
18.6
60

40
60.46
20

0
Crops

Rice Maize Wheat Others


Prepared by Prof.Balaram P. Sharma

The percentage bar diagram of example [2] above is as below:

Percentage bar diagram showing the passed out students in B.Sc. Forestry

120

100

80 38.46
53.33

60 Female
Male

40
61.54
46.67
20

0
2074 2075

(a). Multiple bar diagram


We draw multiple bar diagram, when two or more variables are to be compared at the same time.
It is also one dimensional bar diagram. A set of adjacent bars are drawn and different colors or
shades are used to differentiate bars. The multiple bar diagram of example [1] above is as
follows:

Multiple bar diagram showing the production of certain crops


250000

215000

200000

Rice
150000
130000 Maize
Wheat
100000 Others
Total

50000 40000
25000 20000

0
crops
Prepared by Prof.Balaram P. Sharma

Example [2] above is presented in multiple bar diagram as:

Multiple bar diagram showing the passed out students in B.Sc. Forestry

80 75

70 65

60

50
40 40 Male
40 35 Female
30 25 Total

20

10

0
2074 2075

[ii].Two dimensional diagram

The lengths as well as width of the bars are considered and the area of the bars are compared in
rectangular diagram. The area of two or more circles of two variables can be compared in
circular diagram. The rectangular diagram & Pie chart (or circular diagram) are the examples of
two dimensional diagram.

(a). Rectangular diagram.

It is the modified form of bar diagram. The only difference is that the areas of two
rectangles of different categories of the same variables are compared. The height or length of the
rectangle is generated by data and the breadth or width is same for each rectangle. The gaps of
two rectangles are taken same.

(b). Circular diagram (or Pie chart).

Pie chart or circular diagram is two dimensional diagram. This method is popular and widely
used method. it is also known as angular diagram. All the given values are converted in terms of
central angles of a circle. The sum of all the angles should be 360°. The conversion of each of
the items into angles are as follows:

Consider the total angle = 360°.


360°
Required angle corresponding to given value X =𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 x (Given value X).
Prepared by Prof.Balaram P. Sharma

Pie chart can be constructed with any convenient radius.

Pie chart (or circular diagram) of example [1] above is as dradn below:

Pie chart showing the production of certain crops

Crops

20000

25000
Rice
Maize
Wheat
40000 130000 Others

Graph of the frequency distribution

The followings are the types of graphs.

[i]. Histogram.

[ii]. Frequency polygon.

[iii]. Frequency curve.

[iv]. Ogive or cumulative frequency curve.

[i]. Histogram

Presentation of frequency distribution in the graphical form is known as ‘Histogram’. Histogram


consists of a set of adjacent vertical rectangles on x-axis with bases equal to the width of the
corresponding class intervals and heights proportional to the corresponding frequency of each
class. Hence the area of the rectangle is proportional to the frequency of the corresponding class.

If only mid values are given, the upper and lower limits of each of the intervals should be
obtained for drawing histograms.
Prepared by Prof.Balaram P. Sharma

Example[i].The followings are the number of forest enterprises having annual income in
Rs.’000’.

Income (Rs.’000’) 0-15 15-30 30-45 45-60 60-75 75-90


No. of forest enterprises. 12 20 25 40 30 15

Histogram showing the income of forest enterprises


45
40
35
N0. of enterprises

30
25
20
15 no. of forest enterprises
10
5
0
0-15 15-30 30-45 45-60 60-75 75-90
Income (Rs. '000')

[ii]. Frequency polygon.

Another graphical method of presenting frequency distribution is ‘frequency polygon’.


To construct frequency polygon, it is to be joined the upper mid points of the histogram by
straight line. If we want to draw frequency polygon without histogram, it is constructed keeping
the mid points of class interval in x-axis and the corresponding frequency in the vertical y-axis.
Frequency polygon is simpler than histogram.
Prepared by Prof.Balaram P. Sharma

Note that histogram frequency polygon, frequency curve and ogive are drawn for
continuous series.

Frequency polygon for income and no. of enterprises


45
40
No. of forest enterprises

35
30
25
20
15 no. of forest enterprises
10
5
0
7.5 22.5 37.5 52.5 67.5 82.5
Income mid point

[iii]. Frequency curve.

A smooth free hand curve drawn through the vertices of a frequency polygon is known as
the frequency curve. The frequency curve smoothes out the corners and the peaks of frequency
polygon in such a way that area enclosed by frequency curve is same as that of frequency
polygon but its shape must be smooth but not with sharp edges. The frequency curve is a
limiting form of frequency polygon when the number of observations become very large and
class intervals are made smaller and smaller.

[iv]. Ogive or cumulative frequency curve.

Cumulative frequency curve or ogive is a graphic presentation of frequency distribution.


It facilitates to see how many observations lie above or below a certain value rather than
knowing the number of observations within intervals. Ogives are helpful to locate the partition
values (quartiles, deciles, percentile and median). There are two types of cumulative frequency
curve (ogives).

(a). Less than c.f. curve (less than ogive)

(b). More than c.f. curve (more than ogive)

Note: [1] cumulative frequency curve or ogive means less than ogive (if otherwise is not stated.)

[2]. If both more than ogive and less than ogive are drawn on the same paper, they intersect at a
point. The foot of the perpendicular drawn from their point of intersection on the x-axis gives the
value of the median.

Examples of less than and more than ogive can be given from Example[i].
Prepared by Prof.Balaram P. Sharma

Income less than No. of enterprises Income more than No. of enterprises
15000 12 0 142
30000 32 15000 130
45000 57 30000 110
60000 97 45000 85
75000 127 60000 45
90000 142 75000 15
More than ogive
160
140
No. of enterprises

120
100
80
60 No. of enterprises
40
20
0
15 30 45 60 75 90
Income more than (Rs. '000')

Less than ogive


160
140
No. of enterprises

120
100
80
60 No. of enterprises
40
20
0
15 30 45 60 75 90
Income less than (Rs. '000')
Prepared by Prof.Balaram P. Sharma

160

140

120

100

80 Less than ogive


More than ogive
60

40

20

0
0 20 40 60 80 100

MEASURES OF CENTRAL TENDENCY

The behavior of unwidely large amount of data which moves towards the middle of the group is
known as central tendency. Its measure are used to describe the middle or centre of data set. The
resulting value obtained from the measures of central tendency can be considered as proxy
typical unique value. Measures of central tendency also enable us to compare two or more sets of
data to facilitate comparison. We may define the central tendency as follows:

Central tendency is a single value within the range of data, which represents a group of
individual values in a simple and concise manner and concentrates in the middle of the
distribution.

Requisites of a good Average

The following characteristics should be applied or satisfied to select an ideal or suitable average

• It should be rigidly defined and its value should be definite.


• It should be easy to calculate.
• It should be simple to understand.
• It should be based on all the observations.
• It should be suitable for further mathematical analysis or treatment.
• It should be least affected by the fluctuation of sampling.
• It should be least affected by extreme observations.

Types of average:

The measure of central tendency is designed to measure central value around which most of the
data tends to concentrate. The measure of central location or central tendency are as follows:
Prepared by Prof.Balaram P. Sharma

(i). Arithmetic mean (A.M.). (ii). Geometric Mean (G.M.)

(iii). Harmonic Mean (H.M.) (iv). Median (Md)

(v). Mode (Mo)

Mean, Median and Mode are considered generally as major measures, while geometric and
harmonic means are considered as minor measures or means of transformed data. H.M. & G.M.
are used in typical cases only.

(i) ARITHMETIC MEAN:

It is the most commonly used and popular mean. Arithmetic mean is also known as
arithmetic average. ’Arithmetic mean’ or simply a ‘mean’ of a set of observations is the sum of
all the observations divided by the number of observations

Individual series: The ungrouped data where each and every value of individual item is listed
down is known as individual series. Let x1, x2, x3, … … …, xn be the ‘n’ observations of the
variable X, then their arithmetic mean denoted by (𝑋̅) is defined as ;

𝑥 +𝑥 +𝑥 +⋯….+𝑥𝑛 ∑𝑥
𝑋̅ = 1 2 3 = ………………………[i]
𝑛 𝑛

This formula for calculating arithmetic mean is known as ‘direct method ‘

When the number of observations are so large and the sizes of the items are so big,
we take their arithmetic mean by taking the deviations of the items from any
arbitrary number. This method is known as ‘short cut method.’ or ‘change of
origin method’ and this arbitrary number is known as ‘assumed mean’. The
addition or subtraction of a constant is sometimes called change of origin. In this
case,
𝑑 +𝑑 +𝑑 +⋯….+𝑑𝑛 ∑𝑑
𝑋̅ = 𝑎 + 1 2 3 = a+ ………………………[ii]
𝑛 𝑛

Where, 𝑋̅ =arithmetic mean, n=number of observations, a=assumed mean, d=x-


a, 𝛴d =sum of the deviations of items taken from assumed mean.

In discrete series:

If f1, f2, f3, …….. , fn be the corresponding frequencies of the variate values x1, x2,
x3, ……., xn respectively, then the arithmetic mean ̅̅̅̅
(𝑋 ) is defined as,
𝑓 𝑥 +𝑓 .𝑥 +⋯………+𝑓𝑛 .𝑥𝑛 ∑ 𝑓𝑥 ∑ 𝑓𝑥
𝑋̅ = 1 1 2 2 = = ………………………[iii]
𝑓1 +𝑓2 +𝑓3 +⋯……+𝑓𝑛 𝛴𝑓 𝑁
Prepared by Prof.Balaram P. Sharma

Where, 𝛴f =N =total frequency. This method of finding A.M. in discrete series is


known as Direct method.

In discrete series, the A. M. can be obtained by taking assumed mean ‘a’ which is
formulated as follows
∑ 𝑓𝑑 ∑ 𝑓𝑑
𝑋̅ = a + =a+ ………………………[iv]
𝛴𝑓 𝑁

Where, a =assumed mean, N =total frequency. d= x-a =deviation of the item


taken from assumed mean ‘a’

In continuous series The structure of continuous series can be reduced in to


discrete series by taking mid-value of each of the class interval as variable x. so the
‘direct method’ and ‘assumed mean method’ [iii] and [iv] are both applicable in
continuous series. Addition or subtraction of a constant is sometimes called change
of origin where as multiplication or division by a constant is termed as change of
scale. Formula for calculation of A.M. by changing both origin and scale is known
as ‘Step deviation method’. which is as follows:
∑ 𝑓𝑑′ ∑ 𝑓𝑑′
𝑋̅ = a + 𝑋ℎ =a+ x h ………………………[v]
𝛴𝑓 𝑁

𝑥−𝑎
Where, a= assumed mean, N=𝛴f = total frequency, d’ = , h = class size. In

other symbol, the formula for short cut method and step deviation methods are
summarized as:
𝑥 +𝑥 +𝑥 +⋯….+𝑥𝑛 ∑𝑥
Direct method [individual series]: 𝑋̅ = 1 2 3 =
𝑛 𝑛

𝑓 𝑥 +𝑓 .𝑥 +⋯………+𝑓𝑛 .𝑥𝑛 ∑ 𝑓𝑥
Direct method [discrete & continuous series]: 𝑋̅ = 1 1 2 2 =
𝑓1 +𝑓2 +𝑓3 +⋯……+𝑓𝑛 𝑁

𝑈 +𝑈 +𝑈 +⋯….+𝑈𝑛 ∑𝑈
Short cut method [Individual series]: 𝑋̅ = 𝑎 + 1 2 3 = a+
𝑛 𝑛

∑ 𝑓𝑈 ∑ 𝑓𝑈
Short cut method [discrete & continuous series]: 𝑋̅ = a + =a+
𝛴𝑓 𝑁

∑ 𝑓𝑈′ ∑ 𝑓𝑈′
Step deviation method [disc. & continuous series]: 𝑋̅= a+ = a+ xh
𝛴𝑓 𝑁

Where, a =assumed mean, d=U=x-a, d’=U’= [x-a]/h

Weighted arithmetic mean:


Prepared by Prof.Balaram P. Sharma

The arithmetic mean as discussed above is simple arithmetic mean. In the


calculation of such simple A. M. it is considered that all the items are equally
important. In practice, the importance of each of the item may not be same. The
importance of some items may be higher than that of other. So, proper
weightage should be given to various items in such cases. And weighted
arithmetic average or mean can be defined as follows ;

If w1, w2, w3, …. .., wn be the weight given to the variate values x 1, x2, x3,
……,.xn respectively. then their weighted arithmetic mean is given by,
𝑥 𝑤 +𝑥 𝑤 +⋯……..+𝑥𝑛 𝑤𝑛 ∑ 𝑤𝑥
𝑋̅𝑤 = 1 1 2 2 =∑ ……….[vi]
𝑤1 +𝑤2 +⋯..+𝑤𝑛 𝑤

Combined Arithmetic mean: If X1 and X2 be the separate arithmetic mean of two


component series having n1 and n2 number of items respectively, then the mean of
the combined series is known as combined arithmetic mean, denoted by 𝑋̅12 and
is defined as ;
̅
𝑛 𝑋 +𝑛 𝑋 ̅
𝑋̅12 = 1 1 2 2. …………………………...[vii]
𝑛1 +𝑛2

Similarly the combined arithmetic mean of three component series can be


𝑛1 .𝑋̅1 +𝑛2 .𝑋̅2 +𝑛3 𝑋̅3
easily formulated as ; 𝑋̅123 = …………….[viii]
𝑛1 +𝑛2 +𝑛3

Properties of arithmetic mean: The following are the mathematical properties


of arithmetic mean-

1. The algebraic sum of the deviations of the item taken from arithmetic mean is
zero. So, 𝛴(X-𝑋̅) = 0 [In individual series], 𝛴f(X-𝑋̅) = 0 [in discrete series],
1
And 𝛴f(X-𝑋̅) =0 [In continuous series], where h is the class size and X is the set

of mid –value of each class.

2. The sum of the squares of the deviations of the items is minimum, when the
deviations are taken from arithmetic mean. Mathematically, ∑(𝑋 − 𝑎)2 is
minimum when, a=𝑋̅.

Merits and Demerits of arithmetic mean:


Prepared by Prof.Balaram P. Sharma

Merits:

• It is rigidly defined.
• It is based on all the observation.
• It is simple to understand and easy to compute.
• It is suitable for further mathematical analysis.
• It is least affected by fluctuation of sampling.

Demerits:

▪ It is very much affected by extreme (largest and smallest) observations


▪ It can’t be calculated accurately in the case of open end classes.
▪ It can’t be computed by inspection and graphically.
▪ It can’t be measured quantitatively, if data having qualitative characteristics
is given or supplied.
▪ It gives sometimes fallacious conclusion.

(ii) Geometric Mean

The geometric mean of the ‘n’ non zero and non negative variate values is the nth
root of their product.

In individual series, Let x1, x2, x3, ………., xn be ‘n’ non zero and non negative
variate values, then their geometric mean (denoted by ‘G’) is defined as ;
∑ log 𝑥
G = 𝑛√𝑥1 . 𝑥2 . 𝑥3 … . . 𝑥𝑛 = 𝑎𝑛𝑡𝑖𝑙𝑜𝑔[ ] ……………………[ix]
𝑛

In discrete series, If f1, f2, f3, ….. fn be the corresponding frequencies of ‘n’ non
zero and non negative variate values x1, x2, x3, ….., xn, their geometric mean G is
defined as ;
∑ 𝑓.log 𝑥
G= 𝑎𝑛𝑡𝑖𝑙𝑜𝑔[ ]
𝑁
……………………………………………………[x]

Where N =𝛴f=total frequency.

In continuous series The formula for the G.M. (G) in the discrete series [x]can be
used by considering mid value of each of the class interval as the variate value x.

Merit and Demerit of G.M.


Prepared by Prof.Balaram P. Sharma

Merits:

➢ It is rigidly defined.
➢ It is based on all the observations.
➢ It is suitable for the further mathematical analysis.
➢ It is not affected very much by the fluctuation of sampling.
➢ It gives more weights comparatively to small items.

Demerits:

➢ It is not easy to calculate for the researcher of non mathematical area.


➢ It can’t be used if any of the variate value is zero or negative. If x=0 (any
one or more), G.M. will be zero. And if x is negative ( any one or more),
‘G’ will be imaginary.

(iii). HARMONIC MEAN

Harmonic mean is defined as ‘the reciprocal of the arithmetic mean of the


reciprocals of the set of non zero variate values’.

In individual series: Let x1, x2, x3, …., xn be non-zero variate values of the variable
X, then their harmonic mean (H.M) which is denoted by H is defined as ;
𝑛
H= 1………………………………….[xi]

𝑥

Where, n be the number of observations.

In discrete & continuous series: If f1, f2, f3, ………., fn be the corresponding
frequencies of n number of variate values x1, x2, x3, ……., xn, then their harmonic
mean H is defined as,
𝑁
H = 1 ………………. ……………[xii]
∑ 𝑓.
𝑥

Where N=total frequency.

Relation among A. M., G.M. and H.M.


𝐴.𝑀. 𝐺.𝑀.
(1). A.M.>G.M.>H.M. (2). [A.M.].[H.M.] = [G.M.]2 or , . =
𝐺.𝑀. 𝐻.𝑀.
Prepared by Prof.Balaram P. Sharma

(iv). MEDIAN:

Median is the variate value which divides total number of observations into two
equal parts. So, the number of observations in the first part is equal to the number
of observations in the second part in the condition when the data is arranged either
in ascending order in magnitude or in descending order in magnitude.

By this definition, we can conclude that median is the positional average,

which locates the central position of the arranged data. It is denoted by M d.


Determination of the median value depends upon the series, which are as follows:

Individual series: First of all, arrange the data in either ascending or descending
order. If total number of observations ‘n’ is odd, there is only one middle value
which divides the whole items in two equal parts. If n is even, there are two central
or middle items. The single central value (median ‘Md’) can be obtained by taking
the A.M. of these two central items. In both cases,
𝑛+1 𝑡ℎ
Md = value of [ ] item …………………………………………[xiii]
2

where ‘n’ denotes the number of items.

Discrete series: The median can be determined in discrete series. For this, we
follow the following steps:

(i). Arrange the data in ascending order according to their magnitude.

(ii). Prepare the cumulative frequency (C.F.) table.

(iii). Use the formula for Median ‘Md’ (where ‘N’ =total frequency)
𝑁+1 𝑡ℎ
Md = value of [ ] item ……………………………………...[xiv]
2

(iv). Observe the cumulative frequency column, and note the value corresponding
𝑁+1
to the c.f. either equal to or greater than the value given by [xiv] or . this gives
2
the Median value Md.

Continuous series: We proceed the following steps to determine the value of


Median ‘Md’ in continuous series:

(i). Arrange the class in ascending order according to their magnitude.


Prepared by Prof.Balaram P. Sharma

(ii). Prepare the less than cumulative frequency (C.F.) table.

(iii). Find N/2

(iv). Observe the cumulative frequency column, and note the value corresponding
to the c.f. either equal to or greater than N/2. This gives the Median class or the
class in which the median belong.

(v). The actual value of median is obtained by using the formula ;


𝑁
−𝑐.𝑓
2
Md = L+ x h
𝑓
……………………………………………………………..[xv]

Where, L=lower limit of median class, N=total frequency.


f=frequency of the median class, h=class size of median class.
c.f.=cumulative frequency preceding the median class.

Merits and Demerits of median:

Merits:

▪ It is clearly or rigidly defined.


▪ It is simple to understand and easy to compute.
▪ It is not affected by extreme observations.
▪ It can be easily computed in the case of open end classes.
▪ It can be obtained sometimes by inspection.
▪ It can be found graphically.
▪ It can be used to averaging the qualitative characters such as honesty,
beauty, intelligence etc.

Demerits:

▪ Arrangement of data in ascending or descending order according to


magnitude is necessary.
▪ It is not based on all the observations.
▪ it can’t be used for further mathematical analysis or treatment.
▪ It can’t be determined exactly, if the number of observation is even in
ungrouped data.
▪ It is affected by fluctuation of sampling for a small size sample as compared
with mean.
Prepared by Prof.Balaram P. Sharma

(v). MODE:

The value (or variate value) which repeats maximum number of times, is known as
Mode. In other words, Mode is the value having maximum frequency. Mode is
denoted by Mo.

The determination of Mode in individual and discrete series is very simple.


If the series is an individual, we have to change the data in discrete series in the
form of frequency distribution. In discrete series, Mode is the corresponding value
of variable having maximum frequency or repetition.

Computation of Mode in continuous series:

Mode can be computed by the following formula ;


∆1 𝑓1 −𝑓0
Mode ‘Mo’ = L + x h = L + x h
∆1 +∆2 2.𝑓1 −𝑓2 −𝑓0
……………………..[xvi]

Where, L=lower limit of the modal class, ∆1 = 𝑓1 − 𝑓0 , ∆2 = 𝑓1 − 𝑓2

Mo =Mode, f1= maximum frequency, f0 = frequency preceding modal class.

f1= frequency following the modal class, h=size of the modal class.

NOTE: Above formula and definition of the Mode suffer from the following
limitations:

➢ If the maximum frequency occur more than one time.


➢ If the maximum frequency occurs at the very beginning or at the very end of
the observations.
➢ If the given frequency is not in the regular order, then Mode can be obtained
by using method of grouping.

Merits and demerits of Mode:

Merits:

▪ It is easy to calculate and simple to understand.


▪ It is not affected by extreme observations.
▪ It can be obtained in the case of open end classes.
▪ It can be calculated by graph or by inspection.
Prepared by Prof.Balaram P. Sharma

Demerits:

▪ It is not rigidly defined.


▪ It is not based on all the observations.
▪ It is not suitable for the further mathematical analysis.
▪ It is affected to a greater extent by fluctuation of sampling.
▪ There is some limitations to apply the simple formula for calculating Mode.

Relation among Mean, Median & Mode:


When the distribution is symmetrical,
Mean = Median =Mode.

When the distribution is moderately asymmetrical (skewed or non symmetrical


distribution), Mean ≠ Median ≠ Mode

But, Mode = 3.Median – 2.Mean.

This last relationship is also known as ‘’Empirical relation ‘’ among A.M., G.M
and H.M.

PARTITION VALUES:

Those variate values which divide the total number of items into equal
number of parts, are said to be partition values. The important partition values are
Quartiles, Deciles and percentiles. There are 3 Quartiles which divide total number
of items into 4 equal parts. Similarly, there are 9 Deciles which divide total number
of observations into 10 equal parts. And the number of percentiles are 99 which
divide whole observations into 100 equal parts.

Three Quartiles are denoted by Q1, Q2 and Q3, which are known as first,
second and third Quartiles respectively. Second Quartile Q2 is Median itself, which
divides whole observations into 2 equal parts. Nine deciles are denoted by D 1, D2,
… ..,D9. Similarly P1, P2,…….,P99 are ninety nine number of percentiles. From
these ideas we can easily conclude that, Md=Q2=D5=P50, Q1=P25, Q3=P75, D1=P10,
D7=P70 and so on.

Generally, Qi denotes 3 Quartiles. Where, Qi<Qi+1 and i=1, 2, 3.

Dj denotes 9 deciles. Where, Dj<Dj+1 and j=1, 2, 3, …..,9.

Pk denotes 99 percentiles. Where, Pk<Pk+1 and k=1, 2, 3, ……..,99.


Prepared by Prof.Balaram P. Sharma

Determination of partition values:

The process of computing partition values (Quartiles, Deciles and percentiles) are
similar to the process of computing Median in individual, discrete and continuous
series. We discuss the process as follows:

Individual series

First of all given ‘n’ number of items should be arranged in ascending order. Then
the Quartiles, Deciles and Percentiles can be determined by the following formula:
𝑖(𝑛+1) 𝑡ℎ
Qi = value of [ ] item. where, i=1, 2, 3.
4

𝑗(𝑛+1) 𝑡ℎ
Dj = value of [ ] item. where, j=1, 2, 3,……………, 9.
10

𝑘(𝑛+1) 𝑡ℎ
Pk = value of [ ] item. Where, k=1, 2, 3, ……………., 99.
100

Discrete series: For computing the partition values in discrete series, we may
procedure as the following steps:

(i). Arrange the data in ascending order according to their magnitude.

(ii). Prepare the cumulative frequency (C.F.) table.

(iii). Use the formula for partition values Qi, Dj & Pk (where ‘N’ =total frequency)
𝑖(𝑁+1) 𝑡ℎ
Qi = value of [ ] item. Where, i=1, 2, 3.
4

𝑗(𝑁+1) 𝑡ℎ
Dj = value of [ ] item. Where, i=1, 2, 3,…….., 9.
10

𝑘(𝑁+1) 𝑡ℎ
Pk = value of [ ] item. Where, i=1, 2, 3,………, 99.
100

(iv). Observe the cumulative frequency column, and note the value corresponding
𝑖(𝑛+1)
to the c.f. either equal to or greater than the value given by step (iii) [i.e. or
4
𝑗(𝑛+1) 𝑘(𝑛+1)
or ] this gives the required partition value (Qi or Dj or Pk ).
10 100

Continuous series: We proceed the following steps to determine the partition


values in continuous series: [suppose we want to find Dj]

(i). Arrange the class in ascending order according to their magnitude.


Prepared by Prof.Balaram P. Sharma

(ii). Prepare the less than cumulative frequency (C.F.) table.

(iii). Find jN/10

(iv). Observe the cumulative frequency column, and note the value corresponding
to the c.f. either equal to or greater than jN/10. This gives the D j class or the class
in which Dj belong.

(v). The actual value of Dj is obtained by using the formula ;


𝑗𝑁
−𝑐.𝑓
Dj = L+ 10
x h …………………(xii)
𝑓

Where, L=lower limit of Dj class, N=total frequency.


f=frequency of the Dj class, h=class size of Dj class.
c.f.=cumulative frequency preceding the Dj class.

Worked out examples:

EX.[1]. Find the arithmetic mean of (i) n natural numbers. (ii) a, ar, ar 2, ……., arn-
1
.

Solution: (i). we know, the first ‘n’ natural numbers are 1, 2, 3, …..,n. Hence, A.
M. of ‘n’ natural numbers is, [since, the sum of the first ‘n’ natural numbers is
𝑛(𝑛+1)
]
2
𝑛
∑ 𝑥 1+2+3+⋯…..+𝑛 1 𝑛(𝑛+1) 𝑛+1
𝑋̅ = 𝑖=1 𝑖 = = . = .
𝑛 𝑛 𝑛 2 2
𝑛 2 +⋯.+𝑎.𝑟 𝑛−1
∑ 𝑥 𝑎+𝑎𝑟+𝑎.𝑟 𝑎
(ii). We have, 𝑋̅ = 𝑖=1 𝑖 = = . [1 + 𝑟 + 𝑟 2 + ⋯ + 𝑟 𝑛−1 ]
𝑛 𝑛 𝑛

𝑎 1−𝑟 𝑛 𝑎(1−𝑟 𝑛 )
= [ ] = .
𝑛 1−𝑟 𝑛(1−𝑟)

Ex.[2]. There are two series X and Y having same number of items. The
relationship of each member of X and Y is, 2x-y=5. If the arithmetic mean of Y is
15, what is the A.M. of series X?

Solution: Given, 2x-y=5. Or, 2𝛴x-𝛴y=5.n


2.𝛴𝑥 𝛴𝑦 5.𝑛
Or, − = Or, 2.𝑋̅-𝑌̅=5
𝑛 𝑛 𝑛

Or, 2.𝑋̅ -15 =5 Or, 𝑋̅ =20/2 =10.


Prepared by Prof.Balaram P. Sharma

Hence, A.M. of series X is 10.

Ex.[3]. A researcher collected secondary data of temperature recorded in (℃) in


100 days from a geological station. Find the average temperature from the data:

Temp.(℃): 0-10 10-20 20-30 30-40 40-50

12 15 28 25 20

Solution:

Calculation of average temperature

Temperature Mid-value (x) F U’=


𝒙−𝟐𝟓 fU’
𝟏𝟎
0-10 5 12 -2 -24
10-20 15 15 -1 -15
20-30 25 28 0 0
30-40 35 25 1 25
40-50 45 20 2 40
𝛴f =N =100 𝛴fU’ = 26
Here, a =25, N =100, h =10, 𝛴fU’ =26. A.M. (𝑋̅) =?
∑ 𝑓𝑈′ 26
We have, 𝑋̅ = a + 𝑋 ℎ = 25 + x 10 =25+2.6 = 27.6
𝑁 100

Hence, average temperature = 27.6 ℃.

Ex.[4]. Find the arithmetic mean of daily expenditure of 150 students of Everest
Hostel.

Daily Above Above Above Above Above Above Above


Expenditure (Rs.) 0 100 200 300 400 500 600
No. of students ‘f’ 150 130 100 50 30 12 0
Solution:

Computation of Mean

Expenditure Mid-value (x) F U’=


𝒙−𝟑𝟓𝟎 fU’
𝟏𝟎𝟎
0-100 50 20 -3 -60
100-200 150 30 -2 -60
200-300 250 50 -1 -50
300-400 350 20 0 0
400-500 450 18 1 18
Prepared by Prof.Balaram P. Sharma

500-600 550 12 2 24
600-700 650 0 3 0
𝛴f =N =150 𝛴fU’ = -128
𝛴𝑓𝑈′ −128
We know, ̅𝑋 = a+ 𝑥 ℎ = 350 + x 100
𝑁 150

=350-83.83 = 264.67

Hence, average daily expenditure = Rs.264.67

Ex.[5]. Find the missing frequencies from the incomplete distribution given below.
(given that average wage is 30.2).

Hourly wages (in ‘Rs.’): 0-10 10-20 20-30 30-40 40-50 Total
Number of workers (f) 4 - 10 20 - 50
Solution: Let f1 and f2 be the number of workers having class interval of hourly
∑ 𝑓.𝑥
wages (Rs.) 10-20 and 40-50 respectively. We have, A.M. 𝑋̅ =
𝑁

Calculation of missing frequency by using mean

Hourly Wages Mid value (x) F Fx


0-10 5 4 20
10-20 15 f1 15.f1
20-30 25 10 250
30-40 35 20 700
40-50 45 f2 45.f2
𝛴f=N=50. 𝛴fx=15f1+45f2+970
∑ 𝑓.𝑥
Now, ̅𝑋 =
𝑁

15.𝑓1 +45.𝑓2 +970


Or, 30.2 =
50

Or, 15.f1+45.f2 = 1510 – 970 = 540

Or, 3.f1+9.f2 =108……………….[i]

Also, f1+f2+34 = 50, gives f1+f2 = 16 …………..[ii]

Solving [i] & [ii] we get, f1 = 6, & f2 =10.

Hence, the required missing frequencies are 6 and 10 respectively.

Ex.[6]. Mean of 100 observations was calculated as 50. Later on it was found that
two items were misread as 92 and 8 instead of 192 and 88. Find the correct mean.
Prepared by Prof.Balaram P. Sharma

Solution: given that,

Number of items, ‘n’ = 100, misread items : 92 and 8,


Correct items : 192 and 88, previous mean = 50,

Correct mean =?
∑𝑥
We have, 𝑋̅ = or, 50 = 𝛴x/100 𝛴x ( from misread item) = 5000.
𝑛

Now, correct 𝛴x = 5000-92-8+192+88 = 5180.


𝑐𝑜𝑟𝑟𝑒𝑐𝑡 ∑ 𝑓𝑥 5180
Finally, correct Mean = = = 51.8
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑛 100

Hence, correct mean = 51.8.

Ex.[7]. Find the Geometric mean of the growth factor of population in a capital
city within 5 years from the following data:

Year 1 2 3 4 5
Growth Factor 1.07 1.08 1.10 1.12 1.18
Solution: we know the formula for G.M. is,
1
G = Anti log [ . ∑ log 𝑥]
𝑛

Calculation of Geometric Mean

Year Growth factor (x) Log x


1 1.07 0.029384
2 1.08 0.033424
3 1.10 0.041393
4 1.12 0.049318
5 1.18 0.071882
∑ 𝐥𝐨𝐠 𝒙 = 𝟎. 𝟐𝟐𝟓𝟑
We have,
1
G = Anti log [ . ∑ log 𝑥] = antilog [1/5 x 0.2252] = 1.1093
𝑛

Hence, average growth factor of population = 1.1093 or, 110.93%

Ex.[8]. Find the G.M. of the following distribution:

Class: 2-4 4-6 6-8 8-10


Prepared by Prof.Balaram P. Sharma

Frequency: 20 40 30 10

Solution:

Computation of G. Mean

Class Mid value (x) Frequency (f) Log x f. Log x


2-4 3 20 0.4771 9.542
4-6 5 40 0.6990 27.960
6-8 7 30 0.8451 25.353
8-10 9 10 0.9552 9.442
𝛴f =N =100 𝛴f. Log x=72.397
Solution: we know,
1 72.397
G = Anti log [ . ∑ 𝑓. log 𝑥] = antilog [ ]
𝑛 100

= Anti log[0.72397] = 5.297

Ex.[9]. An enquiry into the budget of the certain family, provided the following
information. Calculate the weighted arithmetic mean.

Expenses on Food Rent Clothing Fuels other


35% 10% 15% 10% 30%
Index number 60 75 80 55 105
Solution: we have, the weighted A.M. is given by,
𝑛
∑ 𝑤𝑖 .𝑥𝑖
𝑋̅𝑤 = ∑𝑖=1
𝑛
𝑖=1 𝑤𝑖

Calculation of weighted A. Mean

Expenses on Weights (𝒘𝒊 ) Index number(𝒙𝒊 ) 𝒘𝒊 . 𝒙𝒊


Food 35 60 2100
Rent 10 75 750
Clothing 15 80 1200
Fuel 10 55 550
Others 30 105 3150
∑ 𝒘𝒊 =100 ∑ 𝒘𝒊 . 𝒙𝒊 =7750

We have, the weighted A.M. is given by,


𝑛
∑ 𝑤𝑖 .𝑥𝑖 7750
𝑋̅𝑤 = ∑𝑖=1
𝑛 = = 77.50
𝑖=1 𝑤𝑖 100
Prepared by Prof.Balaram P. Sharma

Ex.[10]. On the behalf of decision of the Ministry of Forest and Environment to


reward scholarship to the student studying M.Sc. Forestry on the basis of the marks
secured in three important subjects of 50 marks in B.Sc. The weight of the subjects
and the marks secured by 3 students A, B & C are as given. By calculating suitable
mean, decide who is awarded.

Subjects Weight Marks by A Marks by B Marks by C


Forest management 5 40 43 34
Silviculture. 3 42 38 43
Wild life management 2 38 36 44
Solution: Clearly, weighted A.M. is suitable for the decision of scholarship. To
compute the weighted mean marks of 3 students A, B & C, taking the symbols 𝑥̅𝑤
, ,,
, 𝑥̅𝑤 & 𝑥̅𝑤 as respective weighted mean marks.
∑ 𝑥𝑤 𝑥1 .𝑤1 +𝑥2 .𝑤2 +𝑥3 .𝑤3 40.5+42.3+38.2
Weighted mean of A, ̅̅̅̅
𝑥𝑤 = ∑ = = =402/10 = 40.2
𝑤 𝑤1 +𝑤2 +𝑤3 5+3+2

∑ 𝑥𝑤 𝑥 .𝑤 +𝑥 .𝑤 +𝑥 .𝑤 43.5+38.3+36.2
Weighted mean of B, ̅̅̅̅
𝑥′𝑤 = ∑ = 1 1 2 2 3 3 = =401/10 = 40.1
𝑤 𝑤1 +𝑤2 +𝑤3 5+3+2

∑ 𝑥𝑤 𝑥1 .𝑤1 +𝑥2 .𝑤2 +𝑥3 .𝑤3 34.5+43.3+44.2


̅̅̅̅̅
Weighted mean of C,𝑥′′ 𝑤 =∑ = = =387/10 = 38.7
𝑤 𝑤1 +𝑤2 +𝑤3 5+3+2

Since, the weighted mean marks of A is largest value, A is awarded scholarship.

Ex.[11]. There are 3 types of staffs in a research centre. The average salary of 8
field workers is Rs.50,000, 3 managers is Rs.75,000 and 9 supporting staff is
Rs.25,000 per month respectively. Find the average salary per month of all the
staffs of that research centre.

Solution: Here,

Managers: Researchers: Supporting Staff: Total:

Numbers: n1 = 3 n2 = 8 n3 = 9 n = 20.

A. Mean: 𝑥1 =75,000
̅̅̅ 𝑥2 =50,000
̅̅̅ 𝑥3 =25,000
̅̅̅ 𝑋̅ =?
𝑛 .𝑋 +𝑛 .𝑋 .+𝑛 .𝑋 ̅ ̅̅̅̅ ̅̅̅̅
We have, Combined Arithmetic Mean, 𝑋̅ = 1 1 2 2 3 3
𝑛1 +𝑛2 +𝑛3

3.75000+8.50000+9.25000
= = 850,000/20 = 42,500
20

Hence, the average salary per month of all the staffs = Rs. 42,500.
Prepared by Prof.Balaram P. Sharma

Ex.[12]. In a class of 100 students, 85 passed and their average marks is 58. The
total marks secured by the entire class is 5260. Find the average marks of the failed
students

Solution: Passed Students: Failed Students: Total Students:

Numbers: n1 = 85 n2 = 100-85 = 15 n = 100.

A. Mean : 𝑋̅ 1 = 58 𝑋̅ 2 = ? n.𝑋̅ = 5260.


𝑛 .𝑋 +𝑛 .𝑋 ̅ ̅
We have, 𝑋̅ = 1 1 2 2
𝑛1 +𝑛2

5260 85.58+15.𝑋̅2
Or, =
100 100

Hence, 𝑋̅ 2 =[5260 – 4930]/15 = 22.

Hence, average marks of the failed students = 22.

Ex.[13]. Calculate the harmonic mean of the following distribution.

X: 3 6 7 10 11

f: 15 25 30 20 18

Solution:

Computation of H.M

X F 1/x f . 1/x
3 15 0.3333 4.9999
6 25 0.1666 4.1666
7 30 0.1428 4.2840
10 20 0.1000 2.0000
11 18 0.0909 1.6362
𝛴f=N=108 𝛴f.1/x=17.0867
𝑁 108
Now, Harmonic Mean H.M. = ∑ = = 6.32.
𝑓.1/𝑥 17.0867

Ex.[14]. Find the Median from the following data:

(a). 27, 124, 54, 35, 61, 87, 78, 40. (b). 6, 3, 21, 31, 12, 24, 9, 17, 22, 11, 21.

(c)
Prepared by Prof.Balaram P. Sharma

Weakly wages 2500 3000 3500 4200 4500 4700


(Rs.)
No. of workers (f) 5 12 20 25 15 8
(d).

Marks 0-10 10-20 20-30 30-40 40-50


No. of students 5 12 20 15 10
Solution, (a). the given data in ascending order is, 27, 35, 40, 54, 61, 78, 87, 124.

Here, n=8 (even), Median (Md) =?


𝑛+1 𝑡ℎ 8+1 𝑡ℎ
We have, Median (Md) = Value of [ ] item. = Value of [ ] item.
2 2
4𝑡ℎ 𝑡𝑒𝑟𝑚+5𝑡ℎ 𝑡𝑒𝑟𝑚 54+61
=Value of (4.5)th term. = Value of = = 57.5
2 2

Hence median value = 57.5

(b). The given data in ascending order is, 3, 6, 9, 11, 12, 17, 21, 21, 22, 24, 31.

Here, n=11 (odd), Median (Md) =?


𝑛+1 𝑡ℎ 11+1 𝑡ℎ
We have, Median (Md) = Value of [ ] item. = Value of [ ] item.
2 2

=Value of (6)th term. = 17.

Hence, median value = 17.

(c). the data is a discrete series.

Calculation of Median wage

Weakly Wages No. of workers (f) c.f


(Rs.)
2500 5 5
3000 12 17
3500 20 37
4200 25 62
4500 15 77
4700 8 85
Here, (N+1)/2 = 86/2 =43. The c.f. just greater than 43 is 62 and its corresponding
value of wage (x)=4200.

Hence, Median = Rs.4200.


Prepared by Prof.Balaram P. Sharma

(d). The data is a continuous series (Exclusive form)

Calculation of Median marks

Marks No. of students (f) c.f.


0-10 5 5
10-20 12 17
20-30 20 37
30-40 15 52
40-50 10 62
N=62.
Here, N/2 =62/2 =31.The c.f. just greater than 31 is 37, whose corresponding class
is 20-30. So, the Median lies between 20 and 30.

Now, L =lower limit of Md class =20, f= frequency of Md class =20,

h =class width of Md class=10 c.f.=preceding c.f. of Md class=17.


𝑁
−𝑐.𝑓. 31−17
2
We have, Md = L+ x h = 20 + x 10 = 27.
𝑓 20

Hence, median marks = 27.

Ex.[15]. The following is the distribution of marks in Forest Extension obtained by


75 students. Find the Median marks.

Marks more than 0 10 20 30 40 50


No. of students. 75 68 60 35 20 10
Solution: The given distribution is more than c.f distribution. Changing into less
than c.f. table in standard form, we get,

Computation of Median marks

Marks. No. of students (f) c.f.


0-10 75-68 = 7 7
10-20 68-60 = 8 15
20-30 60-35 = 25 40
30-40 35-20 = 15 55
40-50 20-10 =10 65
50 and above. 10 75
𝛴f = N = 75
Prepared by Prof.Balaram P. Sharma

Here, N/2 = 75/2 =37.5. the c.f. just greater than 37.5 is 40. So, the Median lies on
the corresponding class interval 20-30. From Md class, L=20, h=10, f=25,
c.f.=15.
𝑁
−𝑐.𝑓. 37.5−15
2
Now, Md = L+ x h = 20 + x 10 = 20+9 =29.
𝑓 25

Hence, median marks = 29.

Ex.[16]. Calculate the appropriate measure of central tendency from the following
data of monthly income (Rs.) generated by a certain Community Forestry of 125
user groups

Mothly Below 500 500-599 600-699 700-799 800-899 900 & more
income (Rs.)
No. of family. 5 35 50 15 12 8
Solution, it is seen that, the first and last intervals are in the form of income below
and above. The data is of monthly income (Rs.). So, suitable measure of central
tendency is Median. The classes are in inclusive form. So, the calculation table is
changed into exclusive.

Calculation of Median Income.

Monthly income (Rs.) No. of families (f) c.f.


Below 499.5 5 5
499.5-599.5 35 40
599.5-699.5 50 90
699.5-799.5 15 105
799.5-899.5 12 117
899.5 and above 8 125
∑ 𝑓=N=125.
Here, N/2=125/2 =62.5.The c.f. just greater than 62.5 is 90 and the corresponding
class (which is Median class) is 599.5-699.5.

Now, L =599.5, f =50, c.f.=40, h=100.


𝑁
−𝑐.𝑓. 62.5−40
2
We have, Md = L+ x h = 599.5 + x 100 = 599.5+45 = 644.5
𝑓 50

Hence, median income = Rs.644.5


Prepared by Prof.Balaram P. Sharma

Ex.[17]. Find the missing frequency from the following distribution. The median
of the distribution is given to be180.

Income (Rs.) : 50-100 100-150 150-200 200-250 250-300

No. of persons: 15 25 40 - 10

Solution, It is given that, Md =180, so, median class is 150-200. The missing
frequency is assumed as f’.

Calculation of missing frequency

Income (Rs.) No. of persons (f) c.f.


50-100 15 15
100-150 25 40
150-200 40 80
200-250 f’ 80+f’
250-300 10 90+f’
∑ 𝒇=N=90+f’
Since, Md=180, Md class is 150-200, h=50, L=150, f=40, c.f. =40, N=90+f’
𝑁 90+𝑓′
−𝑐.𝑓. −40
2 2
We have, Md = L+ x h = 150 + x 50
𝑓 40

90+𝑓′ −80 5 [𝑓′ +10].5


Or, 180=150 + . or, 180-150=
2 4 8

Or, 30 x 8 =[f’+10].5 or, f’+10 = 48

Hence, missing frequency f’ = 48-10 = 38.

Ex.[18]. Find the modal income from the following frequency distribution of daily
income.

Mid value( Daily 115 120 125 130 135 140


income)
Number of workers. 10 15 30 20 12 8
Solution, here the mid values of each classes are given. Change it into exclusive
form of continuous series. Diff. of mid value =class size =5 & diff./2 =5/2=2.5.

Then, first class in exclusive form is 115-2.5 to 115+2.5, i.e., 112.5-117.5 & so on.
Prepared by Prof.Balaram P. Sharma

Calculation of Mode

Daily income No. of workers (f)


112.5-117.5 10
117.5-122.5 15
122.5-127.5 30
127.5-132.5 20
132.5-137.5 12
137.5-142.5 8
Here, maximum frequency f1=30, Modal class is 122.5-127.5, f0=15, f2=20, h=5,
L=122.5
𝑓1 −𝑓0 30−15
We have, Mode Mo = L + x h = 122.5 + x5
2.𝑓1 −𝑓0 −𝑓2 2 𝑋 30−15−20

=122.5 + 3 =125.5

Ex.[19]. Modal income (hourly) for a group of 100 workers is 140. The number of
workers earning Rs. between 0 to 50 is 10. Thirty workers earn Rs. Between 100-
150. And 15 workers earn Rs.200-250. If the maximum income is Rs.250, Find the
no. of workers earning Rs. Between 50-100 and 150-200.

Solution: Given that Mo =140, N= 100. Suppose that f0 and f2 be the no. of
workers earning Rs. Between 50-100 and 150-200 respectively.

Calculation of missing frequencies

Hourly income No. of workers.(f).


0-50 10
50-100 f0
100-150 30
150-200 f2
200-250 15
𝛴f =N=55+f0+f2=100
Here, Mo =140, N=100, f1=30, f0=f0, f2=f2, h=50, L=100
𝑓1 −𝑓0 30−𝑓0
Mo = L + x h = 100 + x50
2.𝑓1 −𝑓0 −𝑓2 2 𝑋 30−𝑓0−𝑓2

30−𝑓0
Or, 140 =100 + x50 or, 40[60-f0-f2]= 50[30-f0]
60−𝑓0 −𝑓2

Or, 240-4.f0-4.f2 =150-5.f0 or, fo-4.f2 =150-240


Prepared by Prof.Balaram P. Sharma

Or, 4.f2-f0 = 90 ………………………………………….[i]

Also, f0+f2+55 =N =100 gives,

.f2+f0 = 45 ………………………………………………..[ii]

Solving [i] and [ii], we get, f0=18, & f2=27

Hence, required no, of workers earning between Rs. 50-100 is 18 and that of
between Rs.150-200 is 27.

Ex.[20]. Find first quartile (Q1), Median (Md), fourth Decile (D4) and 85th
Percentile (P85) from the data, : 21, 17, 8, 12, 23, 16, 4,10

Solution: Here, No. of items ‘n’ =8.

The data in ascending order is : 4, 8, 10, 12, 16, 17, 21,23.


1.(𝑛+1) 𝑡ℎ
Now, first quartile, (Q1) = Value of [ ] item. = Value of [9/4]th item.
4

= Value of [2.25]th item. = 2nd item+ 0.25[3rd-2nd] item.

= 8+ 0.25 x[10-8] =8+0.5 8.5


(𝑛+1) 𝑡ℎ
Again, Md = Value of [ ] item. = Value of [9/2]th item.
2

= Value of 4.5th item = [12+16]/2 =14.


4.(𝑛+1) 𝑡ℎ
Now, D4 = Value of [ ] item. = Value of [36/10]th item.
10

= Value of [3.6]th item. = 3rd item+ 0.6[4th-3rd] item.

= 10+ 0.6 x[12-10] =10+1.2 = 11.2.


85.(𝑛+1) 𝑡ℎ
Finally, P85 = Value of [ ] item. = Value of [153/20]th item.
100

= Value of [7.65]th item. = 7th item+ 0.65[8th-7th] item.

= 21+ 0.65 x[23-21] =21+1.3 = 22.3

Hence, Q1 =5, Md = 14, D4 =13.44, P85 =22.3

EX.[21]. Find the median size of trousers which are bought from a supplier for 200
students in a hostel as shown below:
Prepared by Prof.Balaram P. Sharma

Size of trousers: 18 20 22 24 26 28 30 32
No. of trousers: 5 20 25 40 50 35 15 10
th th
Also determine third Quartile, 7 Decile & 65 Percentile.

Solution:

Computation of Median, Quartile, Deciles and percentile

Size of trousers: 18 20 22 24 26 28 30 32
No. of trousers:f 5 20 25 40 50 35 15 10
c.f. 5 25 50 90 140 175 190 200
𝑁+1 𝑡ℎ 200+1 𝑡ℎ
Now, Md = Value of [ ] item. = Value of [ ] item.
2 2

= Value of [100.5]𝑡ℎ item. = 26.


3.(𝑁+1) 𝑡ℎ 3.(200+1) 𝑡ℎ
Also, Q3 = Value of [ ] item. = Value of [ ] item.
4 4

= Value of [150.75]𝑡ℎ item. = 28.


7(𝑁+1) 𝑡ℎ 7(200+1) 𝑡ℎ
Again, D7 = Value of [ ] item. = Value of [ ] item.
10 10

= Value of [140.7]𝑡ℎ item. = 28


65(𝑁+1) 𝑡ℎ 65(200+1) 𝑡ℎ
Finally, P65 = Value of [ ] item. = Value of [ ] item.
100 100

= Value of [130.65]𝑡ℎ item. = 26.

.Hence, Md = 26, Q3 = 28, D7 =28, P65 = 26.

EX.[22] The daily sale of potato by 100 retailers in Hetauda sub-metropolitan city

are as follows:

Daily sale of potato(kg) 20-40 20-60 20-80 20-100 20-120


Number of Retailers (f) 10 25 55 80 100
Find, (a). Mean, Median and Modal sale of potato.

(b). Q1, D4 and P70 of the sale.

Solution: This frequency distribution is of typeless than cumulative frequency


distribution. Changing it into simple table with c.f. column, we get,

Computation of Mean, Median, Mode & Partition Values.


Prepared by Prof.Balaram P. Sharma

Daily sale Mid value f U’=


𝒙−𝟕𝟎 fU’ c.f.
𝟐𝟎
‘kg’ (x)
20-40 30 10 -2 -20 10
40-60 50 15 -1 -15 25
60-80 70 30 0 0 55
80-100 90 25 1 25 80
100-120 110 20 2 40 100
𝛴f= N=100 𝛴fU’=30
(a). For Mean: a=70, N=100, h=20, 𝛴f.U’=30, 𝑋̅ =?
∑ 𝑓𝑈′
We have, 𝑋̅ = a+ x h =70 +[30/100]x 20 =70+6 =76
𝑁

For Median, N/2 =100/2 = 50, So, Median lies on the class 60-80.

Here, L =60, c.f. =25, f =30, h =20, N/2 =50, Md =?


𝑁
−𝑐.𝑓. 50−25
2
Now, we have, Md = L + x h = 60 + x 20
𝑓 30

= 60 + 16.66 = 76.66

For Mode, Since maximum frequency is 30, the corresponding class 60-80 is the
modal class.

Now, L =60, f1 =30, f0 =15, f2 =25, h =20, Mo =?


𝑓1 −𝑓0 30−15
Also, Mode (Mo) = L + x h = 60 + x 20
2.𝑓1 −𝑓0 −𝑓2 60−15−25

= 60 + 15 = 75.

Hence, mean sale= 76 kg., median =76.66 kg. & modal sale = 75 kg.

(b). For Q1:

1.N/4 =100/4 =25, So, Q1 lies on the class 40-60.

Now, N/4 =25, c.f. =10, f =15, h =20, L =40 and Q1 =?


1.𝑁
−𝑐.𝑓. 25−10
4
Formula, Q1 = L + xh = 40 + x 20
𝑓 15

= 40 + 20 = 60.
Prepared by Prof.Balaram P. Sharma

For D4: 4.N/10 = 4 x 100/10 = 40. The c.f. just greater than 40 is 55 and
corresponding class is 60-80.

Also, L =60, c.f. =25, f =30, h =20, D4 =?


𝑁
4. −𝑐.𝑓. 40−25
10
Formula, D4 = L + xh = 60 + x 20
𝑓 30

= 60+10 = 70.

For P70: 70.N/100 = 70 x 100/100 = 70. The c.f. just greater than 70 is 80 and
corresponding class is 80-100.

Also, L =80, c.f. =55, f =25, h =20, P70 =?


𝑁
70. −𝑐.𝑓. 70−55
100
Formula, P70 = L + xh = 80 + x 20
𝑓 25

= 80+12 = 92.

Hence, Q1 =60 kg, D4 =70 kg, P70 =92 kg.

Ex.[23]. The following are the distribution of marks obtained by 250 students of
Faculty of Forestry in the subject ‘Fire Ecology’. (a). find the minimum pass marks
if only 30% of the students had failed. (b) If top 20 % students were awarded for
scholarship, what is the minimum marks obtained by students who were awarded.
And (c) find out the range of the marks of the middle 60% students.

Marks in ‘Fire Ecology’ 0-10 10-20 20-25 25-30 30-40 40-50


Number of students. 30 45 80 50 25 20
Solution: If 30% students had failed, which means 70% students had passed. The
minimum pass marks obtained by the students is P30 or D3.

Calculation of Partition Values.

Marks No. of students (f) c.f.


0-10 30 30
10-20 45 75
20-25 80 155
25-30 50 205
30-40 25 230
40-50 20 250
𝛴f=N =250
Prepared by Prof.Balaram P. Sharma

For P30: 30.N/100 = 30x250/100 =75. The c.f. column contains 75. So, the class
in which P30 belong is 10-20.

Here, Required class is 10-20, L=10, c.f.=30, f=45, h=10, P30=?


30.𝑁
−𝑐.𝑓 75−30
100
We have, P30 = L+ x h = 10 + x 10
𝑓 45

=10 +10 = 20.

Hence, required minimum pass marks =20.

(b) The minimum marks obtained by top 20% students is P80 or D8.

For P80: We have, 80 x N/100 = 80 x 250/100 = 200. The c.f. just greater than
200 is 205. The corresponding class in which P80 belong is 25-30.

Here, L=25, c.f.=155, f=50, h=5, P80=?


80.𝑁
−𝑐.𝑓 200−155
100
We have, P80 = L+ x h = 25 + x5
𝑓 50

=25 +4.5 = 29.5

Hence, the minimum marks obtained by students who were awarded is 29.5

(c) Hint: The middle 60% students occupy 30% above Median [or positional
average] and 30% below Median. In other words, we have to find P20 &P80. Range
of the marks of the middle 60% students = P80 – P20. [Solve it!]

MEASURE OF DISPERSION
If two distributions may have same Mean, Median and Mode, They may not be
identical. They may differ in formations. Consider the following examples:

Data Mean Median Mode


Series A: 10 11 11 12 12 12 13 13 14 12 12 12
Series B: 8 10 10 12 12 12 14 14 12 12 12
16
Series C: 2 7 7 12 12 12 17 17 12 12 12
22
Each of the series A, B & C has the same value of central tendency (Mean, Median
& Mode). The equal value is 12. Three series have coincident Mean, Median &
Prepared by Prof.Balaram P. Sharma

Mode, which doesn’t mean that the series are identical. But they differ in measure
of ‘Dispersion’ or ‘Variability’. It means they differ in entirely formations.

Actually, the meaning of dispersion is the scatterness of the items from the
centre value. Then, the dispersion is defined as ‘the measure of variation of the
items from the central value’.

Absolute and Relative measures:

The measures of dispersion having the same unit as that of the given series, is
known as ‘the absolute measure of dispersion’. It can be used to compare the
variability of two distributions having the same units. Two distributions having
different units can be compared with the help of relative measures of dispersion.
The relative measure of dispersion is a ratio defined by
𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑜𝑓 𝐷𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛
Relative measure of Dispersion =
𝑆𝑢𝑖𝑡𝑎𝑏𝑙𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒.

Relative measure is a pure number which is independent of original units of


measurement. So these are also called ‘Coefficient of dispersion’.

Measures of Dispersion

The following are the various measures of dispersion:

[i]. Range (R)

[ii]. Semi – Inter quartile Range or quartile deviation (Q.D.)

[iii]. Mean Deviation (M.D.)

[iv]. Standard Deviation (S.D.)

[i]. Range

The difference between largest (maximum) and smallest (minimum) values


in a given distribution is known as Range.

Hence, Range = L-S

Where, L=Value of largest item., S=value of smallest item.

In the continuous frequency distribution, the Range is determined either (i) by


taking the difference between the upper limit of the higher class and the lower limit
Prepared by Prof.Balaram P. Sharma

of the lowest class or (ii) by taking the difference between middle point of the
highest class and the middle point of the lowest class.

The Range is the absolute measure of dispersion. It has the same unit of
measurement as that of the given data. Coefficient of Range is the relative measure
corresponding to Range.
𝐿−𝑆
Hence, Coefficient of Range =
𝐿+𝑆

If we compare two distributions having same units and almost nearly mean,
we can say that the distribution is more dispersed or more variable which has more
Coefficient of range.

[ii] Semi-inter quartile range or Quartile Deviation (Q.D.)

The measure of dispersion based on the Quartiles is known as the Quartile


Deviation (Q.D.) .The difference between upper Quartile (Q3) and lower Quartile
(Q1) is known as the inter quartile range. The half of the inter quartile range is
known as the semi inter quartile range or Quartile Deviation (Q.D)
1
Hence, Q.D. = . [𝑄3 − 𝑄1 ]
2

The relative measure of dispersion based on Quartile Deviation is called the


Coefficient of Q.D. and is given by
𝑄3 −𝑄1
Coefficient of Q.D. =
𝑄3 +𝑄1

The distribution has more variability (or less uniformity or less equality or
less consistency), if the distribution has more Q.D. or coefficient of Q.D.

[iii]. Mean Deviation

The arithmetic mean of the deviations of the items from mean, median or mode is
known as the Mean Deviation (M.D.) when all the deviations are considered as
̅ Md, Mo denotes the arithmetic mean, median and mode respectively,
positive. If 𝑋,
then the Mean deviation from mean, median and mode are as follows:

Individual Series Discrete & Continuous


M.D. from Mean ∑ |𝑥−𝑋̅| 𝛴𝑓|𝑥−𝑋̅|
= =
𝑛 𝑁
M.D. from Median ∑ |𝑥−𝑀𝑑 | ∑ 𝑓|𝑥−𝑀𝑑 |
= =
𝑛 𝑁
Prepared by Prof.Balaram P. Sharma

M.D. from Mode 𝑜 ∑ |𝑥−𝑀 | 𝑜 ∑ 𝑓|𝑥−𝑀 |


= =
𝑛 𝑁
X represents mid value of the class intervals in the Continuous series. These all
above are the absolute measure of dispersion. The relative measure of dispersion
based on mean deviation are as follows:
𝑀.𝐷.𝑓𝑟𝑜𝑚 𝑀𝑒𝑎𝑛
Coefficient of M.D. from Mean =
𝑀𝑒𝑎𝑛

𝑀.𝐷.𝑓𝑟𝑜𝑚 𝑀𝑒𝑑𝑖𝑎𝑛
Coefficient of M.D. from Median =
𝑀𝑒𝑑𝑖𝑎𝑛

𝑀.𝐷.𝑓𝑟𝑜𝑚 𝑀𝑜𝑑𝑒
Coefficient of M.D. from Mode =
𝑀𝑜𝑑𝑒

Note: Since Mean deviation is based on all the observations, it is better


measurement of dispersion than Range and Q.D.

Standard Deviation:

The Standard Deviation (S.D.) is defined as the positive square root of the
mean of the square of the deviations taken from the arithmetic mean. It is denoted
by σ.

If x be variate values, 𝑋̅ denotes their arithmetic mean, n=number of items,


N=total frequency, then, the Standard Deviation (S.D.) is denoted by 𝜎 or 𝜎𝑥 is
given by,

∑[𝑥−𝑋] ∑𝑥̅̅̅2∑𝑥 2
σ=√ =√ − [ ]2 ………………………….[i]
𝑛 𝑛 𝑛

2
̅̅̅
∑ 𝑓[𝑥−𝑋] ∑ 𝑓𝑥 2 ∑ 𝑓𝑥 2
σ=√ =√ −[ ] ……………………….[ii]
𝑁 𝑁 𝑁

The formula [i] and [ii] represent the S.D. in individual and discrete + continuous
series respectively. In continuous series x represents the mid value of each class.

When the deviations are taken from assumed mean, let, u=x-a. (change of
𝑥−𝑎
origin), u’= (change of origin and scale) were, a= assumed mean and h= class

size. Then, S.D. (σ) is given by,

∑ 𝑢2 ∑𝑢 2
σ =√ −[ ] ………………………………………………[iii]
𝑛 𝑛
Prepared by Prof.Balaram P. Sharma

∑ 𝑓𝑢2 ∑ 𝑓𝑢 2
σ =√ −[ ] ………………………………………….[iv]
𝑁 𝑁

∑ 𝑓𝑢′2 ∑ 𝑓𝑢′ 2
σ =hx√ −[ ] ……………………………………..[v]
𝑁 𝑁

Formula [iii] represents the short cut method (individual series). [iv]
represents the short cut method (discrete & continuous series). And [v] represents
the step deviation method (continuous series).

Notes:

➢ S.D. is independent of origin but depends upon scale.


➢ In discrete distribution, it (standard deviation) can’t be less than Mean
Deviation from Mean.
𝜎
➢ The coefficient of S.D.= ̅ (Relative measure of standard deviation).
𝑋
➢ The square of the Standard Deviation (S.D.) is known as the Variance
of the distribution, which is denoted by 𝜎 2.
➢ Coefficient of Variance (C.V.) is always expressed in percentage.
𝜎
Which is defined as, C.V. = ̅ x 100%.
𝑋
➢ The root mean square deviation denoted by ‘s’ is defined as,
∑[𝑥−𝑎]2
s= √ ,where, ‘a’ is an arbitrary number or assumed mean. If
𝑁
‘a’ is replaced by A.M. (𝑋̅) , root mean square deviation ‘s’ will be
S.D.(σ).
➢ Least possible value of mean square deviation ‘s’ will be ‘σ’.
➢ Combined standard deviation: If n1 and n2 are the number of items
in two series having mean 𝑋̅1 and 𝑋̅2 respectively and s.d. σ1 and σ2
respectively. Then the combined S.D. σ12 is defined as,
𝑛1 [𝜎12 +𝑑12 ]+𝑛2 [𝜎22 +𝑑22 ]
σ12 =√ where, d1 = 𝑋̅1 − 𝑋̅12 & d2 = 𝑋̅2 - 𝑋̅12 .
𝑛1 +𝑛2
𝑛 .𝑋 𝑛 .𝑋 ̅ ̅
The combined A. Mean, 𝑋̅12 = 1 1+ 2 2
𝑛1 +𝑛2
➢ If C.V of A < C.V of B, then, Series A is less variable than B
[variability]

Or, Series A is more equitable than B [Equality]


or, Series A is more uniform than B [Uniformity]
Prepared by Prof.Balaram P. Sharma

Or, Series A is more consistent than B [Consistency]

Merits and Demerits of deviations:

Merits of Range

▪ It is rigidly defined.
▪ It is easy to calculate and simple to understand.
▪ It takes minimum time to know about dispersion with the help of it.

Demerits of Range

▪ It is based on all observation.


▪ It is affected by extreme values.
▪ It is affected by fluctuation of sampling.
▪ It is not suitable for further mathematical analysis.
▪ It can’t be calculated in the case of open end classes.

Merits of Q.D.

▪ It rigidly defined.
▪ It is simple to understand and easy to calculate.
▪ It is not affected by extreme values.
▪ It can be calculated in the case of open end classes.
▪ It is more effective than range since it is based on 50% of central items.

Demerits of Q.D.

▪ It is affected by fluctuation of sampling.


▪ It is not suitable for further mathematical analysis.
▪ It is not based on all observations.

Merits of M.D.

▪ It is based on all observation.


▪ It is simple to understand and easy to calculate.
▪ It is reliable for comparison of two distributions about their formation, as the
deviations are taken from the central values.

Demerits of M.D.
Prepared by Prof.Balaram P. Sharma

▪ It is not capable for algebraic treatment.


▪ It is not applicable in open end classes.
▪ It doesn’t give the satisfactory result, when deviations are taken from mode.
▪ The algebraic signs are ignored in the case of mean deviation.

Merits of S.D.

▪ It is rigidly defined.
▪ It is based on all observations.
▪ .it is least affected by fluctuation of sampling.
▪ It is suitable for further mathematical analysis.

Demerits of S.D.

▪ It can’t be calculated for open end classes.


▪ It is difficult to calculate.
▪ It gives more weights to the extreme values and less weights to items which
are near to the mean.

Worked out examples:

Ex.[1]. Calculate the following measure of dispersion and their coefficients. (a)
Range. (b) Semi-inter quartile range or Q.D. (c) Mean Deviation (M.D.). (d)
Standard Deviation (S.D.) from the data: 9, 15, 7, 14, 11, 9, 12, 10, 14.

Solution: The given data in ascending order is, 7, 9, 9, 10, 11, 12, 14, 14, 15.

(a). Range = Largest item – Smallest item = 15-7 = 8.


𝐿−𝑆 15−7
Coefficient of Range = = = 8/22 = 4/11. = 0.3636
𝐿+𝑆 15+7

𝑛+1 𝑡ℎ 9+1 𝑡ℎ
(b). For Q1: Now, Q1 = Value [ ] item = value of[ ] item.
4 4

2𝑛𝑑 +3𝑟𝑑
= Value of (2.5)th item = Value of item
2

= (9+9)/2 = 9
3.(𝑛+1) 𝑡ℎ 3.(9+1) 𝑡ℎ
For Q3: Now, Q3 = Value [ ] item = value of[ ] item.
4 4

7𝑡ℎ +8𝑡ℎ
= Value of (7.5)th item = Value of item
2
Prepared by Prof.Balaram P. Sharma

= (14+14)/2 = 14.

Hence, Q.D. = Q3 – Q1 = 14-9 = 5.


𝑄3 −𝑄1 14−9
And Coefficient Of Q.D. = = =5/23 = 0.217
𝑄3 +𝑄1 14+9

∑𝑋 1
(c). Mean, 𝑋̅ = = .[7+9+9+10+11+12+14+14+15] = 101/9 = 11.22
𝑛 9

̅̅̅
∑ |𝑋−𝑋|
M.D. from Mean =
𝑛

Computation of M.D. from Mean & its coefficient

X ̅| =|x-11.22|
|x-𝒙
7 |-4.22| =4.22
9 |-2.22| =2.22
9 |-2.22| =2.22
10 |-1.22| =1.22
11 |-0.22| =0.22
12 |0.78| = 0.78
14 |2.78| = 2.78
14 |2.78| = 2.78
15 |3.78| =3.78
∑ |𝒙 − 𝑿 ̅ |=20.22
̅̅̅
∑ |𝑥−𝑋|
Hence, M.D. from Mean = = 20.22/9 = 2.246.
𝑛

𝑀.𝐷.𝑓𝑟𝑜𝑚 𝑚𝑒𝑎𝑛
Further, coefficient of M.D. from mean = = 2.246/11.22 = 0.2002.
𝑀𝑒𝑎𝑛

∑ 𝑢2 ∑𝑢 2
(d). We have, S.D.(σ) = √ −[ ] where, u = x-a.
𝑛 𝑛

Calculation of S.D. and its coefficient

Value (x) u= x-11 u2


7 -4 16
9 -2 4
9 -2 4
10 -1 1
11 0 0
12 1 1
14 3 9
Prepared by Prof.Balaram P. Sharma

14 3 9
15 4 16
∑𝒖 = 2 ∑ 𝒖𝟐 = 60
∑ 𝑢2 ∑𝑢 2 60 2
Now, σ = √ −[ ] =√ − [ ]2 =√6.66 − 0.197
𝑛 𝑛 9 9

=√6.463 = 2.542.
𝜎
Coefficient of S.D. = = 2.542/11.22 = 0.226.
𝑋̅
𝜎
Coefficient of Variance (C.V.) = x 100% = 2.542/11.22 x 100% = 22.6%.
𝑋̅

Ex.[2]. Distribution A and B represent the marks obtained by a number of student


in two distinct semester of AFU. Examine which of the distribution has more
variable marks by using suitable coefficient.

Distribu Marks Below 17.5 17.5-22.5 22.5-27.7 27.5-32.5 32.5-37.5


tion A f 5 20 15 10 5
Distribu Marks Below 12.5 12.5-17.5 17.5-22.5 22.5-27.5 27.5-32.5
tion B F 8 12 13 22 5
Solution: Since both the distributions have open end classes, Mean deviation and
Standard deviations are not appropriate. So deviation by quartile method is
appropriate. Then we use coefficient of Q.D.

Computation of Q. D. and its coefficient

Distribution A Distribution B
Class f c.f. Class F c.f
interval interval
Below 17.5 5 5 Below 12.5 8 8
17.5-22.5 20 25 12.5-17.5 12 20
22.5-27.5 15 40 17.5-22.5 13 33
27.5-32.5 10 50 22.5-27.5 22 55
32.5-37.5 5 55 27.5-32.5 5 60
Series A:

For Q1: we have, N/4 = 55/4 = 13.75. The c.f. just greater than 13.75 is 25. So, Q 1
lies on the class 17.5-22.5.
𝑁
−𝑐.𝑓. 13.75−5
4
But Q1 = L + x h = 17.5+ x 5 = 17.5 + 2.187 = 19.687.
𝑓 20
Prepared by Prof.Balaram P. Sharma

For Q3: we have, 3xN/4 =3x55/4 = 41.25. The c.f. just greater than 41.25 is 50. So,
Q3 lies on the class 27.5-32.5.
3.𝑁
−𝑐.𝑓. 41.25−40
4
But Q3 = L + x h = 27.5+ x 5 = 27.5 + 0.625 = 28.125
𝑓 10

𝑄3 −𝑄1 28.125−19.687
Now, Coefficient of Q.D. = = = 8.43/47.81 = 0.176.
𝑄3 +𝑄1 28.125+19.687

Series B

For Q1: we have, N/4 = 60/4 = 15. The c.f. just greater than 15 is 20. So, Q 1 lies on
the class 12.5-17.5.
𝑁
−𝑐.𝑓. 15−8
4
But Q1 = L + x h = 12.5+ x 5 = 12.5 + 2.916 = 15.483.
𝑓 12

For Q3: we have, 3xN/4 =3x60/4 = 45. The c.f. just greater than 45 is 55. So, Q3
lies on the class 22.5-27.5.
3.𝑁
−𝑐.𝑓. 45−33
4
But Q3 = L + x h = 22.5+ x 5 = 22.5 + 2.7272 = 25.227
𝑓 22

𝑄3 −𝑄1 25.227−15.483
Now, Coefficient of Q.D. = = =9.743/40.71 = 0.239.
𝑄3 +𝑄1 25.227+15.483

Since, the coefficient of Q.D. of series B is greater than that of A, so series B is


more variable than series A.

Ex.[3]. The scores of two dancers in 8 rounds of a dance competition are as


follows:

Dancer A: 38 39 39 36 38 40 39 41

Dancer B: 42 40 43 45 43 41 40 44

Examine which may be considered to be more consistent dancer.

Solution: first of all we find C.V. of two dancers A and B

Computation of C.V. of two dancers

Dancer A Dancer B
2
x u=x-38 u X v=x-42 v2
38 0 0 42 0 0
Prepared by Prof.Balaram P. Sharma

39 1 1 40 -2 4
39 1 1 43 1 1
36 -2 4 45 3 9
38 0 0 43 1 1
40 2 4 41 -1 1
39 1 1 40 -2 4
41 3 9 44 2 4
∑ 𝒖 =6 ∑ 𝒖𝟐 =20 𝜮𝒗 =2 ∑ 𝒗𝟐 =24
Dancer A:
∑𝑥
Now, 𝑋̅ = a+ = 38 + 6/8 = 38.75
𝑛

2
∑𝑢 ∑𝑢 20 6
Also, σ = √ − [ ]2 = √ − [ ]2 = √2.5 − 0.562
𝑛 𝑛 8 8

= √1.937 = 1.391.
𝜎 1.391
C.V. of A = ̅ x100% = x 100% = 3.66%
𝑋 38.75

Dancer B:
∑𝑥
Now, 𝑋̅ = a+ = 42+ 2/8 = 42.25
𝑛

2
∑𝑣 ∑𝑣 24 2
Also, σ = √ − [ ]2 = √ − [ ]2 = √3 − 0.0625
𝑛 𝑛 8 8

= √2.937 = 1.713.
𝜎 1.713
C.V. of A = ̅ x100% = x 100% = 4.05%
𝑋 42.25

Since, C.V. of dancer A is less than that of dancer B, so A is the consistent dancer.

EX [4]. Weight of 50 logs of a chatta (in kg) are as follows:

Weight (Kg.) above: 0 10 20 30 40

Number of logs : 50 44 28 12 5

If there is no any log having weight more than 50 kg, find out the mean, standard
deviation, coefficient of standard deviation and C.V.

Solution:
Prepared by Prof.Balaram P. Sharma

Computation of mean, Standard deviation and C.V.

Weight Mid value f u’ =


𝒙−𝟐𝟓 fu’ fu’2
(kg) ‘x’ 𝟏𝟎

0-10 5 6 -2 -12 24
10-20 15 16 -1 -16 16
20-30 25 16 0 0 0
30-40 35 7 1 7 7
40-50 45 5 2 10 20
𝛴f =N =50 𝛴fu’ =-11 𝛴fu’2 =67
We have, mean
∑ 𝑓𝑢′ −11
𝑋̅ =a+ x h = 25+ x 10 =25-2.2 =22.8 kgs.
𝑁 50

∑ 𝑓𝑢′2 ∑ 𝑓𝑢′ 2 67 −11 2


S.D. (σ) = h x √ −[ ] =10 x √ −[ ]
𝑁 𝑁 50 50

= 10 x √1.34 − 0.0484 =10 x 1.1364 =11.36


𝜎 11.36
Now, Coefficient of S.D. = = =0.498.
𝑋̅ 22.8

𝜎 11.36
Finally, Coefficient of Variance C.V. = ̅ x100% = x100% =49.8%
𝑋 22.8

Ex.[5] The marks of (a) 8 student of a class as well as (b) all 80 students of that
class in two subjects are as follows. Recommend by suitable method that which
subjects in both cases have more uniform marks.

(a). Marks in Statistics: 34 43 28 35 30 37 29 36

Marks in Cytology: 42 27 34 30 36 30 34 31

(b).

Marks 0-10 10-20 20-30 30-40 40 -50


No. of students in Statistics 10 15 20 22 13
No. of students in Cytology 12 20 25 15 8
Solution: (a).

Calculation of mean, S.D. and C.V. of marks

Marks in statistics Marks in Cytology


X ̅ = x-
x- 𝑿 ̅ ]2
[x-𝑿 y ̅ =y-33
y-𝒀 ̅ ]2
[y-𝒀
Prepared by Prof.Balaram P. Sharma

34
34 0 0 42 9 81
43 9 81 27 -6 36
28 -6 36 34 1 1
35 1 1 30 -3 9
30 -4 16 36 3 9
37 3 9 30 -3 9
29 -5 25 34 1 1
36 2 4 31 -2 4
𝛴x =272 ∑[𝑥 − 𝑋̅]2=172 𝛴y =264 𝛴[y-𝑌̅]2=150
∑𝑥
Now, Mean marks in Statistics = =272/8 = 34
𝑛

∑𝑦
Mean marks in Cytology = =264/8 = 33
𝑛

∑[𝑥−𝑋] ̅ 2
172
S.D. of Marks in Statistics, 𝜎𝑥 =√ =√ =√21.5 =4.63
𝑛 8

∑[𝑦−𝑌] 150 ̅ 2
S.D. of Marks in Cytology, 𝜎𝑦 =√ =√ =√18.75 =4.33
𝑛 8

𝜎 4.63
C.V. of marks in Statistics, (C.V.)x = ̅𝑥 x100% = x100% =13.61%.
𝑋 34

𝜎𝑦 4.33
C.V. of marks in Cytology, (C.V.)y = ̅ x100% = x100% =13.12%.
𝑌 33

Here, C.V. of marks in Statistics, (C.V.)x > C.V. of marks in Cytology, (C.V.)y ,
the marks of Cytology is more uniform than that of Statistics.

(b).

Calculation of mean, S.D. and C.V.

Marks in Statistics Marks in Cytology


marks Mid f u’ fu’ Fu’2 f v’ fv’ fv’2
value 𝒙−𝟐𝟓 𝒙−𝟐𝟓
= =
(x) 𝟏𝟎 𝟏𝟎

0-10 5 10 -2 -20 40 12 -2 -24 48


10-20 15 15 -1 -15 15 20 -1 -20 20
20-30 25 20 0 0 0 25 0 0 0
30-40 35 22 1 22 22 15 1 15 15
40-50 45 13 2 26 52 8 2 16 32
N=80 𝛴=13 𝛴=129 N=80 𝛴=-13 𝛴=115
Prepared by Prof.Balaram P. Sharma

Now, A. mean of marks in Statistics,


∑ 𝑓𝑢′ 13
𝑋̅ = 𝑎 + . ℎ = 25 + .10 = 25+1.625 = 26.625
𝑁 80

2
∑ 𝑓𝑢′ 𝑓𝑢′ 129 13
S.D. (σ) = hx√ − [ ]2 =10 x √ − [ ]2 = 10 x √1.612 − 0.0264
𝑁 𝑁 80 80

=10 x 1.258 =12.58


𝜎 12.58
C.V. = ̅ x 100% = x100% =47.24%
𝑋 26.625

Now, A. mean of marks in Cytology,


∑ 𝑓𝑣′ −13
𝑋̅ = 𝑎+ . ℎ = 25 + .10 = 25-1.625 =23.375
𝑁 80
2
∑ 𝑓𝑣′ 𝑓𝑣′ 115 −13
S.D. (σ) = h x√ − [ ]2 =10 x √ − [ ]2 = 10 x √1.437 − 0.0264
𝑁 𝑁 80 80

=10 x 1.411 =14.11


𝜎 14.11
C.V. = ̅ x 100% = x100% =60.36%
𝑋 23.375

Since, C.V. of marks in Statistics < C.V. of marks in Cytology, the marks in
statistics is more uniform than that of Cytology.

SKEWNESS , MOMENT AND KURTOSIS

SKEWNESS

# When a series is not symmetrical, such distribution is said to be skewed or


Asymmetrical.
– - CROXTON and COWDEN.
# Skewness refers to the asymmetry or lack of symmetry in the shape of Frequency
distribution. -MORRIS HAMBURG.

The word skewness refers to lack of symmetry i.e. the distribution is not
symmetrical, it is called skewed distribution. In symmetrical distribution, the value
of mean, median and mode are alike or coincide.
Prepared by Prof.Balaram P. Sharma

i.e., Mean = MEDIAN = MODE

When the distribution is skewed, it can be either positively skewed or negatively


skewed.

Hence, a distribution of data is said to be skewed, if

a. arithmetic mean ≠ median ≠ mode.

b. quartiles are not equidistant from the median. i.e., Md –Q1≠ Q3 – Md

c. the curve drawn from the frequency distribution is not a bell shape type.

TYPES OF SKEWNESS :

a. NO SKEWNESS or SYMMETRICAL

A distribution of data is said to be no skewed, if the curve drawn from the data is
neither elongated more to the left nor to the right side . In this case, the curve is
equally elongated to the right as well as left. Also the nature of curve on both end
sides appears as asymptotic behavior i.e. asymptotic tail appears on both sides. In
other word, the distribution has no skewness if, MEAN = MEDIAN = MODE.
Prepared by Prof.Balaram P. Sharma

b. POSITIVELY SKEWED

A distribution of data is said to have right skewed or positively skewed if, the
curve drawn from the data is more elongated to the right side. In this case, we can
observe that:MEAN > MEDIAN > MODE.

C. NEGATIVELY SKEWED

If the curve drawn from the data is more elongated to the left side, such
distribution of data is said to have left skewed or negatively skewed. In this case
we can observe that : MEAN < MEDIAN < MODE.

TEST OF SYMMETRIC AND SKEWED DISTRIBUTION :

When the data is symmetrical, the following conditions are satisfied:

1. The value of mean, median and mode coincide.

2. The data when plotted on the graph, the normal bell-shaped form is obtained.
The asymptotic behavior is seen on both side tail.

3. Sum of the positive deviations from the median is equal to the sum of the
negative deviations.

4. Both quartiles are equidistant from the median.

5. Frequencies are equally distributed at the point of equal deviations from the
mode.

If these information are not satisfied, the data is skewed

MEASURES OF SKEWNESS :.
Prepared by Prof.Balaram P. Sharma

1. FIRST ABSOLUTE MEASURE OF SKEWNESS = mean – mode.

=mean – (3.median – 2. Mean )

= 3 ( mean – median )

2. SECOND ABSOLUTE MEASURE OF SKEWNESS = ( Q3 –Md ) – ( Md – Q1 )

= Q3 + Q1 – 2. Md

The absolute measure of skewness are not widely used because they stand for
original units. The coefficient of skewness (which is the relative measure)

3a. Karl Pearson’s coefficient of skewness ( SKp) :

It is also known as “ Pearsonian’s coefficient of skewness “


mean−mode 3(mean−median)
SKp= =
σ σ

Note: 1. -1≤ SKp ≤ 1 ( Sometimes -3 ≤ SKp ≤ 3 )

2. If Mean = Median = Mode, i.e distribution is symmetrical, SKp= 0

3. If the data is negatively skewed, then -1 ≤ SKp< 0

If the data is positively skewed, then 0 < SKp≤ 1

3b. Bowley’s coefficient of skewness (SKb)


( Q3 –Md ) – ( Md – Q1 ) Q3+Q1−2.Md
SKb = ( Q3 –Md)+ ( Md – Q1 )
=
Q3 – Q1

Note:1. If the distribution is symmetrical, Q3–Md=Md– Q1 i.e Q3 + Q1 – 2. Md =0

2. -1≤ SKb ≤ 1

3. If the distribution is positively skewed, 0 <SKb≤ 1 and Q3–Md>Md– Q1

4..If the distribution is negatively skewed, -1 ≤SKb< 0 and Q3–Md<Md– Q1

3c. Kelly’s coefficient of skewness (SKk ):


( P90 –Md ) – ( Md – P10 ) P90+P10−2.Md D9+D1−2.Md
SKk = ( P90 –Md)+ ( Md – P10 )
= =
P90+P10 D9+D1

Examples worked out:


Prepared by Prof.Balaram P. Sharma

EX.[1]. Calculate the coefficient of skewness from the following data of


investment (‘000’ Rs.) in the forest based enterprises by certain community
forestry of Bagmati province.

Investment(‘000’Rs.) 10-20 20-30 30-40 40-50 50-60 60-70 70-80


x:
Number of enterprises f: 12 18 20 15 10 3 2
Solution: Suppose the assumed mean ‘a’ =45

Calculation of coefficient of skewness

Investment Mid-value Frequency d’=(x-45)/10 fd’ fd’2


x f
10-20 15 12 -3 -36 108
20-30 25 18 -2 -36 72
30-40 35 20 -1 -20 20
40-50 45 15 0 0 0
50-60 55 10 1 10 10
60-70 65 3 2 6 12
70-80 75 2 3 6 18
N=𝛴f=80 𝛴fd’=-70 𝛴fd’2 =240
𝛴𝑓𝑑′ −70
We have, mean investment, 𝑋̅ =a+ x h =45+ .10 = 45-8.75 =36.25
𝑁 80

𝛴𝑓𝑑′2 𝑓𝑑′ 2 240 −70 2


Also, S.D. (σ) = h x √ −[ ] =10 x √ −[ ]
𝑁 𝑁 80 80

=10 x √3 − 0.766 =14.95 (approximately)

The modal class is 30-40 since, the frequency is 20 which is maximum


corresponding to this class. Now, the mode is,
𝑓1−𝑓0 20−18
Mo = L+ xh =30+ x10 =30+2.86 =32.86
2𝑓1−𝑓0−𝑓2 2 𝑋20−18−15

Where, L=30, f1=20, f0=18, f2=15.

Hence, the coefficient of skewness is,


𝑋̅−𝑀𝑜 36.25−32.86 3.39
Sk = = = =0.227
𝜎 14.95 14.95
Prepared by Prof.Balaram P. Sharma

EX.[2]. Compare the skewness of the wages distributed by two companies P & Q.
Also comment on the result. Some information determined with the data of wages
for the workers of P & Q are as follows:

Company: Md : Mo: S.D.:

P 60 65 10

Q 65 68 12

Solution: For Company P, Mode =3.Median – 2.Mean


or, 65 =3 x 60 -2.𝑋̅ then, 𝑋̅ = 1/2[180-65] =57.5
𝑋̅−𝑚𝑜𝑑𝑒
Hence, Sk = =[57.5-65]/10 = -0.75
𝜎

For Company Q, Mode =3.Median – 2.Mean

or, 68 =3 x 65 -2.𝑋̅ then, 𝑋̅ = 1/2[195-68] =63.5


𝑋̅−𝑚𝑜𝑑𝑒
Hence, Sk = =[63.5-68]/12 = -0.375
𝜎

By comparing the skewness of two distributions we conclude that both the


distributions are negatively skewed and distribution P is more skewed than that of
Q.

EX.[3]. Find the mean and mode of the frequency distribution, which gives the
following results : C.V.=7.5, S.D.=2 and Karl Pearson’s coefficient of skewness
(Sk)=0.5.
𝜎
Solution: We know, C.V. = ̅ x100%
𝑋
𝜎
Hence, 𝑋̅ = x100 =200/7.5 =26.67
𝐶.𝑉.

𝑋̅−𝑀𝑜 26.67−𝑀𝑜
Also we know, Sk = or, 0.5 =
𝜎 2

Hence, Mode (Mo) =26.67-1 =25.67

EX.[4]. In a certain distribution, the following results are obtained: Mean (𝑥̅ ) =45,
Median (Mo) =48, Coefficient of skewness (Sk) =-0.4. Find s.d. ( 𝜎) and the
coefficient of variation. [solve it !]
Prepared by Prof.Balaram P. Sharma

EX.[5]. In a moderately skewed frequency distribution, the mean is 10 kg. and its
median is 8.50 kg. Find the Pearson’s coefficient of skewness of the distribution if,
the coefficient of variance is 20%. [solve!]

EX.[6]. From the following information, calculate Karl Pearson’s coefficient of


skewness: 𝛴x =452, x2 =24270, Mode =43.7 and number of observation =10.
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
Solution: We have, Karl Pearson’s coefficient of skewness, Sk(P)=
𝑆.𝐷.

∑𝑋
Here, Mean, 𝑋̅ = = 452/10 =45.2
𝑛

1 2 ∑𝑋 2 1 452 2
S.D. (σ) = √ . ∑ 𝑋 − [ ] =√ 𝑋 24270 − [ ] =√2427 − 2043.04
𝑛 𝑛 10 10

= √383.96 =19.59
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
Hence, Sk(P)= = [45.2-47.7]/19.59 = 0.08
𝑆.𝐷.

EX.[7]. Find out the Bowley’s coefficient of skewness [or skewness based on
quartiles] from the following data of the length of the log ‘inch’ on a chatta.

Length of log Less than 36-41 42-47 48-53 54-59 60-65 66 & above
‘inch’: 36
No. of logs (f): 17 23 39 38 27 19 12
Solution:

Computation of coefficient of skewness

Length of log (‘inch’) Number of log (f) C.f.


Less than 36 17 17
36-41 23 40
42-47 39 79
48-53 38 117
54-59 27 144
60-65 19 163
66 and above 12 175
Here, N/2 = 175/2 =87.5, N/4 = 175/4 = 43.75, and, 3.N/4 = 3 x 175/4 =
131.25

Hence, median lies in the class 48-53 with real limit 47.5-53.5
Prepared by Prof.Balaram P. Sharma
𝑁
−𝑐𝑓 87.5−79
2
Therefore, Md = 𝐿 + x h = 47.5+ x 6 =48.84
𝑓 15

Now, Q1 lies in the class 42-47 with real limit 41.5-47.5


𝑁
−𝑐𝑓 43.5−40
4
Therefore, Q1 = 𝐿 + x h = 41.5+ x 6 =42.08
𝑓 39

In the same way, Q3 lies in the class 54-59 with real limit 53.5-59.5
3.𝑁/4−𝑐𝑓 131.25−117
Therefore, Q3 = 𝐿 x h = 53.5+ x 6 =56.67
𝑓 27

Finally, the Coefficient of skewness based on quartile (Bowley’s Coefficient of


skewness)
𝑄3 +𝑄1 −2.𝑀𝑑 56.67+42.08−2 𝑋 48.84
= =
𝑄3 −𝑄1 56.67−42.08

98.75−97.68
= = 1.07/14.59 =0.073
14.59

EX.[8]. From the following frequency distribution calculate the Bowley’s


coefficient of skewness:

Monthly wages (‘000’Rs.) 23-27 28-32 33-37 38-42 43-47 48-52


:
Number of workers (f): 22 16 9 4 3 1
Solution:

Computation of coefficient of skewness

Monthly wages (‘000’ Number of workers (f) C.f.


Rs.)
23-27 22 22
28-32 16 38
33-37 9 47
38-42 4 51
43-47 3 54
48-52 1 55
Here, N/2 = 55/2 =27.5, N/4 = 55/4 = 13.75, and, 3.N/4 = 3 x 575/4 = 41.25

Hence, median lies in the class 28-32 with real limit 27.5-32.5
Prepared by Prof.Balaram P. Sharma
𝑁
−𝑐𝑓 27.5−22
2
Therefore, Md = 𝐿 + x h = 27.5+ x 5 =29.22
𝑓 16

Now, Q1 lies in the class 23-27 with real limit 22.5-27.5


𝑁
−𝑐𝑓 13.75−0
4
Therefore, Q1 = 𝐿 + x h = 22.5+ x 5 =25.125
𝑓 22

In the same way, Q3 lies in the class 33-37 with real limit 32.5-37.5
3.𝑁/4−𝑐𝑓 41.25−38
Therefore, Q3 = 𝐿 x h = 32.5+ x 5 =34.31
𝑓 9

Finally, the Coefficient of skewness based on quartile (Bowley’s Coefficient of


skewness)
𝑄3 +𝑄1 −2.𝑀𝑑 34.31+25.125−2 𝑋 29.22
= =
𝑄3 −𝑄1 34.31−25.125

0.995
= = 0.11
9.185

EX.[9]. The coefficient of skewness for a certain distribution is -0.8. Calculate the
Median, if the upper and lower quartiles are respectively 56.6 & 44.1.

Solution: Here, upper quartile, Q3=55.6, lower quartile Q 1=44.1


Bowley’s coefficient of skewness, Sk(B) =-0.8, Median, Md =?
𝑄3 +𝑄1 −2.𝑀𝑑 56.6+44.1−2.𝑚𝑒𝑑𝑖𝑎𝑛
We know, Sk(B) = =
𝑄3 −𝑄1 56.6−44.1

100.7−2.𝑚𝑑
Or, -0.8 = or, -o.8 x 12.5 = 100.7-2.md
12.5

Or, 2.Md =100.7+10 =110.7

∴ Median, Md =55.35

EX.[10]. In a certain distribution, coefficient of skewness is 0.6. if the sum of two


extreme quartiles is 100 and the median is 38, find the value of both
quartiles.

Solution: Here, Sk(B) =0.6, Q1+Q3 =100, Median, Md =38, Q1 =?, Q3=?
𝑄3 +𝑄1 −2.𝑀𝑑 100−2 𝑋 38
We know, Sk(B) = =
𝑄3 −𝑄1 𝑄3−𝑄1
Prepared by Prof.Balaram P. Sharma

Or,0.6 = 24/[Q3-Q1] or,Q3-Q1 =24/0.6 =40

Solving two eqn, Q3+Q1 =100 & Q3-Q1 = 40, we get,

Q3 =70 & Q1 = 30.

EX.[11]. Calculate the Pearson’s coefficient of skewness from the following


frequency distribution

Hourly wage 0-10 10-20 20-30 30-40 40-40 50-60


(Rs.):
No. of workers ( 13 15 18 23 19 12
f):
Solution:

Computation of Pearson’s coefficient of skewness

C.I. Mid value (X) F U’ =[X- fU’ FU’2


35]/10
0-10 5 13 -3 -39 117
10-20 15 15 -2 -30 60
20-30 25 18 -1 -18 18
30-40 35 23 0 0 0
40-50 45 19 1 19 19
50-60 55 12 2 24 48
N =100 𝛴fU’ =-44 𝛴fu’2 =262
∑ 𝑓𝑈′ −44
Here, 𝑋̅ = a+ x h = 35 + x10 =30.6
𝑁 100

𝑓1 −𝑓0 23−18
Now, Mode = L + x h = 30+ x 10 = 35.56
2.𝑓1 −𝑓2 −𝑓0 2.23−18−19

1 2 ∑ 𝑓𝑈′ 2 262 −44 2


S.D. (σ) = h x √ . ∑ 𝑓𝑈′ − [ ] =10 𝑋√ −[ ] = 15.6
𝑁 𝑁 100 100

Hence, the Pearson’s coefficient of skewness,


𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 30.6−35.55
Sk(P) = = = -0.32
𝑆.𝐷. 15.6

EX.[12]. The following data represents the height of the trees in a garden.
Calculate Karl Pearson’s coefficient of skewness from the given data:

Height of Below Below Below Below Below Below Belo Below


trees(ft.) 7 14 21 28 35 42 w 49 56
Prepared by Prof.Balaram P. Sharma

Trees 26 57 92 134 216 287 341 350


nos.(f)
Solution:

Computation of Karl Pearson’s coefficient of skewness

Height (ft.) Mid value Frequency U’= 𝑿−𝟑𝟏.𝟓 fU’ fU’2


X f 𝟕

0-7 3.5 26 -4 -104


7-14 10.5 31 -3 -93
14-21 17.5 35 -2 -70
21-28 24.5 42 -1 -42
28-35 31.5 82 0 0
35-42 38.5 71 1 71
42-49 45.5 54 2 108
49-56 52.5 9 3 27
𝛴f=N=350 𝛴fU’=-103 𝛴fU’2=1245
∑ 𝑓𝑈′ −103
Now, Mean, 𝑋̅ = a+ x h = 31.5 + x 7 = 31.5-2.06 = 29.44 ft.
𝑁 350

Since, maximum frequency ‘f1’=82, the mode lies on the class interval 28-35.
𝑓1 −𝑓0 82−42
So, Mode, Mo = L + x h = 28 + x 7 = 33.49.
2.𝑓1 −𝑓2 −𝑓0 2.82−42−71

1 2 ∑ 𝑓𝑈′ 2 1245 −103 2


Also, S.D. (σ) = h x √ . ∑ 𝑓𝑈′ − [ ] =7 𝑋√ −[ ]
𝑁 𝑁 350 350

= 7 x √3.473 = 13.045

Finally, Karl Pearson’s coefficient of skewness,


𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 29.44−33.49
Sk(P) = = = -0.31Type equation here.
𝑆.𝐷. 13.045

Hence, Sk(P) = -0.31, which shows that the distribution is negatively skewed.

EX.[13]. From the following distribution, find out the Karl Pearson’s coefficient
of skewness.

Length of log in Above Above Above Above Above Above Up to


‘inch’ 48 50 52 54 56 58 60
Number of log 100 80 65 30 17 8 100
‘f’
[Solve it !]
Prepared by Prof.Balaram P. Sharma

Moment

Let the symbol ‘x’ represents the deviation of any item in a distribution from
the arithmetic mean of that distribution. The arithmetic mean of the various
powers of these deviations in any distribution is called the moment of the
distribution.

The arithmetic mean of the 1st power of the deviations is known as the
first moment about the mean.

The arithmetic mean of the squares of the deviations is known as the second
moment about the mean.

The arithmetic mean of the cubes of the deviations is known as the third
moment about the mean.

The moment about the mean is known as the central moment. These are
denoted by Greek letter µ(mu).Thus the symbols µ 1, µ2 , µ3etc represents the first
moment, second moment, third moment respectively. Symbol
∑(𝑋−𝑋̅) ∑𝑥
µ1 = =
𝑛 𝑛

µ 1 = 0,since the sum of the deviations of items from their arithmetic mean is
always zero.
∑(𝑋−𝑋) 2 ∑𝑥 2
µ2 = = {µ2 = 𝜎 2, 𝑎𝑛𝑑 𝜎 = √𝜇2}
𝑛 𝑛

∑(𝑋−𝑋) 3 ∑𝑥 3
µ3 = =
𝑛 𝑛

∑ 𝑓(𝑋−𝑋̅) ∑ 𝑓𝑥
In discrete series, µ 1 = =
𝑁 𝑁

∑𝑓(𝑋−𝑋) 2 ∑𝑓𝑥 2
µ2 = =
𝑁 𝑁

∑𝑓(𝑋−𝑋) 3 ∑𝑓𝑥 3
µ3 = =
𝑁 𝑁

∑𝑓(𝑋−𝑋) 4 ∑𝑓𝑥 4
µ4 = =
𝑁 𝑁

𝜇32 µ4
TWO IMPORTANT CONSTANTS :𝛽 1 = and𝛽2=
𝜇23 𝜇22
Prepared by Prof.Balaram P. Sharma

Where 𝛽 1 (beta one) represents skewness and 𝛽2(beta two)represents kurtosis.

#note: The odd moments are always zero in symmetrical distribution, however this
rule doesn’t hold in the case of asymmetric distribution.

MEASUREMRNT OF SKEWNESS BASED ON MOMENTS :

A measure of skewness is obtained by using µ2andµ3about mean


𝜇32
𝛽1 = where𝛽 1 is used as relative measure of skewness.In symmetric
𝜇23
distribution,𝛽 1 shall be zero. Greater the value of 𝛽 1, more the distribution skewed.
But 𝛽 1 doesn’t determine the direction (positive or negative) of the skewness. 𝛽 1
is always positive, since µ23 is positive being square and µ32 is cube of variance
(which is always positive).To remove this drawback, take square root of 𝛽 1 as 𝛾1
(gamma one)

𝜇32 𝜇3 𝜇3
𝛾1 =√𝛽1 =√ = =
𝜇23 √(𝜎 2 ) 3 𝜎3

Show the sign(Positive or Negative) depends upon 𝜇3 . Now, If 𝜇3 > 0 , Skewness


is positive. If 𝜇3 < 0, Skewness is negative.

KURTOSIS

Kurtosis means ‘ bulginess ‘ of the curve. Which means ‘ flatness or peakness ‘ in


the region about the mode of frequency curve.

The degree of kurtosis is measured relative to the peakness of normal curve.

# If a curve is more peaked than the normal curve, it is called leptokurtic.

#If a curve is more flat than the normal curve, it is called platykurtic.

#The normal curve itself is called mesokurtic

The condition of peakness or flatness is known as kurtosis.##

MEASURE OF KURTOSIS:

a. By Moment:
Prepared by Prof.Balaram P. Sharma
𝜇4
The measures of kurtosis depends upon the coefficient 𝛽2 ,where𝛽2 = .
𝜇22

Greater the value of 𝛽2 , 𝑚𝑜𝑟𝑒 𝑡ℎ𝑒 𝑝𝑒𝑎𝑘𝑛𝑒𝑠𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒.

#Note:1. If 𝛽2 > 3, the curve is more peaked than the normal curve.Such a curve is
leptokurtic.

#Note:2. If 𝛽2 < 3, the curve is less peaked than the normal curve.Such a curve is
platykurtic

#Note:3. If 𝛽2 = 3, the curve is mesokurtic or normal.

➢ If we take 𝛾2 = 𝛽2 − 3 , the following conditions are found related to


kurtosis:
➢ If 𝛾2 =0, then the curve is normal or mesokurtic.
➢ If 𝛾2 >0, then the curve is leptokurtic.
➢ If 𝛾2 <0, then the curve is platykurtic.

b. By partition value coefficient method:

the measure of kurtosis based on quartiles and some deciles is defined as:
1 𝑄3 −𝑄1 1 𝑄3 −𝑄1
k= . = .
2 𝐷9 −𝐷1 2 𝑃90 −𝑃10.

Where k is called the partition value coefficient of kurtosis.

➢ If k = 0.263, the distribution is normal or mesokurtic.


➢ If k > 0.263, the distribution is leptokurtic.
➢ If k < 0.263, the distribution is platykurtic.

SOME WORKED OUT EXAMPLES :

EX.[1]. Calculate the percentile coefficient of kurtosis from the following data also
interpret the result.

Daily 50-60 60-70 70-80 80-90 90-100 100-110 110-120


wages(Rs.)
Frequency 10 14 18 24 16 12 6
Solution: Calculation for k.

Wage(Rs.) No.of persons(f) c.f.


50-60 10 10
Prepared by Prof.Balaram P. Sharma

60-70 14 24
70-80 18 42
80-90 24 66
90-100 16 82
100-110 12 94
110-120 6 100
N=100

For Q1 :𝑁⁄4 =25, Q1 lies on class interval (c.i.) 70-80 .so


𝑁⁄ − 𝑐.𝑓. 25−24
4
Q1 = L+ x h =70+ x 10 =70.56
𝑓 18

For Q3 :3𝑁⁄4 =75, Q3 lies on C.I. 90-100.


3𝑁⁄ −𝑐.𝑓. 75−66.
4
Q3 =90+ x h = 90+ x 10 = 95.63
𝑓 16

FOR P10, 10𝑁⁄


100 = 10, SO p10 lies on the class interval 50-60.So
10𝑁⁄
100−𝑐.𝑓. 10−0
P10 = L+ x h = 50 + x10= 60
𝑓 10

For P90, 90𝑁⁄


100 = 90, SO p90 lies on the class interval 100-110.So
90𝑁⁄
100−𝑐.𝑓. 90−82
P90 = L+ x h = 100 + x10= 106.67
𝑓 12

1 𝑄3 −𝑄1 1 95.63−7𝑜.56 25.07


Now, k = x = x = =0.2685
2 𝑃90 −𝑃10 2 106.67−60 2 𝑋 46.67

Hence the curve or distribution is leptokurtic (which is nearly normal but not
normal, being the value of k =0.2685).

EX.[2]. Calculate the percentile coefficient of kurtosis from the following data of
weakly increment of 100 fodder plant of the same species in a certain plantation
area. Also interpret the result.

Weakly 100- 110- 120- 130- 140- 150-160 160-170


increment(cm) 110 120 130 140 150
No. of plants f 10 14 18 24 16 12 6
Solution: Calculation for k.
Prepared by Prof.Balaram P. Sharma

Weakly increment X No. of fodder plants(f) c.f.


110-110 10 10
110-120 14 24
120-130 18 42
130-140 24 66
140-150 16 82
150-160 12 94
160-170 6 100
N=100

For Q1: 𝑁⁄4 =25, then Q1 lies on class interval (c.i.) 120-130. so
𝑁⁄ − 𝑐.𝑓. 25−24
4
Q1 = L+ x h =120+ x 10 =120.56
𝑓 18

For Q3: 3𝑁⁄4 =75, Q3 lies on C.I. 140-150.


3𝑁⁄ −𝑐.𝑓. 75−66.
4
Q3 =140+ x h= 140+ x 10 = 145.63
𝑓 16

FOR P10: 10𝑁⁄


100 = 10, SO, p10 lies on the class interval 100-110. So,
10𝑁⁄
100−𝑐.𝑓. 10−0
P10 = L+ x h = 100 + x10= 110
𝑓 10

For P90: 90𝑁⁄


100 = 90, SO p90 lies on the class interval 150-160.So
90𝑁⁄
100−𝑐.𝑓. 90−82
P90 = L+ x h = 150 + x10= 156.67
𝑓 12

1 𝑄3 −𝑄1 1 145.63−12𝑜.56 25.07


Now, k = x xh = x = =o.269
2 𝑃90 −𝑃10 2 156.67−110 2 𝑋 46.67

Hence the curve or distribution is laptokurtic (which is nearly normal but not
normal, being the value of k =0.269).

CORRELATION ANALYSIS.
If two quantities vary in such a way that the movement in one are accompanied by
movement in the other, these quantities are correlated. Generally, correlation is
Prepared by Prof.Balaram P. Sharma

defined as the relationship or association between (among) one dependent variable


and one or more independent variable(s).

The degree of relationship between(among) the variables under


consideration is measured through the correlation analysis. The measure of
correlation is called the correlation coefficient which summarizes the degree and
direction of movement in one figure.

Examples of correlations: The amount of rain fall and the volume of production of
certain commodity, age and blood pressure of people in a certain community, the
increase(or decrease) of price accompanied by increase (or decrease) in the
quantities demanded, the amount of production of wheat under the amount of rain
fall as well as amount of chemical fertilizer used etc are the examples of
correlation.

Cause and effect relationship: In the study of correlation, the concept of cause and
effect relationship is more important. The variables which makes the other variable
to change is called cause variable and the resulting variable is known as the effect
variable. There may be more than one cause variable affecting a single variable.
Mathematically, the dependent variable(effect) Y and independent variable(cause)
X can be related as

Y = f(X)----------------(i)

Y = f(Xi)\ i=1, 2, 3,-------, n or Y = f(X1, X2, X3,--------, Xn )------------(ii)

The first case is denoted for simple correlation and the second case for multiple
correlation.

TYPES OF CORRELATION :

a. positive and negative correlation (direct or indirect).

b. linear and non-linear correlation.

c. simple, partial and multiple correlation.

a. positive and negative correlation:If both the variables move in the same direction
i.e. increase(or decrease) in the value of one variable results the increase (or
decrease) in the value of other variable(s), is said to be positively correlation. On
the other hand, if both the variables move in opposite direction. Then the
Prepared by Prof.Balaram P. Sharma

correlation between these two variables is known as negatively correlation. For


example:
Income x (in Rs.) : 200 300 400 500 600 700
Expenditure(in Rs.): 175 200 250 275 400 650
Price(Rs.) x: 1500 2000 2500 2800 3000
Demand (No.)y: 75 50 40 35 28
The first table shows the positive (direct) correlation and the second table shows
the negative (indirect) correlation.

b. linear and non-linear correlation: When a unit change in one variable results a
constant change in other variable over the entire range of the value is known as
linear correlation. In other words, if the amount of change in one variable tends to
bear constant ratio to the amount of change in the other variable.Such correlation is
said to be linear correlation. Otherwise, these are said to be non-linear correlation.
For example:

Fertilizer used(kg) x: 25 30 35 40 45 50
Production(kg) y : 100 125 150 175 200 225

Age(in year) 20 30 40 50 60 70
x:
Blood pressure(mm) y: 115 125 130 132 140 135
The first table shows the linear correlation and the second shows the non-linear
correlation.

c. simple, partial and multiple collinear:

The correlation between two variables is known as simple correlation. But the
partial or multiple collinear is the case when the correlation of more than two
variables is taken. In multiple correlation, the correlation among three or more
variables is studied together simultaneously. But in partial correlation, the
correlation between any two variables is taken considering other any one or
more variable as constant. For example, (i)yield of rice per acre against both
amount of rain fall and amount of fertilizer used is the case of multiple
correlation. (ii) If we want to study the effect of quality of seeds, chemical
fertilizer used, and soil fertility on the production of certain crop .
Prepared by Prof.Balaram P. Sharma

The relationship between # production and soil fertility, # production and


quality of seed. # production and chemical fertilizer used are the case of simple
correlation.

If we study relation among production with all three variables simultaneously,


which is the case of multiple correlation. But we study the partial correlation
between production and soil fertility (say) keeping quality of seed and chemical
fertilizer used as constant.

.SCATTER DIAGRAM METHOD:

The another method of showing the relation between two variables is the scatter
diagram .The simplest device for ascertaining whether two variables is to prepare a
dot chart called scatter diagram. To draw scatter diagram, variables x and y are
plotted along x- and y-axis respectively of the graph paper and corresponding pair
(x,y) are plotted as dot on the graph. ##

Note that: -1 ≤ r ≤ 1. Where r represents the coefficient of correlation.

GRAPHIC METHOD:

Individual values of two variables are plotted in the graph paper. Two curves X
and Y are drawn. By examine direction and closeness of the two curves so drawn
we can see whether or not the variables are related. Usually, in case of time series
this method can be applied. ##

KARL PEARSON’S (or PEARSONEAN)COEFFICIENT OF CORRELATION:

The correlation coefficient between two variables X and Y usually, denoted by r or


rXY or r12 is a numerical measure of the relationship between them and is defined
as,
𝑐𝑜𝑣(𝑋,𝑌) 𝐶𝑂𝑉(𝑋,𝑌) ∑(𝑋−𝑋̅)(𝑌−𝑌̅)
r = = Where cov(X,Y) = covariance
√𝑣𝑎𝑟(𝑋)√𝑣𝑎𝑟(𝑌) 𝜎𝑋 𝜎𝑌 √∑(𝑋−𝑋̅)2 √∑(𝑌−𝑌̅)2
between the variables X and Y. Also 𝑋̅ and 𝑌̅ denote the arithmetic mean of series
X and Y respectively. If x= X- 𝑋̅ and Y= Y-𝑌̅, then the product moment formula is
as:
Prepared by Prof.Balaram P. Sharma
∑ 𝑥𝑦
r=
𝜎𝑥 𝜎𝑦

other formula: # Direct method (actual data method):


𝑁.∑ 𝑋𝑌− (∑ 𝑋) (∑ 𝑌 )
r12 = …………..[1]
√𝑁 ∑𝑋 2−(∑ 𝑋)2 √𝑁 ∑𝑌 2−(∑ 𝑌)2

#Short cut method (Assumed mean method)


𝑁.∑ 𝑑1 𝑑2 − (∑ 𝑑1 ) (∑ 𝑑2 )
r12 = ……………[2]
√𝑁 ∑𝑑1 2−(∑ 𝑑1 )2 √𝑁 ∑𝑑2 2−(∑ 𝑑2 )2

where, d1= X-A, d2= Y-B, Also, A, B are assumed mean of series X and Y
respectively

#Step deviation method


𝑁.∑ 𝑑1′ 𝑑2′ − (∑ 𝑑1′ ) (∑ 𝑑2′ )
r12 = ′ ′
…………….[3]
√𝑁 ∑𝑑′ 2−(∑ 𝑑1 )2 √𝑁 ∑𝑑′ 2−(∑ 𝑑2 )2
1 2

𝑋−𝐴 𝑌−𝐵
Where, d1’ = , d 2’ = , A and B are assumed mean of series X and Y
ℎ1 ℎ2
respectively, h1 and h2 are class size of series X and Y respectively.

Properties of coefficient of correlation:

1. The coefficient of correlation lies between -1 to 1. Symbolically, -1 ≤ r ≤ 1 or


|𝑟| ≤ 1.

2.The coefficient of correlation is independent of change of scale and origin of


variables X and Y. i.e., rxy = 𝑟𝑑𝑥 𝑑𝑦 = 𝑟𝑑1 𝑑2 .[independent of change of origin]

Where d1 = X – A1, d2 = Y –A2 (A1 and A2 are assumed mean of series X and Y
respectively.)

Also, r12= 𝑟𝑑1′ 𝑑2′ . [independent of change of scale].


𝑋−𝐴1 𝑋−𝐴2
Where, 𝑑1′ = , 𝑑2′ =
ℎ1 ℎ2

[Where, A1 and A2 are assumed mean of series X and Y respectively. h1, h2 are
common factors of series X and Y respectively.]

3. Being relative measure r has no unit.


Prepared by Prof.Balaram P. Sharma

4. The value of r is symmetrical with respect to series X and Y. i.e., r12 = r21.

5. The coefficient of correlation is the geometrical mean between two regression


coefficient. i.e, r = √𝑏𝑥𝑦 . 𝑏𝑦𝑥

6. Interpretation of correlation coefficient ( r ).

Direction Degree of correlation.


Positive Negative
r = +1 r = -1 Perfect
0.75 < r < 1 -1 < r < -0.75 Very [significant] high.
0.50 < r < 0.75 -0.75 < r < -0.50 High.
0.25 < r < 0.50 -0.50 < r < -0.25 Low.
0 < r < 0.25 -0.25 < r < 0 Very low.
r=0 r=0 No [absent]

Probable error:

The probable error [P.E.] of the correlation coefficient is applicable for the
measurement of reliability of the computed value of the correlation coefficient ‘r’
which is defined as:
1−𝑟 2
P.E. = 0.6745 x = 0.6745 x S.E.
√𝑁

Where, r = correlation coefficient, N = number of observation pair.


1−𝑟 2
S.E. = standard error of correlation coefficient =
√𝑁

Note:[i]. If r < P.E., the value of r is not significant, no matter how high r value is.

[ii]. P.E. ≤ r ≤ 6.(P.E.), no thing can be decided with certainty about ‘ r ‘.

[iii]. 6.(P.E.) < r, the value of r is significant i.e., value of correlation is


significant.

Worked out examples:

Ex.[1]. The following data gives the marks of 10 students in mathematics and
statistics as follows:

Student 1 2 3 4 5 6 7 8 9 10
Prepared by Prof.Balaram P. Sharma

Marks in maths. 45 70 65 30 90 40 50 75 85 60
Marks in statistics 35 90 70 40 95 40 60 80 80 50
Find the correlation coefficient by

[a] direct method. [b]short cut method. [c]step deviation method.

Solution:[a].Let X and Y denote the marks in mathematics and statistics


respectively.

Calculation of correlation coefficient by direct method.

X Y X2 Y2 XY
45 35 2025 1225 1575
70 90 4900 8100 6300
65 70 4225 4900 4550
30 40 900 1600 1200
90 95 8100 9025 8550
40 40 1600 1600 1600
50 60 2500 3600 3000
75 80 5625 6400 6000
85 80 7225 6400 6800
60 50 3600 2500 3000
∑ 𝑋 = 610 ∑ 𝑋 = 640 ∑ 𝑋 2 = 40700 2
∑ 𝑋 = 45350 ∑ 𝑋𝑌 =
42575

N0w the simple correlation coefficient between variables X and Y is,


𝑁.∑ 𝑋𝑌− (∑ 𝑋) (∑ 𝑌 ) 10𝑋42575−610𝑋640
r12 = =
√𝑁 ∑𝑋 2−(∑ 𝑋)2 √𝑁 ∑ 𝑌 2−(∑ 𝑌)2 √10𝑋40700−6102 √10𝑋45350−6402

425750−390400 35350
= = = 0.9031
√407000−372100√453500−409600 186.81 𝑋 209.52

Ex.[2]. A student while calculating the correlation coefficient between two


variables X and Y obtained the following results

N = 25, ∑ 𝑋 =125, ∑ 𝑌 = 100, ∑ 𝑋 2 =650, ∑ 𝑌 2 = 460 , ∑ 𝑋𝑌 = 508

It was later found, at the time of checking that he had copied down two pairs of
observation as (6,14) and (8,6) instead of correct values (8,12) and (6,8).obtain the
correct value of the correlation coefficient between X and Y.

Solution: The corrected values will be


Prepared by Prof.Balaram P. Sharma

∑ 𝑋=125-6-8+6+8=125, ∑ 𝑌=100-14-6+12+8=100

∑ 𝑋 2 =650-(6)2 − (8)2 + (6)2 + (8)2 = 650 ,

∑ 𝑌 2 =460-(14)2 − (6)2 + (12)2 + (8)2 = 436

∑ 𝑋𝑌=508-6x14-8x6+8x12+6x8 = 520

Using these corrected values in the formula


𝑁.∑ 𝑋𝑌− (∑ 𝑋) (∑ 𝑌 ) 25𝑋520−125𝑋100
𝑟= = = 0.67
√𝑁 ∑𝑋 2−(∑ 𝑋)2 √𝑁 ∑𝑌 2−(∑ 𝑌)2 √25𝑋650−1252 √25𝑋436−1002

Ex.[3]. Find the total number of pairs of observations from the following
information

∑ 𝑥𝑦 =60, r=o.8, standard deviation of Y=2.5 and ∑ 𝑥 2 =90, where x and y are the
deviation taken from their respective arithmetic means.

Solution:

𝐻𝑒𝑟𝑒, ∑ 𝑥𝑦 =60,∑ 𝑥 2 =90, 𝜎𝑦 =2.5, r =0.8, n =?


∑(𝑋−𝑋̅)(𝑌−𝑌̅)
Where x=X-𝑋̅, y=Y-𝑌̅, Now, the correlation coefficients, r=
√∑(𝑋−𝑋̅)2 √∑(𝑌−𝑌̅)2

∑ 𝑥𝑦 60
Or, r = or, 0.8 =
√𝑥 2 √𝑦 2 √90√∑ 𝑦 2

Or, 0.64 x 90 x ∑ 𝑦 2 =3600 [squaring both sides]


3600 ∑ 𝑦2
∑ 𝑦2= = 62.5, since,𝜎𝑦2 =
0.64 𝑋 90 𝑛

∑ 𝑦2 62.5
∴n= = = 10, Hence, total nos. of observation n = 10.
𝜎𝑦2 (2.5)2

Ex.[4]. Calculate the Karl Pearson’s coefficient of correlation and interpret the
result of studying hour per day and marks obtained (out of full marks 20) of the
following 5 students:

Student A B C D E
Studying hour per day (X) 2 3 4 5 6
Marks obtained (Y) 7 9 10 14 15
Solution:
Prepared by Prof.Balaram P. Sharma

Table for calculation of correlation coefficient

X Y x = X –̅̅̅̅
𝑿= X - 𝒙𝟐 ̅= Y
y = Y -𝒀 𝒚𝟐 xy
4 -11
2 7 -2 4 -4 16 8
3 9 -1 1 -2 4 2
4 10 0 0 -1 1 0
5 14 1 1 3 9 3
6 15 2 4 4 16 8
𝚺X=𝟐𝟎 𝛴Y=55 𝛴x=0 𝛴𝒙𝟐 =10 𝛴y=0 𝛴𝒚𝟐 =46 𝛴xy=21

∑ 𝑋 20 ∑ 𝑌 55
We have, 𝑋̅= = =4 and 𝑌̅= = =11
𝑛 5 𝑛 5

∑ 𝑥𝑦 21
Now, correlation coefficient, r = = = 0.98
√𝑥 2 √𝑦 2 √10√46

∴ r=0.98. This shoes that there is almost perfect positive correlation between two
series.

Ex.[5]. A company formed by a community forestry user group manufactures


different types of forest products. It has been using radio for advertising it’s
products. The following table shows amounts of radio time(X in minutes) and the
number of forest product sold(Y) over the last 6 days

X: 25 18 32 21 35 29
Y: 16 11 20 15 26 28
Determine the Karl Pearson’s coefficient of correlation and interpret the result.

Solution: using short cut method, take A=25(assumed mean of series X) and
B=20(assumed mean of series Y). Put U=X-25 and V=Y-20.

Table for calculation of r

X Y U=X-25 V=Y-20 U2 V2 UV
25 16 0 -4 0 16 0
18 11 -7 -9 49 81 63
32 20 7 0 49 0 0
21 15 -4 -5 16 25 20
35 26 10 6 100 36 60
29 28 4 8 16 64 32
Prepared by Prof.Balaram P. Sharma

𝛴U=10 𝛴V=-4 𝛴𝑼𝟐 =230 𝛴𝑉 2 =222 𝛴UV=175


Now, correlation coefficient, ( r)
𝑛 ∑ 𝑈𝑉−(∑ 𝑈) (∑ 𝑉) 6𝑋175−10𝑋(−4)
r= = = 0.84
√𝑛 ∑ 𝑈 2 −(∑ 𝑈)2 √𝑛 ∑ 𝑉 2 −(∑ 𝑉)2 √6𝑋230−(10)2 √6𝑋222−(−4)2

Hence, there is very high degree or positive correlation between radio time (X)
and the forest product sold(Y).

RANK CORRELATION COEFFICIENT:


The Karl Pearson’s correlation coefficient is based on the assumption that the
population being studied in normally quantitative way. There are some statistical
series in which the variables under consideration are not capable quantitative
measurement but these can be ordered in rank or serial basis. This is happened
when the characteristic are qualitative (attitudes).Such as beauty, honesty, talent,
smartness, fitness character etc.

The rank correlation coefficient is the measurement of relationship between the


two variables with respect to their rank or serial. Which is also called Spearman’s
rank correlation coefficient. It is denoted by rs.
6.∑𝐷 2 6.∑𝐷 2
Formula:rs = 1- = 1- .
𝑁(𝑁2 −1) 𝑁3 −𝑁

Where. rs = Spearman’s rank correlation coefficient between variables X and Y.

R1 and R2 are ranks of series X and Y respectively.

N = no. of pairs of corresponding series.


𝑚3 3
1−𝑚1 𝑚2−𝑚2
6.[∑𝐷 2+ + +⋯…………………… ]
When rank is repeated: rs = 1 – 12 12
𝑁(𝑁2 −1)

Where, D = R1 – R2

N = no. of pairs of observation.

m = no. of times the item repeated. No. of times first item repeated is m 1,
no. of times second item repeated is m2and so on.
Prepared by Prof.Balaram P. Sharma

Note: If rank is not given, prepare a rank starting from 1 in ascending


order or descending order. We have to apply the same direction of rank in
both series.

Worked out examples:

Ex.[1]. The ranking of 10 students in two subjects statistics and wild life
management are as follows:

Statistics: 3 5 8 4 7 10 2 1 6 9
Wild life 6 4 9 8 1 2 3 10 5 7
mgnt.
Solution: Let R1 be rank of marks in statistics and R2 that of in subject wild life
management.

Calculation of Spearman’s rank correlation.

R1 R2 𝐷2 = (𝑅1 − 𝑅2 ) 2

3 6 9
5 4 1
8 9 1
4 8 16
7 1 36
10 2 64
2 3 1
1 10 81
6 5 1
9 7 4
∑ 𝐷2 = 214.

6.∑𝐷 2 6 𝑋 214
Here, rs = 1- =1– = 1 – 1.297 = - 0.297
𝑁(𝑁2 −1) 1000−10

Ex.[2] .Two forest officers were requested to refer ranking of 7 different species of
medicinal plant for the plantation in a certain community based forest. The priority
ranking are as follows:

Species: A B C D E F G
Forester M : 2 1 4 3 5 7 6
Forester N : 1 3 2 4 5 6 7
Prepared by Prof.Balaram P. Sharma

Calculate the Spearman’s rank correlation coefficient. Also discuss the direction of
ranking are similar or opposite.

Solution: Let R1 and R2 denote the rank referred by two foresters M and N
respectively.

Computation of Spearman’s rank coefficient.

R1 R2 D = R1 – R2 𝐷2
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0
7 6 1 1
6 7 -1 1
∑ 𝐷2 =12
6.∑𝐷 2 6 𝑋 12
Here, Spearman’s rank coefficient rs = 1- =1– = 0.786
𝑁(𝑁2 −1) 343−7

Hence the direction of ranking pattern of two foresters are in the same direction
(being rs positive). Furthermore rs is strictly positive(since 0.75 ≤rs≤ +1)

Ex.[3]. Calculate coefficient of rank among marks assigned to 10 students by


judges X and Y in a dance competition as shown below: (Also interpret the result
based on the direction of ranking pattern by two judges.)

Student I.D Nos. : 1 2 3 4 5 6 7 8 9 10


Marks by judge X : 52 53 42 60 45 41 37 38 35 27
Marks by judge Y : 65 68 43 38 77 48 35 30 25 50

Solution: Let R1 and R2 represent the rank prepared on the basis of marks assigned
to students in ascending serial. The ranks are not repeated in this case.

Computation of Spearman’s rank coefficient

Marks by judge X R1 Marks by judge Y R2 D = R1 – R 2 𝐷2


52 8 65 8 0 0
53 9 68 9 0 0
42 6 43 5 1 1
60 10 38 4 6 36
45 7 77 10 -3 9
41 5 48 6 -1 1
Prepared by Prof.Balaram P. Sharma

37 3 35 3 0 0
38 4 30 2 2 4
25 1 25 1 0 0
27 2 50 7 -5 25
∑ 𝐷2 = 76
6.∑𝐷 2 6 𝑋 76
Here, Spearman’s rank coefficient,rs =1- =1– =1-0.461 = 0.539
𝑁(𝑁2 −1) 1000−10

Hence, the ranking pattern of two judges are in the same direction with high level
of positive correlation.(Being rs positive and 0.5 ≤𝑟𝑠 ≤ 0.75).

Ex.[4]. Calculate the Spearman’s rank coefficient of correlation from the following
data (also interpret the result ):

X : 80 78 75 75 68 67 60 59

Y : 12 13 14 14 14 16 15 17

Solution: Denoting the rank induced by two series as R1 and R2 respectively.


Which is the rank repeated case.

Calculation of Spearman’s rank coefficient

X Rank R1 Y Rank R2 D = R1 - R2 D2
80 8 12 1 7 49
78 7 13 2 5 25
75 5.5 14 4 1.5 2.25
75 5.5 14 4 1.5 2.25
68 4 14 4 0 0
67 3 16 7 -4 16
60 2 15 6 -4 16
59 1 17 8 -7 49
∑ 𝐷2 = 159.5
𝑚3 3
1−𝑚1 𝑚2−𝑚2 23 −2 33 −3
6.[∑𝐷 2+ + +⋯…………………… ] 159.5+ +
12 12 12 12
Now, rs = 1 – 2
= 1 – 6. 3
𝑁(𝑁 −1) 8 −8

6(159.5+0.5+2)
=1– = 1 – 1.929 = - 0.929 #
504

Hence, there is Significantly high( very high degree of ) negative correlation


between two series.
Prepared by Prof.Balaram P. Sharma

Ex.[5]. Two teaching methods A and B applied for 11 pairs of students, so that
student in a pair have approximately equal scores on an intelligence test. In each
pair of one student was taught by mrthod A and other by mothod B. the marks are
as follows.

Marks obtained by meth A 24 29 19 14 30 19 27 30 20 28 11


Marks obtained by meth B 37 35 16 26 23 27 19 20 16 11 21
Hint: Prepare rank R1 and R2 by two series (the case is repeated rank). Use the
formula for Spearman’s rank rs of repeated case.

REGRATION ANALYSIS

In a functional relation between two variables, if the value of one variable is


known, the value of other variable can be determined exactly.But in statistical
relation between two variables, if one variable is given, we can not exactly
determine the value of other variable corresponding to it. But we can simply
predict it.

“ Regression analysis is a statistical tool (or device) with the help of which we can
estimate or predict the unknown value of one variable corresponding to the known
value of another variable.”

The variable whose values are known is known as independent variable and the
variable whose valueis estimated or predicted is known as dependent variable.

LINE OF REGRESSION:

Whenever there shows a relationship between the two variables the dots of the
scatter diagram will concentrate around a certain curve. If a curve is a straight line,
which is called the line of regression. If the scatter diagram doesn’t represent a
straight line, the line of best fit is the required line of regression.

The line of best fit is obtained by the method of least squares. It is a line from
which the sum of the deviations of the various points on either side is equal to zero
so that the sum of the squares of these deviations is minimum.##

There are two lines of regressions. The first is line of regression of y on x and the
other one is the line of regression of x on y. The line of regression of y on x gives
the most probable value of y corresponding to given values of x. Similarly, the
Prepared by Prof.Balaram P. Sharma

lines of regression of x on y gives the most probable value of x for given value of
y.The two lines of regression intersect at the point (𝑥̅ ,𝑦̅), where 𝑥̅ and 𝑦̅ being
the arithmetic mean or average of two series x and y respectively.
If there is a wide gap between two lines of regression , the correlation between two
variables is less. If two regression lines are near enough then the correlation
between two variables is high. Whenever, two lines of regression coincides,there is
a perfect correlation between two variables. If two lines of regression intersect at
right angles, there is no correlation between two variables. ##

REGRESSION EQUATION & REGRESSION COEFFICIENT:

Regression lines expressed in terms of algebraic relations are known as the


regression equations. Since there are two regression lines, so there are two
regression equations. [i] regression equation of y on x expresses the variation of y
for a change in the value of x. [ii] regression equation of x on y expresses the
variation of x for a change in the value of y.

REGRESSION EQUATION OF Y ON X:

Let the regression equation of y on x be

Y =ax + b ……………………………………..(i)

Or, ∑ 𝑦 =na +b∑ 𝑥


∑𝑦 ∑𝑥
Or, =a+b [dividing both sides by n]
𝑛 𝑛

Or, 𝑦̅ = a + b.𝑥̅ ……………………………….[ii]

Or, y - 𝑦̅ = b.(x-𝑥̅ ) [subtracting ii from i] ……….[iii]

This equation [iii] is equation of line of regression of y on x. this equation shows


that the line of regression of y on x passes through (𝑥̅ ,𝑦̅), where𝑥̅ and 𝑦̅ denote the
arithmetic mean of series xand y respectively. The symbol ‘b’ is known as the
regression coefficient of y on x which is also denoted by byx.

To find the value of b or byx :


Prepared by Prof.Balaram P. Sharma

Already, regression equation of y on x is, y = a + bx

Or, 𝛴y = na + b𝛴x ……………………………………..[iv]

Multiplying [i] by x and summing n times, we get,

𝛴xy = a 𝛴x +b 𝛴𝑥 2 ……………………………………[v].

Multiplying [v] by n & [iv] by 𝛴x, then subtracting we get,

Or, n𝛴xy – 𝛴x.𝛴y = nb𝛴𝑥 2 - b(𝛴𝑥)2

Or, n𝛴xy – 𝛴x.𝛴y = b[n𝛴𝑥 2 – (𝛴𝑥)2 ]


𝑛𝛴𝑥𝑦− 𝛴𝑥.𝛴𝑦
∴b= …………………………………………[vi]
𝑛𝛴𝑥 2 −(𝛴𝑥)2

𝛴𝑥𝑦 𝛴𝑥 𝛴𝑦
− . 𝑐𝑜𝑣(𝑥,𝑦) 𝑐𝑜𝑣(𝑥,𝑦)
𝑛 𝑛 𝑛
= 2 = [since, r = ]
𝛴𝑥 𝛴𝑥
−( )2 𝜎𝑥2 𝜎𝑥 .𝜎𝑦
𝑛 𝑛

𝜎𝑥 𝜎𝑦
= r.
𝜎𝑥2

𝜎𝑦
Hence, byx = 𝑟 . ……………………………………….[vii]
𝜎𝑥

Note: 1#.similarly, the regression equation of x on y is,

. x – 𝑥̅ = bxy (y – 𝑦̅) …………………………….[viii]

Where, bxy is the regression coefficient of x on y and is given by,


𝑛𝛴𝑥𝑦− 𝛴𝑥.𝛴𝑦 𝜎𝑦
bxy = =𝑟. …………………………[ix]
𝑛𝛴𝑦 2 −(𝛴𝑦)2 𝜎𝑥

2# the regression equation can easily be obtained when the deviations of the items
of X and Y series be taken from assumed mean A and B respectively.then,
𝛴𝑢 𝛴𝑣 𝑛.𝛴𝑢𝑣−𝛴𝑢.𝛴𝑣 𝜎𝑦
𝑋̅ = A + , 𝑌̅ = B + , byx = 2 2
= 𝑟.
𝑛 𝑛 𝑛𝛴𝑢 −(𝛴𝑢) 𝜎𝑥

𝑛.𝛴𝑢𝑣−𝛴𝑢.𝛴𝑣 𝜎𝑥
And, bxy = = 𝑟.
𝑛𝛴𝑣 2 −(𝛴𝑣)2 𝜎𝑦

Where, u=X-A, v=Y-B


Prepared by Prof.Balaram P. Sharma

3# since, 𝜎𝑥 and 𝜎𝑦 are not independent (dependent) of scale. so, if we take the
𝑋−𝐴 𝑌−𝐵 𝑘 𝑁.𝛴𝑢′ 𝑣 ′ −𝛴𝑢′ .𝛴𝑣 ′ ℎ
deviations u’= , v’= , then, byx = [ ′2
] and, bxy =
ℎ 𝐾 ℎ 𝑁.𝛴𝑢 −(𝛴𝑢′ )2 𝑘
𝑁.𝛴𝑢′𝑣′−𝛴𝑢′.𝛴𝑣′
[ ]
𝑁𝛴𝑣′2 −(𝛴𝑣′)2

Where, h and k are class size of series X and Y respectively.

Properties of regression coefficients:

[1]. The regression coefficients are independent of change of origin but not of
scale.

[2].since, r2= bxy .byx , ∴ r = ±√𝑏𝑥𝑦 . 𝑏𝑦𝑥

It means the correlation coefficient between two variables is equal to the geometric
mean of two regression coefficients.

[3]. Since, -1 ≤ r ≤ 1, and by property 1, we conclude that, 𝑏𝑥𝑦 . 𝑏𝑦𝑥 ≤ 1

Hence, the product of two regression coefficient is less than or equal to 1(one).
This implies that if one of the regression coefficient is greater than 1, the other is
obviously less than 1. In other words, both regression coefficient can not be greater
than 1 together.

[4]. Since, r2= bxy .byx , we conclude that either both the coefficient will be positive
or both negative. Otherwise, r2 will be negative (which is impossible). Further
more, the sign of r, bxy and byx are all positive or negative.[by property 2]

[5]. The arithmetic mean of two regression coefficients is greater than or equal to
r.(provided r > 0)
1
i.e. [𝑏𝑥𝑦 + 𝑏𝑦𝑥 ] ≥ 1.
2

[6]. Two regression lines intersects at (𝑋̅,𝑌̅), where 𝑋̅ and 𝑌̅denotes the arithmetic
means of two series respectively.

Examples worked out:

Ex.[1]. From the following data between the age of husbands and wives calculate
the regression equation and find the approximate age of husband when wife’s age
is 20 also predict the age of wife when age of husband is 30.
Prepared by Prof.Balaram P. Sharma

Wife’s age (X) : 18 20 22 23 27 28 30


Husband’s age (Y) : 23 25 27 30 32 31 35

Solution: We have to find both lines of regressions.

Wife’s age Husband’s age (Y) X2 Y2 XY


(X) : :
18 23 324 529 414
20 25 400 625 500
22 27 484 729 594
23 30 529 900 690
27 32 729 1024 864
28 31 784 961 868
30 35 900 1225 1050
𝛴X=168 𝛴Y=203 𝛴𝑋 2 =4150 𝛴𝑌 2 =5993 𝛴XY=4980
𝛴𝑋 168 ∑𝑌 203
Here, n=7, 𝑋̅= = =24 and 𝑌̅ = = =29
𝑛 7 𝑛 7

(i). Regression coefficient of Y(husband’s age) on X(wife’s age),


𝑛𝛴𝑥𝑦− 𝛴𝑥.𝛴𝑦 7𝑋4980−168𝑋203
byx = = =0.915
𝑛𝛴𝑥 2 −(𝛴𝑥)2 7𝑋4150−1682

Equation ofregression line of Y (husband’s age) on X(wife’s age) is,

Y-𝑌̅ =byx (X-𝑋̅)

r, Y-29 = 0.915(X-24)

or, Y = 0.915 X + 7.04

Put, X=20, the approximate husband’s age is,

Y’ = 0.915x20+7.04 = 25.34 ≈ 25

(ii). Regression coefficient of wife’s age(X) on husband’s age(Y) is,


𝑛𝛴𝑥𝑦− 𝛴𝑥.𝛴𝑦 7𝑋4980−168𝑋203
bxy = = =1.02
𝑛𝛴𝑦 2 −(𝛴𝑦)2 7𝑋5993−2032

Regression equation of wife’s age(X) on husband’s age(Y) is,

X-𝑋̅ =bxy (Y-𝑌̅)

Or X-24 = 1.02 (Y-29)


Prepared by Prof.Balaram P. Sharma

So, X =1.02 Y – 5.58 is the required equation. Put Y=30, the projected age of wife
is,

X’ =1.02X30 -5.58 =25.02 ≈ 25.

Ex.[2]. Find two regression coefficients and regression equations from the
following information given below:

Mean of the series X = 65, mean of the series Y =67

Standard deviation of X =2.5, standard deviation of Y ==3.5

Coefficient of correlation (r) =0.8

Solution : given that, 𝑋̅ =65, 𝑌̅ =67, r =0.8, 𝜎𝑥 =2.5. 𝜎𝑦 =3.5

Regression coefficient of Y on X,
𝜎𝑦 3.5
byx = r = 0.8 x =1.12
𝜎𝑥 2.5

Regression equation of y on x is,

Y-𝑌̅ =byx (X-𝑋̅)

Or, Y-67 = 1.12 x (X-65)

So, Y =1.12 X -5.8 is the required equation.


𝜎𝑥 2.5
Also regression coefficient of X on Y is, bxy = r = 0.8 x =0.57
𝜎𝑦 3.5

Hence, regression equation of line X on Y is,

X – 𝑋̅ = bxy (Y-𝑌̅)

Or, X -65 =0.57x(Y-67)

Or, X =0.57 Y +26.81 is the required equation.

Ex.[3]. The equation of regression lines obtained in a correlation analysis are


3x+12y =19 and 9x+3y =46. Obtain (i) mean of series X. (ii) mean of series Y.

(iii) the regression coefficient bxy and byx. (iv) the coefficient of correlation
between X and Y.
Prepared by Prof.Balaram P. Sharma

Solution: The equations of regression lines are

3x+12y =19 …..[i] and 9x+3y=46 ……………………….[ii]

Solving [i] and [ii], we get, x=5,y=1/3. Since the regression lines intersect at (𝑋̅,𝑌̅)
𝑠𝑜(𝑋̅,𝑌̅) =(5,1/3).

If possible, let 3x+12y =19 be the line of regression of y on x & 9x+3y =46 be that
of x on y. so,[i] and [ii] are

X= -4y+19/3 ∴bxy =-4

And, y= -3x+46/3 ∴ byx =-3

Testing whether the product 𝑏𝑥𝑦 . 𝑏𝑦𝑥 ≤ 1 or not. Now, 𝑏𝑥𝑦 . 𝑏𝑦𝑥 =(-4).(-3)=12 >1.
[impossible]. So our supposition is incorrect. Hence, line[i] is equation of
regression of y on x. Then, [i] gives, y= -1/4 x+19/12 ,∴ byx = -1/4.

And [ii] gives, x= -1/3 y+ 46/9, ∴ byx = -1/3.

1 1
Finally, r=√𝑏𝑥𝑦 . 𝑏𝑦𝑥 =√ = < 1, [correct since r < 1].
12 2√3

Ex.[4]. Obtain two regression equations from the following data:

X: 6 2 10 4 8

Y: 9 11 5 8 7

Solution: calculation of regression coefficient

X Y X-𝑋̅=X-6 (x) Y- 𝑌̅ =Y-8 𝑥2 𝑦2 xy


(y)
6 9 0 1 0 1 0
2 11 -4 3 16 9 -12
10 5 4 -3 16 9 -12
4 8 -2 0 4 0 0
8 7 2 -1 4 1 -2
𝛴X=30 𝛴Y=40 𝛴X=0 𝛴Y=0 𝛴𝑥 2 =40 𝛴𝑦 2 =20 𝛴xy=-26

∑𝑋 ∑𝑌
Here, 𝑋̅ = = 30/5 =6, 𝑌̅ = = 40/5 =8
𝑛 𝑛
Prepared by Prof.Balaram P. Sharma

𝑛𝛴𝑥𝑦− 𝛴𝑥.𝛴𝑦 5.(−26)−0


Regression coefficient, bxy = = = -1.3
𝑛𝛴𝑦 2 −(𝛴𝑦)2 5𝑋20−0

𝑛𝛴𝑥𝑦− 𝛴𝑥.𝛴𝑦 5.(−26)−0


And, byx = = = -0.65
𝑛𝛴𝑥 2 −(𝛴𝑥)2 5𝑋40−0

Hence, regression equation of y on x is,

Y-𝑌̅ = byx(X-𝑋̅)

Or, Y- 8 = - 0.65(X-6) ∴ 𝑌 = - 0.65 X + 11.9

Also regression equation of X on Y is,

X- 𝑋̅ = bxy (Y-𝑌̅)

Or, X- 6= -1.3 (Y-8) ∴ X = - 1.3Y +16.4

Ex.[5]. The following table shows the age X and blood pressure Y of 8 persons.

X: 52 63 45 36 72 65 47 25
Y: 62 53 51 25 79 43 60 33
Obtain the regression equation of Y on X. also find the expected blood pressure of
a person who is 49 years old.Solution: Take assumed mean of series X and Y, as
A&B. Take A =50, B =50.

Calculation of regression equation of Y on X

X Y X-50 Y-50 U2 V2 UV
(U) (V)
52 62 2 12 4 144 24
63 53 13 3 169 9 39
45 51 -5 1 25 1 -5
36 25 -14 -25 196 625 350
72 79 22 29 484 841 638
65 43 15 -7 225 49 -105
47 60 -3 10 9 100 -30
25 33 -25 -17 625 289 425
𝛴X=405 𝛴Y=406 𝛴U=5 𝛴V=6 𝛴𝑈 =1737 𝛴𝑉 2 =2058
2 𝛴UV=1336
∑𝑋 ∑𝑌
Here, 𝑋̅ = =405/8 =50.625, and,𝑌̅ = =406/8=50.75
𝑛 𝑛

𝑛.𝛴𝑢𝑣−𝛴𝑢.𝛴𝑣 8𝑋1336−5𝑋6
byx = = =0.768
𝑛𝛴𝑢2 −(𝛴𝑢)2 8𝑋1737−25
Prepared by Prof.Balaram P. Sharma

finally, regression equation of Y on X is,

Y-𝑌̅ = byx (X-𝑋̅)

Or, Y- 50.75 = 0.768 (X- 50.625 )∴ Y= 11.87 +0.768 X, is the required regression
equation.

When X = 49 years, Ye or Y^ [expected blood pressure] = ?

Now, Ye or Y^ or Y49 = 11.87+0.768 x 49 = 49.50

Ex.[6].In a partially destroyed laboratory records of an analysis of correlation data,


the following results are legible:

Variance of X =9, regression equations are:8x-10y+66=0 & 40x-18y=214.


Find on the basis of the above information:

( i) the mean value of X and Y. (ii) coefficient of correlation between X and Y.


(iii) standard deviation of Y.

Solution: (i). The regression equations are, 8x-10y =-66 ……………A

and, 40x-18y =214 ……………B

solving A and B, we get the mean of the series X and Y, so we get, the point of
intersection of two regression lines (X,Y)=(13,17). Hence, the mean value of series
X, (𝑋̅) =13. And mean value of series Y, (𝑌̅)=17.

(ii). To find the correlation coefficients, we have to obtain bxy and byx. We may
assume that equation A is regression of X on Y and B is regression of Y on X. so
by A, 8x =10y -66

Or, x= 5/4.y -66/8 ∴bxy = 5/4 =1.25

Also by B, 40x - 18y =214

Or, y = 40/18. X -214/18. byx =40/18 = 20/9.

Here, bxy .byx =[5/4].[20/9] =25/9 >1. Which is impossible. Then we can conclude
that equation A i.e. 8x-10y+66=0,is a regression equation of Y on X.
and, y=8/10 x + 6.6 ∴ byx =8/10 = 4/5

Also, equation B i.e., 40 x-18 y =214, is a regression equation of X on Y.


Prepared by Prof.Balaram P. Sharma

And, x =18/40 x +214/40 ∴ bxy =18/40 = 9/20

Hence, correlation coefficient, r= √𝑏𝑥𝑦 𝑏𝑦𝑥 = √(4/5). (9/20) =0.6


𝜎𝑥 𝜎𝑦 𝜎𝑦
(III). Since, bxy = r ∴ r = bxy. = 9/20 x Type equation here.
𝜎𝑦 𝜎𝑥 3

Hence, Standard deviation of Y (𝜎𝑦 )= 3r x 20/9 =3 x 0.6 x 20/9 = 4.

Ex.[7]. For 50 students of a class the regression equation of the marks in statistics
(X) on the marks on accountancy(Y) is 3y-5x+180=0. The mean marks in
accountancy is 44 and the variance of marks in statistics is [9/16]th of variance of
marks in accountancy. Find the mean marks in statistics and the coefficient of
correlation between the marks in two subjects.

Solution: we are given, 3y-5x+180=0, which is the regression of X on Y.

So, x= 3/5. Y+36 …………………….(i)

Then, regression coefficient of x on y is bxy =3/5 = 0.6

When, 𝑌̅ =44,𝑋̅ = 3/5 x 44 +36 = 62.4

Hence, mean marks in statistics is, 𝑋̅ = 62.4


𝜎𝑥
Also, 𝜎𝑥2 = 9/16.𝜎𝑦2 [given condition] ∴ =√9/16 =3/4
𝜎𝑦

𝜎𝑥 𝜎𝑦
Since, bxy = r ∴ r = bxy. = 0.6x4/3 = 0.8
𝜎𝑦 𝜎𝑥

Ex.[8]. Using the following data, obtain the regression equation X as independent
variable and find the most likely value of Y when X = 24.

X = 20, Y =15, 𝜎𝑥= 4, 𝜎𝑦 = 3, r = 0.7

Solution: The regression equation X as independent variable (i.e. Y on X) is,


𝜎𝑦
Y –𝑌̅ = byx.(X – ̅𝑋) or, Y – 15 = r.
𝜎𝑥

Or, Y – 15 = 0.7 x 3/4 (X-20) or, Y – 15 = 0.525(X-20)

Or, Y=15 + 0.525 X -10.5 ∴ Y = 0.525 X + 4.5

Hence the best fit line of Y on X is, Y^ or Ye = 4.5 + 0.525 X.


Prepared by Prof.Balaram P. Sharma

When,X = 24, then, Y^ or Ye = 4.5 +0.525 X 24 = 17.1

You might also like