Bhattacharya - 1967 - Simple Method Resolution Distribution Into Gaussian Components
Bhattacharya - 1967 - Simple Method Resolution Distribution Into Gaussian Components
Author(s): C. G. Bhattacharya
Source: Biometrics, Vol. 23, No. 1 (Mar., 1967), pp. 115-135
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/2528285 .
Accessed: 25/06/2014 04:14
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.
http://www.jstor.org
C. G. BHATTACHARYA
CentralInland Fisheries ResearchInstitute,Barrackpore,India'
SUMMARY
An approximate method of solution is given of the problem of resolution of a
distribution into Gaussian components when the component distributions are
adequately separated. Illustrative examples are given.
RESUME
Une solution approchee du probleme de la resolution d'une distribution en
composantes gaussiennes est etablie lorsque les distributions composantes sont
convenablements6parees. La methode est illustree par des exemples.
INTRODUCTION
The distribution-of a morphometric character inl a biological popula-
tion is a mixture of components corresponding to different species, broods,
sexes, etc. A problem which frequently arises is to find the relative
frequencies and the frequency distribution of such components by an
analysis of the observed frequency distribution. The frequency distribu-
tion of any such component is usually assumed to be normal: hence
the problem is one of resolution of a distribution into Gaussiain com-
ponents.
For a population of fish such an analysis has been found to be very
helpful for population studies, particularly when determination of age
of a fish is difficult. The frequency distribution of length obtained
from a sample of fish is usually skew and polymnodal:in many cases, the
modes correspond to individual age-groups and are very helpful for
separating them. Buchanan-Wollaston and Hodgeson [1929] dis-
approved of the smoothing out of 'bumpy' distributions, as practiced
by the early fishery biologists, even for small samples. They suggested
that the individual 'humps' indicate meaningful modes around which
normal curves ought to be fitted.
The problem of resolution of a distribution into two Gaussian com-
ponents, and some particular cases of it, have been considered by several
I Present, address; Institute of Statistics, University of Ghana, Legon, Accra, Ghana.
115
METHOD
Let y (x) denote the observed frequency in the class with x as its
mid-point and let h denote the class interval. We plot y(x + h)/y(x)
against x on semi-log paper, or A log y = log y(x + h) - log y(x) against
x on ordinary graph paper, and look for the regions where the graph
looks like a straight line with negative slope. Subject to certain condi-
tions (Appendix A), the number of such regions is the number of com-
ponents. We now take a translucent paper with a straight line drawn
on it and match the straight lines, noting for each such region the angle
(say 0, for the rth line) it makes with the negative direction of the axis
of x, and the x-intercept (say X, for the rth line). As shown in Appendix
A, the mean and s.d. of the rth component may be estimated by,
Pr = ''r + h/2 (1)
o = (dh cot 0f/b) - (h2/12) (2)
where b and d denote the relative scales for x and A log y respectively.
While matching the straight line it is better to fit closely to the points
where the frequency is large even if the apparent discrepancy becomes
somewhat large where the frequency is small.
Several methods may be used for estimation of the proportions of
mixture, after estimation of the Ai and o-i . Writing
Y(x) = expected frequency in the class with x as its mid-point,
Ni = total frequency of the ith component,
k = number of components,
P (x) = distribution function of a standard normal deviate,
and
x
pi(x) = p(+ 2h Pi p(-lh- (3)
we may consider methods based on the following four formulae.
k
(i) Y= NiPi.
This relationship may be easily fitted, since an estimate Pi of Pi is
obtainable by substituting jZiand 0rifor Ai and o-i in (3). Instead of
going into the complications of fitting the above relationship consider-
ing both variables subject to error, it seems easier to use ordinary
regression methods, treating Pi as fixed. Then the estimated Ni are
solutions of
k
(ia) E
i =1
Ni EPiXi = E ypij j y ..
1,*** k (4)
where >J stands for summation over the classes fitted by the rth line.
(ii) Y NrPr + Nr+iPr+i
for the classes fitted by the rth or (r + l)th line or which lie in between
them. Estimates of Nr and NrT+1 fromn this, denoted by Nr (r+1) and
Nr+I (r) may be obtained from
= 1. (7)
Pi=
(iii) Y NrP,
for the classes fitted by the rth line; Nr may thus be estimated by either
(iiia) Nr = YPr/
Y Pr (8)
or
(iiib) Nr = EY/I P (9)
where summation is restricted to the region under consideration.
iv) As shown in Appendix A,
hN, _h2 [o'2-
r (h2/12)] ( -
ly log -[ - 4 (X
log , -
og Y+ (h2/12)]
? ? ?+ log 27r (10)
where summiation is restricted to the region under consideration, and
n denotes the number of classes in the region. If common logarithms
are used, the 2nd and third term on R.H.S. of (10) must be multiplied
by log10 e, as illustrated in Appendix B.)
It may be observed that all the methods described above, with the
exception of (ii), yield a single estimate of the total frequency of each
component; from these the proportions of the mixture may be estimated
by
k
pi = Ni/ i.(1
The use of equations (1) and (2) for estimation of mean and s.d. is
illustrated in examples 1 and 2, where, for simplicity, the Ni are esti-
mated by equation (9), but considering only two classes near the centre
of each straight region. Equations (4)-(1l) are illustrated in Appendix
B, using the data of example 1.
Method (i) is very laborious when the number of components is
large. If the components are well separated the diagonal terms dominate
in the coefficient matrix of equations (4), so that an iterative procedure
(Bodewig [1956]) is very suitable. Method (ib) is a good substitute
for method (ia) and is less laborious, in that computation of sums of
squares and products is replaced by computation of partial sums, and
that the diagonal terms in the coefficient matrix are more dominant, so
that the iteration process converges more rapidly. Methods (iii) and (iv)
are simple, (iv) being particularly useful when statistical tables are
not available. Method (ii) appears to be a good compromise between
methods (i) and (iii). There seems little point in undertaking a laborious
calculation to estimate the Ni with high apparent precision when the
pi and 6-, are themselves subject to error.
,-
( I)
(i;i)
FIGURE 1
i) Two NORMAL DISTIRIBUTIONS: a COMPLETELY OVERLAPPED BY b
ii) THREE NORMAL DISTRIBUTIONS: b COMPLETELY OVERLAPPED BY a AND C
iii) THREE NORMAL DISTRIBUTIONS: BOTH a AND C ARE COMPLETELY OVERLAPPLD
EXAMPLES
Example 1
The data (Table 1) for this example are taken from Tanaka [1962]
and relate to the frequency distribution of forkal length of Porgy caught
by the pair-trawl fishery of the East China Sea.
The graph (Figure 2) of logarithmic differences of class frequency
against the midpoint of the class (circles) shows four approximately
straight regions with negative slope, indicating four distinct components.
The presence of a further component between the last two of these,
but substantially overlapped by them, is also suggested (the solid
points, and the line through them, are not yet available at this stage
of the argument). The parameters of this component cannot be esti-
mated directly, and the approach indicated in the previous section
must be adopted.
After matching straight lines with each of the approximately straight
regions mentioned above, we get
h = 1, b =1O, d =200 X loge=86.858
= 10.53, 22 = 14.78, 23 = 19.36, 5 = 26.12
= 85.25?, 02 = 81.25?, 03 = 73.00, 05 =
75.5'.
Hence, from (1) and (2),
= 11.03, g2 =
15.28, g3 = 19.86, 5 = 26.62
=l .81, 62 = 1.13, 03 = 1.60, 05 = 1.47.
TABLE 1
EXAMPLE 1: FREQUENCY DISTRIBUTION OF FORKAL LENGTH OF PORGIES
Observed
Class range Mid-point frequency log1oy A log10y
0-7 -
06 u
2
02
Li
0'S 01
LU
X, , o,0~~~~~~~~~~~~~ ,
0~~~~~~~~~~~~~~~~~
1,'5 13'5S 15'5 17'5 I9's 2I 23 25'5\ 27'5
0
0
-02,
-0.3- 0
FIGURE 2
EXAMPLE 1: GRAPH OF LOGARITHMIC DIFFERENCES OF THE CLASS-FREQUENCIES
AGAINST THE MID-POINTS OF THE CLASSES
TABLE 2
EXAMPLE 1: CALCULATION OF RESIDUAL FREQUENCIES IN REGION BETWEEN
THIRD AND FIFTH COMPONENTS
YR
x y P3 P5 =y -RJ3P3 -1R5P5 log,oYR A log11YR
4= 23.62, O4 1.07,
- y (23.5) + yP(24.5) - 639
P4(0.5 ?P4(24.5)
Finally, from (11),
TABLE 3
EXAMIPLE 1: COMPARISON OF THE RESULTS OBTAINED BY FOUR METHODS
A. BUCHANAN-WOLLASTON B. CASSIE C. TANAKA D. AUTHOR
1 2 3 4 5
TABLE 4
EXAMPLE 2: ARTIFICIAL MIXTURE OF THREE GAUSSIAN DISTRIBUTIONS
25i
25 t
23 0
21 0
F- 0~~~~~~~~~~~~ 0
17 8C 0
}01 80 0
0O 01 l 0 05 Z 5 10 20 30 40 50 60 70 80 90 95 98 99 99 899 9
PROBIT
SCALE)
CUMULATIVEy5)FREQUEyCY1-
FIGURE 3
EXAMPLE 2: PROBIT PLOT OF THE FREQUENCY DISTRIBUTION
~~~
Here 1 'l .71 02 1.1003 16
9 *
~~ A
,A
h=1, ,
A
b=50,
, d = 10, , _
iV1, -
y(14.5) + y(ll.5)
3'o c;:.
24
U'
Il
Lu
1'2 0
tU
uJ
-o'6
-118
MIPOINVr OP CLASS
FIGURE 4
EXAMPLE 2: GRAPH OF LOGARITHMIC DIFFERENCES OF THE CLASS FREQUENCIES
AGAINST THE MID-POINTS OF THE CLASSES
TABLE 5
EXAMPLE 2: COMPARISON OF THE RESULTS OBTAINED BY AUTHOR'S METHOD
WITH THE VALUES USED TO CONSTRUCT THE 'DATA' (IN PARENTHESES)
Component
Parameters
1 2 3
standardeviation
standard deviatn ( .80
.78) ( 1.12
1.16)
1.60
( 1 60)
.3334
proportionsof
proportions mixtur.4404
Ofmixture ( .4311) ( .3444) ( .2262
2245)
DISCUSSION
A satisfactory practical solution of the problem under investiga-
tion continues to elude mathematical statisticians. Although the
problem admits a neat theoretical solution, very great difficulties would
be encountered in the practical application of the theoretical results.
The methods so far adopted by fishery workers, as well as that presented
in this paper, are all approximate in the sense that they are applicable
only when the components are adequately separated.
The mathematical basis of the probability paper method is not
very clear. The other methods have a clear mathematical basis, are
effective even with considerable overlap of the components provided
the sample is sufficiently large, and when applicable require no correc-
tion for truncation of the components.
The use of differences, as Tanaka [1962] remarked, involves the
danger of magnifying errors in the frequency distribution. This is the
reason why numerical differentiation is generally viewed with much
concern (Nielson [1956]). It may, however, be recalled that Fisher
[1950] used logarithmic differences in place of relative rates for fitting
a logistic curve, and that a similar approach by Bhattacharya [1964-65]
has been found to be quite satisfactory for fitting a more general class
of growth curves which includes the logistic curve as a particular case.
In the present case, the differences are used directly rather than as
substitutes for differential coefficients. Hence, in view of the simplify-
ing assumptions already made, the use of first differences does not seem
to be objectionable provided due care is taken with small frequencies.
The advantage of a linear transform in any applied research can
hardly be overemphasized. In the present case it has the great ad-
vantage that it reduces subjective elements to a minimum and is
certainly the quickest and simplest of all the existing methods.
The assumption that the class-range should be small is important,
and the sample should be sufficiently large so that the class frequencies
do not become very small in the regions of interest. This point may
be taken into consideration at the stage of collection and compilation
of data. If it is felt that the original class-width is too large for the
method to be applicable, it may sometimes be useful to divide the
original class into an odd number of subclasses and work with the fre-
quency of the central subclass, which may be estimated by some
smoothing formula such as King's formulae (Willers [1948]). This would
require a distinction between the class-width and the class interval.
ACKNOWLEDGEMENTS
The author wishes to express his gratitude to Dr. B. S. Bhimachar,
Dirctor of the Institute, for his constant encouragement during the
course of the study, to Shri V. R. Pantulu for inspiring the study and
his constant help and to Prof. H. K. Nandy of the University of Calcutta
for his helpful criticism and valuable advice during the progress of the
work. Thanks are also due to Shri P. Datta for his useful suggestions
in connection with the study.
REFERENCES
Bhattacharya, C. G. [1966]. Fitting a class of growth curves. Sankhya B28, 1-10.
Bodewig, E. [1956]. Matrix calculus. 1st Edn. Amsterdam:North Holland Publ. Co.
Buchanan-Wollaston,H. G. and Hodgeson, W. C. [1929]. A new method of treating
frequency curves in fishery statistics, with some results. J. Cons. 4, 207-25.
Cassie, R. M. [1954]. Some uses of probability paper for the graphical analysis of
polymodal frequency distributions. Aust. J. Mar. Freshw.Res. 5, 513-22.
Fisher, R. A. [1950]. Statisticalmethodsfor researchworkers. 11th. Edn. Edinburgh:
Oliver and Boyd.
Fisher, R. A. and Yates, F. [1963]. Statistical tablesfor biological,agriculturaland
medicalresearch. 6th. Edn. London: Oliver and Boyd.
Gottschalk, V. H. [1948]. Symmetrical bimodal frequency curves. J. Franklin
Inst. 245, 245-52.
Harding, J. F. [1949]. The use of probability paper for the graphical analysis of
polymodal frequency distributions. J. Mar. biol. Ass. U. K. 28, 141-53.
Nielson, K. L. [1956]. Methodsin numericalanalysis. 1st Edn. New York: The
Macmillan Company.
Oka, M. [1954]. Ecologicalstudies on the kidai by the statistical method II. On the
growth of kidai (Taius tumifrons). Bull. Fac. FIish.Nagasaki 2, 8-25.
APPENDIX A
THE UNDERLYING MATHEMATICAL ASSUMPTIONS
in the region where the effect of all except the rth component is negligible.
Writing v = x + Oru, this becomes
h/2cr
y ? Nr j Z(tr + U) du
-h/2a,r
h/2ar 03 h/2a,r
Z()(tr) = Z78) (tr
=Nr A du =
@ Juscod Nr t 1 ut}s
8dfu
~h/2ar 8=0 s 8=0 -h/2a,r
n t i /2vrh an { hrs o h
neglecting terms involving h5 and higher powers of h.
i.e.,
APPENDIX B
EXAMPLE
result that the 4th component cannot be isolated before the total
frequencies of the other components are determined, we restrict the
illustration to the first three components, and consider only the classes
in the range 9-22, in which we may neglect the effect of the 4th and
5th components. As suggested in ?2, the systems of linear equations
encountered in Methods (ia) and (ib) are conveniently solved by a
common iteration procedure called iteration II by Bodewig [1956].
This method starts with an initial solution, obtained by ignoring all
except the diagonal terms of the coefficient matrix: the kth approxima-
tion is then obtained by adjusting the right hand sides for the off-
diagonal terms, calculated using the (l - 1) th approximation.
Preliminary calculations are shown in Table 6: values of fi and &'
are from Table 3.
TABLE 6
COMPARISON OF METHODS (i)-(iv) (APPENDIX B): PRELIMINARY CALCULATIONS
Mid-
Class point
range of class P1 P2 P3
(cm.) (x) X 10 X 10 X 10 Y 10og10Y X -l X- 2 X- 3
Method (ia) Summing over classes in the range 9-22, equations (4)
become
.330854308N1 + .003458204N2 + .000001913N3 = 1923.38220
.003458204P1, + .243821904N2 + .014349321N3 = 1102.03553
.000001913N1 + .014349321X2 + .168761528N?3= 555.26670
Method (ib) Summing over classes in the rainges 9-13, 13-17 and
18-22, corresponding to the 1st, 2nd and 3rd line respectively, equations
(5) become
.98752A1 + .02089N2 + .00001N3 = 5713
.00690N1 + .91680N2 + .03692N3 = 4066
ON1 + .00758S?2+ .78695N3 = 2349.
The solution is
N1 = 5695; S2 = 4274; R3 = 2944; E i= 12913
Pi .4410; P2 = .3310; p3 = .2280.
TABLE 7
ESTIMATION OF THE Ni BY METHOD (iii)
Method (ii) We consider the classes in the ranges 9-17 and 13-22 for
estimating P2/Pl and p3/p2 respectively. For p2/pl equations (5) become
.330854308N1(2) + .003458204X2(1) = 1923.38220
.003458204N1?(2)+ .2407755017?2(1)= 1073.54284.
C? C col
.. m cb
00
-b C.
CO0
(
00
<t tb 0m O . 0
Cl CC) CO l
ool~X C
ho~~~~~~0
O)0C
ao 010 a) X Ic N
lb
1+
V] 0CllC
H H <
C4 cli ~ ~ ~
0 ; xo0 co 0-
00 .1i
4 Cli
Cl
a v ++ .
C b
CO _ _ _
0' r l0
4zC
t4 ce) 0
The solution is