0% found this document useful (0 votes)
111 views

Bhattacharya - 1967 - Simple Method Resolution Distribution Into Gaussian Components

The document describes a simple method for resolving a distribution into Gaussian components. It involves plotting the logarithm of class frequencies against midpoints and looking for straight line regions, which indicate components. Mean and standard deviation of each component can then be estimated from the line parameters. Several methods are proposed to estimate the proportions of each component in the mixture, including regression and equations involving expected and observed frequencies.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Bhattacharya - 1967 - Simple Method Resolution Distribution Into Gaussian Components

The document describes a simple method for resolving a distribution into Gaussian components. It involves plotting the logarithm of class frequencies against midpoints and looking for straight line regions, which indicate components. Mean and standard deviation of each component can then be estimated from the line parameters. Several methods are proposed to estimate the proportions of each component in the mixture, including regression and equations involving expected and observed frequencies.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

A Simple Method of Resolution of a Distribution into Gaussian Components

Author(s): C. G. Bhattacharya
Source: Biometrics, Vol. 23, No. 1 (Mar., 1967), pp. 115-135
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/2528285 .
Accessed: 25/06/2014 04:14

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.

http://www.jstor.org

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
A SIMPLE METHOD OF RESOLUTION OF A
DISTRIBUTION INTO GAUSSIAN COMPONENTS

C. G. BHATTACHARYA
CentralInland Fisheries ResearchInstitute,Barrackpore,India'

SUMMARY
An approximate method of solution is given of the problem of resolution of a
distribution into Gaussian components when the component distributions are
adequately separated. Illustrative examples are given.

RESUME
Une solution approchee du probleme de la resolution d'une distribution en
composantes gaussiennes est etablie lorsque les distributions composantes sont
convenablements6parees. La methode est illustree par des exemples.

INTRODUCTION
The distribution-of a morphometric character inl a biological popula-
tion is a mixture of components corresponding to different species, broods,
sexes, etc. A problem which frequently arises is to find the relative
frequencies and the frequency distribution of such components by an
analysis of the observed frequency distribution. The frequency distribu-
tion of any such component is usually assumed to be normal: hence
the problem is one of resolution of a distribution into Gaussiain com-
ponents.
For a population of fish such an analysis has been found to be very
helpful for population studies, particularly when determination of age
of a fish is difficult. The frequency distribution of length obtained
from a sample of fish is usually skew and polymnodal:in many cases, the
modes correspond to individual age-groups and are very helpful for
separating them. Buchanan-Wollaston and Hodgeson [1929] dis-
approved of the smoothing out of 'bumpy' distributions, as practiced
by the early fishery biologists, even for small samples. They suggested
that the individual 'humps' indicate meaningful modes around which
normal curves ought to be fitted.
The problem of resolution of a distribution into two Gaussian com-
ponents, and some particular cases of it, have been considered by several
I Present, address; Institute of Statistics, University of Ghana, Legon, Accra, Ghana.

115

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
116 BIOMETRICS, MARCH 1967

authors using the methods of moments (Pearson [1894; 1915], Rao


[1948]), incomplete moments (Pearson and Lee [1908-09]), half moments
(Gottschalk [1948]), and maximum likelihood (Rao [1948]), and a
graphical procedure based on the relationship between skewness and
kurtosis (Preston [1953]). The difficulties encountered with these
methods increase at a tremendous rate as the number of components
increases, and the general problem in which the number of components
is unknown, and may be more than two, does not seem to have been
considered in the statistical literature.
Various approximate methods have been suggested by fishery workers
for situations in which the components are adequately separated. The
probability paper method (Harding [1949], Cassie [1954]) involves
dissection of the distribution at the point of inflexion of the probit
plot, followed by correction for overlap of the components. The other
methods (Buchanan-Wollaston and Hodgeson [1929], Oka [1954],
Tanaka [1962]) depend on equating the class-frequency to the ordinate
at the midpoint of the class so that the logarithm of class frequency
is a quadratic function of the mid-point of the class in a region where the
effect of all but one component is negligible. The underlying idea,
similar to that of Pearson and Lee [1908-09], is to attempt to determine
a particular component from the region where the effect of all other
components is negligible.
While Buchanan-Wollaston and Hodgeson [1929] and Tanaka [1962]
fitted these parabolas directly to estimate the proportions of mixture
along with mean and s.d., Oka [1954] attempted to estimate only mean
and s.d. by fitting straight lines (representing derivatives of these
parabolas) to the differential coefficients of log class-frequency, estimated
by average divided differences for two consecutive classes. In his
numerical example Oka considered a constant class interval. It may
be observed that, if the class intervals are unequal, the parabolas will
vary from group to group, and approximating differential coefficients
by divided differences may involve large errors.
In this paper I consider a cubic approximation to density within
a class, but approximate the logarithm of class frequency by a quad-
ratic. This introduces some corrections for grouping. Further, the
class interval has been assumed to be constant, and since simple dif-
ferencing reduces a quadratic to a straight line, I have used direct
differences instead of approximate differential coefficients as used by
Oka. When the class intervals are unequal the differences may be
corrected by an iterative procedure. Finally, methods are suggested
and applied for estimation of the proportions in the mixture, which is
an important part of the analysis, unfortunately ignored in Oka's paper.

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTIONINTO GAUSSIANCOMPONENTS 117

METHOD
Let y (x) denote the observed frequency in the class with x as its
mid-point and let h denote the class interval. We plot y(x + h)/y(x)
against x on semi-log paper, or A log y = log y(x + h) - log y(x) against
x on ordinary graph paper, and look for the regions where the graph
looks like a straight line with negative slope. Subject to certain condi-
tions (Appendix A), the number of such regions is the number of com-
ponents. We now take a translucent paper with a straight line drawn
on it and match the straight lines, noting for each such region the angle
(say 0, for the rth line) it makes with the negative direction of the axis
of x, and the x-intercept (say X, for the rth line). As shown in Appendix
A, the mean and s.d. of the rth component may be estimated by,
Pr = ''r + h/2 (1)
o = (dh cot 0f/b) - (h2/12) (2)
where b and d denote the relative scales for x and A log y respectively.
While matching the straight line it is better to fit closely to the points
where the frequency is large even if the apparent discrepancy becomes
somewhat large where the frequency is small.
Several methods may be used for estimation of the proportions of
mixture, after estimation of the Ai and o-i . Writing
Y(x) = expected frequency in the class with x as its mid-point,
Ni = total frequency of the ith component,
k = number of components,
P (x) = distribution function of a standard normal deviate,
and
x
pi(x) = p(+ 2h Pi p(-lh- (3)
we may consider methods based on the following four formulae.
k

(i) Y= NiPi.
This relationship may be easily fitted, since an estimate Pi of Pi is
obtainable by substituting jZiand 0rifor Ai and o-i in (3). Instead of
going into the complications of fitting the above relationship consider-
ing both variables subject to error, it seems easier to use ordinary
regression methods, treating Pi as fixed. Then the estimated Ni are
solutions of
k

(ia) E
i =1
Ni EPiXi = E ypij j y ..
1,*** k (4)

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
118 BIOMETRICS, MARCH 1967

where summiation is over all classes in the range of the distribution.


Alternatively, the following simpler formula may be used:
k

(ib) ZNiEPi= Ey, r=1,* ,k (5)


i=1 r r

where >J stands for summation over the classes fitted by the rth line.
(ii) Y NrPr + Nr+iPr+i
for the classes fitted by the rth or (r + l)th line or which lie in between
them. Estimates of Nr and NrT+1 fromn this, denoted by Nr (r+1) and
Nr+I (r) may be obtained from

Nr(r+l) Pr Nr+l(r) Z frPr+i = Z YPr (6)


Nr(r+l) Z PrfDr+1 + Nr+l(r) Z r+ = Z Y+I
where summation is restricted to the region under consideration. The
proportions of the various components in the mixture, pi say
(i-=1, **..., k), may then be estimated from the relations

Pi+l/Pi = Ni+l(i)/Ni(i+l), i = 1, * l ,k-1

= 1. (7)
Pi=

(iii) Y NrP,
for the classes fitted by the rth line; Nr may thus be estimated by either
(iiia) Nr = YPr/
Y Pr (8)
or
(iiib) Nr = EY/I P (9)
where summation is restricted to the region under consideration.
iv) As shown in Appendix A,
hN, _h2 [o'2-
r (h2/12)] ( -
ly log -[ - 4 (X

for the classes fitted by the rth line; an estimate of Nr is given by

log , -
og Y+ (h2/12)]
? ? ?+ log 27r (10)
where summiation is restricted to the region under consideration, and
n denotes the number of classes in the region. If common logarithms
are used, the 2nd and third term on R.H.S. of (10) must be multiplied
by log10 e, as illustrated in Appendix B.)

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMPONENTS 119

It may be observed that all the methods described above, with the
exception of (ii), yield a single estimate of the total frequency of each
component; from these the proportions of the mixture may be estimated
by
k

pi = Ni/ i.(1
The use of equations (1) and (2) for estimation of mean and s.d. is
illustrated in examples 1 and 2, where, for simplicity, the Ni are esti-
mated by equation (9), but considering only two classes near the centre
of each straight region. Equations (4)-(1l) are illustrated in Appendix
B, using the data of example 1.
Method (i) is very laborious when the number of components is
large. If the components are well separated the diagonal terms dominate
in the coefficient matrix of equations (4), so that an iterative procedure
(Bodewig [1956]) is very suitable. Method (ib) is a good substitute
for method (ia) and is less laborious, in that computation of sums of
squares and products is replaced by computation of partial sums, and
that the diagonal terms in the coefficient matrix are more dominant, so
that the iteration process converges more rapidly. Methods (iii) and (iv)
are simple, (iv) being particularly useful when statistical tables are
not available. Method (ii) appears to be a good compromise between
methods (i) and (iii). There seems little point in undertaking a laborious
calculation to estimate the Ni with high apparent precision when the
pi and 6-, are themselves subject to error.

SOME SPECIAL CASES


The method described above is strictly valid under certain precise
conditions. In some special cases these conditions are not satisfied,
but the method may still prove useful.
Figure la illustrates a two-component situation where the component
a is completely overlapped by b. In such a situation the graph of the
logarithmic difference of the class-frequency against the mid-point of
the class will indicate a straight line corresponding to component b.
Component a can then be determined by subtraction of the frequencies
due to component b from the frequency distribution of the mixture.
Figure lb illustrates a three-component situation where the com-
ponent b is completely overlapped by a and c: in the middle region all
three components overlap. In such a situation the graph will reveal
two straight lines from which a and c can be determined: component
b can then be determined by subtraction.
In Figure 1c (three components) the compolenlts a and c are com-

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
120 BIOMETRICS, MARCH 1967

,-

( I)

(i;i)

FIGURE 1
i) Two NORMAL DISTIRIBUTIONS: a COMPLETELY OVERLAPPED BY b
ii) THREE NORMAL DISTRIBUTIONS: b COMPLETELY OVERLAPPED BY a AND C
iii) THREE NORMAL DISTRIBUTIONS: BOTH a AND C ARE COMPLETELY OVERLAPPLD

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMPONENTS 121

pletely overlapped by b. In this case the graph will show a straight


line corresponding to component b: subtracting the frequencies due to
b from the frequency distribution of the mixture then leaves a simple
mixture (without overlap) of a and c.
In the general situation, where it is not known whether the assump-
tions underlying the present method are satisfied, the validity of the
method may be roughly examined from the data. One first determines
those components which are revealed in the graph of logarithmic
difference of the class frequency agains-t the mid-point of the class.
The frequencies due to these components may then be subtracted from
the observed frequency distribution: if all the components have been
determined, the residual frequency should be negligible. A non-
negligible residual frequency indicates that the conditions underlying
the present method are not satisfied: however, the process may be
repeated with the residual frequency, and further components possibly
so determined.

EXAMPLES

Example 1
The data (Table 1) for this example are taken from Tanaka [1962]
and relate to the frequency distribution of forkal length of Porgy caught
by the pair-trawl fishery of the East China Sea.
The graph (Figure 2) of logarithmic differences of class frequency
against the midpoint of the class (circles) shows four approximately
straight regions with negative slope, indicating four distinct components.
The presence of a further component between the last two of these,
but substantially overlapped by them, is also suggested (the solid
points, and the line through them, are not yet available at this stage
of the argument). The parameters of this component cannot be esti-
mated directly, and the approach indicated in the previous section
must be adopted.
After matching straight lines with each of the approximately straight
regions mentioned above, we get
h = 1, b =1O, d =200 X loge=86.858
= 10.53, 22 = 14.78, 23 = 19.36, 5 = 26.12
= 85.25?, 02 = 81.25?, 03 = 73.00, 05 =
75.5'.
Hence, from (1) and (2),
= 11.03, g2 =
15.28, g3 = 19.86, 5 = 26.62
=l .81, 62 = 1.13, 03 = 1.60, 05 = 1.47.

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
122 BIOMETRICS, MARCH 1967

TABLE 1
EXAMPLE 1: FREQUENCY DISTRIBUTION OF FORKAL LENGTH OF PORGIES

Observed
Class range Mid-point frequency log1oy A log10y

9-10 9.5 509 2.707 .643


10-11 10.5 2240 3.350 .019
11-12 11.5 2341 3.369 -.575
12-13 12.5 623 2.794 -.116
13-14 13.5 476 2.678 .412
14-15 14.5 1230 2.090 .068
15-16 15.5 1439 3.158 -.194
16-17 16.5 921 2.964 -.313
17-18 17.5 448 2.651 .058
18-19 18.5 512 2.709 .148
19-20 19.5 719 2.857 - .029
20-21 20.5 673 2.828 -.180
21-22 21.5 445 2.648 -.115
22-23 22.5 341 2.533 -.042
23-24 23.5 310 2.491 -.133
24-25 24.5 228 2.358 -.133
25-26 25.5 168 2.225 -.079
26-27 26.5 140 2.146 -.089
27-28 27.5 114 2.057 -.251
28-29 28.5 64 1.806 -.464
29-30 29.5 22 1.342

For simplicity, an estimate of the total frequency of each component


was found by using formula (9), but conisideringonly two classes near
the centre of the straight region. Thus,
y(lO.5) ? y(11.5)
N1 = PJ(10 5) + P(( 15) = 5811;

N2 - y(14.5) + y(1l5.5) = 4381;


A (14.5) ? P(I15.5)

N3 - (19.5) + y(20.5) = 2984;

N5 y(26.5) ? y(27.5) _= 516.


P5(26.5) + P5(27.5)
The contributions of the third and fifth components are now sub-
tracted from the observed frequencies in the intermediate region
(Table 2).

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMPONENTS 123

0-7 -

06 u
2

02

Li

0'S 01

LU
X, , o,0~~~~~~~~~~~~~ ,
0~~~~~~~~~~~~~~~~~
1,'5 13'5S 15'5 17'5 I9's 2I 23 25'5\ 27'5

0
0

-02,

-0.3- 0

-06L MIIDpONT OF CLASS

FIGURE 2
EXAMPLE 1: GRAPH OF LOGARITHMIC DIFFERENCES OF THE CLASS-FREQUENCIES
AGAINST THE MID-POINTS OF THE CLASSES

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
124 BIOMETRICS, MARCH 1967

TABLE 2
EXAMPLE 1: CALCULATION OF RESIDUAL FREQUENCIES IN REGION BETWEEN
THIRD AND FIFTH COMPONENTS

YR
x y P3 P5 =y -RJ3P3 -1R5P5 log,oYR A log11YR

22.5 341 .06568 .00606 142 2.1523 .2169


23.5 310 .02002 .03045 234 2.3692 -.1517
24.5 228 .00417 .09788 165 2.2175 -.4251
25.5 168 .00060 .20139 62 1.7924

The new graph (solid points in Figure 2) shows an approximately


straight region with negative slope, clearly pointing out the intermediate
component. For this component,
24 = 23.12, 04 - 82.0.
Hence

4= 23.62, O4 1.07,
- y (23.5) + yP(24.5) - 639
P4(0.5 ?P4(24.5)
Finally, from (11),

A= .4065, A= .3067, = .2087, A = .0420, A = .0361.


The results obtained by the present method, without using trial
and error, are in close agreement with those obtained by Tanaka [1962]
using other methods involving trial and error to get improved results
(Table 3).
Examnple2
In example 1 the method was applied to real data on a fish popula-
tion. Here it is applied to a known mixture of Gaussian distributions
(Table 4) with adequate separation of the components.
The distribution has three distinct modes corresponding to the three
components. The troughs between the modes, the corresponding points
of inflexion of the ogive as well as the points of inflexion of the probit
plot (Figure 3) yield proportions of mixture close to the actual values.
The graph of logarithmic difference of frequency against the mid-
point of the class (Figure 4) shows three approximately straight regions
with negative slope, indicating the presence of the three components.

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMJPONENTS 125

TABLE 3
EXAMIPLE 1: COMPARISON OF THE RESULTS OBTAINED BY FOUR METHODS
A. BUCHANAN-WOLLASTON B. CASSIE C. TANAKA D. AUTHOR

Parameter Method Components

1 2 3 4 5

A 11.05 15.32 19.85 23.58 26.82


Mean (cm.) B 11.02 15.33 19.85 23.46 26.92
C 10.99 15.26 19.84 23.50 26.82
D 11.03 15.28 19.86 23.62 26.62

A .844 1.161 1.412 1.212 1.443


Standard deviation B .76 1.15 1.32 1.29 1.54
(cm.) C .8 1.2 1.4 1.2 1.4
D .81 1.13 1.60 1.07 1.47

A .4072 .3110 .1860 .0642 .0316


Proportions of B .4049 .3164 .1788 .0693 .0307
mixture C .4007 .3194 .1873 .0598 .0328
D .4065 .3067 .2087 .0420 .0361

TABLE 4
EXAMPLE 2: ARTIFICIAL MIXTURE OF THREE GAUSSIAN DISTRIBUTIONS

Class-range Mid-point Frequency


(cm.) (x) (?) loge y A loge Y

8-9 8.5 31 3.4340 2.84


9-10 9.5 532 6.2766 1.42
10-11 10.5 2198 7.6953 .04
11-12 11.5 2297 7.7394 -1.21
12-13 12.5 685 6.5294 - .33
13-14 13.5 494 6.2025 .88
14-15 14.5 1188 7.0800 .22
15-16 15.5 1479 7.2991 - .46
16-17 16.5 938 6.8438 - .66
17-18 17.5 486 6.1882 .10
18-19 18.5 537 6.2860 .27
19-20 19.5 702 6.5539 - .06
20-21 20.5 664 6.4983 - .43
21-22 21.5 431 6.0661 - .81
22-23 22.5 192 5.2575 -1.18
23-24 23.5 59 4.0775 -1.59
24-25 24.5 12 2.4849 -1.79
25-26 25.5 2 .6932

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
126 BIOMETRICS, MARCH 1967

25i

25 t

23 0

21 0

F- 0~~~~~~~~~~~~ 0

17 8C 0

}01 80 0

0O 01 l 0 05 Z 5 10 20 30 40 50 60 70 80 90 95 98 99 99 899 9
PROBIT
SCALE)
CUMULATIVEy5)FREQUEyCY1-
FIGURE 3
EXAMPLE 2: PROBIT PLOT OF THE FREQUENCY DISTRIBUTION

~~~
Here 1 'l .71 02 1.1003 16

9 *
~~ A
,A
h=1, ,
A
b=50,
, d = 10, , _

510.54, = 14.81, 25 = 19.35,


01= 82.0?, 62 = 74.00, 63 = 62.00.
Hence,
1?= 11.04, /13 = 15.31, j3 = 19.85
&1= .78, 02= 1.16, 63 = 1.60

iV1, -
y(14.5) + y(ll.5)

N2 - y(l4.5) + y(lS.5) = 4496;


-P2(14.5)+ P2(15.5)
3 P (19.5) + y(20.5) = 2930;
P3.431195P2 P3(20.5)
=.431 1, A2 = .3444, A3 = .2245.

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMPONENTS 127

3'o c;:.

24

U'
Il

Lu

1'2 0
tU

uJ

8'5 lot 12. 0~~~~~


14'S\ 16 18'5 20'5 22a5 24'5

-o'6

-118

MIPOINVr OP CLASS

FIGURE 4
EXAMPLE 2: GRAPH OF LOGARITHMIC DIFFERENCES OF THE CLASS FREQUENCIES
AGAINST THE MID-POINTS OF THE CLASSES

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
128 BIOMETRICS, MARCH 1967

TABLE 5
EXAMPLE 2: COMPARISON OF THE RESULTS OBTAINED BY AUTHOR'S METHOD
WITH THE VALUES USED TO CONSTRUCT THE 'DATA' (IN PARENTHESES)

Component
Parameters
1 2 3

11.03 15.28 19.86


mean
(11.04) (15.31) (19.85)

standardeviation
standard deviatn ( .80
.78) ( 1.12
1.16)
1.60
( 1 60)

.3334
proportionsof
proportions mixtur.4404
Ofmixture ( .4311) ( .3444) ( .2262
2245)

Table 5 compares the values of the parameters with the actual


values (in parentheses).

DISCUSSION
A satisfactory practical solution of the problem under investiga-
tion continues to elude mathematical statisticians. Although the
problem admits a neat theoretical solution, very great difficulties would
be encountered in the practical application of the theoretical results.
The methods so far adopted by fishery workers, as well as that presented
in this paper, are all approximate in the sense that they are applicable
only when the components are adequately separated.
The mathematical basis of the probability paper method is not
very clear. The other methods have a clear mathematical basis, are
effective even with considerable overlap of the components provided
the sample is sufficiently large, and when applicable require no correc-
tion for truncation of the components.
The use of differences, as Tanaka [1962] remarked, involves the
danger of magnifying errors in the frequency distribution. This is the
reason why numerical differentiation is generally viewed with much
concern (Nielson [1956]). It may, however, be recalled that Fisher
[1950] used logarithmic differences in place of relative rates for fitting
a logistic curve, and that a similar approach by Bhattacharya [1964-65]
has been found to be quite satisfactory for fitting a more general class
of growth curves which includes the logistic curve as a particular case.
In the present case, the differences are used directly rather than as
substitutes for differential coefficients. Hence, in view of the simplify-

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMPONENTS 129

ing assumptions already made, the use of first differences does not seem
to be objectionable provided due care is taken with small frequencies.
The advantage of a linear transform in any applied research can
hardly be overemphasized. In the present case it has the great ad-
vantage that it reduces subjective elements to a minimum and is
certainly the quickest and simplest of all the existing methods.
The assumption that the class-range should be small is important,
and the sample should be sufficiently large so that the class frequencies
do not become very small in the regions of interest. This point may
be taken into consideration at the stage of collection and compilation
of data. If it is felt that the original class-width is too large for the
method to be applicable, it may sometimes be useful to divide the
original class into an odd number of subclasses and work with the fre-
quency of the central subclass, which may be estimated by some
smoothing formula such as King's formulae (Willers [1948]). This would
require a distinction between the class-width and the class interval.

ACKNOWLEDGEMENTS
The author wishes to express his gratitude to Dr. B. S. Bhimachar,
Dirctor of the Institute, for his constant encouragement during the
course of the study, to Shri V. R. Pantulu for inspiring the study and
his constant help and to Prof. H. K. Nandy of the University of Calcutta
for his helpful criticism and valuable advice during the progress of the
work. Thanks are also due to Shri P. Datta for his useful suggestions
in connection with the study.

REFERENCES
Bhattacharya, C. G. [1966]. Fitting a class of growth curves. Sankhya B28, 1-10.
Bodewig, E. [1956]. Matrix calculus. 1st Edn. Amsterdam:North Holland Publ. Co.
Buchanan-Wollaston,H. G. and Hodgeson, W. C. [1929]. A new method of treating
frequency curves in fishery statistics, with some results. J. Cons. 4, 207-25.
Cassie, R. M. [1954]. Some uses of probability paper for the graphical analysis of
polymodal frequency distributions. Aust. J. Mar. Freshw.Res. 5, 513-22.
Fisher, R. A. [1950]. Statisticalmethodsfor researchworkers. 11th. Edn. Edinburgh:
Oliver and Boyd.
Fisher, R. A. and Yates, F. [1963]. Statistical tablesfor biological,agriculturaland
medicalresearch. 6th. Edn. London: Oliver and Boyd.
Gottschalk, V. H. [1948]. Symmetrical bimodal frequency curves. J. Franklin
Inst. 245, 245-52.
Harding, J. F. [1949]. The use of probability paper for the graphical analysis of
polymodal frequency distributions. J. Mar. biol. Ass. U. K. 28, 141-53.
Nielson, K. L. [1956]. Methodsin numericalanalysis. 1st Edn. New York: The
Macmillan Company.
Oka, M. [1954]. Ecologicalstudies on the kidai by the statistical method II. On the
growth of kidai (Taius tumifrons). Bull. Fac. FIish.Nagasaki 2, 8-25.

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
130 BIOMETRICS, MARCH 1967

Pearson, E. S. and Hartley, HI.0. [1958]. Biometrikatablesfor statisticians. Vol. 1.


2nd Edn. London: CambridgeUniversity Press.
Pearson, K. [1894]. Contribution to the mathematical theory of evolution. Phil.
Trans. A 185, 71-110.
Pearson, K. and Lee, A. [1908-09]. On the generalizedprobable error in multiple
normal correlation. Biometrika6, 59-68.
Pearson, K. [1915]. On the problem of sexing osteometric material. Biometrika40,
479-87.
Preston, E. J. [1953]. A graphical method for analysis of statistical distributions
into normal coniponents. Biometrika40, 460-64.
Rao, C. R. [1948]. The utilisation of multiple measurements in problems of bio-
logical classification. J. R. Statist. Soc. B 10, 159-93.
Tanaka, S. [1962]. A method of analysing of polymodal frequency distribution and
its application to the length distribution of the Porgy, Taius tumifrons(J. and
S.). J. Fish. Res. Bd. Can. 19, 1143-59.
Willers, F. A. [1948]. Practical analysis: Graphicaland numericalmethods. Tr. by
Robert T. Beyer. 1st. Edn. New York: Dover publications.

APPENDIX A
THE UNDERLYING MATHEMATICAL ASSUMPTIONS

Let the frequency function be a mixture of k Gaussian distributions


with parameters (Ni , pi, oj), i = 1, * * *, k. We assume that the com-
ponent distributions are sufficiently separated for there to exist for
each component a sufficiently broad region where the effect of all other
components is comparatively negligible. We further assume that the
class-range is sufficiently small. Let h denote the class interval and
y denote the frequency in the class with x as its mid-point. Then,
x+h/2 k NT r x+h/2 NA2
E e
y =ef 1V
-(v-Mi)2/2crO dv (A-jr)2/2cr' dv
Jz-h/2 \/2 e',
i=l 0i J2
z-h/2

in the region where the effect of all except the rth component is negligible.
Writing v = x + Oru, this becomes
h/2cr
y ? Nr j Z(tr + U) du
-h/2a,r
h/2ar 03 h/2a,r
Z()(tr) = Z78) (tr
=Nr A du =
@ Juscod Nr t 1 ut}s
8dfu
~h/2ar 8=0 s 8=0 -h/2a,r

where Z stands for the density function of a standard normal deviate


and Z(s) for its sth derivative, and tr = (x - gr)/r .
Carrying out the integration w.r.t. u and expressing Z(' (t) as
product of Z(t) and Hermite polynomial of the sth degree, we have

n t i /2vrh an { hrs o h
neglecting terms involving h5 and higher powers of h.

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTIONINTO GAUSSIANCOMPONENTS 131

Taking logarithms and neglecting terms involving h4 and higher


powers of h,
hNr h2 c2 -
(h2/12) 2
2(_
cTrV27r 24o_
Now,
At2 = 2h(x - pr + h/2)/cT
whence
A log y --h(0_2 - h2/12)(x - I.r + h/2)/(oJr
This shows that the graph of A log y against x is a straight line
with negative slope equal to -h[o_2- (h2/12)]/4.
When plotting it may be necessary to choose different scales for
x and A log y. If b and d denote the scales for x and A log y respectively,
the slope becomes -dh(_2 - h2/12)/bc4 .
Let Orbe the acute angle made by the line with the negative direc-
tion of the axis of x. Then, if a = b tan Od/dh,
4 2 h
ao_ - +
12 = 0?

i.e.,

2 = 1 = V1- (ah2/3) 1 i [1- (ah2/6)]


-
Cr 2a 2a
neglecting terms involving h4 and higher powers of h: thus
o2
- h2/12 or 1/a - h2/12.
The solution _2' -' h2/12 has obviously to be rejected, and hence we find
o2 1/a - h2/12 = dh cot fOr/b- h2/12.
Let X. be the value of x corresponding to which log y is zero: then
from the expression for log y given earlier,
Xr - Iur + h/2 = 0,
,r = Xr + h/2.

APPENDIX B
EXAMPLE

The various methods suggested in ?2 for estimation of the propor-


tions of mixture will be illustrated on the data of example 1. Since
these data do not conform to the assumptions of Appendix A, with the

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
132 BIOMETRICS, MARCH 1967

result that the 4th component cannot be isolated before the total
frequencies of the other components are determined, we restrict the
illustration to the first three components, and consider only the classes
in the range 9-22, in which we may neglect the effect of the 4th and
5th components. As suggested in ?2, the systems of linear equations
encountered in Methods (ia) and (ib) are conveniently solved by a
common iteration procedure called iteration II by Bodewig [1956].
This method starts with an initial solution, obtained by ignoring all
except the diagonal terms of the coefficient matrix: the kth approxima-
tion is then obtained by adjusting the right hand sides for the off-
diagonal terms, calculated using the (l - 1) th approximation.
Preliminary calculations are shown in Table 6: values of fi and &'
are from Table 3.

TABLE 6
COMPARISON OF METHODS (i)-(iv) (APPENDIX B): PRELIMINARY CALCULATIONS

Mid-
Class point
range of class P1 P2 P3
(cm.) (x) X 10 X 10 X 10 Y 10og10Y X -l X- 2 X- 3

8-9 8.5 550


9-10 9.5 9338 509 2.707 -1.53
10-11 10.5 38608 7 2240 3.350 - .53
11-12 11.5 40230 163 2341 3.369 .47
12-13 12.5 10576 1919 1 623 2.794 1.47
13-14 13.5 680 10565 11 476 2.678 -1.78
14-15 14.5 10 27475 107 1230 3.090 - .78
15-16 15.5 33853 673 1439 3.158 .22
16-17 16.5 19787 2901 921 2.964 1.22
17-18 17.5 5473 8559 448 2.651
18-19 18.5 713 17294 512 2.709 -1.36
19-20 19.5 44 23940 719 2.857 - .36
20-21 20.5 1 22706 673 2.828 .64
21-22 21.5 14755 445 2.648 1.64

Method (ia) Summing over classes in the range 9-22, equations (4)
become
.330854308N1 + .003458204N2 + .000001913N3 = 1923.38220
.003458204P1, + .243821904N2 + .014349321N3 = 1102.03553
.000001913N1 + .014349321X2 + .168761528N?3= 555.26670

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMPONENTS 133

The iterative solution, with the results of the successive iterations


arranged columnwise, is

J1 5813.38 5766.12 5769.01 5768.76 5768.78


N2 4519.83 4243.75 4267.04 4265.62 4265.74
N3 3290.24 2905.87 2929.35 2927.36 2927.49.
Hence
N1 = 5769; i2 = 4266; N3 = 2927; ZNi = 12962,
and from equation (11),

A1 = .4451, .3291, P2 = .2258. A

Method (ib) Summing over classes in the rainges 9-13, 13-17 and
18-22, corresponding to the 1st, 2nd and 3rd line respectively, equations
(5) become
.98752A1 + .02089N2 + .00001N3 = 5713
.00690N1 + .91680N2 + .03692N3 = 4066
ON1 + .00758S?2+ .78695N3 = 2349.
The solution is
N1 = 5695; S2 = 4274; R3 = 2944; E i= 12913
Pi .4410; P2 = .3310; p3 = .2280.

TABLE 7
ESTIMATION OF THE Ni BY METHOD (iii)

Method(iiia) using Method(iiib)using


Com- equation(8) equation(9)
ponent
(i) E 2i Pi lqi pi EP EY ; pi
1 .330808058 1920.02240 5804 .4401 .98752 5713 5785 .4381
2 .240404583 1057.61484 4399 .3336 .91680 4066 4435 .3359
3 .160547850 479.14501 2984 .2263 .78695 2349 2985 .2260

Method (ii) We consider the classes in the ranges 9-17 and 13-22 for
estimating P2/Pl and p3/p2 respectively. For p2/pl equations (5) become
.330854308N1(2) + .003458204X2(1) = 1923.38220
.003458204N1?(2)+ .2407755017?2(1)= 1073.54284.

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
134 BIOMETRICS, MARCH 1967

C? C col

.. m cb
00

-b C.
CO0
(
00

<t tb 0m O . 0

Cl CC) CO l
ool~X C
ho~~~~~~0

O)0C

ao 010 a) X Ic N
lb

1+

V] 0CllC

H H <
C4 cli ~ ~ ~
0 ; xo0 co 0-

00 .1i

4 Cli
Cl

a v ++ .
C b
CO _ _ _

0' r l0

4zC

t4 ce) 0

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions
RESOLUTION INTO GAUSSIAN COMPONENTS 135

The solution is

91(2) = 5768, N2 (1) = 4376,


whence, from equation (6),

02/P1 = =(1,/@1 (2= .7587.


Similarly P3/P2 = .6820, and finally,
Pi .4394, P2 = .3333, p3 = .2273.
Methods (iii) and (iv) The classes used are the same as for method (ib).
Computations using equations (8) and (9) are presented in Table 7,
and those using equation (10) are presented in Table 8: the estimated
Pi obtained are very similar among themselves, and to those obtained
by methods (i) and (ii).

This content downloaded from 91.229.229.205 on Wed, 25 Jun 2014 04:14:36 AM


All use subject to JSTOR Terms and Conditions

You might also like