0% found this document useful (0 votes)
2 views

Genetic-Algorithm-Based-Wavelength-Selection-for

This study presents an enhanced genetic algorithm (GA)-based wavelength selection procedure aimed at optimizing near-infrared wavelengths for glucose determination in biological matrixes. The research demonstrates that using a small number of initial wavelengths can significantly reduce the number of wavelengths needed for partial least-squares (PLS) calibration models while maintaining performance. The effects of spectral resolution on wavelength selection and calibration model performance are also explored, revealing that lower resolution can further decrease the number of wavelengths selected without compromising model accuracy.

Uploaded by

sheba.fin.fr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Genetic-Algorithm-Based-Wavelength-Selection-for

This study presents an enhanced genetic algorithm (GA)-based wavelength selection procedure aimed at optimizing near-infrared wavelengths for glucose determination in biological matrixes. The research demonstrates that using a small number of initial wavelengths can significantly reduce the number of wavelengths needed for partial least-squares (PLS) calibration models while maintaining performance. The effects of spectral resolution on wavelength selection and calibration model performance are also explored, revealing that lower resolution can further decrease the number of wavelengths selected without compromising model accuracy.

Uploaded by

sheba.fin.fr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Anal. Chem.

1998, 70, 4472-4479

Articles

Genetic Algorithm-Based Wavelength Selection for


the Near-Infrared Determination of Glucose in
Biological Matrixes: Initialization Strategies and
Effects of Spectral Resolution
Qing Ding and Gary W. Small*

Center for Intelligent Chemical Instrumentation, Department of Chemistry & Biochemistry, Ohio University,
Athens, Ohio 45701

Mark A. Arnold

Department of Chemistry, Iowa Advanced Technology Laboratories, University of Iowa, Iowa City, Iowa 52242

An improved genetic algorithm (GA)-based wavelength performance relative to those based on full spectra.1,2
selection procedure is developed to optimize both the Various wavelength selection algorithms and criteria for use
near-infrared wavelengths used and the number of latent with multivariate calibration have been reported.3-17 Partial least-
variables employed in building partial least-squares (PLS) squares (PLS) regression is widely used for processing full spectra
calibration models. This GA-based wavelength selection because of its ability to extract analyte information from the many
algorithm is applied to the determination of glucose in two sources of variance within the spectral data matrix. Wavelength
different biological matrixes. With random selection of a selection methods have traditionally not been used with PLS
small number of initial wavelengths, a dramatic reduction regression models because of this ability to decompose the data
in the number of wavelengths required for building the matrix in a manner biased toward the isolation of analyte-
PLS calibration models is observed. The fitness function dependent information. However, recent studies have indicated
used to guide the GA, the method of recombination used, that the performance of PLS models can be improved through
and the effect of spectral resolution on the wavelength wavelength selection.13-17 A mathematical justification of the
selection are also studied. In the resolution study, the theory that wavelength selection can enhance the performance
original data with a point spacing of 2 cm-1 are deresolved of PLS models was also reported recently.17
to 4-, 8-, and 16-cm-1 point spacings by truncating the
collected interferograms before applying the Fourier (1) Brown, C. W.; Lynch, P. F.; Obremski, R. J.; Lavery, D. S. Anal. Chem. 1982,
54, 1472-1479.
processing step. The use of lower resolution spectra is (2) Rossi, D. T.; Pardue, H. L. Anal. Chim. Acta 1985, 175, 153-161.
found to reduce further the number of final wavelengths (3) Kalivas, J. H.; Roberts, N.; Sutter, J. M. Anal. Chem. 1989, 61, 2024-2030.
selected by the GA, and the performance of the optimal (4) Liang, Y.; Xie, Y.; Yu, R. Anal. Chim. Acta 1989, 222, 347-357.
(5) Sasaki, K.; Kawata, S.; Minami, S. Appl. Spectrosc. 1986, 40, 185-190.
calibration models obtained with the original spectra is (6) Salamin, P. A.; Bartels, H.; Forster, P. Chemom. Intell. Lab. Syst. 1991, 11,
maintained with the lower resolution spectra of both 4- 57-62.
and 8-cm-1 point spacing. Degradation in performance (7) Brown, P. J. J. Chemom. 1993, 7, 255-265.
(8) Brown, P. J. J. Chemom. 1992, 6, 151-161.
is observed with the spectra computed with a point (9) Lucasius, C. B.; Kateman, G. TrAC, Trends Anal. Chem. 1991, 10, 254-
spacing of 16 cm-1, however. 261.
(10) Lucasius, C. B.; Beckers, M. L. M.; Kateman, G. Anal. Chim. Acta 1994,
286, 135-153.
Multivariate calibration models are widely used in near-infrared (11) Jouan-Rimbaud, D.; Massart, D.; Leardi, R.; Noord, O. D. Anal. Chem. 1995,
(near-IR) spectroscopy to allow analyte spectral information to be 67, 4295-4301.
(12) Hörchner, U.; Kalivas, J. H. Anal. Chim. Acta 1995, 311, 1-13.
extracted from overlapping spectral bands arising from the sample (13) Rimbaud, D. J.; Walczak, B.; Massart, D.; Last, I. R.; Prebble, K. A. Anal.
matrix. Wavelength selection methods are feature (variable) Chim. Acta 1995, 304, 285-295.
selection techniques that allow calibration models to be con- (14) Navaroo-Vailloslada, F.; Perez-Arribas, L. V.; Leon-Gonzalez M. E.; Polo-
Diez, L. M. Anal. Chim. Acta 1995, 313, 93-101.
structed with a subset of spectral points instead of with full spectra. (15) Bangalore, A. S.; Shaffer, R. E.; Small, G. W.; Arnold, M. A. Anal. Chem.
This allows the wavelengths representing relevant spectral infor- 1996, 68, 4200-4212.
mation to be selected, while points dominated by noise or other (16) McShane, M. J.; Cote, G. L.; Spiegelman, C. H. Appl. Spectrosc. 1997, 51,
1559.
extraneous sources of variation are not included in the calibration (17) Spiegelman, C. H.; McShane, M. J.; Goetz, M. J.; Motamedi, M.; Yue, Q. L.;
model. The resulting calibration models may exhibit improved Cote, G. L. Anal. Chem. 1998, 70, 35-44.

4472 Analytical Chemistry, Vol. 70, No. 21, November 1, 1998 10.1021/ac980451q CCC: $15.00 © 1998 American Chemical Society
Published on Web 09/19/1998
To search for the optimal set of wavelengths, numerical temperature was controlled to 37-38 °C with a water-jacketed cell
optimization techniques such as genetic algorithms (GAs)9-11 and holder.
simulated annealing (SA)12 have been employed. These optimiza- Human serum samples were collected from patients at the
tion methods are efficient algorithms for interrogating a large University of Iowa Hospitals and Clinics. The glucose levels in
search space in which many combinations of wavelengths are the samples were determined with a conventional clinical glucose
possible. analyzer by the hospital clinical chemistry laboratory. The
One of the research interests of our laboratories is to develop precision of such measurements is typically in the range of 0.3
techniques based on near-IR spectroscopy for the measurement mM.26 All serum samples were frozen until just before the
of glucose in various biological matrixes.15,18-25 As part of this collection of the spectra. The procedures used in collecting and
work, a GA-based wavelength selection procedure for use with handling the serum samples complied with approved ethical and
PLS regression has been successfully implemented to accomplish safety standards at the University of Iowa. The 235 samples used
simultaneous optimization of the wavelengths selected and the for this study spanned a range of 3.2-31.9 mM glucose concentra-
number of latent variables employed in building a calibration tion.
model.15 In this paper, an enhanced GA-based wavelength The GTB samples were prepared in a pH 7.4, 0.1 M phosphate
selection procedure is developed through the investigation of buffer solution. 5-Fluorouracil (0.044% w/w) was added as a
strategies for initializing the GA in an optimal manner, further preservative. Regent-grade glucose, sodium phosphate salts,
exploration of the configuration of the GA, and evaluation of the 5-fluorouracil, triacetin, and BSA were purchased from common
impact of spectral resolution on the wavelength optimization. suppliers. The reagent-grade water used for the preparation of
the sample solutions was obtained from a Mili-Q Plus water
EXPERIMENTAL SECTION purification system (Millipore, Inc., Bedford, MA). A factorial
Instrumentation and Reagents. Two data sets were em- design was adopted for the concentration levels of glucose, BSA,
ployed for this research. One focused on the analysis of glucose and triacetin to minimize the correlation among these components.
in human serum samples (serum data set) and was collected at The data set consisted of samples prepared from all combinations
the University of Iowa. The other focused on the analysis of of 10 levels of glucose (1, 3, 5, 7, 9, 11, 13, 15, 17, and 19 mM),
glucose in an aqueous matrix of bovine serum albumin (BSA) four levels of BSA (50, 65, 80, and 95 g/L), and four levels of
and triacetin (GTB data set) and was collected at Ohio University. triacetin (1.4, 2.1, 2.8, and 3.5 g/L). A total of 160 (10 × 4 × 4)
The BSA and triacetin were used for modeling proteins and samples were prepared for the GTB data set.
triglycerides, respectively, in human blood. These two data sets Procedures. Double-sided interferograms of 16 384 points
were used in the previous study.15 were collected for the serum data set. Two to four replicate
Spectra in the serum data set were collected with a Nicolet interferograms were collected for each serum sample based on
740 Fourier transform spectrometer (Nicolet Instrument Corp., 256 coadded scans. The single-beam spectra were computed from
Madison, WI) configured with a 250-W tungsten-halogen source, the collected interferograms with software resident on the Nicolet
CaF2 beam splitter, and liquid nitrogen-cooled InSb detector. The 620 computer controlling the spectrometer. Triangular apodiza-
near-IR spectral region of 5000-4000 cm-1 was used. A K-band tion and Mertz phase correction were used in Fourier processing
interference filter (Barr Associates, Westford, MA) was used to the interferograms. The resulting spectra had a nominal point
isolate this spectral region. The samples were placed in an Infrasil spacing of 2 cm-1. The samples were measured in a randomized
quartz cell with 2.5-mm path length. The temperature of the order with respect to glucose concentration. Spectra of a pH 7.3,
samples was controlled to 37.0 ( 0.2 °C by use of a water-jacketed 0.1 M phosphate buffer were acquired periodically for use as
cell holder. background spectra in computing spectra in absorbance units.
The GTB data set was collected with a Digilab FTS-60A Fourier The procedure for collecting the GTB spectra was similar to
transform spectrometer (Bio-Rad, Cambridge, MA) configured that used for the serum data. However, single-side interferograms
with a 100-W tungsten-halogen source, CaF2 beam splitter, and of 16 384 points were collected based on 256 coadded scans.
InSb detector cooled with liquid nitrogen. The data were also Again, single-beam spectra with a nominal point spacing of 2 cm-1
collected over the spectral region of 5000-4000 cm-1, which was were computed from the collected interferograms, and triangle
isolated by a K-band interference filter (Barr Associates). An apodization and Mertz phase correction were employed. The
Infrasil quartz cell with a path length of 2 mm was used, and the software used for the Fourier processing was resident on the Bio-
Rad SPC-3200 computer controlling the spectrometer. Three
(18) Arnold, M. A.; Small, G. W. Anal. Chem. 1990, 62, 1457-1464.
(19) Marquardt, L. A.; Arnold, M. A.; Small, G. W. Anal. Chem. 1993, 65, 3271- replicate interferograms were collected for each sample. The data
3278. collection was also randomized with respect to glucose concentra-
(20) Small, G. W.; Arnold, M. A.; Marquardt, L. A. Anal. Chem. 1993, 65, 3279- tions, and interferograms of phosphate buffer were collected
3289.
(21) Hazen, K. H.; Arnold, M. A.; Small, G. W. Appl. Spectrosc. 1994, 48, 477- periodically for use in computing spectra in absorbance units.
483. All data analysis was performed with a Silicon Graphics Indigo2
(22) Shaffer, R. E.; Small, G. W.; Arnold, M. A. Anal. Chem. 1996, 68, 2663- R10000 workstation (Silicon Graphics, Mountain View, CA)
2675.
(23) Pan, S.; Chung, H.; Arnold, M. A.; Small, G. W. Anal. Chem. 1996, 68, operating under Irix (version 6.2). The software used for the data
1124-1135. analysis was written in Fortran 77. Subroutines used for multiple
(24) Mattu, M. J.; Small, G. W.; Arnold, M. A. Anal. Chem. 1997, 69, 4695- linear regression computations were obtained from the IMSL
4702.
(25) Ding, Q.; Small, G. W. In Fourier Transform Spectroscopy: 11th International software package (IMSL, Inc, Houston, TX).
Conference; de Haseth, J. A., Ed.; American Institute of Physics: Woodbury,
NY, 1998; pp 264-267. (26) Burmeister, J. J.; Arnold, M. A. Anal. Lett. 1995, 28, 581-592.

Analytical Chemistry, Vol. 70, No. 21, November 1, 1998 4473


consisted of 79.8 g/L BSA/3.5 g/L triacetin, 95.0 g/L BSA/1.4
g/L triacetin, and 49.4 g/L BSA/2.1 g/L triacetin, respectively.
Figure 1C shows the spectra of three serum samples from three
different patients who happened to have the same blood glucose
concentration of 17.0 mM. The other constituents analyzed in
these samples were 59 g/L total protein, 131 mg/dL cholesterol,
0.65 g/L triglyceride, and 24 mg/dL urea; 71 g/L total protein,
187 mg/dL cholesterol, 1.85 g/L triglyceride, and 22 mg/dL urea;
and 79 g/L protein, 294 mg/dL cholesterol, 3.33 g/L triglyceride,
and 20 mg/dL urea. In Figure 1B and C, no glucose absorption
features can be observed in the spectra of the GTB and serum
samples. Instead, the other components (especially the proteins15)
produce the dominant spectral features in the near-IR region. Also,
although the samples have the same glucose concentrations, there
are significant variations in the spectra because the samples have
different concentrations of the other components in the matrix.
The complexity of these data sets illustrates the challenge of
determining glucose in biological matrixes by near-IR spectros-
copy and prompts the need for suitable data analysis methods
for use in extracting the glucose information from the spectra.
A GA-based wavelength selection method has been imple-
mented in our laboratories to allow joint optimization of the
wavelengths used and the number of PLS factors employed to
build optimal calibration models.15 Three near-IR data sets were
used in that work, corresponding to the measurement of glucose
in the same serum and GTB data sets used here and a data set
focusing on the determination of methyl isobutyl ketone (MIBK)
in water. Of the three data sets, the MIBK measurement
represented the simplest determination, while the GTB and serum
data sets, respectively, were progressively more challenging.
Figure 1. Near-IR absorbance spectra of (A) glucose at 96 mM, In the previous study, although a significant reduction in the
(B) three GTB samples with the glucose concentration at 19 mM, number of wavelengths used to build the calibration models was
and (C) three human serum samples from three patients. The glucose realized relative to the use of full spectra, several hundred
concentration for each serum sample was 17.0 mM.
wavelengths were still used in the optimal models built for the
two glucose data sets. The purpose of the work reported here
Spectra at reduced resolution were obtained by truncating the
was to explore ways to improve the original GA-based wavelength
interferograms appropriately before the Fourier processing step.
Through this procedure, spectra with nominal point spacings of selection procedure in an effort to decrease the number of
4, 8, and 16 cm-1 were obtained for both data sets. The Fourier wavelengths selected in building calibration models for challeng-
processing calculations for the spectra at reduced resolution were ing measurements such as the glucose determination.
performed with original software implemented on the Silicon Overview of GA-Based Wavelength Selection. GAs are
Graphics system. Triangular apodization and Mertz phase cor- efficient numerical optimization methods based on the principles
rection were also used in the computation of these spectra. of genetics and natural selection. Since efficient optimization is
one of the key requirements in implementing wavelength selec-
RESULTS AND DISCUSSION tion, GA-based methods have become popular for selecting subsets
The characteristics of the data sets employed in this study have of wavelengths for use in building multivariate calibration
been detailed in the previous paper.15 Overall, these two data sets models.9-11,15 Shaffer and Small have described the basic steps
have relatively low signal-to-noise ratios in terms of the analyte of implementing a GA-based optimization.27 These concepts will
(glucose) absorption features, and the glucose spectral information be summarized briefly here.
is overwhelmed by the spectral features arising from the other In a GA, the collection of variables whose values are to be
constituents of the sample matrix. A typical near-IR absorbance optimized is termed a chromosome, and the individual variables
spectrum of glucose in water at a concentration of 96 mM is shown are called genes. A chromosome represents a candidate solution
in Figure 1A over the range of 4800-4200 cm-1. In this region, to the optimization problem. In pursuit of the optimal chromo-
glucose absorption bands are located near 4700, 4400, and 4300 some, a GA operates simultaneously on a group of chromosomes
cm-1. The C-H combination band centered near 4400 cm-1 was called a population. The first population is generated by perturb-
found most useful in modeling glucose concentrations in a ing an initial chromosome that is either generated randomly or
previous study.18 supplied by the user.
Figure 1B shows the spectra of three GTB samples that have
the same glucose concentration of 19 mM. The three samples (27) Shaffer, R. E.; Small, G. W. Anal. Chem. 1997, 69, 236A-242A.

4474 Analytical Chemistry, Vol. 70, No. 21, November 1, 1998


After the first population is formed, the fitness of each of the Table 1. Data Set Partitioning
individual chromosomes in the population is evaluated on the basis
no. of samples (spectra)
of a user-defined objective function (fitness function). To imple-
ment a GA successfully, the fitness function must be selected to data set serum GTB
encode the degree to which the settings of the variables in the calibration set 188 (561) 120 (360)
chromosome are optimal. prediction set 47 (140) 40 (120)
total 235 (701) 160 (480)
The chromosomes with the best fitness values are selected to
generate a new set of child chromosomes through the methods
of recombination and mutation. The recombination approach that
we have typically employed is termed single-point crossover. Given tion models. The replicate spectra of the samples were allocated
two selected parent chromosomes, a gene location on the together into the corresponding data subsets.
chromosome is chosen randomly, and the values of all the genes During the optimization, for each calculation of the fitness
up to that point are interchanged between the two parents to form function, the calibration set was further divided randomly three
two new child chromosomes. Mutation is applied to the child times to produce three calibration subsets (80% of the calibration
chromosomes and involves altering the gene values on a gene- samples) and three monitoring sets (20% of the calibration
by-gene basis. Whether or not mutation occurs for a given gene samples). As before, the replicate spectra of the samples were
is governed by a user-specified mutation probability. Recombina- allocated together into the data subsets. For the chromosome
tion and mutation introduce diversity into the child chromosomes being evaluated, a calibration model was computed with the
while preserving the information carried by the parents. spectra in each calibration subset, and each resulting model was
The new population formed with the child chromosomes used to predict the glucose concentrations for the spectra in the
replaces the original, and the chromosomes with the best fitness corresponding monitoring set. This allowed the predictive ability
values in the new population are again selected to reproduce of the model to be made part of the fitness evaluation. The use
through recombination and mutation. This procedure is an of multiple calibration/monitoring sets and repartitioning the data
iterative evolutionary process in search of the chromosome with before each fitness calculation helped to keep individual samples
the highest fitness value. The formation of each new population from having undue influence on the fitness value.
represents one iteration of the algorithm and is termed a The fitness function used in two previous studies15,22 was
generation. The algorithm terminates after a fixed number of
(MSE + MSME + hw)-1 (1)
generations or when a chromosome with a user-specified level of
fitness is found.
In our implementation of the GA, the values to be optimized where MSE is the mean squared error in concentration of spectra
were which of the individual spectral points to use as input in the calibration subset, MSME is the mean squared error in
variables in the PLS calculation and the number of the resulting concentration of spectra in the monitoring set, h is the number
PLS factors to use in constructing the calibration model. The of PLS factors employed in the calibration model, and w is a
chromosome consisted of a binary gene for each spectral point weighting factor that controls the influence of h on the fitness
and an integer gene to store the number of PLS factors. The value. The incorporation of h into the fitness function allows a
binary genes stored values of 1 or 0 indicating whether the joint optimization of the selected wavelengths and the model size
corresponding spectral point was included in the PLS calculation. without the requirement for a separate optimization of h at each
For the data with a nominal spectral point spacing of 2 cm-1, there evaluation of the fitness function. If desired, the final model
were 519 resolution elements between 5000 and 4000 cm-1. Thus, produced by the optimization can be evaluated further to ensure
in this case, the chromosome consisted of 519 + 1 ) 520 genes. that the value of h is optimal. Procedures based on randomization
The order of the 519 genes was the same as that of the tests28 and the statistical F-test29 are commonly used for this
corresponding points in the spectrum (e.g., the first and second purpose.
genes corresponded to 5000 and 4998 cm-1, respectively). In practice, the fitness value is taken as the mean of eq 1,
The initial population was formed by randomly perturbing an computed across the three calibration subset/monitoring set
initial starting chromosome. The perturbation of each binary gene combinations. This fitness function was used as the starting point
involves changing the value of the gene from 1 to 0, or vice versa, for the work reported here. As detailed below, further investiga-
according to an initial probability set by the user. The perturbation tion of this function was performed in subsequent studies. For
of the number of PLS factors was performed by scaling a Gaussian- use with eq 1, a value of w ) 0.45 was optimized for the GTB
distributed random deviate with a step size and adding the scaled data set through a procedure described previously to balance the
value to the previous number of PLS factors used. These same predictive performance provided by the calibration models and
perturbation steps were also used to mutate genes in child the model sizes.22 A value of w ) 2.0 was found to be optimal
chromosomes formed through the recombination process. previously for the serum data set and was also used in this
To implement the GA-based wavelength selection, the data sets research.
were partitioned randomly into a calibration set and a prediction For the current research, the GA configuration was similar to
set, as shown in Table 1. The spectra in the calibration set were that used in the previous work.15 The single-point crossover
used during the GA calculations, while those in the prediction method of recombination was employed in the initial experiments.
set were withheld entirely from the optimization and used (28) van der Voet, H. Chemom. Intell. Lab. Syst. 1994, 25, 313-323.
subsequently to assess the performance of the optimized calibra- (29) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193-1202.

Analytical Chemistry, Vol. 70, No. 21, November 1, 1998 4475


As detailed below, further investigation of this parameter was range helps to decrease the number of wavelengths in the optimal
subsequently performed. The other GA parameters optimized sets selected by the GA.
previously were adopted in the current work without further study. Figure 2A displays the 91 wavelengths present in the best
These included a mutation probability of 0.001, a recombination chromosome produced with the initial range of 4460-4420 cm-1.
probability of 0.9, and the use of 100 generations in the optimiza- Similar to the case with the initialization of the broad range of
tion. The population size was 100 and 150 chromosomes for the 4850-4250 cm-1, 19 of the 22 spectral points in the range of 4460-
serum and GTB data sets, respectively. 4420 cm-1 are also retained in the optimal set of wavelengths.
Investigation of Factors Influencing the Number of Wave- This suggests that the initialization of all wavelengths in a specified
lengths Selected. Previous work has suggested that the choice range may not be a wise strategy for the GA-based wavelength
of an initial chromosome has a critical effect on the final selection. The high correlation between the spectral information
wavelengths selected by a GA. In the previous study,15 all encoded in adjacent spectral points in the specified range might
wavelengths in a specified spectral range were used in the initial prevent the GA from removing the wavelengths within that range.
chromosome. The spectral ranges selected were those that Since the model typically performs no worse with the additional
included the glucose absorption bands and had proven useful wavelengths included, there is no driving force to remove them
previously for building calibration models. Therefore, in the once they have been added. For example, while the fitness
previous work that employed contiguous sections of the spectrum, function specified in eq 1 applies a penalty to calibration models
the spectral ranges used for the GA initialization were relatively constructed with a large number of PLS factors, there is no similar
broad. With this approach, a total of 292 wavelengths were penalty applied to models on the basis of the number of
selected by the GA in building the optimal calibration model for wavelengths used. Furthermore, since the single-point crossover
the serum data set, and 150 wavelengths were needed for the recombination procedure swaps entire sections of the parent
optimal calibration model built with the GTB data set. chromosomes, adjacent wavelengths are naturally carried along
In the current research, our goal was to decrease the number into the child chromosomes.
of wavelengths required to build an effective calibration model. On the basis of these observations, three modifications to the
We hypothesized that one strategy for reducing the number of optimization were investigated. First, a procedure was investi-
final wavelengths might be to reduce the number of wavelengths gated in which a random set of wavelengths was initialized within
selected in the initial chromosome. In the previous study,15 all a specified range. Second, modifications to the fitness function
the wavelengths over the spectral range of 4850-4250 cm-1 (312 of eq 1 were studied. Third, an alternate recombination method
spectral points) were used for the initial chromosome for the was evaluated.
serum data set. Interestingly, 286 out of the 312 wavelengths in Random Initialization of Wavelengths. A procedure was
this range remained in the final 292 wavelengths selected by the evaluated in which approximately 10% of the wavelengths in
GA to build the optimal calibration model. The GA appeared not different ranges were randomly selected as the initial wavelengths.
to be efficient at deleting wavelengths from the initial chromo- The optimal results obtained with this procedure with the serum
some. data set are reported in Table 2. Over each range listed in Table
To investigate this phenomenon, the initial chromosome was 2, the random selection of initial wavelengths significantly
set to a narrower spectral range of 4460-4420 cm-1 (22 spectral decreased the numbers of wavelengths required in the optimal
points). Three GA runs were performed through the use of three calibration models compared to the number of wavelengths
different seeds for the random number generator. The same three selected with the initialization of the broad range of 4850-4250
seeds were used for all the experiments. On the basis of the cm-1. The numbers of wavelengths present in the optimal
computed fitness values, the 10 best chromosomes were saved chromosomes with the initialization ranges of 4800-4200 and
from each GA run. The overall top five chromosomes from the 4700-4300 cm-1 are smaller than those produced with the
three GA runs were then selected to build the optimal calibration initialization ranges of 5000-4000 and 4900-4100 cm-1. Also, the
models. Using the full calibration set of spectra, the five SEP values in Table 2 demonstrate that improved model perfor-
calibration models were constructed on the basis of the specified mance is obtained when the 4800-4200- and 4700-4300-cm-1
wavelengths and the selected numbers of PLS factors. The initialization ranges are used. This observation is consistent with
resulting models were then applied to predict the glucose the fact that the glucose absorption bands are located within the
concentrations corresponding to the spectra in the independent range of 4800-4200 cm-1 and that the 5000-4800- and 4200-
prediction set. The standard error of prediction (SEP) was 4000-cm-1 regions exhibit increased spectral noise.15 On the basis
computed to characterize the prediction results. of these results, random initialization over the 4800-4200-cm-1
Of the five models tested, the model that produced the lowest range was adopted for all further work with the serum data set.
SEP will be used for comparison with the model produced The procedure of random selection of initial wavelengths was
previously with the initialization range of 4850-4250 cm-1. The also applied to the GTB data set. Because the optimal calibration
number of wavelengths selected by the GA was decreased from models previously built for the GTB data set were based on
292 to 91 with the initialization of the narrow range. This spectral ranges around 4700-4300 cm-1, this range was used for
represents a reduction of more than a factor of 3 in the number the random selection of the initial wavelengths for the GTB data
of wavelengths selected. Furthermore, the prediction results are set. As expected, with random selection of the initial wavelengths,
better for the calibration model built with fewer wavelengths (SEP the number of wavelengths present in the optimal chromosome
) 1.31 mM vs 1.44 mM for the model based on 292 spectral dramatically decreased. Comparison of the optimal results
points). This suggests that the initialization of a narrow spectral obtained with the initialization of all wavelengths in the spectral
4476 Analytical Chemistry, Vol. 70, No. 21, November 1, 1998
Figure 2. Spectral points present in the best chromosome selected by the GA. (A) serum data set, 2-cm-1 point spacing, all wavelengths in
the range of 4460-4420 cm-1 selected in the initial chromosome; (B) serum data set, 2-cm-1 point spacing, random selection of initial wavelengths
from the range of 4800-4200 cm-1; (C) GTB data set, 2-cm-1 point spacing, random selection of initial wavelengths from the range of 4700-
4300 cm-1; (D-F) serum data set, random selection of initial wavelengths from 4800 to 4200 cm-1, point spacings of (D) 4, (E) 8, and (F) 16
cm-1; (G-I) GTB data set, random selection of initial wavelengths from 4700 to 4300 cm-1, point spacings of (G) 4, (H) 8, and (I) 16 cm-1.

Table 2. Effect of Initial Range with Random Selection


for the Serum Data Set Evaluation of Fitness Function Modifications. Modifica-
tions to eq 1 were investigated in an effort to limit the number of
no. of wavelengths
wavelengths used in building the calibration models. By adding
initial range initial selected no. of PLS SECa SEPb
(cm-1) factors (mM) (mM) the number of wavelengths, p, used in computing the PLS factors
5000-4000 50 97 24 1.35 1.55 to the denominator of the equation, models based on fewer
4900-4100 42 91 21 1.36 1.54 wavelengths were given an increased fitness score. Experiments
4800-4200 29 77 19 1.26 1.35
4700-4300 18 72 21 1.35 1.37 were performed in which the number of wavelengths was used
directly in the equation and in which this value was weighted in
a Standard error of calibration computed from the residuals of the
calibration model. b Standard error of prediction. various ways (analogous to the weighting of h in eq 1). The
modified fitness functions were tested both with and without the
random initialization procedure described in the previous section.
range of 4675-4375 cm-1 (156 spectral points initially selected) The results of these experiments indicated clearly that the random
and random selection of 10% of the wavelengths in the range of initialization procedure was the key step in limiting the number
4700-4300 cm-1 (23 points initially selected) reveals a reduction of wavelengths used in the optimal calibration model. When used
in selected wavelengths from 150 to 55. The SEP values produced in conjunction with random initialization, the modified fitness
by the corresponding models are effectively identical (SEP ) 0.63 functions worked well. Without random initialization of the
mM for the model based on 150 wavelengths vs 0.61 mM for the starting wavelengths, however, the modified fitness functions did
model based on 55 points). A factor of 3 reduction in the number not perform significantly better than eq 1. On the basis of these
of wavelengths selected in the optimal chromosome was again results and given that the inclusion of an additional term in the
obtained with no degradation in model performance. fitness function raises additional concerns about appropriate
Parts B and C of Figure 2 display the optimal sets of balancing of the contributions of MSE, MSME, h, and p, it was
wavelengths selected by the GA with the random initialization decided to retain eq 1 as the fitness function for subsequent work.
procedure for the serum and GTB data sets, respectively. Panels Evaluation of Uniform Crossover Recombination. Since
A and B of Figure 3 are correlation plots of predicted vs actual or the single-point crossover recombination method interchanges
measured glucose concentrations obtained from these optimal contiguous sections of the parent chromosomes in the creation
calibration models for the GTB and serum data sets, respectively. of the new child chromosomes, it was hypothesized that this
Even with many fewer wavelengths in the calibration models, good procedure contributes to the carry-along of adjacent spectral points
correlations between predicted and actual glucose concentrations that contribute potentially redundant information. To address this
are observed in both plots. issue, optimizations were also performed with the uniform
Analytical Chemistry, Vol. 70, No. 21, November 1, 1998 4477
Table 3. Optimal Results with the Serum Data Sets of
Different Resolutions

no. of
point spacing wavelengths no. of PLS SECa SEPb
(cm-1) selected factors (mM) (mM)
2 77 19 1.26 1.35
4 52 19 1.30 1.32
8 48 15 1.50 1.38
16 27 17 1.54 1.54
a Standard error of calibration computed from the residuals of the
calibration model. b Standard error of prediction.

Table 4. Optimal Results with the GTB Data Sets of


Different Resolutions

no. of
point spacing wavelengths no. of PLS SECa SEPb
(cm-1) selected factors (mM) (mM)
2 55 13 0.53 0.61
4 48 13 0.57 0.61
8 37 13 0.63 0.68
16 23 13 0.67 0.84
a Standard error of calibration computed from the residuals of the
calibration model. b Standard error of prediction.

results obtained from the study of the effect of spectral resolution


on wavelength selection with the serum data set have been
reported.25
The same procedure of random selection of initial wavelengths
with a limited spectral range described in the previous section
was used for the resolution study. In addition, eq 1 was used as
Figure 3. Glucose concentration correlation plots for (A) GTB data the fitness function, and the single-point crossover method of
set, 2-cm-1 point spacing, random selection of the initial wavelengths,
recombination was employed.
and (B) serum data set, 2-cm-1 point spacing, random selection of
the initial wavelengths. The open circles and solid triangles denote The initialization probability was set at 10% for the original
the spectra in the calibration and prediction sets, respectively. serum and GTB data sets at the 2-cm-1 spectral point spacing.
However, to maintain approximately the same number of initial
wavelengths selected with the different resolutions, the initializa-
crossover method of recombination. With this method, the tion probability was set at 20%, 40%, and 80% for the data sets based
interchange is performed on a gene-by-gene basis. The decision on spectra with 4-, 8-, and 16-cm-1 point spacings. Tables 3 and
of whether to interchange a given gene between the parents is 4 summarize the optimal results obtained with the serum and GTB
random but is governed by a uniform crossover probability (e.g., data sets of different resolutions, respectively. Also, the mean
a 30% probability that genes will be exchanged). SEP values and the corresponding 95% confidence limits computed
This recombination method was compared to the single-point from the prediction results with the top five calibration models
crossover approach in a series of trials. However, as in the are displayed in panels A and B of Figure 4 for the GTB and serum
experiments with the modified fitness functions, the key to data sets of the different resolutions, respectively. As before, those
reducing the number of wavelengths needed in the calibration five models are based on the top five chromosomes from the three
model was again the random initialization procedure. Given that replicate GA runs.
the uniform crossover method adds an additional GA configuration For both the serum and GTB data sets, as the spectral
parameter (i.e., the crossover probability), it was decided to retain resolution decreases, the number of wavelengths in the optimal
the single-point crossover technique in subsequent work. chromosome also decreases. This is expected because the total
Effect of Spectral Resolution. The use of lower resolution number of wavelengths available to be selected is reduced.
spectra is analogous to the selection of wavelengths at equidistant Encouragingly, as the resolution decreases from 2 to 8 cm-1, the
locations in the original spectra. For this reason, we investigated overall prediction results (both optimal results and mean SEP
the effect of spectral resolution on the GA-based wavelength values) for both data sets are maintained, while the number of
selection procedure. Also, spectral resolution is an important wavelengths in the optimal chromosome is reduced from 77 to
experimental parameter in near-IR spectroscopy because the 48 for the serum data set and from 55 to 37 for the GTB data set.
required resolution affects the complexity of a dedicated instru- As the resolution further decreases to 16 cm-1, however, the
ment that might be used to implement an analysis such as the prediction results become significantly worse for both data sets.
measurement of glucose in a biological sample. Preliminary Parts D-F of Figure 2 display the wavelengths present in the
4478 Analytical Chemistry, Vol. 70, No. 21, November 1, 1998
of parts A-I of Figure 2, however, that through all the optimal
sets of wavelengths selected, spectral points relevant to glucose
information (e.g., close to the 4400- and 4300-cm-1 glucose bands)
were selected for use in building the glucose calibration models.

CONCLUSIONS
The choice of initial wavelengths is critical for GA-based
wavelength selection. The choice of a starting chromosome not
only affects the efficiency of the GA to search for the optimal set
of wavelengths but also has a crucial effect on the number of
wavelengths present in the optimal chromosome. With random
selection of a relatively small number of initial wavelengths, the
number of wavelengths selected by the GA in building the optimal
calibration models has been dramatically reduced. With the use
of lower spectral resolution, the GA optimization efficiency can
be further improved and the number of wavelengths in the optimal
chromosome can be further decreased. For the glucose analysis,
the prediction results based on the optimal calibration models with
data sets of lower resolution (i.e., 4- and 8-cm-1 point spacing)
are not significantly different from those obtained with data sets
of the original 2-cm-1 point spacing. Reduced performance was
obtained with spectra at 16-cm-1 point spacing, however. This
indicates the feasibility of use of lower resolution spectra for
wavelength selection and also for other data analysis methodolo-
Figure 4. The mean SEP values and the corresponding upper 95%
confidence limits (error bars) from the top five models generated for gies applied to the near-IR measurement of glucose in biological
the data sets of different spectral resolutions. (A) GTB data set. (B) matrixes.
serum data set.
ACKNOWLEDGMENT
This research was supported entirely by the National Institutes
optimal chromosomes for the different resolutions (4-, 8-, and 16-
of Health under Grant DK45126. Mutua Mattu, Ndumiso Cingo,
cm-1 point spacings, respectively) for the serum data set. Parts
and Kevin Hazen are thanked for their assistance in collecting
G-I of Figure 2 present the corresponding selected wavelengths
the spectral data used in this research. Ronald Feld is thanked
for the GTB data sets with point spacings of 4, 8, and 16 cm-1,
for his help in obtaining the glucose levels in the human serum
respectively.
samples. Ronald Shaffer and Arjun Bangalore are acknowledged
Although these wavelengths were selected in building the
for writing the original version of the GA software.
optimal calibration models for the data sets of different resolutions,
they do not represent a unique collection of wavelengths which
are required to build a good calibration model. They are displayed Received for review April 27, 1998. Accepted August 23,
to show the rough pattern of the distribution of the wavelengths 1998.
selected by the GA-based procedure. It is clear from an inspection AC980451Q

Analytical Chemistry, Vol. 70, No. 21, November 1, 1998 4479

You might also like