Genetic-Algorithm-Based-Wavelength-Selection-for
Genetic-Algorithm-Based-Wavelength-Selection-for
Articles
Center for Intelligent Chemical Instrumentation, Department of Chemistry & Biochemistry, Ohio University,
Athens, Ohio 45701
Mark A. Arnold
Department of Chemistry, Iowa Advanced Technology Laboratories, University of Iowa, Iowa City, Iowa 52242
An improved genetic algorithm (GA)-based wavelength performance relative to those based on full spectra.1,2
selection procedure is developed to optimize both the Various wavelength selection algorithms and criteria for use
near-infrared wavelengths used and the number of latent with multivariate calibration have been reported.3-17 Partial least-
variables employed in building partial least-squares (PLS) squares (PLS) regression is widely used for processing full spectra
calibration models. This GA-based wavelength selection because of its ability to extract analyte information from the many
algorithm is applied to the determination of glucose in two sources of variance within the spectral data matrix. Wavelength
different biological matrixes. With random selection of a selection methods have traditionally not been used with PLS
small number of initial wavelengths, a dramatic reduction regression models because of this ability to decompose the data
in the number of wavelengths required for building the matrix in a manner biased toward the isolation of analyte-
PLS calibration models is observed. The fitness function dependent information. However, recent studies have indicated
used to guide the GA, the method of recombination used, that the performance of PLS models can be improved through
and the effect of spectral resolution on the wavelength wavelength selection.13-17 A mathematical justification of the
selection are also studied. In the resolution study, the theory that wavelength selection can enhance the performance
original data with a point spacing of 2 cm-1 are deresolved of PLS models was also reported recently.17
to 4-, 8-, and 16-cm-1 point spacings by truncating the
collected interferograms before applying the Fourier (1) Brown, C. W.; Lynch, P. F.; Obremski, R. J.; Lavery, D. S. Anal. Chem. 1982,
54, 1472-1479.
processing step. The use of lower resolution spectra is (2) Rossi, D. T.; Pardue, H. L. Anal. Chim. Acta 1985, 175, 153-161.
found to reduce further the number of final wavelengths (3) Kalivas, J. H.; Roberts, N.; Sutter, J. M. Anal. Chem. 1989, 61, 2024-2030.
selected by the GA, and the performance of the optimal (4) Liang, Y.; Xie, Y.; Yu, R. Anal. Chim. Acta 1989, 222, 347-357.
(5) Sasaki, K.; Kawata, S.; Minami, S. Appl. Spectrosc. 1986, 40, 185-190.
calibration models obtained with the original spectra is (6) Salamin, P. A.; Bartels, H.; Forster, P. Chemom. Intell. Lab. Syst. 1991, 11,
maintained with the lower resolution spectra of both 4- 57-62.
and 8-cm-1 point spacing. Degradation in performance (7) Brown, P. J. J. Chemom. 1993, 7, 255-265.
(8) Brown, P. J. J. Chemom. 1992, 6, 151-161.
is observed with the spectra computed with a point (9) Lucasius, C. B.; Kateman, G. TrAC, Trends Anal. Chem. 1991, 10, 254-
spacing of 16 cm-1, however. 261.
(10) Lucasius, C. B.; Beckers, M. L. M.; Kateman, G. Anal. Chim. Acta 1994,
286, 135-153.
Multivariate calibration models are widely used in near-infrared (11) Jouan-Rimbaud, D.; Massart, D.; Leardi, R.; Noord, O. D. Anal. Chem. 1995,
(near-IR) spectroscopy to allow analyte spectral information to be 67, 4295-4301.
(12) Hörchner, U.; Kalivas, J. H. Anal. Chim. Acta 1995, 311, 1-13.
extracted from overlapping spectral bands arising from the sample (13) Rimbaud, D. J.; Walczak, B.; Massart, D.; Last, I. R.; Prebble, K. A. Anal.
matrix. Wavelength selection methods are feature (variable) Chim. Acta 1995, 304, 285-295.
selection techniques that allow calibration models to be con- (14) Navaroo-Vailloslada, F.; Perez-Arribas, L. V.; Leon-Gonzalez M. E.; Polo-
Diez, L. M. Anal. Chim. Acta 1995, 313, 93-101.
structed with a subset of spectral points instead of with full spectra. (15) Bangalore, A. S.; Shaffer, R. E.; Small, G. W.; Arnold, M. A. Anal. Chem.
This allows the wavelengths representing relevant spectral infor- 1996, 68, 4200-4212.
mation to be selected, while points dominated by noise or other (16) McShane, M. J.; Cote, G. L.; Spiegelman, C. H. Appl. Spectrosc. 1997, 51,
1559.
extraneous sources of variation are not included in the calibration (17) Spiegelman, C. H.; McShane, M. J.; Goetz, M. J.; Motamedi, M.; Yue, Q. L.;
model. The resulting calibration models may exhibit improved Cote, G. L. Anal. Chem. 1998, 70, 35-44.
4472 Analytical Chemistry, Vol. 70, No. 21, November 1, 1998 10.1021/ac980451q CCC: $15.00 © 1998 American Chemical Society
Published on Web 09/19/1998
To search for the optimal set of wavelengths, numerical temperature was controlled to 37-38 °C with a water-jacketed cell
optimization techniques such as genetic algorithms (GAs)9-11 and holder.
simulated annealing (SA)12 have been employed. These optimiza- Human serum samples were collected from patients at the
tion methods are efficient algorithms for interrogating a large University of Iowa Hospitals and Clinics. The glucose levels in
search space in which many combinations of wavelengths are the samples were determined with a conventional clinical glucose
possible. analyzer by the hospital clinical chemistry laboratory. The
One of the research interests of our laboratories is to develop precision of such measurements is typically in the range of 0.3
techniques based on near-IR spectroscopy for the measurement mM.26 All serum samples were frozen until just before the
of glucose in various biological matrixes.15,18-25 As part of this collection of the spectra. The procedures used in collecting and
work, a GA-based wavelength selection procedure for use with handling the serum samples complied with approved ethical and
PLS regression has been successfully implemented to accomplish safety standards at the University of Iowa. The 235 samples used
simultaneous optimization of the wavelengths selected and the for this study spanned a range of 3.2-31.9 mM glucose concentra-
number of latent variables employed in building a calibration tion.
model.15 In this paper, an enhanced GA-based wavelength The GTB samples were prepared in a pH 7.4, 0.1 M phosphate
selection procedure is developed through the investigation of buffer solution. 5-Fluorouracil (0.044% w/w) was added as a
strategies for initializing the GA in an optimal manner, further preservative. Regent-grade glucose, sodium phosphate salts,
exploration of the configuration of the GA, and evaluation of the 5-fluorouracil, triacetin, and BSA were purchased from common
impact of spectral resolution on the wavelength optimization. suppliers. The reagent-grade water used for the preparation of
the sample solutions was obtained from a Mili-Q Plus water
EXPERIMENTAL SECTION purification system (Millipore, Inc., Bedford, MA). A factorial
Instrumentation and Reagents. Two data sets were em- design was adopted for the concentration levels of glucose, BSA,
ployed for this research. One focused on the analysis of glucose and triacetin to minimize the correlation among these components.
in human serum samples (serum data set) and was collected at The data set consisted of samples prepared from all combinations
the University of Iowa. The other focused on the analysis of of 10 levels of glucose (1, 3, 5, 7, 9, 11, 13, 15, 17, and 19 mM),
glucose in an aqueous matrix of bovine serum albumin (BSA) four levels of BSA (50, 65, 80, and 95 g/L), and four levels of
and triacetin (GTB data set) and was collected at Ohio University. triacetin (1.4, 2.1, 2.8, and 3.5 g/L). A total of 160 (10 × 4 × 4)
The BSA and triacetin were used for modeling proteins and samples were prepared for the GTB data set.
triglycerides, respectively, in human blood. These two data sets Procedures. Double-sided interferograms of 16 384 points
were used in the previous study.15 were collected for the serum data set. Two to four replicate
Spectra in the serum data set were collected with a Nicolet interferograms were collected for each serum sample based on
740 Fourier transform spectrometer (Nicolet Instrument Corp., 256 coadded scans. The single-beam spectra were computed from
Madison, WI) configured with a 250-W tungsten-halogen source, the collected interferograms with software resident on the Nicolet
CaF2 beam splitter, and liquid nitrogen-cooled InSb detector. The 620 computer controlling the spectrometer. Triangular apodiza-
near-IR spectral region of 5000-4000 cm-1 was used. A K-band tion and Mertz phase correction were used in Fourier processing
interference filter (Barr Associates, Westford, MA) was used to the interferograms. The resulting spectra had a nominal point
isolate this spectral region. The samples were placed in an Infrasil spacing of 2 cm-1. The samples were measured in a randomized
quartz cell with 2.5-mm path length. The temperature of the order with respect to glucose concentration. Spectra of a pH 7.3,
samples was controlled to 37.0 ( 0.2 °C by use of a water-jacketed 0.1 M phosphate buffer were acquired periodically for use as
cell holder. background spectra in computing spectra in absorbance units.
The GTB data set was collected with a Digilab FTS-60A Fourier The procedure for collecting the GTB spectra was similar to
transform spectrometer (Bio-Rad, Cambridge, MA) configured that used for the serum data. However, single-side interferograms
with a 100-W tungsten-halogen source, CaF2 beam splitter, and of 16 384 points were collected based on 256 coadded scans.
InSb detector cooled with liquid nitrogen. The data were also Again, single-beam spectra with a nominal point spacing of 2 cm-1
collected over the spectral region of 5000-4000 cm-1, which was were computed from the collected interferograms, and triangle
isolated by a K-band interference filter (Barr Associates). An apodization and Mertz phase correction were employed. The
Infrasil quartz cell with a path length of 2 mm was used, and the software used for the Fourier processing was resident on the Bio-
Rad SPC-3200 computer controlling the spectrometer. Three
(18) Arnold, M. A.; Small, G. W. Anal. Chem. 1990, 62, 1457-1464.
(19) Marquardt, L. A.; Arnold, M. A.; Small, G. W. Anal. Chem. 1993, 65, 3271- replicate interferograms were collected for each sample. The data
3278. collection was also randomized with respect to glucose concentra-
(20) Small, G. W.; Arnold, M. A.; Marquardt, L. A. Anal. Chem. 1993, 65, 3279- tions, and interferograms of phosphate buffer were collected
3289.
(21) Hazen, K. H.; Arnold, M. A.; Small, G. W. Appl. Spectrosc. 1994, 48, 477- periodically for use in computing spectra in absorbance units.
483. All data analysis was performed with a Silicon Graphics Indigo2
(22) Shaffer, R. E.; Small, G. W.; Arnold, M. A. Anal. Chem. 1996, 68, 2663- R10000 workstation (Silicon Graphics, Mountain View, CA)
2675.
(23) Pan, S.; Chung, H.; Arnold, M. A.; Small, G. W. Anal. Chem. 1996, 68, operating under Irix (version 6.2). The software used for the data
1124-1135. analysis was written in Fortran 77. Subroutines used for multiple
(24) Mattu, M. J.; Small, G. W.; Arnold, M. A. Anal. Chem. 1997, 69, 4695- linear regression computations were obtained from the IMSL
4702.
(25) Ding, Q.; Small, G. W. In Fourier Transform Spectroscopy: 11th International software package (IMSL, Inc, Houston, TX).
Conference; de Haseth, J. A., Ed.; American Institute of Physics: Woodbury,
NY, 1998; pp 264-267. (26) Burmeister, J. J.; Arnold, M. A. Anal. Lett. 1995, 28, 581-592.
no. of
point spacing wavelengths no. of PLS SECa SEPb
(cm-1) selected factors (mM) (mM)
2 77 19 1.26 1.35
4 52 19 1.30 1.32
8 48 15 1.50 1.38
16 27 17 1.54 1.54
a Standard error of calibration computed from the residuals of the
calibration model. b Standard error of prediction.
no. of
point spacing wavelengths no. of PLS SECa SEPb
(cm-1) selected factors (mM) (mM)
2 55 13 0.53 0.61
4 48 13 0.57 0.61
8 37 13 0.63 0.68
16 23 13 0.67 0.84
a Standard error of calibration computed from the residuals of the
calibration model. b Standard error of prediction.
CONCLUSIONS
The choice of initial wavelengths is critical for GA-based
wavelength selection. The choice of a starting chromosome not
only affects the efficiency of the GA to search for the optimal set
of wavelengths but also has a crucial effect on the number of
wavelengths present in the optimal chromosome. With random
selection of a relatively small number of initial wavelengths, the
number of wavelengths selected by the GA in building the optimal
calibration models has been dramatically reduced. With the use
of lower spectral resolution, the GA optimization efficiency can
be further improved and the number of wavelengths in the optimal
chromosome can be further decreased. For the glucose analysis,
the prediction results based on the optimal calibration models with
data sets of lower resolution (i.e., 4- and 8-cm-1 point spacing)
are not significantly different from those obtained with data sets
of the original 2-cm-1 point spacing. Reduced performance was
obtained with spectra at 16-cm-1 point spacing, however. This
indicates the feasibility of use of lower resolution spectra for
wavelength selection and also for other data analysis methodolo-
Figure 4. The mean SEP values and the corresponding upper 95%
confidence limits (error bars) from the top five models generated for gies applied to the near-IR measurement of glucose in biological
the data sets of different spectral resolutions. (A) GTB data set. (B) matrixes.
serum data set.
ACKNOWLEDGMENT
This research was supported entirely by the National Institutes
optimal chromosomes for the different resolutions (4-, 8-, and 16-
of Health under Grant DK45126. Mutua Mattu, Ndumiso Cingo,
cm-1 point spacings, respectively) for the serum data set. Parts
and Kevin Hazen are thanked for their assistance in collecting
G-I of Figure 2 present the corresponding selected wavelengths
the spectral data used in this research. Ronald Feld is thanked
for the GTB data sets with point spacings of 4, 8, and 16 cm-1,
for his help in obtaining the glucose levels in the human serum
respectively.
samples. Ronald Shaffer and Arjun Bangalore are acknowledged
Although these wavelengths were selected in building the
for writing the original version of the GA software.
optimal calibration models for the data sets of different resolutions,
they do not represent a unique collection of wavelengths which
are required to build a good calibration model. They are displayed Received for review April 27, 1998. Accepted August 23,
to show the rough pattern of the distribution of the wavelengths 1998.
selected by the GA-based procedure. It is clear from an inspection AC980451Q