2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)
2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)
a r t i c l e i n f o a b s t r a c t
Article history: Recently there have been renewed interests in single-hidden-layer neural networks (SHLNNs). This is due
Received 31 March 2011 to its powerful modeling ability as well as the existence of some efficient learning algorithms. A promi-
Available online 9 December 2011 nent example of such algorithms is extreme learning machine (ELM), which assigns random values to the
Communicated by J. Laaksonen
lower-layer weights. While ELM can be trained efficiently, it requires many more hidden units than is
typically needed by the conventional neural networks to achieve matched classification accuracy. The
Keywords: use of a large number of hidden units translates to significantly increased test time, which is more valu-
Neural network
able than training time in practice. In this paper, we propose a series of new efficient learning algorithms
Extreme learning machine
Accelerated gradient algorithm
for SHLNNs. Our algorithms exploit both the structure of SHLNNs and the gradient information over all
Weighted algorithm training epochs, and update the weights in the direction along which the overall square error is reduced
MNIST the most. Experiments on the MNIST handwritten digit recognition task and the MAGIC gamma telescope
dataset show that the algorithms proposed in this paper obtain significantly better classification accuracy
than ELM when the same number of hidden units is used. For obtaining the same classification accuracy,
our best algorithm requires only 1/16 of the model size and thus approximately 1/16 of test time com-
pared with ELM. This huge advantage is gained at the expense of 5 times or less the training cost incurred
by the ELM training.
Ó 2011 Elsevier B.V. All rights reserved.
1. Introduction where Y = [y1, . . . , yi, . . . , yN]. Note that once the lower-layer weights
W are fixed, the hidden-layer values. H = [h1, . . . , hi, . . . , hN] are also
Recently there have been renewed interests in single-hidden- determined uniquely. And subsequently, the upper-layer weights
layer neural networks (SHLNNs) with least square error (LSE) train- U can be determined by setting the gradient
ing criterion, partly due to its modeling ability and partly due to
@E @Tr½ðU T H TÞðU T H TÞT
the existence of efficient learning algorithms such as extreme ¼ ¼ 2HðU T H TÞT ð2Þ
learning machine (ELM) (Huang et al., 2006). @U @U
Given the set of input vectors X = [x1, . . . , xi, . . . , xN], in which to zero, leading to the closed-form solution
each vector is denoted by xi = [x1j, . . . , xji, . . . , xDi] where D is the
dimension of the input vector and N is the total number of training U ¼ ðHH T Þ1 HT T : ð3Þ
samples. Denote L the number of hidden units and C the dimension Note that (3) defines an implicit constraint between the two sets of
of the output vector, the output of the SHLNN is yi = UThi where weights, U and W, via the hidden layer output H, in the SHLNN. This
hi = r(WTxi) is the hidden layer output, U is an weight L C matrix gives rise to a structure that our new algorithms will exploit in opti-
at the upper layer, W is an D L weight matrix at the lower layer, mizing the SHLNN.
and r() is the sigmoid function. Note that the bias terms are Although solution (3) is simple, regularization techniques need
implicitly represented in the above formulation if xi and hi are aug- to be used in actual implementation to deal with sometimes ill-
mented with 1’s. conditioned hidden layer matrix H (i.e., HHT is singular). A popular
Given the target vectors T = [t1, . . . , ti, . . . , tN], where each target technique, which is used in this study, is based on the ridge regres-
ti = [t1i, . . . , tji, . . . , tCi]T, the parameters U and W are learned to min- sion theory (Hoerl and Kennard, 1970). More specifically, (3) is
imize the square error converted to
E ¼ kY Tk2 ¼ Tr½ðY TÞðY TÞT ; ð1Þ 1
I
U¼ þ HH T HT T ; ð4Þ
l
by adding a positive value I/l to the diagonal of HHT, where I is the
⇑ Corresponding author. Tel.: +1 425 707 9282; fax: +1 425 936 7329. identity matrix and l is a positive constant to control the degree of
E-mail addresses: [email protected] (D. Yu), [email protected] (L. Deng). regularization. The resultant solution (4) actually minimizes
0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.patrec.2011.12.002
D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558 555
kU T H Tk2 þ lkUk2 , where lkUk2 is an L2 regularization term. Given fixed current U and W, we compute gradient
Solution (4) is typically more stable and tends to have better gener-
alization performance than (3) and is used throughout the paper @E @Tr½ðU T rðW T X TÞðU T rW T XÞ TÞT
¼
whenever pseudo inverse is involved. @W @W
It has been shown in Huang et al (2006) that the lower-layer ¼ 2X½H ð1 HÞ ðUU T H UTÞT ð5Þ
weights W can be randomly selected and the resulting SHLNN
where is element-wise product. This first algorithm first updates
can still approximate any function by setting the upper-layer
W using the gradient defined directly in (5) as
weights U according to (3). The training process can thus be re-
duced to a pseudo-inverse problem and hence it is extremely effi- @E
W kþ1 ¼ W k q ð6Þ
cient. This is the basis of the extreme learning machine (ELM) @W
(Huang et al., 2006). where q is the learning rate. It then calculates U using the closed-
However, the drawback of ELM is its inefficiency in using the form solution (4). Since it is unaware of the upper-layer solution
model parameters. To achieve good classification accuracy, ELM re- when calculating the gradient, we name it as ‘‘upper-layer-solu-
quires a huge number of hidden units. This inevitably increases the tion-unaware’’ or USUA. The USUA algorithm is simple to imple-
model size and the test time. In practice, the test time is much ment and each epoch takes less time than other algorithms we
more valuable than the training time due to two reasons. First, will introduce in the next several subsections thanks to the simple
training is only needed to be done once while test needs to be done form of the gradient (5). However, it is less effective than other
as many times as the service is live. Second, training can be done algorithms and typically requires more epochs to converge to a
offline and can tolerate long latency while test typically requires good solution and more hidden units to achieve the same accuracy.
real time response. To reduce the model size, a number of algo-
rithms, such as evolutionary-ELM (Zhu et al., 2005) and enhanced 2.2. Upper-layer-solution-aware algorithm
random search based incremental ELM (EI-ELM) (Huang and Chen,
2008), have been proposed in the literature. These algorithms ran- In the USUA algorithm we do not take into consideration the fact
domly generate all or part of the lower-layer weights and select the that U completely depends on W. As a result, the direction defined by
ones with the LSE. However, these algorithms are not efficient in gradient (5) is suboptimal. In the upper-layer-solution-aware (USA)
finding good model parameters since they only use the value of algorithm we derive the gradient oE/oW by considering W’s effect
the objective function in the search process. on the upper-layer weights U and thus its effect on the square error
In this paper, we propose a series of new efficient algorithms as the training objective function. By treating U a function of W and
to train SHLNNs. Our algorithms exploit both the structure of plugging (3) into criterion (1) we obtain the new gradient
SHLNNs, expressed in terms of the constraint of (3), and the gra-
@E @Tr½ðU T H TÞðU T H TÞT
dient information over all training epochs. They also update the ¼
weights in the direction that can reduce the overall square error @W @W
the most. We compare our algorithms with ELM and EI-ELM on @Tr½ð½ðHH T Þð1Þ HT T T H TÞð½ðHH T Þð1Þ HT T T H TÞT
¼
the MNIST handwritten digit recognition dataset (LeCun et al., @W
1998) and the MAGIC gamma telescope dataset. The experiments @Tr½TT T TH T ðHH T Þð1Þ HT T @Tr½ðHH T Þð1Þ HT T TH T
show that all algorithms proposed in this paper obtain signifi- ¼ ¼
@W @W
cantly better classification accuracy than ELM and EI-ELM when
@Tr½ð@ðW T XÞ½@ðW T XÞT Þð1Þ @ðW T XÞT T T½@ðW T XÞT
the same number of hidden units is used. To obtain the same ¼
classification accuracy, our best algorithm requires only 1/16 of
@W
the model size and thus test time needed by ELM at the cost of ¼ 2X½H T ð1 HÞT ½H y ðHT T ÞðTH y Þ T T ðTH y Þ ð7Þ
5 folds or less training time by ELM. The 2048 hidden unit SHLNN
where
trained using our best algorithm achieved 98.9% classification
accuracy on the MNIST task. This compares favorably with the H y ¼ H T ðHH T Þ1 ð8Þ
three-hidden-layer deep belief network (DBN) (Hinton and Sala-
is the pseudo-inverse of H.
khutdinov, 2006).
In the derivation of (7) we used the fact that HHT is symmetric
The rest of the paper is organized as follows. In Section 2 we de-
and so is (HHT)1. We also used the fact that
scribe our novel efficient algorithms. In Section 3 we report our
experimental results on the MNIST and MAGIC datasets. We con- @Tr½ðHH T Þ1 HT T TH T
¼ 2H T ðHH T Þ1 HT T TH T ðHH T Þ1
clude the paper in Section 4. @H T
þ 2T T TH T ðHH T Þ1 ð9Þ
2. New algorithms exploiting structures Since the USA algorithm knows the effect of W on U, it tends to
move W towards a direction that finds the optimal points faster.
In this section, we propose four increasingly more effective and However, due to the more complicated gradient calculation that
efficient algorithms for learning the SHLNNs. Although the algo- involves a pseudo-inverse, each USA epoch takes longer time to
rithms are developed and evaluated based on the sigmoid network, compute than that of USUA. Note that we grouped the products
the techniques can be directly extended to SHLNNs with other acti- of matrices in (7). This is necessary to reduce the memory usage
vation functions such as radial basis function. when the number of samples is very large.
The idea behind this first algorithm is simple. Since the upper- The USA algorithm updates weights based on the current gradi-
layer weights can be determined explicitly using the closed-form ent only. However, it has been shown for the convex problems that
solution (4) once the lower-layer weights are determined, we can the convergence speed can be improved if the gradient information
just search for the lower-layer weights along the gradient direction over the history is used when updating the weights (Nesterov,
at each epoch. 2004; Beck and Teboulle, 2010). Although the speedup may not
556 D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558
2.4. Weighted accelerated USA algorithm We evaluated and compared the four learning algorithms de-
scribed in Section 2 against the basic ELM algorithm and the EI-
In (7), each sample is weighted the same. It is intuitive, how- ELM algorithm on the MNIST dataset (LeCun et al., 1998) and the
ever, that we may improve the convergence speed by focusing on MAGIC gamma telescope dataset (Frank and Asuncion, 2010).
the samples with most errors for two reasons. First, it allows the
training procedure to slightly change the search direction (since
weighted sum is different) at each epoch and thus has better 3.1. Dataset Description
chance to jump out of the local optimums. Second, since the train-
ing procedure focuses on the samples with most errors, it can re- The MNIST dataset contains binary images of handwritten dig-
duce the overall errors faster. its. The digits have been size-normalized to fit in a 20 20 pixel
In this work, we define the weight box while preserving their aspect ratio and centered in a 28 28
image by computing and translating the center of mass of the pix-
1 N a N
kii ¼ ky t k2 þ ¼ kyi t i k2 þ a =ða þ 1Þ ð13Þ els. The task is to classify each 28 28 image into one of the 10
aþ1 E i i aþ1 E
digits. The MNIST training set is composed of 60,000 examples
for each sample i, where E is the square error over the whole train- from approximately 250 writers, out of which we randomly se-
ing set, N is the training set size, and a is a smoothing factor. The lected 5,000 samples as the cross validation set. The test set has
weighting factors kii are so chosen that they are positively corre- 10,000 patterns. The sets of writers of the training set and test
lated to the errors introduced by each sample while being smoothed set are disjoint.
to make sure weights assigned to each sample is at least a/(a + 1). a The MAGIC gamma telescope dataset was generated using the
is typically set to 1 initially and increases over epochs so that even- Monte Carlo procedure to simulate registration of high energy
tually the original criterion E defined in (1) is optimized. gamma particles in a ground-based atmospheric Cherenkov gam-
At each step, instead of minimizing E directly we can minimize ma telescope using the imaging technique. Cherenkov gamma tele-
the weighted error scope observes high energy gamma rays, taking advantage of the
h i radiation emitted by charged particles produced inside the electro-
€ ¼ Tr ðY TÞKðY TÞT ;
E ð14Þ magnetic showers initiated by the gammas, and developing in the
atmosphere. This Cherenkov radiation leaks through the atmo-
where K ¼ diag½k11 ; . . . ; kii ; . . . ; kNN is an N by N diagonal weight
sphere and gets recorded in the detector, allowing reconstruction
matrix.
€ , once the lower-layer weights W are fixed the of the shower parameters.
To minimize E
The MAGIC dataset contains 19020 samples out of which we
upper-layer weights U can be determined by setting the gradient
randomly selected 10% (1902 samples) as the cross validation
€ @Tr½ðY TÞKðY TÞT
@E set, 10% (1902 samples) as the test set, and the rest as the training
¼ ¼ 2HKðU T H TÞT ð15Þ set. Each sample in the dataset has 10 real-valued attributes and a
@U @U
class label (signal or background). The task is to classify the obser-
to zero, which has the closed-form solution
vation to either the signal class or background class based on the
U ¼ ðHKH T Þ1 HKT T : ð16Þ attributes. Note that these attributes have some structures. How-
ever, in this study we did not exploit these structures since our
By plugging (16) into (14) and using similar derivation steps used to goal is not to achieve the best result on this dataset but compare
@E
derive @W in (7), we obtain the gradient different algorithms proposed in the paper.
D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558 557
used and when 2048 hidden units are used, it overfits the training
data and the test set accuracy becomes lower. However, we can
achieve same or higher accuracies as the best achievable using
ELM algorithm with 64, 32, and 32 hidden units, respectively, using
USA, A-USA, and WA-USA algorithms. This indicates that at test
time we can achieve the same or higher accuracy with 1/16, 1/
32, and 1/32 of computation time using these algorithms com-
pared to the ELM algorithm. Note that to train a 32-hidden-unit
SHLNN using the A-USA or WA-USA algorithm we only need to
spend less than four times of the time needed to train a 1024 hid-
den unit model using ELM.
4. Conclusion
Fig. 2. The average test set accuracy as a function of the number of hidden units
and different learning algorithms on the MAGIC dataset. In this paper we presented four efficient algorithms for training
SHLNNs. These algorithms exploit information such as the struc-
ELM model and 2,220 seconds for a 1024 hidden unit EI-EIM mod- ture of SHLNNs and gradient values over epochs, and update the
el. If 2048 hidden units are used, we can obtain 98.55% average test weights along the most promising direction. We demonstrated
set accuracy with WA-USA, which is very difficult to obtain using both the efficiency and effectiveness of these algorithms on the
ELM. MNIST and MAGIC datasets. Among all the algorithms developed
Note that we can consistently achieve 100% classification accu- in this work, we recommend using the WA-USA and A-USA algo-
racy on the training set when we use WA-USA with 1024 and more rithms since they converge fastest and typically to a better model.
hidden units which is not the case when other algorithms are used. We believe this line of work can help improve the scalability of
This prevents further improvement on the classification accuracy neural networks in speech recognition systems (e.g., Dahl et al.,
on both training and test sets even though square error continues 2012; Yu and Deng, 2011) which typically require thousands of
to decline. This also explains the smaller gain when the number of hours of training data.
hidden units increases from 1024 to 2048 when WA-USA is used.
Our proposed algorithms also compare favorably over other Acknowledgment
SHLNN training algorithms previous proposed. For example, with
random initialization WA-USA can achieve 97.3% test set accuracy We thank Dr. Guang-Bin Huang at Singapore Nanyang Techno-
using 256 hidden units. This result is better than 95.3% test set logical University for fruitful discussions on ELM.
accuracy achieved using SHLNN with 300 hidden units but trained
using conventional back-propagation algorithm with mean square
error criterion (LeCun et al., 1998). References
Furthermore, using the WA-USA algorithm and the single 2048
Beck, A., Teboulle, M., 2010. Gradient-Based Methods with Application to Signal
hidden layer weights initialized with the restricted Boltzmann ma- Recovery Problems. In: Palomar, D., Eldar, Y. (Eds.), Convex Optimization in
chine (RBM), we obtained average test set accuracy of 98.9% which Signal Processing and Communications. Cambridge University Press, Berlin.
Dahl, G.E., Yu, D., Deng, L., Acero, A. 2012. Context-Dependent Pre-Trained Deep
is slightly better than the 98.8% obtained using a 3-hidden-layer
Neural Networks for Large Vocabulary Speech Recognition. IEEE Transactions
DBN initialized using RBM (Hinton and Salakhutdinov, 2006) with on Audio, Speech, and Language Processing - Special Issue on Deep Learning for
significantly less training time. Speech and Language Processing (special issue).
Frank, A., Asuncion, A. 2010. UCI Machine Learning Repository. Irvine, CA:
University of California, School of Information and Computer Science.
3.3. Experimental results on MAGIC dataset Available from: http://www.archive.ics.uci.edu/ml.
Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with
Similar comparison experiments have been conducted on the neural networks. Science 313 (5786), 504–507.
Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: biased estimation for
MAGIC dataset. Fig. 2 summarizes and compares the classification nonorthogonal problems. Technometrics 12 (1), 55–67.
accuracy using ELM, EI-ELM, USUA, USA, A-USA, and WA-USA algo- Huang, G.-B., Zhu, Q.-Y., Siew, C.-K., 2006. Extreme learning machine: theory and
rithms as a function of the number of hidden units. Although the applications. Neurocomputing 70, 489–501.
Huang, G.-B., Chen, L., 2008. Enhanced random search based incremental extreme
relative accuracy improvement is different from those observed learning machine. Neurocomputing 71, 3460–3468.
in MNIST dataset, the accuracy curves share the same basic trend LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. 1998. Gradient-based learning applied to
as that in Fig. 1. We can see that, esp. when the number of hidden document recognition. In: Proceedings of the IEEE, 86(11), 2278–2324.
Negnevitsky, M., Ringrose, M., 1999. Accelerated learning in multi-layer neural
units is small, the proposed algorithms significantly outperform
networks. Neural Information Processing ICONIP 3, 1167–1171.
ELM and EI-ELM. Although when the number of hidden unit in- Nesterov, Y., 2004. Introductory Lectures on Convex Optimization: A Basic Course.
creases to 256, the gap between the accuracies obtained using pro- Kluwer Academic Publishers.
Yu, D., Deng, L., 2011. Deep learning and its relevance to signal and information
posed approaches and that achieved using ELM and EI-ELM
processing. IEEE Signal Processing Magazine 28 (1), 145–154.
decreases, the difference is still very large. Actually, ELM obtained Zhu, Q.-Y., Qin, A.K., Suganthan, P.N., Huang, G.-B., 2005. Evolutionary extreme
the highest test set accuracy of 87.0% when 1024 hidden units are learning machine. Pattern Recogn. 38, 1759–1763.