0% found this document useful (0 votes)
70 views5 pages

2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)

This document proposes new efficient learning algorithms for single-hidden-layer neural networks (SHLNNs). The algorithms aim to update the weights in the direction that reduces the overall square error the most, while exploiting the structure of SHLNNs and gradient information over training epochs. Experiments on handwritten digit and gamma telescope datasets show the algorithms obtain better classification accuracy than extreme learning machines using the same number of hidden units, requiring only 1/16 the model size and test time to achieve the same accuracy.

Uploaded by

ksajj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views5 pages

2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)

This document proposes new efficient learning algorithms for single-hidden-layer neural networks (SHLNNs). The algorithms aim to update the weights in the direction that reduces the overall square error the most, while exploiting the structure of SHLNNs and gradient information over training epochs. Experiments on handwritten digit and gamma telescope datasets show the algorithms obtain better classification accuracy than extreme learning machines using the same number of hidden units, requiring only 1/16 the model size and test time to achieve the same accuracy.

Uploaded by

ksajj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Pattern Recognition Letters 33 (2012) 554–558

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

Efficient and effective algorithms for training single-hidden-layer neural networks


Dong Yu ⇑, Li Deng
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

a r t i c l e i n f o a b s t r a c t

Article history: Recently there have been renewed interests in single-hidden-layer neural networks (SHLNNs). This is due
Received 31 March 2011 to its powerful modeling ability as well as the existence of some efficient learning algorithms. A promi-
Available online 9 December 2011 nent example of such algorithms is extreme learning machine (ELM), which assigns random values to the
Communicated by J. Laaksonen
lower-layer weights. While ELM can be trained efficiently, it requires many more hidden units than is
typically needed by the conventional neural networks to achieve matched classification accuracy. The
Keywords: use of a large number of hidden units translates to significantly increased test time, which is more valu-
Neural network
able than training time in practice. In this paper, we propose a series of new efficient learning algorithms
Extreme learning machine
Accelerated gradient algorithm
for SHLNNs. Our algorithms exploit both the structure of SHLNNs and the gradient information over all
Weighted algorithm training epochs, and update the weights in the direction along which the overall square error is reduced
MNIST the most. Experiments on the MNIST handwritten digit recognition task and the MAGIC gamma telescope
dataset show that the algorithms proposed in this paper obtain significantly better classification accuracy
than ELM when the same number of hidden units is used. For obtaining the same classification accuracy,
our best algorithm requires only 1/16 of the model size and thus approximately 1/16 of test time com-
pared with ELM. This huge advantage is gained at the expense of 5 times or less the training cost incurred
by the ELM training.
Ó 2011 Elsevier B.V. All rights reserved.

1. Introduction where Y = [y1, . . . , yi, . . . , yN]. Note that once the lower-layer weights
W are fixed, the hidden-layer values. H = [h1, . . . , hi, . . . , hN] are also
Recently there have been renewed interests in single-hidden- determined uniquely. And subsequently, the upper-layer weights
layer neural networks (SHLNNs) with least square error (LSE) train- U can be determined by setting the gradient
ing criterion, partly due to its modeling ability and partly due to
@E @Tr½ðU T H  TÞðU T H  TÞT 
the existence of efficient learning algorithms such as extreme ¼ ¼ 2HðU T H  TÞT ð2Þ
learning machine (ELM) (Huang et al., 2006). @U @U
Given the set of input vectors X = [x1, . . . , xi, . . . , xN], in which to zero, leading to the closed-form solution
each vector is denoted by xi = [x1j, . . . , xji, . . . , xDi] where D is the
dimension of the input vector and N is the total number of training U ¼ ðHH T Þ1 HT T : ð3Þ
samples. Denote L the number of hidden units and C the dimension Note that (3) defines an implicit constraint between the two sets of
of the output vector, the output of the SHLNN is yi = UThi where weights, U and W, via the hidden layer output H, in the SHLNN. This
hi = r(WTxi) is the hidden layer output, U is an weight L  C matrix gives rise to a structure that our new algorithms will exploit in opti-
at the upper layer, W is an D  L weight matrix at the lower layer, mizing the SHLNN.
and r() is the sigmoid function. Note that the bias terms are Although solution (3) is simple, regularization techniques need
implicitly represented in the above formulation if xi and hi are aug- to be used in actual implementation to deal with sometimes ill-
mented with 1’s. conditioned hidden layer matrix H (i.e., HHT is singular). A popular
Given the target vectors T = [t1, . . . , ti, . . . , tN], where each target technique, which is used in this study, is based on the ridge regres-
ti = [t1i, . . . , tji, . . . , tCi]T, the parameters U and W are learned to min- sion theory (Hoerl and Kennard, 1970). More specifically, (3) is
imize the square error converted to
E ¼ kY  Tk2 ¼ Tr½ðY  TÞðY  TÞT ; ð1Þ  1
I
U¼ þ HH T HT T ; ð4Þ
l
by adding a positive value I/l to the diagonal of HHT, where I is the
⇑ Corresponding author. Tel.: +1 425 707 9282; fax: +1 425 936 7329. identity matrix and l is a positive constant to control the degree of
E-mail addresses: [email protected] (D. Yu), [email protected] (L. Deng). regularization. The resultant solution (4) actually minimizes

0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.patrec.2011.12.002
D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558 555

kU T H  Tk2 þ lkUk2 , where lkUk2 is an L2 regularization term. Given fixed current U and W, we compute gradient
Solution (4) is typically more stable and tends to have better gener-
alization performance than (3) and is used throughout the paper @E @Tr½ðU T rðW T X  TÞðU T rW T XÞ  TÞT 
¼
whenever pseudo inverse is involved. @W @W
It has been shown in Huang et al (2006) that the lower-layer ¼ 2X½H  ð1  HÞ  ðUU T H  UTÞT  ð5Þ
weights W can be randomly selected and the resulting SHLNN
where  is element-wise product. This first algorithm first updates
can still approximate any function by setting the upper-layer
W using the gradient defined directly in (5) as
weights U according to (3). The training process can thus be re-
duced to a pseudo-inverse problem and hence it is extremely effi- @E
W kþ1 ¼ W k  q ð6Þ
cient. This is the basis of the extreme learning machine (ELM) @W
(Huang et al., 2006). where q is the learning rate. It then calculates U using the closed-
However, the drawback of ELM is its inefficiency in using the form solution (4). Since it is unaware of the upper-layer solution
model parameters. To achieve good classification accuracy, ELM re- when calculating the gradient, we name it as ‘‘upper-layer-solu-
quires a huge number of hidden units. This inevitably increases the tion-unaware’’ or USUA. The USUA algorithm is simple to imple-
model size and the test time. In practice, the test time is much ment and each epoch takes less time than other algorithms we
more valuable than the training time due to two reasons. First, will introduce in the next several subsections thanks to the simple
training is only needed to be done once while test needs to be done form of the gradient (5). However, it is less effective than other
as many times as the service is live. Second, training can be done algorithms and typically requires more epochs to converge to a
offline and can tolerate long latency while test typically requires good solution and more hidden units to achieve the same accuracy.
real time response. To reduce the model size, a number of algo-
rithms, such as evolutionary-ELM (Zhu et al., 2005) and enhanced 2.2. Upper-layer-solution-aware algorithm
random search based incremental ELM (EI-ELM) (Huang and Chen,
2008), have been proposed in the literature. These algorithms ran- In the USUA algorithm we do not take into consideration the fact
domly generate all or part of the lower-layer weights and select the that U completely depends on W. As a result, the direction defined by
ones with the LSE. However, these algorithms are not efficient in gradient (5) is suboptimal. In the upper-layer-solution-aware (USA)
finding good model parameters since they only use the value of algorithm we derive the gradient oE/oW by considering W’s effect
the objective function in the search process. on the upper-layer weights U and thus its effect on the square error
In this paper, we propose a series of new efficient algorithms as the training objective function. By treating U a function of W and
to train SHLNNs. Our algorithms exploit both the structure of plugging (3) into criterion (1) we obtain the new gradient
SHLNNs, expressed in terms of the constraint of (3), and the gra-
@E @Tr½ðU T H  TÞðU T H  TÞT 
dient information over all training epochs. They also update the ¼
weights in the direction that can reduce the overall square error @W @W
the most. We compare our algorithms with ELM and EI-ELM on @Tr½ð½ðHH T Þð1Þ HT T T H  TÞð½ðHH T Þð1Þ HT T T H  TÞT 
¼
the MNIST handwritten digit recognition dataset (LeCun et al., @W
1998) and the MAGIC gamma telescope dataset. The experiments @Tr½TT T  TH T ðHH T Þð1Þ HT T  @Tr½ðHH T Þð1Þ HT T TH T 
show that all algorithms proposed in this paper obtain signifi- ¼ ¼
@W @W
cantly better classification accuracy than ELM and EI-ELM when
@Tr½ð@ðW T XÞ½@ðW T XÞT Þð1Þ @ðW T XÞT T T½@ðW T XÞT 
the same number of hidden units is used. To obtain the same ¼
classification accuracy, our best algorithm requires only 1/16 of
@W
the model size and thus test time needed by ELM at the cost of ¼ 2X½H T  ð1  HÞT  ½H y ðHT T ÞðTH y Þ  T T ðTH y Þ ð7Þ
5 folds or less training time by ELM. The 2048 hidden unit SHLNN
where
trained using our best algorithm achieved 98.9% classification
accuracy on the MNIST task. This compares favorably with the H y ¼ H T ðHH T Þ1 ð8Þ
three-hidden-layer deep belief network (DBN) (Hinton and Sala-
is the pseudo-inverse of H.
khutdinov, 2006).
In the derivation of (7) we used the fact that HHT is symmetric
The rest of the paper is organized as follows. In Section 2 we de-
and so is (HHT)1. We also used the fact that
scribe our novel efficient algorithms. In Section 3 we report our
experimental results on the MNIST and MAGIC datasets. We con- @Tr½ðHH T Þ1 HT T TH T 
¼ 2H T ðHH T Þ1 HT T TH T ðHH T Þ1
clude the paper in Section 4. @H T
þ 2T T TH T ðHH T Þ1 ð9Þ
2. New algorithms exploiting structures Since the USA algorithm knows the effect of W on U, it tends to
move W towards a direction that finds the optimal points faster.
In this section, we propose four increasingly more effective and However, due to the more complicated gradient calculation that
efficient algorithms for learning the SHLNNs. Although the algo- involves a pseudo-inverse, each USA epoch takes longer time to
rithms are developed and evaluated based on the sigmoid network, compute than that of USUA. Note that we grouped the products
the techniques can be directly extended to SHLNNs with other acti- of matrices in (7). This is necessary to reduce the memory usage
vation functions such as radial basis function. when the number of samples is very large.

2.1. Upper-layer-solution-unaware algorithm 2.3. Accelerated upper-layer-solution-aware algorithm

The idea behind this first algorithm is simple. Since the upper- The USA algorithm updates weights based on the current gradi-
layer weights can be determined explicitly using the closed-form ent only. However, it has been shown for the convex problems that
solution (4) once the lower-layer weights are determined, we can the convergence speed can be improved if the gradient information
just search for the lower-layer weights along the gradient direction over the history is used when updating the weights (Nesterov,
at each epoch. 2004; Beck and Teboulle, 2010). Although the speedup may not
556 D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558

be guaranteed in theory for our non-convex problems, we have ob- €


@E @Tr½ðU T H  TÞKðU T H  TÞT 
served in practice that such algorithms do converge faster and to a ¼
@W @W
better place. Actually, similar but less principled techniques such
@Tr½ð½ðHKH T Þ1 HKT T T H  TÞKð½ðHKH T Þ1 HKT T T H  TÞT 
as momentum (Negnevitsky and Ringrose, 1999) have been suc- ¼
@W
cessfully applied to train non-convex multi-layer perceptrons
(MLPs). In this paper, we used the FISTA algorithm (Beck and Te- @Tr½TKT T  TKH T ðHKH T Þ1 HKT T 
¼
boulle, 2010) to accelerate the learning process. More specifically, @W
we choose W0 and set W ¼ W 0 and m1 = 1 during initialization. We @Tr½ðHKH T Þ1 HKT T TKH T 
¼
then update W, W and t according to @W
@E ¼ 2X½H T  ð1  HÞT  ½H z ðHKT T ÞðTH z Þ  KT T ðTH z Þ;
Wk ¼ Wk  q ; ð10Þ
@W ð17Þ
 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 where
mkþ1 ¼ 1 þ 1 þ 4m2k ; and ð11Þ
2
H z ¼ KH T ðHKH T Þ1 : ð18Þ
mk1
W kþ1 ¼ Wk þ ðW k  W k1 Þ: ð12Þ
mkþ1 Note that since we re-estimate the weights after each epoch,
the algorithm will try to move the weights with a larger step to-
We name this algorithm accelerated USA (A-USA).
ward the direction where the error can be most effectively re-
Note that since the A-USA algorithm needs to keep track of two
duced. Once the error for a sample is reduced, the weight for
sets of weights Wk and W k , it is slightly slower than the USA algo-
that sample becomes smaller in the next epoch. This not only
rithm for each epoch. However, since it used the gradient informa-
speeds up the convergence but also makes the training less likely
tion from the history to determine the search direction, it can find
to be trapped into local optima. Because this algorithm uses adap-
the optimal solution with less epochs than the USA algorithm.
tive weightings, we name it weighted accelerated USA (WA-USA).
Additional information on how FISTA and similar techniques can
speed up the gradient descent algorithm can be found in (Nesterov,
2004; Beck and Teboulle, 2010). 3. Experiments

2.4. Weighted accelerated USA algorithm We evaluated and compared the four learning algorithms de-
scribed in Section 2 against the basic ELM algorithm and the EI-
In (7), each sample is weighted the same. It is intuitive, how- ELM algorithm on the MNIST dataset (LeCun et al., 1998) and the
ever, that we may improve the convergence speed by focusing on MAGIC gamma telescope dataset (Frank and Asuncion, 2010).
the samples with most errors for two reasons. First, it allows the
training procedure to slightly change the search direction (since
weighted sum is different) at each epoch and thus has better 3.1. Dataset Description
chance to jump out of the local optimums. Second, since the train-
ing procedure focuses on the samples with most errors, it can re- The MNIST dataset contains binary images of handwritten dig-
duce the overall errors faster. its. The digits have been size-normalized to fit in a 20  20 pixel
In this work, we define the weight box while preserving their aspect ratio and centered in a 28  28
  image by computing and translating the center of mass of the pix-
1 N a N
kii ¼ ky  t k2 þ ¼ kyi  t i k2 þ a =ða þ 1Þ ð13Þ els. The task is to classify each 28  28 image into one of the 10
aþ1 E i i aþ1 E
digits. The MNIST training set is composed of 60,000 examples
for each sample i, where E is the square error over the whole train- from approximately 250 writers, out of which we randomly se-
ing set, N is the training set size, and a is a smoothing factor. The lected 5,000 samples as the cross validation set. The test set has
weighting factors kii are so chosen that they are positively corre- 10,000 patterns. The sets of writers of the training set and test
lated to the errors introduced by each sample while being smoothed set are disjoint.
to make sure weights assigned to each sample is at least a/(a + 1). a The MAGIC gamma telescope dataset was generated using the
is typically set to 1 initially and increases over epochs so that even- Monte Carlo procedure to simulate registration of high energy
tually the original criterion E defined in (1) is optimized. gamma particles in a ground-based atmospheric Cherenkov gam-
At each step, instead of minimizing E directly we can minimize ma telescope using the imaging technique. Cherenkov gamma tele-
the weighted error scope observes high energy gamma rays, taking advantage of the
h i radiation emitted by charged particles produced inside the electro-
€ ¼ Tr ðY  TÞKðY  TÞT ;
E ð14Þ magnetic showers initiated by the gammas, and developing in the
atmosphere. This Cherenkov radiation leaks through the atmo-
where K ¼ diag½k11 ; . . . ; kii ; . . . ; kNN  is an N by N diagonal weight
sphere and gets recorded in the detector, allowing reconstruction
matrix.
€ , once the lower-layer weights W are fixed the of the shower parameters.
To minimize E
The MAGIC dataset contains 19020 samples out of which we
upper-layer weights U can be determined by setting the gradient
randomly selected 10% (1902 samples) as the cross validation
€ @Tr½ðY  TÞKðY  TÞT 
@E set, 10% (1902 samples) as the test set, and the rest as the training
¼ ¼ 2HKðU T H  TÞT ð15Þ set. Each sample in the dataset has 10 real-valued attributes and a
@U @U
class label (signal or background). The task is to classify the obser-
to zero, which has the closed-form solution
vation to either the signal class or background class based on the
U ¼ ðHKH T Þ1 HKT T : ð16Þ attributes. Note that these attributes have some structures. How-
ever, in this study we did not exploit these structures since our
By plugging (16) into (14) and using similar derivation steps used to goal is not to achieve the best result on this dataset but compare
@E
derive @W in (7), we obtain the gradient different algorithms proposed in the paper.
D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558 557

3.2. Experimental results on MNIST Table 2


Test time as a function of the number of hidden units
on MNIST dataset.
We compared the basic ELM algorithm, the EI-ELM algorithm,
and all four algorithms described in Section 2 with the number # hidden units Test time (s)
of hidden units in the set of {64, 128, 256, 512, 1024, 2048} on the 64 0.19 ± 0.01
MNIST dataset. The results are summarized in Table 1. We ran each 128 0.38 ± 0.03
configuration 10 times and report the mean and standard devia- 256 0.76 ± 0.06
512 1.48 ± 0.13
tions in test-set classification accuracy, training-set classification 1024 2.65 ± 0.10
accuracy, and training time. The test time only depends on the 2048 4.97 ± 0.08
model size and is summarized in Table 2. Not surprisingly, the test
time approximately doubles when the hidden layer size (and mod-
el size) doubles.
For EI-ELM, we randomly generated 50 new configurations of
weights at each step first. The one with least square error (LSE)
was selected and survived. We noticed, however, that if we added
only one hidden unit at each time, the training process can be very
slow. To make the training speed comparable to other algorithms
discussed in this paper, we added 16 hidden units at each step.
For USUA and USA, we set the maximum number of epochs to
30 and used simple line search that doubles or halves the learning
rate so as to improve the training objective function. When the
learning rate is smaller than 1e-6 the algorithms stopped even if
the maximum number of epoch was not reached.
Fig. 1. The average test set accuracy as a function of the number of hidden units
We did not use line search for A-USA and WA-USA. The learning
and different learning algorithms on the MNIST dataset.
rate used in A-USA was fixed for all epochs and was set to 0.001.
We used the learning rate of 0.0005 in WA-USA for all settings,
which is smaller than that used for A-USA since the update is ex- validation set is used to select the best configuration and to deter-
pected to move with large steps along some directions. The cross mine when to stop training.
The results summarized in Table 1 can be compared from sev-
eral perspectives. To make observation easier, we plot the test
Table 1 set accuracy in Fig. 1. If we compare the accuracy across different
Summary of test set accuracy, training set accuracy, and training time on MNIST digit
classification task.
algorithms for the same number of hidden units, we can clearly
see that all the algorithms proposed in this paper significantly out-
Algorithm # hid units Test acc (%) Train acc (%) Training time (s) perform ELM and EI-ELM. We also notice that from the accuracy
ELM 64 67.88 ± 2.01 66.88 ± 2.00 1.05 ± 0.14 point of view, WA-USA performs best, followed by A-USA, which
ELM 128 78.99 ± 1.20 78.06 ± 1.28 1.89 ± 0.07 in turn performs better than USA and USUA. If we compare the
ELM 256 85.55 ± 0.44 84.9 ± 0.36 3.46 ± 0.12
training time for the SHLNNs with the same number of hidden
ELM 512 89.65 ± 0.28 89.41 ± 0.27 6.96 ± 0.06
ELM 1024 92.65 ± 0.21 92.85 ± 0.13 13.8 ± 0.07 units, we can indeed see that ELM takes considerably less time
ELM 2048 94.68 ± 0.06 95.31 ± 80.05 28.35 ± 0.17 (about two orders of magnitude) than all other algorithms. Note
EI-ELM 64 73.68 ± 0.87 72.84 ± 0.57 147.59 ± 0.88 all the algorithms proposed in this paper significantly outperform
EI-ELM 128 81.46 ± 0.63 80.61 ± 0.44 282.37 ± 1.27 EI-ELM with similar training time. This is expected since EI-ELM
EI-ELM 256 86.74 ± 0.40 86.2 ± 0.30 550.13 ± 11.17 only uses the 0-th order information while all our algorithms used
EI-ELM 512 90.52 ± 0.35 90.24 ± 0.17 1069.73 ± 6.27
first-order gradient information. Among the algorithms proposed
EI-ELM 1024 92.92 ± 0.14 93.23 ± 0.10 2220.47 ± 18.41
EI-ELM 2048 94.78 ± 0.15 95.51 ± 0.07 4629.67 ± 91.81
in this paper, WA-USA and A-USA perform faster than USA since
they are accelerated algorithms.
USUA 64 84.78 ± 1.42 84.27 ± 1.49 99.13 ± 3.03
USUA 128 88.42 ± 1.05 88.06 ± 1.10 177.81 ± 5.86
These results can be examined from a different angle. Instead of
USUA 256 90.73 ± 0.46 90.82 ± 0.5 347.35 ± 14.83 comparing results with the same network size, we can compare
USUA 512 93.24 ± 0.39 93.79 ± 0.47 681.88 ± 20.04 SHLNNs with the same test-set’s accuracy. From Fig. 1 and Table 1
USUA 1024 94.84 ± 0.37 95.82 ± 0.41 1323.04 ± 64.35 we see that the best average accuracy obtained using ELM is
USUA 2048 96.27 ± 0.14 97.86 ± 0.13 2643.73 ± 84.18
94.68% with 2048 hidden units. EI-ELM is only slightly better than
USA 64 86.4 ± 1.06 85.89 ± 1.25 114.68 ± 2.00 ELM with an average accuracy of 94.78%. This is because when the
USA 128 89.81 ± 0.76 89.62 ± 0.87 221.1 ± 3.37
number of hidden units increases, random selection becomes less
USA 256 92.59 ± 0.86 92.86 ± 0.83 463.97 ± 10.43
USA 512 94.87 ± 0.35 95.58 ± 0.46 1029.47 ± 13.45 effective. This fact is also indicated by smaller standard deviations
USA 1024 96.47 ± 0.13 97.63 ± 0.10 2471.36 ± 10.35 as the number of hidden units increases in ELM. However, using
USA 2048 97.39 ± 0.07 98.95 ± 0.07 7116.77 ± 19.10 USUA, we obtained accuracy of 94.84% with only 1024 hidden
A-USA 64 90.12 ± 1.66 89.98 ± 1.82 88.1 ± 5.81 units. This would cut the test time by half. Further improvement
A-USA 128 94.35 ± 0.16 94.82 ± 0.11 153.77 ± 0.41 is achieved when we use USA with accuracy of 94.78% using only
A-USA 256 95.87 ± 0.13 96.64 ± 0.11 320.9 ± 0.33
512 hidden units. For A-USA only 256 hidden units are needed to
A-USA 512 97.01 ± 0.12 98.04 ± 0.17 717.64 ± 0.60
A-USA 1024 97.64 ± 0.06 99.3 ± 0.03 1727.17 ± 1.89
achieve 95.87% accuracy. Further, only 128 hidden units are
A-USA 2048 98.02 ± 0.08 99.87 ± 0.01 4916.57 ± 2.64 needed to obtain comparable accuracy of 94.80% using WA-USA.
WA-USA 64 93.64 ± 0.46 94.12 ± 0.46 84.51 ± 0.61
In other words, WA-USA can achieve the same accuracy as ELM
WA-USA 128 96.03 + 0.25 97.08 ± 0.20 154.9 ± 0.85 using only 1/16 of the network size and test time. This is extremely
WA-USA 256 97.09 ± 0.21 98.720.12 322.4 ± 1.91 favorable for practical usage since a 1/16 test time translates to 16
WA-USA 512 97.59 ± 0.13 99.56 ± 0.09 757.34 ± 0.75 times more throughput. Also note that it takes only 155 seconds to
WA-USA 1024 98.45 ± 0.12 100 ± 0 1965.28 ± 7.10
train a network with 128 hidden units using WA-USA. This is in
WA-USA 2048 98.55 ± 0.11 100 ± 0 5907.17 ± 10.81
comparison to 28.35 seconds needed to train a 2048 hidden unit
558 D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558

used and when 2048 hidden units are used, it overfits the training
data and the test set accuracy becomes lower. However, we can
achieve same or higher accuracies as the best achievable using
ELM algorithm with 64, 32, and 32 hidden units, respectively, using
USA, A-USA, and WA-USA algorithms. This indicates that at test
time we can achieve the same or higher accuracy with 1/16, 1/
32, and 1/32 of computation time using these algorithms com-
pared to the ELM algorithm. Note that to train a 32-hidden-unit
SHLNN using the A-USA or WA-USA algorithm we only need to
spend less than four times of the time needed to train a 1024 hid-
den unit model using ELM.

4. Conclusion
Fig. 2. The average test set accuracy as a function of the number of hidden units
and different learning algorithms on the MAGIC dataset. In this paper we presented four efficient algorithms for training
SHLNNs. These algorithms exploit information such as the struc-
ELM model and 2,220 seconds for a 1024 hidden unit EI-EIM mod- ture of SHLNNs and gradient values over epochs, and update the
el. If 2048 hidden units are used, we can obtain 98.55% average test weights along the most promising direction. We demonstrated
set accuracy with WA-USA, which is very difficult to obtain using both the efficiency and effectiveness of these algorithms on the
ELM. MNIST and MAGIC datasets. Among all the algorithms developed
Note that we can consistently achieve 100% classification accu- in this work, we recommend using the WA-USA and A-USA algo-
racy on the training set when we use WA-USA with 1024 and more rithms since they converge fastest and typically to a better model.
hidden units which is not the case when other algorithms are used. We believe this line of work can help improve the scalability of
This prevents further improvement on the classification accuracy neural networks in speech recognition systems (e.g., Dahl et al.,
on both training and test sets even though square error continues 2012; Yu and Deng, 2011) which typically require thousands of
to decline. This also explains the smaller gain when the number of hours of training data.
hidden units increases from 1024 to 2048 when WA-USA is used.
Our proposed algorithms also compare favorably over other Acknowledgment
SHLNN training algorithms previous proposed. For example, with
random initialization WA-USA can achieve 97.3% test set accuracy We thank Dr. Guang-Bin Huang at Singapore Nanyang Techno-
using 256 hidden units. This result is better than 95.3% test set logical University for fruitful discussions on ELM.
accuracy achieved using SHLNN with 300 hidden units but trained
using conventional back-propagation algorithm with mean square
error criterion (LeCun et al., 1998). References
Furthermore, using the WA-USA algorithm and the single 2048
Beck, A., Teboulle, M., 2010. Gradient-Based Methods with Application to Signal
hidden layer weights initialized with the restricted Boltzmann ma- Recovery Problems. In: Palomar, D., Eldar, Y. (Eds.), Convex Optimization in
chine (RBM), we obtained average test set accuracy of 98.9% which Signal Processing and Communications. Cambridge University Press, Berlin.
Dahl, G.E., Yu, D., Deng, L., Acero, A. 2012. Context-Dependent Pre-Trained Deep
is slightly better than the 98.8% obtained using a 3-hidden-layer
Neural Networks for Large Vocabulary Speech Recognition. IEEE Transactions
DBN initialized using RBM (Hinton and Salakhutdinov, 2006) with on Audio, Speech, and Language Processing - Special Issue on Deep Learning for
significantly less training time. Speech and Language Processing (special issue).
Frank, A., Asuncion, A. 2010. UCI Machine Learning Repository. Irvine, CA:
University of California, School of Information and Computer Science.
3.3. Experimental results on MAGIC dataset Available from: http://www.archive.ics.uci.edu/ml.
Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with
Similar comparison experiments have been conducted on the neural networks. Science 313 (5786), 504–507.
Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: biased estimation for
MAGIC dataset. Fig. 2 summarizes and compares the classification nonorthogonal problems. Technometrics 12 (1), 55–67.
accuracy using ELM, EI-ELM, USUA, USA, A-USA, and WA-USA algo- Huang, G.-B., Zhu, Q.-Y., Siew, C.-K., 2006. Extreme learning machine: theory and
rithms as a function of the number of hidden units. Although the applications. Neurocomputing 70, 489–501.
Huang, G.-B., Chen, L., 2008. Enhanced random search based incremental extreme
relative accuracy improvement is different from those observed learning machine. Neurocomputing 71, 3460–3468.
in MNIST dataset, the accuracy curves share the same basic trend LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. 1998. Gradient-based learning applied to
as that in Fig. 1. We can see that, esp. when the number of hidden document recognition. In: Proceedings of the IEEE, 86(11), 2278–2324.
Negnevitsky, M., Ringrose, M., 1999. Accelerated learning in multi-layer neural
units is small, the proposed algorithms significantly outperform
networks. Neural Information Processing ICONIP 3, 1167–1171.
ELM and EI-ELM. Although when the number of hidden unit in- Nesterov, Y., 2004. Introductory Lectures on Convex Optimization: A Basic Course.
creases to 256, the gap between the accuracies obtained using pro- Kluwer Academic Publishers.
Yu, D., Deng, L., 2011. Deep learning and its relevance to signal and information
posed approaches and that achieved using ELM and EI-ELM
processing. IEEE Signal Processing Magazine 28 (1), 145–154.
decreases, the difference is still very large. Actually, ELM obtained Zhu, Q.-Y., Qin, A.K., Suganthan, P.N., Huang, G.-B., 2005. Evolutionary extreme
the highest test set accuracy of 87.0% when 1024 hidden units are learning machine. Pattern Recogn. 38, 1759–1763.

You might also like