2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)

This document proposes new efficient learning algorithms for single-hidden-layer neural networks (SHLNNs). The algorithms aim to update the weights in the direction that reduces the overall square error the most, while exploiting the structure of SHLNNs and gradient information over training epochs. Experiments on handwritten digit and gamma telescope datasets show the algorithms obtain better classification accuracy than extreme learning machines using the same number of hidden units, requiring only 1/16 the model size and test time to achieve the same accuracy.

Uploaded by

ksajj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views5 pages

2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)

Uploaded by

ksajj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Pattern Recognition Letters 33 (2012) 554–558

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

Efﬁcient and effective algorithms for training single-hidden-layer neural networks

Dong Yu ⇑, Li Deng
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

a r t i c l e i n f o a b s t r a c t

Article history: Recently there have been renewed interests in single-hidden-layer neural networks (SHLNNs). This is due
Received 31 March 2011 to its powerful modeling ability as well as the existence of some efficient learning algorithms. A promi-
Available online 9 December 2011 nent example of such algorithms is extreme learning machine (ELM), which assigns random values to the
Communicated by J. Laaksonen
lower-layer weights. While ELM can be trained efficiently, it requires many more hidden units than is
typically needed by the conventional neural networks to achieve matched classification accuracy. The
Keywords: use of a large number of hidden units translates to significantly increased test time, which is more valu-
Neural network
able than training time in practice. In this paper, we propose a series of new efficient learning algorithms
Extreme learning machine
Accelerated gradient algorithm
for SHLNNs. Our algorithms exploit both the structure of SHLNNs and the gradient information over all
Weighted algorithm training epochs, and update the weights in the direction along which the overall square error is reduced
MNIST the most. Experiments on the MNIST handwritten digit recognition task and the MAGIC gamma telescope
dataset show that the algorithms proposed in this paper obtain significantly better classification accuracy
than ELM when the same number of hidden units is used. For obtaining the same classification accuracy,
our best algorithm requires only 1/16 of the model size and thus approximately 1/16 of test time com-
pared with ELM. This huge advantage is gained at the expense of 5 times or less the training cost incurred
by the ELM training.
Ó 2011 Elsevier B.V. All rights reserved.

1. Introduction where Y = [y1, . . . , yi, . . . , yN]. Note that once the lower-layer weights
W are fixed, the hidden-layer values. H = [h1, . . . , hi, . . . , hN] are also
Recently there have been renewed interests in single-hidden- determined uniquely. And subsequently, the upper-layer weights
layer neural networks (SHLNNs) with least square error (LSE) train- U can be determined by setting the gradient
ing criterion, partly due to its modeling ability and partly due to
@E @Tr½ðU T H TÞðU T H TÞT
the existence of efficient learning algorithms such as extreme ¼ ¼ 2HðU T H TÞT ð2Þ
learning machine (ELM) (Huang et al., 2006). @U @U
Given the set of input vectors X = [x1, . . . , xi, . . . , xN], in which to zero, leading to the closed-form solution
each vector is denoted by xi = [x1j, . . . , xji, . . . , xDi] where D is the
dimension of the input vector and N is the total number of training U ¼ ðHH T Þ1 HT T : ð3Þ
samples. Denote L the number of hidden units and C the dimension Note that (3) defines an implicit constraint between the two sets of
of the output vector, the output of the SHLNN is yi = UThi where weights, U and W, via the hidden layer output H, in the SHLNN. This
hi = r(WTxi) is the hidden layer output, U is an weight L C matrix gives rise to a structure that our new algorithms will exploit in opti-
at the upper layer, W is an D L weight matrix at the lower layer, mizing the SHLNN.
and r() is the sigmoid function. Note that the bias terms are Although solution (3) is simple, regularization techniques need
implicitly represented in the above formulation if xi and hi are aug- to be used in actual implementation to deal with sometimes ill-
mented with 1’s. conditioned hidden layer matrix H (i.e., HHT is singular). A popular
Given the target vectors T = [t1, . . . , ti, . . . , tN], where each target technique, which is used in this study, is based on the ridge regres-
ti = [t1i, . . . , tji, . . . , tCi]T, the parameters U and W are learned to min- sion theory (Hoerl and Kennard, 1970). More specifically, (3) is
imize the square error converted to
E ¼ kY Tk2 ¼ Tr½ðY TÞðY TÞT ; ð1Þ 1
I
U¼ þ HH T HT T ; ð4Þ
l
by adding a positive value I/l to the diagonal of HHT, where I is the
⇑ Corresponding author. Tel.: +1 425 707 9282; fax: +1 425 936 7329. identity matrix and l is a positive constant to control the degree of
E-mail addresses: [email protected] (D. Yu), [email protected] (L. Deng). regularization. The resultant solution (4) actually minimizes

kU T H Tk2 þ lkUk2 , where lkUk2 is an L2 regularization term. Given fixed current U and W, we compute gradient
Solution (4) is typically more stable and tends to have better gener-
alization performance than (3) and is used throughout the paper @E @Tr½ðU T rðW T X TÞðU T rW T XÞ TÞT
¼
whenever pseudo inverse is involved. @W @W
It has been shown in Huang et al (2006) that the lower-layer ¼ 2X½H ð1 HÞ ðUU T H UTÞT ð5Þ
weights W can be randomly selected and the resulting SHLNN
where is element-wise product. This first algorithm first updates
can still approximate any function by setting the upper-layer
W using the gradient defined directly in (5) as
weights U according to (3). The training process can thus be re-
duced to a pseudo-inverse problem and hence it is extremely effi- @E
W kþ1 ¼ W k q ð6Þ
cient. This is the basis of the extreme learning machine (ELM) @W
(Huang et al., 2006). where q is the learning rate. It then calculates U using the closed-
However, the drawback of ELM is its inefficiency in using the form solution (4). Since it is unaware of the upper-layer solution
model parameters. To achieve good classification accuracy, ELM re- when calculating the gradient, we name it as ‘‘upper-layer-solu-
quires a huge number of hidden units. This inevitably increases the tion-unaware’’ or USUA. The USUA algorithm is simple to imple-
model size and the test time. In practice, the test time is much ment and each epoch takes less time than other algorithms we
more valuable than the training time due to two reasons. First, will introduce in the next several subsections thanks to the simple
training is only needed to be done once while test needs to be done form of the gradient (5). However, it is less effective than other
as many times as the service is live. Second, training can be done algorithms and typically requires more epochs to converge to a
offline and can tolerate long latency while test typically requires good solution and more hidden units to achieve the same accuracy.
real time response. To reduce the model size, a number of algo-
rithms, such as evolutionary-ELM (Zhu et al., 2005) and enhanced 2.2. Upper-layer-solution-aware algorithm
random search based incremental ELM (EI-ELM) (Huang and Chen,
2008), have been proposed in the literature. These algorithms ran- In the USUA algorithm we do not take into consideration the fact
domly generate all or part of the lower-layer weights and select the that U completely depends on W. As a result, the direction defined by
ones with the LSE. However, these algorithms are not efficient in gradient (5) is suboptimal. In the upper-layer-solution-aware (USA)
finding good model parameters since they only use the value of algorithm we derive the gradient oE/oW by considering W’s effect
the objective function in the search process. on the upper-layer weights U and thus its effect on the square error
In this paper, we propose a series of new efficient algorithms as the training objective function. By treating U a function of W and
to train SHLNNs. Our algorithms exploit both the structure of plugging (3) into criterion (1) we obtain the new gradient
SHLNNs, expressed in terms of the constraint of (3), and the gra-
@E @Tr½ðU T H TÞðU T H TÞT
dient information over all training epochs. They also update the ¼
weights in the direction that can reduce the overall square error @W @W
the most. We compare our algorithms with ELM and EI-ELM on @Tr½ð½ðHH T Þð1Þ HT T T H TÞð½ðHH T Þð1Þ HT T T H TÞT
¼
the MNIST handwritten digit recognition dataset (LeCun et al., @W
1998) and the MAGIC gamma telescope dataset. The experiments @Tr½TT T TH T ðHH T Þð1Þ HT T @Tr½ðHH T Þð1Þ HT T TH T
show that all algorithms proposed in this paper obtain signifi- ¼ ¼
@W @W
cantly better classification accuracy than ELM and EI-ELM when
@Tr½ð@ðW T XÞ½@ðW T XÞT Þð1Þ @ðW T XÞT T T½@ðW T XÞT
the same number of hidden units is used. To obtain the same ¼
classification accuracy, our best algorithm requires only 1/16 of
@W
the model size and thus test time needed by ELM at the cost of ¼ 2X½H T ð1 HÞT ½H y ðHT T ÞðTH y Þ T T ðTH y Þ ð7Þ
5 folds or less training time by ELM. The 2048 hidden unit SHLNN
where
trained using our best algorithm achieved 98.9% classification
accuracy on the MNIST task. This compares favorably with the H y ¼ H T ðHH T Þ1 ð8Þ
three-hidden-layer deep belief network (DBN) (Hinton and Sala-
is the pseudo-inverse of H.
khutdinov, 2006).
In the derivation of (7) we used the fact that HHT is symmetric
The rest of the paper is organized as follows. In Section 2 we de-
and so is (HHT)1. We also used the fact that
scribe our novel efficient algorithms. In Section 3 we report our
experimental results on the MNIST and MAGIC datasets. We con- @Tr½ðHH T Þ1 HT T TH T
¼ 2H T ðHH T Þ1 HT T TH T ðHH T Þ1
clude the paper in Section 4. @H T
þ 2T T TH T ðHH T Þ1 ð9Þ
2. New algorithms exploiting structures Since the USA algorithm knows the effect of W on U, it tends to
move W towards a direction that finds the optimal points faster.
In this section, we propose four increasingly more effective and However, due to the more complicated gradient calculation that
efficient algorithms for learning the SHLNNs. Although the algo- involves a pseudo-inverse, each USA epoch takes longer time to
rithms are developed and evaluated based on the sigmoid network, compute than that of USUA. Note that we grouped the products
the techniques can be directly extended to SHLNNs with other acti- of matrices in (7). This is necessary to reduce the memory usage
vation functions such as radial basis function. when the number of samples is very large.

2.1. Upper-layer-solution-unaware algorithm 2.3. Accelerated upper-layer-solution-aware algorithm

The idea behind this ﬁrst algorithm is simple. Since the upper- The USA algorithm updates weights based on the current gradi-
layer weights can be determined explicitly using the closed-form ent only. However, it has been shown for the convex problems that
solution (4) once the lower-layer weights are determined, we can the convergence speed can be improved if the gradient information
just search for the lower-layer weights along the gradient direction over the history is used when updating the weights (Nesterov,
at each epoch. 2004; Beck and Teboulle, 2010). Although the speedup may not
556 D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558

be guaranteed in theory for our non-convex problems, we have ob- €

@E @Tr½ðU T H TÞKðU T H TÞT
served in practice that such algorithms do converge faster and to a ¼
@W @W
better place. Actually, similar but less principled techniques such
@Tr½ð½ðHKH T Þ1 HKT T T H TÞKð½ðHKH T Þ1 HKT T T H TÞT
as momentum (Negnevitsky and Ringrose, 1999) have been suc- ¼
@W
cessfully applied to train non-convex multi-layer perceptrons
(MLPs). In this paper, we used the FISTA algorithm (Beck and Te- @Tr½TKT T TKH T ðHKH T Þ1 HKT T
¼
boulle, 2010) to accelerate the learning process. More specifically, @W
we choose W0 and set W ¼ W 0 and m1 = 1 during initialization. We @Tr½ðHKH T Þ1 HKT T TKH T
¼
then update W, W and t according to @W
@E ¼ 2X½H T ð1 HÞT ½H z ðHKT T ÞðTH z Þ KT T ðTH z Þ;
Wk ¼ Wk q ; ð10Þ
@W ð17Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 where
mkþ1 ¼ 1 þ 1 þ 4m2k ; and ð11Þ
2
H z ¼ KH T ðHKH T Þ1 : ð18Þ
mk1
W kþ1 ¼ Wk þ ðW k W k1 Þ: ð12Þ
mkþ1 Note that since we re-estimate the weights after each epoch,
the algorithm will try to move the weights with a larger step to-
We name this algorithm accelerated USA (A-USA).
ward the direction where the error can be most effectively re-
Note that since the A-USA algorithm needs to keep track of two
duced. Once the error for a sample is reduced, the weight for
sets of weights Wk and W k , it is slightly slower than the USA algo-
that sample becomes smaller in the next epoch. This not only
rithm for each epoch. However, since it used the gradient informa-
speeds up the convergence but also makes the training less likely
tion from the history to determine the search direction, it can find
to be trapped into local optima. Because this algorithm uses adap-
the optimal solution with less epochs than the USA algorithm.
tive weightings, we name it weighted accelerated USA (WA-USA).
Additional information on how FISTA and similar techniques can
speed up the gradient descent algorithm can be found in (Nesterov,
2004; Beck and Teboulle, 2010). 3. Experiments

2.4. Weighted accelerated USA algorithm We evaluated and compared the four learning algorithms de-
scribed in Section 2 against the basic ELM algorithm and the EI-
In (7), each sample is weighted the same. It is intuitive, how- ELM algorithm on the MNIST dataset (LeCun et al., 1998) and the
ever, that we may improve the convergence speed by focusing on MAGIC gamma telescope dataset (Frank and Asuncion, 2010).
the samples with most errors for two reasons. First, it allows the
training procedure to slightly change the search direction (since
weighted sum is different) at each epoch and thus has better 3.1. Dataset Description
chance to jump out of the local optimums. Second, since the train-
ing procedure focuses on the samples with most errors, it can re- The MNIST dataset contains binary images of handwritten dig-
duce the overall errors faster. its. The digits have been size-normalized to fit in a 20 20 pixel
In this work, we define the weight box while preserving their aspect ratio and centered in a 28 28
image by computing and translating the center of mass of the pix-
1 N a N
kii ¼ ky t k2 þ ¼ kyi t i k2 þ a =ða þ 1Þ ð13Þ els. The task is to classify each 28 28 image into one of the 10
aþ1 E i i aþ1 E
digits. The MNIST training set is composed of 60,000 examples
for each sample i, where E is the square error over the whole train- from approximately 250 writers, out of which we randomly se-
ing set, N is the training set size, and a is a smoothing factor. The lected 5,000 samples as the cross validation set. The test set has
weighting factors kii are so chosen that they are positively corre- 10,000 patterns. The sets of writers of the training set and test
lated to the errors introduced by each sample while being smoothed set are disjoint.
to make sure weights assigned to each sample is at least a/(a + 1). a The MAGIC gamma telescope dataset was generated using the
is typically set to 1 initially and increases over epochs so that even- Monte Carlo procedure to simulate registration of high energy
tually the original criterion E defined in (1) is optimized. gamma particles in a ground-based atmospheric Cherenkov gam-
At each step, instead of minimizing E directly we can minimize ma telescope using the imaging technique. Cherenkov gamma tele-
the weighted error scope observes high energy gamma rays, taking advantage of the
h i radiation emitted by charged particles produced inside the electro-
€ ¼ Tr ðY TÞKðY TÞT ;
E ð14Þ magnetic showers initiated by the gammas, and developing in the
atmosphere. This Cherenkov radiation leaks through the atmo-
where K ¼ diag½k11 ; . . . ; kii ; . . . ; kNN is an N by N diagonal weight
sphere and gets recorded in the detector, allowing reconstruction
matrix.
€ , once the lower-layer weights W are fixed the of the shower parameters.
To minimize E
The MAGIC dataset contains 19020 samples out of which we
upper-layer weights U can be determined by setting the gradient
randomly selected 10% (1902 samples) as the cross validation
€ @Tr½ðY TÞKðY TÞT
@E set, 10% (1902 samples) as the test set, and the rest as the training
¼ ¼ 2HKðU T H TÞT ð15Þ set. Each sample in the dataset has 10 real-valued attributes and a
@U @U
class label (signal or background). The task is to classify the obser-
to zero, which has the closed-form solution
vation to either the signal class or background class based on the
U ¼ ðHKH T Þ1 HKT T : ð16Þ attributes. Note that these attributes have some structures. How-
ever, in this study we did not exploit these structures since our
By plugging (16) into (14) and using similar derivation steps used to goal is not to achieve the best result on this dataset but compare
@E
derive @W in (7), we obtain the gradient different algorithms proposed in the paper.
D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558 557

3.2. Experimental results on MNIST Table 2

Test time as a function of the number of hidden units
on MNIST dataset.
We compared the basic ELM algorithm, the EI-ELM algorithm,
and all four algorithms described in Section 2 with the number # hidden units Test time (s)
of hidden units in the set of {64, 128, 256, 512, 1024, 2048} on the 64 0.19 ± 0.01
MNIST dataset. The results are summarized in Table 1. We ran each 128 0.38 ± 0.03
configuration 10 times and report the mean and standard devia- 256 0.76 ± 0.06
512 1.48 ± 0.13
tions in test-set classification accuracy, training-set classification 1024 2.65 ± 0.10
accuracy, and training time. The test time only depends on the 2048 4.97 ± 0.08
model size and is summarized in Table 2. Not surprisingly, the test
time approximately doubles when the hidden layer size (and mod-
el size) doubles.
For EI-ELM, we randomly generated 50 new configurations of
weights at each step first. The one with least square error (LSE)
was selected and survived. We noticed, however, that if we added
only one hidden unit at each time, the training process can be very
slow. To make the training speed comparable to other algorithms
discussed in this paper, we added 16 hidden units at each step.
For USUA and USA, we set the maximum number of epochs to
30 and used simple line search that doubles or halves the learning
rate so as to improve the training objective function. When the
learning rate is smaller than 1e-6 the algorithms stopped even if
the maximum number of epoch was not reached.
Fig. 1. The average test set accuracy as a function of the number of hidden units
We did not use line search for A-USA and WA-USA. The learning
and different learning algorithms on the MNIST dataset.
rate used in A-USA was fixed for all epochs and was set to 0.001.
We used the learning rate of 0.0005 in WA-USA for all settings,
which is smaller than that used for A-USA since the update is ex- validation set is used to select the best configuration and to deter-
pected to move with large steps along some directions. The cross mine when to stop training.
The results summarized in Table 1 can be compared from sev-
eral perspectives. To make observation easier, we plot the test
Table 1 set accuracy in Fig. 1. If we compare the accuracy across different
Summary of test set accuracy, training set accuracy, and training time on MNIST digit
classification task.
algorithms for the same number of hidden units, we can clearly
see that all the algorithms proposed in this paper significantly out-
Algorithm # hid units Test acc (%) Train acc (%) Training time (s) perform ELM and EI-ELM. We also notice that from the accuracy
ELM 64 67.88 ± 2.01 66.88 ± 2.00 1.05 ± 0.14 point of view, WA-USA performs best, followed by A-USA, which
ELM 128 78.99 ± 1.20 78.06 ± 1.28 1.89 ± 0.07 in turn performs better than USA and USUA. If we compare the
ELM 256 85.55 ± 0.44 84.9 ± 0.36 3.46 ± 0.12
training time for the SHLNNs with the same number of hidden
ELM 512 89.65 ± 0.28 89.41 ± 0.27 6.96 ± 0.06
ELM 1024 92.65 ± 0.21 92.85 ± 0.13 13.8 ± 0.07 units, we can indeed see that ELM takes considerably less time
ELM 2048 94.68 ± 0.06 95.31 ± 80.05 28.35 ± 0.17 (about two orders of magnitude) than all other algorithms. Note
EI-ELM 64 73.68 ± 0.87 72.84 ± 0.57 147.59 ± 0.88 all the algorithms proposed in this paper significantly outperform
EI-ELM 128 81.46 ± 0.63 80.61 ± 0.44 282.37 ± 1.27 EI-ELM with similar training time. This is expected since EI-ELM
EI-ELM 256 86.74 ± 0.40 86.2 ± 0.30 550.13 ± 11.17 only uses the 0-th order information while all our algorithms used
EI-ELM 512 90.52 ± 0.35 90.24 ± 0.17 1069.73 ± 6.27
first-order gradient information. Among the algorithms proposed
EI-ELM 1024 92.92 ± 0.14 93.23 ± 0.10 2220.47 ± 18.41
EI-ELM 2048 94.78 ± 0.15 95.51 ± 0.07 4629.67 ± 91.81
in this paper, WA-USA and A-USA perform faster than USA since
they are accelerated algorithms.
USUA 64 84.78 ± 1.42 84.27 ± 1.49 99.13 ± 3.03
USUA 128 88.42 ± 1.05 88.06 ± 1.10 177.81 ± 5.86
These results can be examined from a different angle. Instead of
USUA 256 90.73 ± 0.46 90.82 ± 0.5 347.35 ± 14.83 comparing results with the same network size, we can compare
USUA 512 93.24 ± 0.39 93.79 ± 0.47 681.88 ± 20.04 SHLNNs with the same test-set’s accuracy. From Fig. 1 and Table 1
USUA 1024 94.84 ± 0.37 95.82 ± 0.41 1323.04 ± 64.35 we see that the best average accuracy obtained using ELM is
USUA 2048 96.27 ± 0.14 97.86 ± 0.13 2643.73 ± 84.18
94.68% with 2048 hidden units. EI-ELM is only slightly better than
USA 64 86.4 ± 1.06 85.89 ± 1.25 114.68 ± 2.00 ELM with an average accuracy of 94.78%. This is because when the
USA 128 89.81 ± 0.76 89.62 ± 0.87 221.1 ± 3.37
number of hidden units increases, random selection becomes less
USA 256 92.59 ± 0.86 92.86 ± 0.83 463.97 ± 10.43
USA 512 94.87 ± 0.35 95.58 ± 0.46 1029.47 ± 13.45 effective. This fact is also indicated by smaller standard deviations
USA 1024 96.47 ± 0.13 97.63 ± 0.10 2471.36 ± 10.35 as the number of hidden units increases in ELM. However, using
USA 2048 97.39 ± 0.07 98.95 ± 0.07 7116.77 ± 19.10 USUA, we obtained accuracy of 94.84% with only 1024 hidden
A-USA 64 90.12 ± 1.66 89.98 ± 1.82 88.1 ± 5.81 units. This would cut the test time by half. Further improvement
A-USA 128 94.35 ± 0.16 94.82 ± 0.11 153.77 ± 0.41 is achieved when we use USA with accuracy of 94.78% using only
A-USA 256 95.87 ± 0.13 96.64 ± 0.11 320.9 ± 0.33
512 hidden units. For A-USA only 256 hidden units are needed to
A-USA 512 97.01 ± 0.12 98.04 ± 0.17 717.64 ± 0.60
A-USA 1024 97.64 ± 0.06 99.3 ± 0.03 1727.17 ± 1.89
achieve 95.87% accuracy. Further, only 128 hidden units are
A-USA 2048 98.02 ± 0.08 99.87 ± 0.01 4916.57 ± 2.64 needed to obtain comparable accuracy of 94.80% using WA-USA.
WA-USA 64 93.64 ± 0.46 94.12 ± 0.46 84.51 ± 0.61
In other words, WA-USA can achieve the same accuracy as ELM
WA-USA 128 96.03 + 0.25 97.08 ± 0.20 154.9 ± 0.85 using only 1/16 of the network size and test time. This is extremely
WA-USA 256 97.09 ± 0.21 98.720.12 322.4 ± 1.91 favorable for practical usage since a 1/16 test time translates to 16
WA-USA 512 97.59 ± 0.13 99.56 ± 0.09 757.34 ± 0.75 times more throughput. Also note that it takes only 155 seconds to
WA-USA 1024 98.45 ± 0.12 100 ± 0 1965.28 ± 7.10
train a network with 128 hidden units using WA-USA. This is in
WA-USA 2048 98.55 ± 0.11 100 ± 0 5907.17 ± 10.81
comparison to 28.35 seconds needed to train a 2048 hidden unit
558 D. Yu, L. Deng / Pattern Recognition Letters 33 (2012) 554–558

used and when 2048 hidden units are used, it overﬁts the training
data and the test set accuracy becomes lower. However, we can
achieve same or higher accuracies as the best achievable using
ELM algorithm with 64, 32, and 32 hidden units, respectively, using
USA, A-USA, and WA-USA algorithms. This indicates that at test
time we can achieve the same or higher accuracy with 1/16, 1/
32, and 1/32 of computation time using these algorithms com-
pared to the ELM algorithm. Note that to train a 32-hidden-unit
SHLNN using the A-USA or WA-USA algorithm we only need to
spend less than four times of the time needed to train a 1024 hid-
den unit model using ELM.

4. Conclusion
Fig. 2. The average test set accuracy as a function of the number of hidden units
and different learning algorithms on the MAGIC dataset. In this paper we presented four efficient algorithms for training
SHLNNs. These algorithms exploit information such as the struc-
ELM model and 2,220 seconds for a 1024 hidden unit EI-EIM mod- ture of SHLNNs and gradient values over epochs, and update the
el. If 2048 hidden units are used, we can obtain 98.55% average test weights along the most promising direction. We demonstrated
set accuracy with WA-USA, which is very difficult to obtain using both the efficiency and effectiveness of these algorithms on the
ELM. MNIST and MAGIC datasets. Among all the algorithms developed
Note that we can consistently achieve 100% classification accu- in this work, we recommend using the WA-USA and A-USA algo-
racy on the training set when we use WA-USA with 1024 and more rithms since they converge fastest and typically to a better model.
hidden units which is not the case when other algorithms are used. We believe this line of work can help improve the scalability of
This prevents further improvement on the classification accuracy neural networks in speech recognition systems (e.g., Dahl et al.,
on both training and test sets even though square error continues 2012; Yu and Deng, 2011) which typically require thousands of
to decline. This also explains the smaller gain when the number of hours of training data.
hidden units increases from 1024 to 2048 when WA-USA is used.
Our proposed algorithms also compare favorably over other Acknowledgment
SHLNN training algorithms previous proposed. For example, with
random initialization WA-USA can achieve 97.3% test set accuracy We thank Dr. Guang-Bin Huang at Singapore Nanyang Techno-
using 256 hidden units. This result is better than 95.3% test set logical University for fruitful discussions on ELM.
accuracy achieved using SHLNN with 300 hidden units but trained
using conventional back-propagation algorithm with mean square
error criterion (LeCun et al., 1998). References
Furthermore, using the WA-USA algorithm and the single 2048
Beck, A., Teboulle, M., 2010. Gradient-Based Methods with Application to Signal
hidden layer weights initialized with the restricted Boltzmann ma- Recovery Problems. In: Palomar, D., Eldar, Y. (Eds.), Convex Optimization in
chine (RBM), we obtained average test set accuracy of 98.9% which Signal Processing and Communications. Cambridge University Press, Berlin.
Dahl, G.E., Yu, D., Deng, L., Acero, A. 2012. Context-Dependent Pre-Trained Deep
is slightly better than the 98.8% obtained using a 3-hidden-layer
Neural Networks for Large Vocabulary Speech Recognition. IEEE Transactions
DBN initialized using RBM (Hinton and Salakhutdinov, 2006) with on Audio, Speech, and Language Processing - Special Issue on Deep Learning for
significantly less training time. Speech and Language Processing (special issue).
Frank, A., Asuncion, A. 2010. UCI Machine Learning Repository. Irvine, CA:
University of California, School of Information and Computer Science.
3.3. Experimental results on MAGIC dataset Available from: http://www.archive.ics.uci.edu/ml.
Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with
Similar comparison experiments have been conducted on the neural networks. Science 313 (5786), 504–507.
Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: biased estimation for
MAGIC dataset. Fig. 2 summarizes and compares the classification nonorthogonal problems. Technometrics 12 (1), 55–67.
accuracy using ELM, EI-ELM, USUA, USA, A-USA, and WA-USA algo- Huang, G.-B., Zhu, Q.-Y., Siew, C.-K., 2006. Extreme learning machine: theory and
rithms as a function of the number of hidden units. Although the applications. Neurocomputing 70, 489–501.
Huang, G.-B., Chen, L., 2008. Enhanced random search based incremental extreme
relative accuracy improvement is different from those observed learning machine. Neurocomputing 71, 3460–3468.
in MNIST dataset, the accuracy curves share the same basic trend LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. 1998. Gradient-based learning applied to
as that in Fig. 1. We can see that, esp. when the number of hidden document recognition. In: Proceedings of the IEEE, 86(11), 2278–2324.
Negnevitsky, M., Ringrose, M., 1999. Accelerated learning in multi-layer neural
units is small, the proposed algorithms significantly outperform
networks. Neural Information Processing ICONIP 3, 1167–1171.
ELM and EI-ELM. Although when the number of hidden unit in- Nesterov, Y., 2004. Introductory Lectures on Convex Optimization: A Basic Course.
creases to 256, the gap between the accuracies obtained using pro- Kluwer Academic Publishers.
Yu, D., Deng, L., 2011. Deep learning and its relevance to signal and information
posed approaches and that achieved using ELM and EI-ELM
processing. IEEE Signal Processing Magazine 28 (1), 145–154.
decreases, the difference is still very large. Actually, ELM obtained Zhu, Q.-Y., Qin, A.K., Suganthan, P.N., Huang, G.-B., 2005. Evolutionary extreme
the highest test set accuracy of 87.0% when 1024 hidden units are learning machine. Pattern Recogn. 38, 1759–1763.

Haykin, Xue-Neural Networks and Learning Machines 3ed Soln
53% (19)
Haykin, Xue-Neural Networks and Learning Machines 3ed Soln
103 pages
Brilliance CT: 6-Slice, 10-Slice, 16-Slice, and 16 Power Configurations
No ratings yet
Brilliance CT: 6-Slice, 10-Slice, 16-Slice, and 16 Power Configurations
285 pages
BackPropagation Through Time
No ratings yet
BackPropagation Through Time
6 pages
Prelab3 Solution
100% (1)
Prelab3 Solution
8 pages
Cross Belt Analyzer Technical Agenda Study v1
100% (1)
Cross Belt Analyzer Technical Agenda Study v1
103 pages
The Fast Computation Methods For Extreme Learning Machine: Tao Dou Xu Zhou
No ratings yet
The Fast Computation Methods For Extreme Learning Machine: Tao Dou Xu Zhou
7 pages
Extreme Learning Machine and Its Applications
No ratings yet
Extreme Learning Machine and Its Applications
8 pages
A Comparison of Extreme Learning Machines and Back-Propagation Trained Feed-Forward Networks Processing The Mnist Database
No ratings yet
A Comparison of Extreme Learning Machines and Back-Propagation Trained Feed-Forward Networks Processing The Mnist Database
4 pages
Integrating Data Selection and Extreme Learning Machine For Imbalanced Data
No ratings yet
Integrating Data Selection and Extreme Learning Machine For Imbalanced Data
9 pages
Extreme_learning_machine_a_new_learning_scheme_of_feedforward_neural_networks
No ratings yet
Extreme_learning_machine_a_new_learning_scheme_of_feedforward_neural_networks
6 pages
Analysis of BP
No ratings yet
Analysis of BP
2 pages
Extreme Learning Machines - A Review and State of The Art PDF
No ratings yet
Extreme Learning Machines - A Review and State of The Art PDF
15 pages
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
No ratings yet
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
15 pages
Imperial Dlcourse2022 Rnn Notes
No ratings yet
Imperial Dlcourse2022 Rnn Notes
9 pages
Congratulations! You Passed!: Shallow Neural Networks
No ratings yet
Congratulations! You Passed!: Shallow Neural Networks
7 pages
Extreme Learning Machine: A Review
No ratings yet
Extreme Learning Machine: A Review
14 pages
HW 3
No ratings yet
HW 3
12 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
What Is Neural Network Technology?
No ratings yet
What Is Neural Network Technology?
17 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
1-s2.0-S0889540623005930
No ratings yet
1-s2.0-S0889540623005930
4 pages
Module 3_Modified
No ratings yet
Module 3_Modified
106 pages
L06 Slides.mlp3
No ratings yet
L06 Slides.mlp3
26 pages
OLMP
No ratings yet
OLMP
28 pages
DS535 Note5 (With Marks)
No ratings yet
DS535 Note5 (With Marks)
24 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Deep Feedforward Networks Application To Patter Recognition
No ratings yet
Deep Feedforward Networks Application To Patter Recognition
5 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
14 Deep
No ratings yet
14 Deep
6 pages
Deep Online Sequential Extreme Learning Machines and Its Application in Pneumonia Detection
No ratings yet
Deep Online Sequential Extreme Learning Machines and Its Application in Pneumonia Detection
6 pages
10.1007@s13042 019 00967 W
No ratings yet
10.1007@s13042 019 00967 W
20 pages
L05 Slides.mlp2
No ratings yet
L05 Slides.mlp2
21 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Extreme Learning Machine With Affine Transformation Inputs in an Activation Function
No ratings yet
Extreme Learning Machine With Affine Transformation Inputs in an Activation Function
15 pages
Back in NN
No ratings yet
Back in NN
12 pages
Chapter 7 Part 3
No ratings yet
Chapter 7 Part 3
7 pages
Research Article: Deep Convolutional Extreme Learning Machine and Its Application in Handwritten Digit Classification
No ratings yet
Research Article: Deep Convolutional Extreme Learning Machine and Its Application in Handwritten Digit Classification
11 pages
Autoencoder Loss Minimization
No ratings yet
Autoencoder Loss Minimization
6 pages
nonlinear
No ratings yet
nonlinear
8 pages
Chap11 Neural Nets
No ratings yet
Chap11 Neural Nets
38 pages
Efficiently Learning Any One Hidden Layer Relu Network From Queries
No ratings yet
Efficiently Learning Any One Hidden Layer Relu Network From Queries
24 pages
A Gentle Introduction To Neural Networks With Python
No ratings yet
A Gentle Introduction To Neural Networks With Python
85 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Project Report (Conv-ELM)
No ratings yet
Project Report (Conv-ELM)
11 pages
L04 Slides.mlp1
No ratings yet
L04 Slides.mlp1
22 pages
A Gentle Introduction To Neural Networks With Python
100% (1)
A Gentle Introduction To Neural Networks With Python
85 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
212 Handout
No ratings yet
212 Handout
28 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
Curs4site PDF
No ratings yet
Curs4site PDF
44 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Effective Algorithms of The Moore-Penrose Inverse Matrices For Extreme Learning Machine
No ratings yet
Effective Algorithms of The Moore-Penrose Inverse Matrices For Extreme Learning Machine
18 pages
Pattern Classification Using Simplified Neural Networks With Pruning Algorithm
No ratings yet
Pattern Classification Using Simplified Neural Networks With Pruning Algorithm
7 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
Haykin Xue Neural Networks and Learning Machines 3ed Soln PDF
50% (2)
Haykin Xue Neural Networks and Learning Machines 3ed Soln PDF
103 pages
cs188-fa24-lec24
No ratings yet
cs188-fa24-lec24
46 pages
RNN LSTM
No ratings yet
RNN LSTM
71 pages
CT1 DL Ans
No ratings yet
CT1 DL Ans
13 pages
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage: Volume 10 Troanary Photonic Storage Blueprint, #1
From Everand
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage: Volume 10 Troanary Photonic Storage Blueprint, #1
Ylia Callan
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
SMK User Manual: October 25, 2017
No ratings yet
SMK User Manual: October 25, 2017
64 pages
Building Ubuntu For Ultra-96 FPGA
0% (1)
Building Ubuntu For Ultra-96 FPGA
5 pages
SMK User Manual: October 25, 2017
No ratings yet
SMK User Manual: October 25, 2017
64 pages
Lab-4 Revised New 2018
No ratings yet
Lab-4 Revised New 2018
1 page
Problem Set Ee8205 PDF
No ratings yet
Problem Set Ee8205 PDF
4 pages
00726791
No ratings yet
00726791
47 pages
Ug179 Wrb4150b User Guide
No ratings yet
Ug179 Wrb4150b User Guide
31 pages
Scheduling Algorithms I
No ratings yet
Scheduling Algorithms I
5 pages
IMECS2010 pp1196-1199 PDF
No ratings yet
IMECS2010 pp1196-1199 PDF
4 pages
Tutorial 7
No ratings yet
Tutorial 7
9 pages
P3 - Automatic Regulation of Remote Voltage
No ratings yet
P3 - Automatic Regulation of Remote Voltage
2 pages
Ch7 Smart Metering&DSI S
No ratings yet
Ch7 Smart Metering&DSI S
49 pages
ENGR 4760U - Assignment 1 - W2017
No ratings yet
ENGR 4760U - Assignment 1 - W2017
2 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Assignment 2
0% (1)
Assignment 2
2 pages
Ps Course - Elee3260 - Uoit - Lec 2
No ratings yet
Ps Course - Elee3260 - Uoit - Lec 2
64 pages
Lab 5
No ratings yet
Lab 5
3 pages
Assignment 6
0% (2)
Assignment 6
3 pages
RFM22 Manual
No ratings yet
RFM22 Manual
150 pages
Computer Organization Hamacher Instructor Manual Solution - Chapter 1
No ratings yet
Computer Organization Hamacher Instructor Manual Solution - Chapter 1
3 pages
ELEE 3180 Project 1 Rubric
No ratings yet
ELEE 3180 Project 1 Rubric
2 pages
FM T & R FM-RTFQ S H M - FM-RRFQ S: Ransmitter Eceiver Eries Ybrid Odules Eries
No ratings yet
FM T & R FM-RTFQ S H M - FM-RRFQ S: Ransmitter Eceiver Eries Ybrid Odules Eries
1 page
Schematic
No ratings yet
Schematic
8 pages
Non Destructive Testing
No ratings yet
Non Destructive Testing
182 pages
IGCSE Physics Checklist RAG
No ratings yet
IGCSE Physics Checklist RAG
16 pages
Atomic Structure (H) MS
No ratings yet
Atomic Structure (H) MS
7 pages
PHYS 40422 Applied Nuclear Physics Lecture 1 & 2: Prof. Jon Billowes
No ratings yet
PHYS 40422 Applied Nuclear Physics Lecture 1 & 2: Prof. Jon Billowes
16 pages
Astrophysics at Very High Energies - Saas-Fee Advanced Course 40. Swiss Society For Astrophysics and Astronomy (PDFDrive)
No ratings yet
Astrophysics at Very High Energies - Saas-Fee Advanced Course 40. Swiss Society For Astrophysics and Astronomy (PDFDrive)
368 pages
Scientific Developments Boon or Bane
0% (1)
Scientific Developments Boon or Bane
7 pages
PE301 Well Logging and Interpretation: Lecture 5 - Density Log
No ratings yet
PE301 Well Logging and Interpretation: Lecture 5 - Density Log
20 pages
Nuclear Radioactivity Worksheet
No ratings yet
Nuclear Radioactivity Worksheet
3 pages
Cswip Ri LV2
100% (1)
Cswip Ri LV2
95 pages
hsg94 PDF
No ratings yet
hsg94 PDF
40 pages
Interaction of Radiation With Matter
No ratings yet
Interaction of Radiation With Matter
16 pages
Savvas Science: 8.4: Electromagnetic Waves
No ratings yet
Savvas Science: 8.4: Electromagnetic Waves
73 pages
15 Radiation - Nuclear Decay Gizmos Simulation - 9065228
No ratings yet
15 Radiation - Nuclear Decay Gizmos Simulation - 9065228
6 pages
Radioactivity: Combined Science 5129
No ratings yet
Radioactivity: Combined Science 5129
45 pages
9701 1.1 Atomic Structure
No ratings yet
9701 1.1 Atomic Structure
75 pages
Uce Physics, Paper 2: Click To See Frequently Asked Questions
No ratings yet
Uce Physics, Paper 2: Click To See Frequently Asked Questions
185 pages
Recent Trends in Radiation Chemistry 1st Edition James F. Wishart - The ebook with rich content is ready for you to download
No ratings yet
Recent Trends in Radiation Chemistry 1st Edition James F. Wishart - The ebook with rich content is ready for you to download
75 pages
chem-project ii
No ratings yet
chem-project ii
22 pages
Chap 13 Dosimetry and Calibration of Photon and Electron Bea
No ratings yet
Chap 13 Dosimetry and Calibration of Photon and Electron Bea
20 pages
Meta-Toxic Test Report Venkata Rao SKH15112023BNG
No ratings yet
Meta-Toxic Test Report Venkata Rao SKH15112023BNG
2 pages
The OZ Machine
No ratings yet
The OZ Machine
11 pages
Radiation Physics Lecture Notes
No ratings yet
Radiation Physics Lecture Notes
33 pages
Lesson 3 - Synthesis of Elements in The Laboratory
100% (3)
Lesson 3 - Synthesis of Elements in The Laboratory
62 pages
Lesson 4
No ratings yet
Lesson 4
58 pages
Gamma Ray Log
No ratings yet
Gamma Ray Log
52 pages
Radioactivity Practice Worksheet 2023
No ratings yet
Radioactivity Practice Worksheet 2023
3 pages
Objective Question Bank (NDT)
No ratings yet
Objective Question Bank (NDT)
22 pages
Nuclear Density Gauges: Because of The Disadvantages of Core Sampling, Many
No ratings yet
Nuclear Density Gauges: Because of The Disadvantages of Core Sampling, Many
2 pages

2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)

Uploaded by

2012 - Dong Yu - Efficientandeffectivealgorithmsfortrainingsinglehi (Retrieved 2018-01-15)

Uploaded by

Pattern Recognition Letters 33 (2012) 554–558

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

Efﬁcient and effective algorithms for training single-hidden-layer neural networks

2.1. Upper-layer-solution-unaware algorithm 2.3. Accelerated upper-layer-solution-aware algorithm

be guaranteed in theory for our non-convex problems, we have ob- €

3.2. Experimental results on MNIST Table 2

You might also like