0% found this document useful (0 votes)
18 views

Application of Deep Belief Networks For Natural Language Understanding

Uploaded by

omer.kunwar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Application of Deep Belief Networks For Natural Language Understanding

Uploaded by

omer.kunwar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

778 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO.

4, APRIL 2014

Application of Deep Belief Networks for


Natural Language Understanding
Ruhi Sarikaya, Geoffrey E. Hinton, and Anoop Deoras

Abstract—Applications of Deep Belief Nets (DBN) to various task-specific slots, such as DestinationLocation, DepartureLo-
problems have been the subject of a number of recent studies cation and DepartureDate, where the intent is FindFlight.
ranging from image classification and speech recognition to audio The conditional random fields (CRFs) [4] is one of the most
classification. In this study we apply DBNs to a natural language
widely used discriminative modeling technique for slot filling
understanding problem. The recent surge of activity in this area
was largely spurred by the development of a greedy layer–wise [2], [3] in spoken language understanding. Slot filling is cast as
pretraining method that uses an efficient learning algorithm a sequence classification problem to obtain the most probable
called Contrastive Divergence (CD). CD allows DBNs to learn a slot sequence:
multi-layer generative model from unlabeled data and the features
discovered by this model are then used to initialize a feed-forward
neural network which is fine-tuned with backpropagation. We
compare a DBN-initialized neural network to three widely used
text classification algorithms: Support Vector Machines (SVM), where is the input word sequence and
boosting and Maximum Entropy (MaxEnt). The plain DBN-based is the sequence of associated class labels .
model gives a call–routing classification accuracy that is equal to Motivated by the success of early commercial interactive
the best of the other models. However, using additional unlabeled voice response (IVR) applications used in call centers, a new
data for DBN pre–training and combining DBN–based learned SLU task evolved: that of determining the user intent. This
features with the original features provides significant gains over
SVMs, which, in turn, performed better than both MaxEnt and
new SLU task was framed as classifying users’ utterances
Boosting. into predefined categories (called intents or call-types) [5].
For example, if the user said something related to a billing
Index Terms—Call-Routing, DBN, Deep Learning, Deep Neural
Nets, Natural language Understanding, RBM. statement in an IVR setting, the automatic call routing system
should direct the call to the billing department. For intent
determination (for call routing or other tasks), early work on
I. INTRODUCTION discriminative classification algorithms for the AT&T HMIHY

T HE goal of spoken language understanding (SLU) sys- system [5] used Boosting [6]. In this paper, we focus on the
tems is to enable communication between a human and intent determination task, specifically focusing on call routing
machine. SLU systems automatically identify a user’s intent applications. We frame the problem in a probabilistic setting.
from natural language by extracting the information bearing More formally, given the sequence of words, , the most
words and issuing queries to back-end databases to satisfy the likely user intent (class label), is given by:
user’s requests. Ground-breaking advances in speech recogni-
tion technology from early 1980’s to early 1990’s opened the
way for spoken language understanding. An early SLU task was
the DARPA (Defense Advanced Research Program Agency) where is the input word sequence and
Airline Travel Information System (ATIS) project [1] in 1990. is the user intent among the possible set of intents . We refer
This project focused on building spoken understanding systems interested readers to [9] for a detailed history and overview on
in the travel domain. These systems handled spoken queries re- SLU.
lated to flight-related information including flight booking and Today, natural language call routing is one of the most widely
hotel reservation. An example utterance from this domain is adopted NLP technologies in the world, and there are hardly
I want to fly from Seattle to Miami tomorrow morning. Lan- any large companies that do not use it for dealing with cus-
guage understanding was reduced to the problem of extracting tomers. The main advantage of call routing is the automation it
provides for customer care, largely eliminating customer/agent
Manuscript received November 08, 2012; revised September 02, 2013; ac- interaction. As such, every small improvement in call routing
cepted January 11, 2014. Date of publication February 11, 2014; date of current accuracy matters since users whose goal is not identified by the
version February 19, 2014. The associate editor coordinating the review of this system require a human agent to resolve their problems. A typ-
manuscript and approving it for publication was Prof. Pascal Fung.
R. Sarikaya and A.Deoras are with Microsoft Corporation, Redmond, WA ical call routing system is composed of two statistical compo-
98052 USA (e-mail: [email protected]; anoop.deoras@microsoft. nents: a speech recognition system and an action classifier. The
com). speech recognition system transcribes the speaker’s speech and
G. Hinton is with the Department of Computer Science, University of
Toronto, Toronto, ON M5S 3G4, Canada (e-mail: [email protected]). sends the transcription to the action classifier, which extracts the
Digital Object Identifier 10.1109/TASLP.2014.2303296 speaker’s intent embodied in different call-types. Each call-type

2329-9290 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF MANITOBA. Downloaded on October 31,2024 at 01:59:36 UTC from IEEE Xplore. Restrictions apply.
SARIKAYA et al.: APPLICATION OF DBNs FOR NATURAL LANGUAGE UNDERSTANDING 779

triggers a different action in the system back-end. There are


numerous machine learning techniques such as Boosting [6],
Maximum Entropy Modeling (MaxEnt) [21], [20] and Support
Vector Machines (SVM) [7], [8], which are used as action clas-
sifiers. All of these techniques require labeled data to train a
model. Quantity and quality of labeled data are the determining
factors in building and deploying such systems. The complexity
of the call routing task largely determines how much labeled
data is needed to achieve a reasonable performance level. As
the complexity of the task increases the amount of training data
required for a reasonable performance level can become large.
Therefore, there are several key areas for technology improve- Fig. 1. RBM Architecture.
ment: 1) minimizing the amount of labeled data to achieve a
given performance level, 2) improving the machine learning al- initialize a multilayer feed-forward neural network significantly
gorithms to achieve the best performance for a given amount of decreases both the time taken for discriminative training and the
labeled data, and 3) exploiting unlabeled data, which are typ- amount of overfitting [13].
ically available in much larger quantities than labeled data, to RBMs can be trained using unlabeled data and they can learn
improve the performance for a given amount of labeled data. stochastic binary features which are good for modeling the
Neural Networks (NNets) are not new to the speech and lan- higher-order statistical structure of a dataset. Even though these
guage processing field. There have been numerous applications features are discovered without considering the discriminative
of NNets to speech recognition and natural language processing task for which they will be used, some of them are typically
problems during the past two decades. Even though NNets, par- very useful for classification as well as for generation. A sub-
ticularly deep nets with many hidden layers, appeared capable sequent stage of discriminative fine-tuning can then slightly
of modeling complex structures and dependencies in the data, change the feature weights to make the network even more
they failed to live up to the expectations because of the lack of useful for discrimination with much less overfitting, which
effective training algorithms for training such networks. Conse- otherwise can be a serious problem with purely discriminative
quently, until very recently, NNets lost the battle against GMMs/ training. This is particularly helpful when the number of la-
HMMs for speech recognition due to larger computational de- beled training examples is relatively small. In this regime, it
mands and difficulty in parallelizing the model training com- has been shown that classifiers based on generative models can
pared to the GMM/HMM approach. In the NLP area, where the outperform discriminative classifiers, even without making use
primary problems can be cast as classification problems, NNets of additional unlabeled data [14].
fared better, but they still were not the preferred modeling ap- Part of the work in this paper is presented in [15]. In this
proach compared to maximum entropy models, support vector paper we pursue two lines of research suggested as future work
machines, and boosting techniques partly due to the difficulty in [15]: a) investigating the effect of using unlabeled data to
in training deep networks. Moreover, SVM and boosting have train RBMs, and b) treating the DBN as a feature generator and
maximum margin properties with faster training algorithms. Re- using a separate classifier such as an SVM to perform the actual
cently, however, there has been increasing interest in Deep Be- classification task. These techniques lead to clear performance
lief Networks (DBNs) because of the invention of an efficient improvements both over the baseline DBN and SVM, which are
layer-by-layer learning technique. The building block of a DBN largely equivalent in terms of the performance figures.
is a probabilistic model called a Restricted Boltzmann Machine The rest of the manuscript is organized as follows: Section 2
(RBM), which is used to discover one layer of features at a time. provides a brief introduction to RBMs. Section 3 describes how
To learn a DBN, RBMs are applied recursively with the fea- to train a stack of RBMs recursively and how to use the re-
ture activations produced by one RBM acting as the data for sulting DBN to initialize a feed-forward neural network that
training the next RBM in the stack. DBNs have been used as can be discriminatively fine-tuned to optimize classification.
generative models of many different forms of data in such di- Section 4 summarizes the other widely used discriminative clas-
verse areas as image classification, speech recognition and in- sifiers. Section 5 presents the experimental results and discus-
formation retrieval [10], [11], [12]. Deep networks typically sion followed by the conclusions in Section 6.
have higher modeling capacity than shallow networks with the
same number of parameters, but they are harder to train, both II. RESTRICTED BOLTZMANN MACHINES
as stochastic top-down generative models and as deterministic A restricted Boltzmann machine [16] is a two-layer, undi-
bottom-up discriminative models. For generative training, it is rected, bipartite graphical model where the first layer consists
generally very difficult to infer the posterior distribution over of observed data variables (or visible units), and the second
the multiple layers of latent (hidden) variables. For discrimina- layer consists of latent variables (or hidden units). The visible
tive training using backpropagation, learning can be very slow and hidden layers are fully connected via symmetric undirected
with multiple hidden layers and overfitting can also be a serious weights, and there are no intra-layer connections within either
problem. The recursive training method for DBNs solves the the visible or the hidden layer. A typical RBM model topology
inference problem. The use of features found by the DBN to is shown in Fig. 1.
Authorized licensed use limited to: UNIVERSITY OF MANITOBA. Downloaded on October 31,2024 at 01:59:36 UTC from IEEE Xplore. Restrictions apply.
780 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 4, APRIL 2014

The weights and biases of an RBM determine the energy of


a joint configuration of the hidden and visible units ,

(1)

with model parameters and .


are the symmetric weight parameters with dimensions,
are the visible unit bias parameters, are the hidden unit bias
parameters. The network assigns a probability to every possible
visible-hidden vector pair via the energy function,

(2)

The normalization term or partition function, , is obtained by


summing over all possible pairs of visible and hidden vectors.

(3)
Fig. 2. Stacking RBMs to create a deep network. This architecture is used in
our experiments.
The probability that the model assigns to a visible vector, , is
obtained by marginalizing over the space of hidden vectors,
Unfortunately, it is exponentially expensive to compute
(4) exactly so the contrastive divergence (CD) approx-
imation to the gradient is used by replacing with
, which is a lot easier and faster to compute [18].
The simplest RBMs use Bernouilli-distributed units (i. e. sto- is computed by setting the visible units to a random
chastic binary units), but they can be generalized to any distri- training vector. Then the binary states of the hidden units are
bution in the exponential family [12]. However, some combi- computed using Eqn. (7), followed by computing the binary
nations of distributions for the visible and hidden units are very states of the visible units using Eqn. (8). The computed visible
hard to train (see [17] for more details). In this paper, we restrict states are a ‘reconstruction’ of the original visible vector.
ourselves to binary units for all of the experiments. Finally, Eqn. (7) is used once more to compute the states of
The derivative of the log probability of a visible vector, the hidden units from the reconstruction. The new learning rule
with respect to the weights is given by: is a crude approximation to following the gradient of the log
probability of the training data, but it works well in practice
(5) and is adequate for discovering good features.

where the angle bracket denotes the expectation with respect to III. LEARNING AND USING DEEP BELIEF NETWORKS
the distribution specified in the subscript. Following the gradient After training the network consisting of the visible layer
of the log likelihood we obtain the update rule for the weights and the first hidden layer, which we will refer to as , its
as, learned parameters, , define , , ,
and via Eqns. (7) and (8). The parameters of
(6) also define a prior distribution over hidden vectors, ,
which is obtained by marginalizing over the space of visible
where is the learning rate. The lack of hidden–hidden con-
vectors. This allows to be written as:
nections makes the first expectation easy to compute. Given a
visible vector, , the hidden units are conditionally independent (9)
and the conditional distribution of hidden unit is given by:

(7) The idea behind training a DBN by training a stack of RBMs (as
shown in Fig. 2) is to keep the defined by , but
to improve by replacing by a better prior over the
where is the logistic sigmoid function hidden vectors. To improve , this better prior must have a
. It is therefore easy to get an unbiased sample of smaller KL divergence than from the “aggregated pos-
. Similarly, because there are no visible-visible terior”, which is the equally weighted mixture of the posterior
connections, we can easily get an unbiased sample of the state distributions over the hidden vectors of on all of the
of a visible unit, , given a hidden vector, : training cases:

(8) (10)
Authorized licensed use limited to: UNIVERSITY OF MANITOBA. Downloaded on October 31,2024 at 01:59:36 UTC from IEEE Xplore. Restrictions apply.
SARIKAYA et al.: APPLICATION OF DBNs FOR NATURAL LANGUAGE UNDERSTANDING 781

gained by using more than about 3 hidden layers. We use the


architecture shown in Fig. 3. It has three hidden layers that are
pre-trained, one at a time, as the hidden layers in a stack of three
RBMs without making any use of the class labels.
It is worth mentioning that the softmax output layer of a
neural network is the same as a MaxEnt classifier: in other
words, a neural network is a MaxEnt classifier in which the
feature functions are learned.

IV. TRADITIONAL CLASSIFIERS

A. Maximum Entropy
The Maximum Entropy (MaxEnt) method is a flexible statis-
tical modeling framework that has been widely used in many
areas of natural language processing [20]. MaxEnt based classi-
Fig. 3. Stacked RBMs (see Fig. 2) are first trained using labeled and unlabeled fiers do not assume statistical independence of the features that
data and then the learned parameters are used to obtain higher level features. are used as predictors. As such, they allow the combination of
These higher level features in conjunction with original input feature vector are multiple overlapping information sources [21], [20]. The infor-
used to train a SVM classifier. This classifier is then used during evaluation.
mation sources are combined as follows:

The analogous statement for Gaussian mixture models is that


(11)
the updated mixing proportion of a component should be closer
to the average posterior probability of that component over all
training cases. which describes the probability of a particular class (e.g.
Now consider training , which is the network formed call-types) given the word sequence spoken by the caller.
by using the samples from the aggregated posterior of Notice that the denominator includes a sum over all classes ,
as training data. It is easy to ensure that the distribution which which is essentially a normalization factor for probabilities to
defines over its visible units is identical to : sum to 1. The are indicator functions, or features, which are
we simply initialize to be an upside-down version of “activated” based on computable features on the word sequence,
in which the roles of visible and hidden units have
for example if a particular word or word pair appears, or if the
been swapped. So has as a visible vector and as a
parse tree contains a particular tag, etc. The MaxEnt models are
hidden vector. Then we train which makes be
trained using the improved iterative scaling algorithm [21] with
a better model of the aggregated posterior than .
Gaussian prior smoothing [20] using a single universal variance
After training , we can combine the two RBMs to
create a hybrid of a directed and an undirected model. parameter of 2.0.
is defined by the undirected , but is defined
B. Boosting
by directed connections from the first hidden layer to the visible
units. In this hybrid model, which we call a deep belief net, exact Boosting is a method that can be used in conjunction with
inference of is no longer easy because the prior many learning algorithms to improve the accuracy of the
over the hidden vectors is no longer defined by . However, it is learning algorithm. The idea of Boosting is to produce an
proved in [19] that if we perform approximate inference for the accurate prediction rule by combining many moderately inac-
first hidden layer by using Eqn. (7), there is a variational lower curate (weak) rules into a single classifier. At each iteration,
bound on the log probability of the training data that is improved boosing adds a new (weak) prediction rule that focuses on
every time we add another layer to the DBN, provided we add samples that are incorrectly classified by the current combined
it in the appropriate way. predictor. Even though Boosting is known to be sensitive to
After training a stack of RBMs, the bottom up recognition noisy data and outliers, in some problems, it is less susceptible
weights of the resulting DBN can be used to initialize the to overfitting than most machine learning algorithms. We used a
weights of a multi-layer feed-forward neural network, which specific implementation of Boosting, AdaBoost using decision
can then be discriminatively fine-tuned by backpropagating stumps, which is described in [6]. Boosting has been applied to
error derivatives. The feed-forward network is given a final
a number of natural language processing tasks in the past [9].
“softmax” layer that computes a probability distribution over
class labels and the derivative of the log probability of the C. Support Vector Machines
correct class is backpropagated to train the incoming weights
of the final layer and to discriminatively fine-tune the weights Support vector machines (SVMs) are supervised learning
in all lower layers. methods used for classification. The basic SVM takes a set
Deep belief networks (DBNs) have yielded impressive classi- of input data and predicts, for each given input, which of two
fication performance on several benchmark classification tasks, possible classes forms the output, making it a non-probabilistic
beating the state-of-the-art in several cases [11]. In principle, binary classifier.
adding more layers improves modeling power, unless the DBN SVMs are derived from the theory of structural risk mini-
already perfectly models the data. In practice, however, little is mization [7]. SVMs learn the boundaries between samples of
Authorized licensed use limited to: UNIVERSITY OF MANITOBA. Downloaded on October 31,2024 at 01:59:36 UTC from IEEE Xplore. Restrictions apply.
782 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 4, APRIL 2014

the two classes by mapping these sample points into a higher TABLE I
dimensional space. SVMs construct a hyperplane or a set of hy- PACKAGE SHIPMENT TASK: ACCURACY FOR TRADITIONAL
AND DBN BASED CLASSIFIERS
perplanes in a high-dimensional space, which can be used for
classification. Intuitively, a good separation is achieved by the
hyperplane that has the largest distance to the nearest training
data point of any class (the “functional margin”), since in gen-
eral the larger the margin the lower the generalization error of
the classifier. The hyperplane separating these regions is found
by maximizing the margin between closest sample points be-
longing to competing classes. In addition to performing linear
classification, SVMs can efficiently perform non-linear classi-
fication using what is called the kernel trick, implicitly map-
ping their inputs into high-dimensional feature spaces. Much of
the flexibility and classification power of SVMs resides in the
choice of kernel. Some of the commonly used kernels are linear,
In Table I, we present the results on the test data for SVMs,
polynomial and radial basis functions. In this work, we chose
MaxEnt, Boosting and DBNs. Various classifier parameters
linear kernels to train the SVM since computationally it is faster
(e.g. smoothing priors for MaxEnt learning, and kernel se-
compared to other kernels, yet there is no significant difference
lection for SVMs) are tuned on the development data. Each
in performance for the current task. This is a fairly standard re-
classifier is trained using the amount of labeled data given in
sult for applying SVMs in natural language processing since we
the first column. Looking first at the traditional classifiers, we
are already using a high-dimensional feature vector.
notice that the SVM classifier obtained 77.8% accuracy using
1 K labeled data. The corresponding figures for the MaxEnt
V. EXPERIMENTAL RESULTS AND DISCUSSION classifier and the Boosting based classifier are 76.0% and
The call-routing task considered in this paper is from a call- 79.6% respectively. Not only for 1 K labeled data but also for
center customer hotline that gives technical assistance for a For- 2 K and 3 K data, Boosting provides the best performance.
tune–500 company [22]. The call-routing system selects one of However, for larger amounts of training data, the SVM con-
35 call–types. The training data has 27 K automatically tran- sistently outperformed both MaxEnt and Boosting, which is
scribed utterances amounting to 178 K words. This data is split in agreement with other studies [22]. The DBN (4th column)
into sets containing {1 K, 2 K, 3 K, 4 K, 5 K, 6 K, 7 K, 8 K, 9 K, performed as well as or slightly better than SVMs for all sizes
10 K} and 27 K utterances respectively. These sets will be re- of training set. When trained on all of the training data, they
ferred to in a similar fashion. The purpose of this split is to inves- had identical performance, achieving 90.3% accuracy.
tigate various training data sizes and their effects on the learning In this paper we pursued two of the three future research di-
methods. We also have two separate datasets containing about rections suggested in [15]. The first extension was using addi-
3.2 K and 5.6 K sentences that are used as development and tional unlabeled data to train the RBMs, since typically there is
test data, respectively. All of these datasets are hand–labeled a lot more unlabeled data available than labeled data. In our ex-
with call–types. In all the classification methods employed here periments, for smaller chunks of labeled data, the entire 27 K
we used vectors of individual word counts as the inputs to the labeled data is treated as unlabeled data to train the DBN. For
models. For the DBNs, the counts were clipped at 1 to allow example, when 1 K labeled data is used to train the DBN, we
them to be modeled by binary units. used 27 K to train the corresponding RBMs. We have repeated
In our experiments with the development data we found that the same steps with different amounts of labeled data given in
hidden layers of provided slightly better re- Table I. The second direction of research was to treat the DBN
sults than the other hidden layer sizes that we tried. The model as a feature extractor and use these features as input to a separate
architecture is shown in Fig. 3. The individual RBM models classifier. We first trained a DBN and then for each utterance,
were trained in an unsupervised fashion using contrastive di- we generated the activity at the top layer. This activity along
vergence learning with 1 step of Gibbs sampling (CD-1). The with the original features were concatenated and used as input
training phase made 100 passes (epochs) through the training to an SVM classifier. Fig. 3 shows the schematics of the setup.
dataset. The weights of each RBM were initialized with small We provided additional experimental results for three sce-
random values sampled from a zero-mean normal distribution narios: a) using additional unlabeled data to train the RBMs
with standard deviation 0.01 and updated using a learning rate (DBN-1), b) using DBN learned features as input additional fea-
of 0.01/batch-size, momentum of 0.9, and a weight decay of tures to SVM classifier (DBN-2), and c) combining the previous
0.001. two scenarios (DBN-3). Using additional unlabeled data pro-
For the discriminative fine-tuning, we use stochastic gradient vided large gains when the ratio of unlabeled to labeled data size
descent (SGD) and we also set the number of iterations by is large, as shown in the column of DBN-1 column in Table I.
using early stopping according to the validation set classifi- For example, when we have 27 K unlabeled data to train RBMs
cation error. To reduce computation time, we select the SGD but only 2 K labeled data to fine tune the DBNs the gain is 1.1%.
learning rate, momentum parameter and other parameters by Likewise, when the labeled data is 3 K the gain is 0.9%. How-
maximizing the accuracy on the development set. ever, as the ratio of the labeled data to unlabeled data gets larger
Authorized licensed use limited to: UNIVERSITY OF MANITOBA. Downloaded on October 31,2024 at 01:59:36 UTC from IEEE Xplore. Restrictions apply.
SARIKAYA et al.: APPLICATION OF DBNs FOR NATURAL LANGUAGE UNDERSTANDING 783

we do not observe gains from using additional unlabeled data. SVM. We also leveraged additional unlabeled data to improve
We note that the amount of unlabeled data considered here is the modeling performance. Both of these extensions resulted
fairly small. In many applications however, the amount of unla- in additional improvement in call–routing classification perfor-
beled data can be substantially larger than the labeled data. It is mance. In the future, we plan to consider DBNs for sequence
one of our future research work directions to investigate using tagging for slot detection and entity tagging in spoken language
substantially larger amounts of unlabeled data to train RBMs in understanding.
a separate application.
In the table we also show feature combination results where REFERENCES
DBN learned features are combined with the original features [1] P. J. Price, “Evaluation of spoken language systems: The ATIS do-
(DBN-2) as input to an SVM classifier. The results indicate that main,” in Proc. DARPA Workshop Speech Nat. Lang., Hidden Valley,
we get consistent gains when DBN based features are combined PA, USA, Jun. 1990.
[2] Y.-Y. Wang and A. Acero, “Discriminative models for spoken lan-
with the original features across all labeled data sizes. Finally, guage understanding,” in Proc. ICSLP, Pittsburgh, PA, USA, Sep.
we combine DBN based features where RBMs are trained with 2006.
large (relative to the labeled data) collection of unlabeled data [3] C. Raymond and G. Riccardi, “Generative and discriminative algo-
rithms for spoken language understanding,” in Proc. Interspeech,
with the original features using an SVM classifier. This set-up Antwerp, Belgium, 2007.
is called DBN-3 and the results are given in the last column of [4] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random Fields:
Table I. The results show that DBN-3 improves the call routing Probabilistic Models for Segmenting and Labeling Sequence Data,” in
performance consistently across all data sizes with the excep- Proc. Int. Conf. Mach. Learn., 2001.
[5] A. L. Gorin, G. Riccardi, and J. H. Wright, “How may I help you?,”
tion of the 1 K data size where Boosting performs better. For Speech Commun., vol. 23, pp. 113–127, 1997.
smaller amounts of labeled data the performance improvements [6] R. E. Schapire and Y. Singer, “Boostexter: A boosting based system
over SVM are significant. For example, 0.8%, 1.9%, 1.2%, 1.3% for text categorization,” Mach. Learn., vol. 39, no. 2/3, pp. 135–168,
2000.
and 1.2% absolute improvements are obtained for 1 K through [7] V. Vapnik, The Nature of Statistical Learning Theory. New York, NY,
5 K labeled data amounts. The improvements were smaller but USA: Springer–Verlag, 1995.
consistent all the way to 27 K labeled data. The performance [8] P. Haffner, G. Tur, and J. Wright, “Optimizing SVMs for complex call
gains are coming largely from using unlabeled data, which is classification,” in Proc. ICASSP, Hong Kong, Apr. 2003, pp. 632–635.
[9] Spoken Language Understanding: Systems for Extracting Semantic In-
used to train RBMs, when the labeled data size is small. The formation from Speech, G. Tur and R. D. Mori, Eds. New York, NY,
results indicate that gains for DBN-1 and DBN-2 are approxi- USA: Wiley, 2011.
mately additive. [10] G. E. Hinton, “Learning multiple layers of representation,” TRENDS
Cognitive Sci., vol. 11, no. 10, pp. 428–434, 2007.
We also investigate whether binarization of the features for [11] G. E. Dahl, M. Ranzato, A. Momamed, and G. E. Hinton, “Phone
DBNs give them an advantage by also testing the SVM classi- Recognition with the Mean-Covariance Restricted Boltzmann Ma-
fier with binarized word count features. The n–gram features are chines,” in Advances in Neural Information Processing Systems
NIPS. Cambridge, MA, USA: MIT Press, 2010.
formed based on the existence of these features regardless of the
[12] M. Welling, M. Rosen-Zvi, and G. E. Hinton, “Exponential family of
actual counts that they are observed in the sentence. There are harmoniums with an application to information retrieval,” in Advances
about 15% of the sentences that had n–gram features of count in Neural Information Processing Systems. Cambridge, MA, USA:
two or more. However, classification results across all data sizes MIT Press, 2005, pp. 1481–1488.
[13] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, and P. Vincent,
show that the feature binarization did not change the SVM per- “Why Does Unsupervised Pre-training Help Deep Learning?,” J.
formance (the changes were in the second decimal). Mach. Learn. Res., vol. 11, pp. 625–660, 2010.
[14] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers:
VI. CONCLUSION AND FUTURE WORK A comparison of logistic regression and naive Bayes,” in Advances
in Neural Information Processing Systems. Cambridge, MA, USA:
This work presented a successful application of Deep Belief MIT Press, 2002, vol. 11.
Nets (DBNs) to a natural language call–routing task. DBNs [15] R. Sarikaya, G. Hinton, and B. Ramabhadran, “Deep belief net-
works for natural language call-routing,” in Proc. ICASSP, 2011, pp.
use unsupervised learning to discover multiple layers of fea- 5680–5683.
tures that are then used in a feed–forward neural network [16] G. E. Hinton, “Training products of experts by minimizing contrastive
and fine–tuned to optimize discrimination. When the amount divergence,” Neural Comput., vol. 14, pp. 1771–1800, 2002.
[17] G. E. Hinton, “A practical guide to training restricted Boltzmann ma-
of training data is limited, unsupervised feature discovery chines,” Univ. of Toronto Mach. Learn. Tech. Rep., UTML TR 2010-
makes DBNs less prone to overfitting than feedforward neural 003.
networks initialized with random weights, and it also makes it [18] G. E. Hinton, “Training product of experts by minimizing constrastive
easier to train neural networks with many hidden layers. divergence,” Neural Comput., vol. 14, no. 18, pp. 1527–1554, 2002.
[19] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algo-
DBNs produce better classification results than several other rithm for deep belief nets,” Adv. Neural Comput., vol. 18, no. 7, pp.
widely used learning techniques, outperforming Maximum En- 1527–1554, 2006.
tropy and Boosting based classifiers. Their performance is al- [20] S. Chen and R. Rosenfeld, “A survey of smoothing techniques for ME
models,” IEEE Trans. Speech Audio Process., vol. 8, no. 1, pp. 37–50,
most identical to SVMs, which are the best of the other tech- Jan. 2001.
niques that we investigated. [21] S. D. Pietra, V. D. Pietra, and J. Lafferty, “Inducing features of random
We further extended our initial work by treating DBNs as fields,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 19, no. 4, pp.
380–93, Apr. 1997.
feature generators to capture and model the underlying struc-
[22] R. Sarikaya, H-K. J. Kuo, V. Goel, and Y. Gao, “Exploiting Unlabeled
ture in the input data. The learned features are used in con- Data Using Multiple Classifiers for Improved Natural Language Call-
junction with the original inputs to do classification using an Routing,” in Proc. Interspeech, Lisbon, Portugal, Sep. 2005.
Authorized licensed use limited to: UNIVERSITY OF MANITOBA. Downloaded on October 31,2024 at 01:59:36 UTC from IEEE Xplore. Restrictions apply.
784 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 4, APRIL 2014

Ruhi Sarikaya is a principal scientist and the Geoffrey Hinton received his Ph.D. degree in
manager of language understanding and dialog Artificial Intelligence from the University of Ed-
systems group at Microsoft. He was a research staff inburgh in 1978. He spent five years as a faculty
member and team lead in the Human Language member at Carnegie Mellon University, Pittsburgh,
Technologies Group at IBM T.J. Watson Research Pennsylvania, and he is currently a Distinguished
Center for ten years. Prior to joining IBM in 2001 he Professor at the University of Toronto and a Dis-
was a researcher at the Center for Spoken Language tinguished Researcher at Google. He is a fellow of
Research (CSLR) at the University of Colorado at the Royal Society and an honorary foreign member
Boulder for two years. He also spent the summer of the American Academy of Arts and Sciences.
of 1999 at the Panasonic Speech Technology Lab- His awards include the David E. Rumelhart Prize,
oratory, Santa Barbara, CA. He received the B.S. the International Joint Conference on Artificial
degree from Bilkent University, Turkey in 1995, M.S. degree from Clemson Intelligence Research Excellence Award, the Killam Prize for Engineering and
University, SC in 1997 and the Ph.D. degree from Duke University, NC in 2001 the Gerhard Herzberg Canada Gold Medal for Science and Engineering. He
all in electrical and computer engineering. He has published over 70 technical was one of the researchers who introduced the back-propagation algorithm. His
papers in refereed journal and conference proceedings and, is inventor of 25 other contributions include Boltzmann machines, distributed representations,
patents in the area of speech and natural language processing. At IBM he has time-delay neural nets, mixtures of experts, variational learning, contrastive
received several prestigious awards for his work including two Outstanding divergence learning, and Deep Belief Nets.
Technical Achievement Awards (2005 and 2008) and two Research Division
Awards (2005 and 2007). Dr. Sarikaya has served as the general co-chair of
IEEE SLT 2012, publicity chair of IEEE ASRU 2005 and as associate editors
of IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING and Anoop Deoras is a research scientist at Microsoft.
IEEE SIGNAL PROCESSING LETTERS. He also served as the lead guest editor He received the B.E. degree in Electronics and
of the special issue on Processing Morphologically-Rich Languages for IEEE Telecommunication Engineering from College of
TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING and gave a Engineering, Pune India in 2003, M.S. degree in
tutorial on Processing Morphologically Rich Languages at Interspeech 2007. Applied Math. & Statistics in 2010 and a M.S. and
His past and present research interests span all aspects of speech and lan- Ph.D. in Electrical & Computer Engineering from
guage processing including natural language processing, spoken dialog systems, Johns Hopkins University in 2011. He is interested
speech recognition, machine translation, machine learning, speech-to-speech in applying machine learning techniques to speech
translation, speaker identification/verification, digital signal processing and sta- recognition and spoken language understanding.
tistical modeling. Dr. Sarikaya is a member of IEEE (senior member), ACL and In his PhD thesis, he investigated several decoding
ISCA. techniques for incorporating complex and long
span language models, such as recurrent neural network language models, in
automatic speech recognition setup. He is a member of IEEE, ISCA and ACL.

Authorized licensed use limited to: UNIVERSITY OF MANITOBA. Downloaded on October 31,2024 at 01:59:36 UTC from IEEE Xplore. Restrictions apply.

You might also like