0% found this document useful (0 votes)
13 views

NN TextClassification

This document proposes using multi-task learning with recurrent neural networks for text classification. It introduces three models that share information across related classification tasks using either shared or task-specific layers. The models are trained jointly on multiple text classification tasks, and experimental results show the multi-task models outperform single-task baselines by leveraging relationships between tasks.

Uploaded by

Dian Puspita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

NN TextClassification

This document proposes using multi-task learning with recurrent neural networks for text classification. It introduces three models that share information across related classification tasks using either shared or task-specific layers. The models are trained jointly on multiple text classification tasks, and experimental results show the multi-task models outperform single-task baselines by leveraging relationships between tasks.

Uploaded by

Dian Puspita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Recurrent Neural Network for Text Classification

with Multi-Task Learning


Pengfei Liu Xipeng Qiu∗ Xuanjing Huang
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
School of Computer Science, Fudan University
825 Zhangheng Road, Shanghai, China
{pfliu14,xpqiu,xjhuang}@fudan.edu.cn
arXiv:1605.05101v1 [cs.CL] 17 May 2016

Abstract are based on unsupervised objectives such as word predic-


tion for training [Collobert et al., 2011; Turian et al., 2010;
Neural network based methods have obtained great Mikolov et al., 2013]. This unsupervised pre-training is effec-
progress on a variety of natural language process- tive to improve the final performance, but it does not directly
ing tasks. However, in most previous works, the optimize the desired task.
models are learned based on single-task super-
vised objectives, which often suffer from insuffi- Multi-task learning utilizes the correlation between related
cient training data. In this paper, we use the multi- tasks to improve classification by learning tasks in parallel.
task learning framework to jointly learn across mul- Motivated by the success of multi-task learning [Caruana,
tiple related tasks. Based on recurrent neural net- 1997], there are several neural network based NLP models
[Collobert and Weston, 2008; Liu et al., 2015b] utilize multi-
work, we propose three different mechanisms of
sharing information to model text with task-specific task learning to jointly learn several tasks with the aim of
and shared layers. The entire network is trained mutual benefit. The basic multi-task architectures of these
jointly on all these tasks. Experiments on four models are to share some lower layers to determine common
benchmark text classification tasks show that our features. After the shared layers, the remaining layers are
proposed models can improve the performance of a split into the multiple specific tasks.
task with the help of other related tasks. In this paper, we propose three different models of sharing
information with recurrent neural network (RNN). All the re-
lated tasks are integrated into a single system which is trained
1 Introduction jointly. The first model uses just one shared layer for all the
tasks. The second model uses different layers for different
Distributed representations of words have been widely used
tasks, but each layer can read information from other layers.
in many natural language processing (NLP) tasks. Follow-
The third model not only assigns one specific layer for each
ing this success, it is rising a substantial interest to learn
task, but also builds a shared layer for all the tasks. Besides,
the distributed representations of the continuous words, such
we introduce a gating mechanism to enable the model to se-
as phrases, sentences, paragraphs and documents [Socher et
lectively utilize the shared information. The entire network is
al., 2013; Le and Mikolov, 2014; Kalchbrenner et al., 2014;
trained jointly on all these tasks.
Liu et al., 2015a]. The primary role of these models is to rep-
resent the variable-length sentence or document as a fixed- Experimental results on four text classification tasks show
length vector. A good representation of the variable-length that the joint learning of multiple related tasks together can
text should fully capture the semantics of natural language. improve the performance of each task relative to learning
The deep neural networks (DNN) based methods usually them separately.
need a large-scale corpus due to the large number of parame- Our contributions are of two-folds:
ters, it is hard to train a network that generalizes well with
limited data. However, the costs are extremely expensive
to build the large scale resources for some NLP tasks. To • First, we propose three multi-task architectures for
deal with this problem, these models often involve an un- RNN. Although the idea of multi-task learning is not
supervised pre-training phase. The final model is fine-tuned new, our work is novel to integrate RNN into the multi-
with respect to a supervised training criterion with a gradient learning framework, which learns to map arbitrary text
based optimization. Recent studies have demonstrated signif- into semantic vector representations with both task-
icant accuracy gains in several NLP tasks [Collobert et al., specific and shared layers.
2011] with the help of the word representations learned from
the large unannotated corpora. Most pre-training methods • Second, we demonstrate strong results on several text
classification tasks. Our multi-task models outperform

Corresponding author. most of state-of-the-art baselines.
2 Recurrent Neural Network for
Specific-Task Text Classification h1 h2 h3 ··· hT softmax y
The primary role of the neural models is to represent the
variable-length text as a fixed-length vector. These models
x1 x2 x3 xT
generally consist of a projection layer that maps words, sub-
word units or n-grams to vector representations (often trained
beforehand with unsupervised methods), and then combine Figure 1: Recurrent Neural Network for Classification
them with the different architectures of neural networks.
There are several kinds of models to model text, such as are the following:
Neural Bag-of-Words (NBOW) model, recurrent neural net-
work (RNN) [Chung et al., 2014], recursive neural network it = σ(Wi xt + Ui ht−1 + Vi ct−1 ), (2)
(RecNN) [Socher et al., 2012; Socher et al., 2013] and con- ft = σ(Wf xt + Uf ht−1 + Vf ct−1 ), (3)
volutional neural network (CNN) [Collobert et al., 2011; ot = σ(Wo xt + Uo ht−1 + Vo ct ), (4)
Kalchbrenner et al., 2014]. These models take as input the
embeddings of words in the text sequence, and summarize its c̃t = tanh(Wc xt + Uc ht−1 ), (5)
meaning with a fixed length vectorial representation. ct = fti ct−1 + it c̃t , (6)
Among them, recurrent neural networks (RNN) are one ht = ot tanh(ct ), (7)
of the most popular architectures used in NLP problems be-
cause their recurrent structure is very suitable to process the where xt is the input at the current time step, σ denotes the
variable-length text. logistic sigmoid function and denotes elementwise multi-
plication. Intuitively, the forget gate controls the amount of
2.1 Recurrent Neural Network which each unit of the memory cell is erased, the input gate
A recurrent neural network (RNN) [Elman, 1990] is able to controls how much each unit is updated, and the output gate
process a sequence of arbitrary length by recursively applying controls the exposure of the internal memory state.
a transition function to its internal hidden state vector ht of 2.2 Task-Specific Output Layer
the input sequence. The activation of the hidden state ht at
time-step t is computed as a function f of the current input In a single specific task, a simple strategy is to map the input
symbol xt and the previous hidden state ht−1 sequence to a fixed-sized vector using one RNN, and then to
 feed the vector to a softmax layer for classification or other
0 t=0 tasks.
ht = (1) Given a text sequence x = {x1 , x2 , · · · , xT }, we first use a
f (ht−1 , xt ) otherwise
lookup layer to get the vector representation (embeddings) xi
It is common to use the state-to-state transition function f of the each word xi . The output at the last moment hT can be
as the composition of an element-wise nonlinearity with an regarded as the representation of the whole sequence, which
affine transformation of both xt and ht−1 . has a fully connected layer followed by a softmax non-linear
Traditionally, a simple strategy for modeling sequence is layer that predicts the probability distribution over classes.
to map the input sequence to a fixed-sized vector using one Figure 1 shows the unfolded RNN structure for text classi-
RNN, and then to feed the vector to a softmax layer for clas- fication.
sification or other tasks [Cho et al., 2014]. The parameters of the network are trained to minimise the
Unfortunately, a problem with RNNs with transition func- cross-entropy of the predicted and true distributions.
tions of this form is that during training, components of the
gradient vector can grow or decay exponentially over long N X
X C
sequences [Hochreiter et al., 2001; Hochreiter and Schmid- L(ŷ, y) = − yij log(ŷij ), (8)
huber, 1997]. This problem with exploding or vanishing gra- i=1 j=1
dients makes it difficult for the RNN model to learn long- where yij is the ground-truth label; ŷij is prediction probabil-
distance correlations in a sequence. ities; N denotes the number of training samples and C is the
Long short-term memory network (LSTM) was proposed class number.
by [Hochreiter and Schmidhuber, 1997] to specifically ad-
dress this issue of learning long-term dependencies. The 3 Three Sharing Models for RNN based
LSTM maintains a separate memory cell inside it that up-
dates and exposes its content only when deemed necessary. Multi-Task Learning
A number of minor modifications to the standard LSTM unit Most existing neural network methods are based on super-
have been made. While there are numerous LSTM variants, vised training objectives on a single task [Collobert et al.,
here we describe the implementation used by Graves [2013]. 2011; Socher et al., 2013; Kalchbrenner et al., 2014]. These
We define the LSTM units at each time step t to be a col- methods often suffer from the limited amounts of training
lection of vectors in Rd : an input gate it , a forget gate ft , an data. To deal with this problem, these models often involve
output gate ot , a memory cell ct and a hidden state ht . d is an unsupervised pre-training phase. This unsupervised pre-
the number of the LSTM units. The entries of the gating vec- training is effective to improve the final performance, but it
tors it , ft and ot are in [0, 1]. The LSTM transition equations does not directly optimize the desired task.
Model-II: Coupled-Layer Architecture In Model-II, we
(m) (m) (m) (m) assign a LSTM layer for each task, which can use the infor-
y (m)
x1 x2 x3 xT
softmax1
(s)
x1
(s)
x2 x3
(s) (s)
xT mation for the LSTM layer of the other task.
Given a pair of tasks (m, n), each task has own LSTM in
(s) (s) (s) (s)
h1 h2 h3 ··· hT the task-specific model. We denote the outputs at step t of
(m) (n)
(n) (n) (n) (n)
two coupled LSTM layer are ht and ht .
y (n)
x1 x2 x3 xT
softmax2 To better control signals flowing from one task to another
(s) (s) (s) (s)
x1 x2 x3 xT
task, we use a global gating unit which endows the model
(a) Model-I: Uniform-Layer Architecture with the capability of deciding how much information it
should accept. We re-define Eqs. (5) and the new memory
content of an LSTM at m-th task is computed by:
x1 x2 x3 xT
 
X (i)
h1
(m)
h2
(m) (m)
h3 ··· hT
(m)
softmax1 y (m) c̃t (m) = tanh Wc(m) xt + g(i→m) Uc(i→m) ht−1 
i∈{m,n}
(11)
(n) (n) (n) (n) (n) (i→m) (m) (i) (i)
h1 h2 h3 ··· hT softmax2 y where g = σ(Wg xt +Ug ht−1 ). The other settings
are same to the standard LSTM.
x1 x2 x3 xT This model can be used to jointly learning for every two
(m)
tasks. We can get two task specific representations hT and
(b) Model-II: Coupled-Layer Architecture (n)
hT for tasks m and n receptively.

x1 x2 x3 xT
Model-III: Shared-Layer Architecture Model-III also as-
signs a separate LSTM layer for each task, but introduces a
(m)
h1
(m)
h2
(m)
h3 ··· (m)
hT softmax1 y (m)
bidirectional LSTM layer to capture the shared information
for all the tasks.
(s) (s) (s) (s)
h1 h2 h3 hT We denote the outputs of the forward and backward

− (s) ←
−(s)
LSTMs at step t as h t and h t respectively. The output
(s) →
− (s) ←
−(s)
h1
(n)
h2
(n)
h3
(n)
··· hT
(n)
softmax2 y (n) of shared layer is ht = h t ⊕ h t .
To enhance the interaction between task-specific layers and
x1 x2 x3 xT the shared layer, we use gating mechanism to endow the neu-
rons in task-specific layer with the ability to accept or refuse
(c) Model-III: Shared-Layer Architecture the information passed by the neuron in shared layers. Unlike
Model-II, we compute the new state for LSTM as follows:
Figure 2: Three architectures for modelling text with multi-  
task learning. (m) (m) (s)
c̃t = tanh Wc(m) xt + g(m) U(m)
c ht−1 + g(s→m) U(s)
c ht ,
(12)
(m) (m) (m)
Motivated by the success of multi-task learning [Caruana, where g(m) = σ(Wg xt + Ug ht−1 ) and g(s→m) =
1997], we propose three multi-task models to leverage super- (m) (s→m) (s)
σ(Wg xt + Ug ht ).
vised data from many related tasks. Deep neural model is
well suited for multi-task learning since the features learned
from a task may be useful for other tasks. Figure 2 gives an 4 Training
illustration of our proposed models. The task-specific representations, which emittd by the muti-
task architectures of all of the above, are ultimately fed into
Model-I: Uniform-Layer Architecture In Model-I, the different output layers, which are also task-specific.
different tasks share a same LSTM layer and an embedding
layer besides their own embedding layers. ŷ(m) = softmax(W(m) h(m) + b(m) ), (13)
For task m, the input x̂t consists of two parts:
(m) (m) (s)
where ŷ(m) is prediction probabilities for task m, W(m) is
x̂t = xt ⊕ xt , (9) the weight which needs to be learned, and b(m) is a bias term.
(m) (s)
where xt , xt denote the task-specific and shared word Our global cost function is the linear combination of cost
embeddings respectively, ⊕ denotes the concatenation opera- function for all joints.
tion. M
X
The LSTM layer is shared for all tasks. The final sequence φ= λm L(ŷ (m) , y (m) ) (14)
representation for task m is the output of LSMT at step T . m=1
(m)
hT = LST M (x̂(m) ). (10) where λm is the weights for each task m respectively.
Dataset Type Train Size Dev. Size Test Size Class Averaged Length Vocabulary Size
SST-1 Sentence 8544 1101 2210 5 19 18K
SST-2 Sentence 6920 872 1821 2 18 15K
SUBJ Sentence 9000 - 1000 2 21 21K
IMDB Document 25,000 - 25,000 2 294 392K

Table 1: Statistics of the four datasets used in this paper.

It is worth noticing that labeled data for training each The first three datasets are sentence-level, and the last
task can come from completely different datasets. Follow- dataset is document-level. The detailed statistics about the
ing [Collobert and Weston, 2008], the training is achieved in four datasets are listed in Table 1.
a stochastic manner by looping over the tasks:
1. Select a random task. 5.2 Hyperparameters and Training
2. Select a random training example from this task. The network is trained with backpropagation and the
gradient-based optimization is performed using the Adagrad
3. Update the parameters for this task by taking a gradient update rule [Duchi et al., 2011]. In all of our experiments,
step with respect to this example. the word embeddings are trained using word2vec [Mikolov
4. Go to 1. et al., 2013] on the Wikipedia corpus (1B words). The vo-
cabulary size is about 500,000. The word embeddings are
Fine Tuning For model-I and model-III, there is a shared fine-tuned during training to improve the performance [Col-
layer for all the tasks. Thus, after the joint learning phase, we lobert et al., 2011]. The other parameters are initialized by
can use a fine tuning strategy to further optimize the perfor- randomly sampling from uniform distribution in [-0.1, 0.1].
mance for each task. The hyperparameters which achieve the best performance on
Pre-training of the shared layer with neural language the development set will be chosen for the final evaluation.
model For model-III, the shared layer can be initialized For datasets without development set, we use 10-fold cross-
by an unsupervised pre-training phase. Here, for the shared validation (CV) instead.
LSTM layer in Model-III, we initialize it by a language model The final hyper-parameters are as follows. The embedding
[Bengio et al., 2007], which is trained on all the four task size for specific task and shared layer are 64. For Model-I,
dataset. there are two embeddings for each word, and both their sizes
are 64. The hidden layer size of LSTM is 50. The initial
5 Experiment learning rate is 0.1. The regularization weight of the parame-
In this section, we investigate the empirical performances of ters is 10−5 .
our proposed three models on four related text classification
tasks and then compare it to other state-of-the-art models. 5.3 Effect of Multi-task Training

5.1 Datasets Model SST-1 SST-2 SUBJ IMDB Avg∆


To show the effectiveness of multi-task learning, we choose Single Task 45.9 85.8 91.6 88.5 -
four different text classification tasks about movie review. Joint Learning 46.5 86.7 92.0 89.9 +0.8
Each task have own dataset, which is briefly described as fol- + Fine Tuning 48.5 87.1 93.4 90.8 +2.0
lows.
Table 2: Results of the uniform-layer architecture.
• SST-1 The movie reviews with five classes (negative,
somewhat negative, neutral, somewhat positive, posi-
tive) in the Stanford Sentiment Treebank1 [Socher et al.,
2013]. Model SST-1 SST-2 SUBJ IMDB Avg∆
Single Task 45.9 85.8 91.6 88.5 -
• SST-2 The movie reviews with binary classes. It is also SST1-SST2 48.9 87.4 - - +2.3
from the Stanford Sentiment Treebank. SST1-SUBJ 46.3 - 92.2 - +0.5
• SUBJ Subjectivity data set where the goal is to classify SST1-IMDB 46.9 - - 89.5 +1.0
each instance (snippet) as being subjective or objective. SST2-SUBJ - 86.5 92.5 - +0.8
[Pang and Lee, 2004] SST2-IMDB - 86.8 - 89.8 +1.2
SUBJ-IMDB - - 92.7 89.3 +0.9
• IMDB The IMDB dataset2 consists of 100,000 movie
reviews with binary classes [Maas et al., 2011]. One Table 3: Results of the coupled-layer architecture.
key aspect of this dataset is that each movie review has
several sentences. We first compare our our proposed models with the stan-
1 dard LSTM for single task classification. We use the im-
http://nlp.stanford.edu/sentiment.
2 plementation of Graves [2013]. The unfolded illustration is
http://ai.stanford.edu/˜amaas/data/
sentiment/ shown in Figure 1.
Model SST-1 SST-2 SUBJ IMDB Avg∆ Model SST-1 SST-2 SUBJ IMDB
Single Task 45.9 85.8 91.6 88.5 - NBOW 42.4 80.5 91.3 83.62
Joint Learning 47.1 87.0 92.5 90.7 +1.4 MV-RNN 44.4 82.9 - -
+ LM 47.9 86.8 93.6 91.0 +1.9 RNTN 45.7 85.4 - -
+ Fine Tuning 49.6 87.9 94.1 91.3 +2.8 DCNN 48.5 86.8 - -
PV 44.6 82.7 90.5 91.7
Table 4: Results of the shared-layer architecture. Tree-LSTM 50.6 86.9 - -
Multi-Task 49.6 87.9 94.1 91.3

Table 2-4 show the classification accuracies on the four Table 5: Results of shared-layer multi-task model against
datasets. The second line (“Single Task”) of each table shows state-of-the-art neural models.
the result of the standard LSTM for each individual task.
• RNTN Recursive Neural Tensor Network with tensor-
Uniform-layer Architecture For the first uniform-layer ar- based feature function and parse trees [Socher et al.,
chitecture, we train the model on four datasets simultane- 2013].
ously. The LSTM layer is shared across all the tasks. The
average improvement of the performances on four datasets is • DCNN Dynamic Convolutional Neural Network with
0.8%. With the further fine-tuning phase, the improvement dynamic k-max pooling [Kalchbrenner et al., 2014].
achieves 2.0% on average. • PV Logistic regression on top of paragraph vectors [Le
and Mikolov, 2014]. Here, we use the popular open
source implementation of PV in Gensim3 .
Coupled-layer Architecture For the second coupled-layer
architecture, the information is shared with a pair of tasks. • Tree-LSTM A generalization of LSTMs to tree-
Therefore, there are six combinations for the four datasets. structured network topologies. [Tai et al., 2015]
We train six models on the different pairs of datasets. We can
Table 5 shows the performance of the shared-layer archi-
find that the pair-wise joint learning also improves the perfor-
tecture compared with the competitor models, which shows
mances. The more relevant the tasks are, the more significant
our model is competitive for the neural-based state-of-the-art
the improvements are. Since SST-1 and SST-2 are from the
models.
same corpus, their improvements are more significant than
Although Tree-LSTM outperforms our model on SST-1, it
the other combinations. The improvement is 2.3% on aver-
needs an external parser to get the sentence topological struc-
age with simultaneously learning on SST-1 and SST-2.
ture. It is worth noticing that our models are compatible with
the other RNN based models. For example, we can easily
Shared-layer Architecture The shared-layer architecture extend our models to incorporate the Tree-LSTM model.
is more general than uniform-layer architecture. Besides a
shared layer for all the tasks, each task has own task-specific 5.5 Case Study
layer. As shown in Table 4, we can see that the average To get an intuitive understanding of what is happening when
improvement of the performances on four datasets is 1.4%, we use the single LSTM or the shared-layer LSTM to pre-
which is better than the uniform-layer architecture. We also dict the class of text, we design an experiment to analyze
investigate the strategy of unsupervised pre-training towards the output of the single LSTM and the shared-layer LSTM
shared LSTM layer. With the LM pre-training, the perfor- at each time step. We sample two sentences from the SST-2
mance is improved by an extra 0.5% on average. Besides, the test dataset, and the changes of the predicted sentiment score
further fine-tuning can significantly improve the performance at different time steps are shown in Figure 3. To get more
by an another 0.9%. insights into how the shared structures influences the specific
To recap, all our proposed models outperform the baseline task. We observe the activation of global gates g (s) , which
of single-task learning. The shared-layer architecture gives controls signals flowing from one shared LSTM layer to task-
the best performances. Moreover, compared with vanilla spcific layer, to understand the behaviour of neurons. We plot
LSTM, our proposed three models don’t cause much extra evolving activation of global gates g (s) through time and sort
computational cost while converge faster. In our experiment, the neurons according to their activations at the last time step.
the most complicated model-III, costs 2.5 times as long as For the sentence “A merry movie about merry
vanilla LSTM. period people’s life.”, which has a positive sen-
timent, while the standard LSTM gives a wrong prediction.
5.4 Comparisons with State-of-the-art Neural
The reason can be inferred from the activation of global gates
Models
g (s) . As shown in Figure 3-(c), we can see clearly the neurons
We compare our model with the following models: are activated much when they take input as “merry”, which
• NBOW The NBOW sums the word vectors and applies a indicates the task-specific layer takes much information from
non-linearity followed by a softmax classification layer. shared layer towards the word “merry”, and this ultimately
makes the model give a correct prediction.
• MV-RNN Matrix-Vector Recursive Neural Network
3
with parse trees [Socher et al., 2012]. https://github.com/piskvorky/gensim/
0.7 1
0.6 0.8
0.5 0.6
0.4
0.4
0.3
0.2
LSTM
0.2 LSTM
Model-III
0 Model-3

<s> A merry movie about merry period people’s life . <s> Not everything works but average higher than mary most other recent comdies

(a) (b)

0.8
0.6 5 0.7 5

10 0.6 10
0.5
0.5
0.4 15 15
0.4

0.3 20 0.3 20
0.2
25 25
0.2 0.1
30 30
<s> A merry movie about merry period people’s life . <s> Not everythingworks but average higher than mary most other recent comdies

(c) (d)

Figure 3: (a)(b) The change of the predicted sentiment score at different time steps. Y-axis represents the sentiment score,
while X-axis represents the input words in chronological order. The red horizontal line gives a border between the positive and
negative sentiments. (c)(d) Visualization of the global gate’s (g (s) ) activation.

Another case “Not everything works, but the within one framework. However, only one lookup table is
average is higher than in Mary and most shared, and the other lookup-tables and layers are task spe-
other recent comedies.” is positive and has a little cific. To deal with the variable-length text sequence, they
complicated semantic composition. As shown in Figure used window-based method to fix the input size.
3-(b,d), simple LSTM cannot capture the structure of “but Liu et al. [2015b] developed a multi-task DNN for learning
... higher than ” while our model is sensitive to representations across multiple tasks. Their multi-task DNN
it, which indicates the shared layer can not only enrich the approach combines tasks of query classification and ranking
meaning of certain words, but can teach some information of for web search. But the input of the model is bag-of-word
structure to specific task. representation, which lose the information of word order.
Different with the two above methods, our models are
5.6 Error Analysis based on recurrent neural network, which is better to model
We analyze the bad cases induced by our proposed shared- the variable-length text sequence.
layer model on SST-2 dataset. Most of the bad cases can be More recently, several multi-task encoder-decoder net-
generalized into two categories works were also proposed for neural machine translation
Complicated Sentence Structure Some sentences in- [Dong et al., 2015; Firat et al., 2016], which can make use
volved complicated structure can not be handled prop- of cross-lingual information. Unlike these works, in this pa-
erly, such as double negation “it never fails to per we design three architectures, which can control the in-
engage us.” and subjunctive sentences “Still, I formation flow between shared layer and task-specific layer
thought it could have been more.”. To solve flexibly, thus obtaining better sentence representations.
these cases, some architectural improvements are necessary,
such as tree-based LSTM [Tai et al., 2015]. 7 Conclusion and Future Work
Sentences Required Reasoning The sentiments of some In this paper, we introduce three RNN based architectures
sentences can be mislead if only considering the literal mean- to model text sequence with multi-task learning. The differ-
ing. For example, the sentence “I tried to read the ences among them are the mechanisms of sharing information
time on my watch.” expresses negative attitude to- among the several tasks. Experimental results show that our
wards a movie, which can be understood correctly by rea- models can improve the performances of a group of related
soning based on common sense. tasks by exploring common features.
In future work, we would like to investigate the other shar-
6 Related Work ing mechanisms of the different tasks.
Neural networks based multi-task learning has proven effec-
tive in many NLP problems [Collobert and Weston, 2008; Acknowledgments
Liu et al., 2015b].
Collobert and Weston [2008] used a shared representa- We would like to thank the anonymous reviewers for their
tion for input words and solve different traditional NLP tasks valuable comments. This work was partially funded by Na-
such as part-of-Speech tagging and semantic role labeling tional Natural Science Foundation of China (No. 61532011,
61473092, and 61472088), the National High Technol- network for modelling sentences. In Proceedings of ACL,
ogy Research and Development Program of China (No. 2014.
2015AA015408). [Le and Mikolov, 2014] Quoc V. Le and Tomas Mikolov.
Distributed representations of sentences and documents.
References In Proceedings of ICML, 2014.
[Bengio et al., 2007] Yoshua Bengio, Pascal Lamblin, Dan [Liu et al., 2015a] PengFei Liu, Xipeng Qiu, Xinchi Chen,
Popovici, Hugo Larochelle, et al. Greedy layer-wise train- Shiyu Wu, and Xuanjing Huang. Multi-timescale long
ing of deep networks. Advances in neural information pro- short-term memory neural network for modelling sen-
cessing systems, 19:153, 2007. tences and documents. In Proceedings of the Conference
[Caruana, 1997] Rich Caruana. Multitask learning. Machine on Empirical Methods in Natural Language Processing,
learning, 28(1):41–75, 1997. 2015.
[Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, [Liu et al., 2015b] Xiaodong Liu, Jianfeng Gao, Xiaodong
Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representa-
Yoshua Bengio. Learning phrase representations using tion learning using multi-task deep neural networks for se-
rnn encoder-decoder for statistical machine translation. In mantic classification and information retrieval. In NAACL,
Proceedings of EMNLP, 2014. 2015.
[Chung et al., 2014] Junyoung Chung, Caglar Gulcehre, [Maas et al., 2011] Andrew L Maas, Raymond E Daly, Pe-
KyungHyun Cho, and Yoshua Bengio. Empirical evalua- ter T Pham, Dan Huang, Andrew Y Ng, and Christopher
tion of gated recurrent neural networks on sequence mod- Potts. Learning word vectors for sentiment analysis. In
eling. arXiv preprint arXiv:1412.3555, 2014. Proceedings of the ACL, pages 142–150, 2011.
[Collobert and Weston, 2008] Ronan Collobert and Jason [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg
Weston. A unified architecture for natural language pro- Corrado, and Jeffrey Dean. Efficient estimation of
cessing: Deep neural networks with multitask learning. In word representations in vector space. arXiv preprint
Proceedings of ICML, 2008. arXiv:1301.3781, 2013.
[Collobert et al., 2011] Ronan Collobert, Jason Weston, [Pang and Lee, 2004] Bo Pang and Lillian Lee. A sentimen-
Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and tal education: Sentiment analysis using subjectivity sum-
Pavel Kuksa. Natural language processing (almost) from marization based on minimum cuts. In Proceedings of
scratch. The Journal of Machine Learning Research, ACL, 2004.
12:2493–2537, 2011. [Socher et al., 2011] Richard Socher, Jeffrey Pennington,
[Dong et al., 2015] Daxiang Dong, Hua Wu, Wei He, Dian- Eric H Huang, Andrew Y Ng, and Christopher D Man-
hai Yu, and Haifeng Wang. Multi-task learning for multi- ning. Semi-supervised recursive autoencoders for predict-
ple language translation. In Proceedings of the ACL, 2015. ing sentiment distributions. In Proceedings of EMNLP,
[Duchi et al., 2011] John Duchi, Elad Hazan, and Yoram 2011.
Singer. Adaptive subgradient methods for online learn- [Socher et al., 2012] Richard Socher, Brody Huval, Christo-
ing and stochastic optimization. The Journal of Machine pher D Manning, and Andrew Y Ng. Semantic composi-
Learning Research, 12:2121–2159, 2011. tionality through recursive matrix-vector spaces. In Pro-
[Elman, 1990] Jeffrey L Elman. Finding structure in time. ceedings of EMNLP, pages 1201–1211, 2012.
Cognitive science, 14(2):179–211, 1990. [Socher et al., 2013] Richard Socher, Alex Perelygin, Jean Y
[Firat et al., 2016] Orhan Firat, Kyunghyun Cho, and Wu, Jason Chuang, Christopher D Manning, Andrew Y
Ng, and Christopher Potts. Recursive deep models for
Yoshua Bengio. Multi-way, multilingual neural machine
semantic compositionality over a sentiment treebank. In
translation with a shared attention mechanism. arXiv
Proceedings of EMNLP, 2013.
preprint arXiv:1601.01073, 2016.
[Tai et al., 2015] Kai Sheng Tai, Richard Socher, and
[Graves, 2013] Alex Graves. Generating sequences with re-
Christopher D Manning. Improved semantic representa-
current neural networks. arXiv preprint arXiv:1308.0850,
tions from tree-structured long short-term memory net-
2013.
works. arXiv preprint arXiv:1503.00075, 2015.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
[Turian et al., 2010] Joseph Turian, Lev Ratinov, and Yoshua
Jürgen Schmidhuber. Long short-term memory. Neural
Bengio. Word representations: a simple and general
computation, 9(8):1735–1780, 1997.
method for semi-supervised learning. In Proceedings of
[Hochreiter et al., 2001] Sepp Hochreiter, Yoshua Bengio, ACL, 2010.
Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow
in recurrent nets: the difficulty of learning long-term de-
pendencies, 2001.
[Kalchbrenner et al., 2014] Nal Kalchbrenner, Edward
Grefenstette, and Phil Blunsom. A convolutional neural

You might also like