NN TextClassification
NN TextClassification
x1 x2 x3 xT
Model-III: Shared-Layer Architecture Model-III also as-
signs a separate LSTM layer for each task, but introduces a
(m)
h1
(m)
h2
(m)
h3 ··· (m)
hT softmax1 y (m)
bidirectional LSTM layer to capture the shared information
for all the tasks.
(s) (s) (s) (s)
h1 h2 h3 hT We denote the outputs of the forward and backward
→
− (s) ←
−(s)
LSTMs at step t as h t and h t respectively. The output
(s) →
− (s) ←
−(s)
h1
(n)
h2
(n)
h3
(n)
··· hT
(n)
softmax2 y (n) of shared layer is ht = h t ⊕ h t .
To enhance the interaction between task-specific layers and
x1 x2 x3 xT the shared layer, we use gating mechanism to endow the neu-
rons in task-specific layer with the ability to accept or refuse
(c) Model-III: Shared-Layer Architecture the information passed by the neuron in shared layers. Unlike
Model-II, we compute the new state for LSTM as follows:
Figure 2: Three architectures for modelling text with multi-
task learning. (m) (m) (s)
c̃t = tanh Wc(m) xt + g(m) U(m)
c ht−1 + g(s→m) U(s)
c ht ,
(12)
(m) (m) (m)
Motivated by the success of multi-task learning [Caruana, where g(m) = σ(Wg xt + Ug ht−1 ) and g(s→m) =
1997], we propose three multi-task models to leverage super- (m) (s→m) (s)
σ(Wg xt + Ug ht ).
vised data from many related tasks. Deep neural model is
well suited for multi-task learning since the features learned
from a task may be useful for other tasks. Figure 2 gives an 4 Training
illustration of our proposed models. The task-specific representations, which emittd by the muti-
task architectures of all of the above, are ultimately fed into
Model-I: Uniform-Layer Architecture In Model-I, the different output layers, which are also task-specific.
different tasks share a same LSTM layer and an embedding
layer besides their own embedding layers. ŷ(m) = softmax(W(m) h(m) + b(m) ), (13)
For task m, the input x̂t consists of two parts:
(m) (m) (s)
where ŷ(m) is prediction probabilities for task m, W(m) is
x̂t = xt ⊕ xt , (9) the weight which needs to be learned, and b(m) is a bias term.
(m) (s)
where xt , xt denote the task-specific and shared word Our global cost function is the linear combination of cost
embeddings respectively, ⊕ denotes the concatenation opera- function for all joints.
tion. M
X
The LSTM layer is shared for all tasks. The final sequence φ= λm L(ŷ (m) , y (m) ) (14)
representation for task m is the output of LSMT at step T . m=1
(m)
hT = LST M (x̂(m) ). (10) where λm is the weights for each task m respectively.
Dataset Type Train Size Dev. Size Test Size Class Averaged Length Vocabulary Size
SST-1 Sentence 8544 1101 2210 5 19 18K
SST-2 Sentence 6920 872 1821 2 18 15K
SUBJ Sentence 9000 - 1000 2 21 21K
IMDB Document 25,000 - 25,000 2 294 392K
It is worth noticing that labeled data for training each The first three datasets are sentence-level, and the last
task can come from completely different datasets. Follow- dataset is document-level. The detailed statistics about the
ing [Collobert and Weston, 2008], the training is achieved in four datasets are listed in Table 1.
a stochastic manner by looping over the tasks:
1. Select a random task. 5.2 Hyperparameters and Training
2. Select a random training example from this task. The network is trained with backpropagation and the
gradient-based optimization is performed using the Adagrad
3. Update the parameters for this task by taking a gradient update rule [Duchi et al., 2011]. In all of our experiments,
step with respect to this example. the word embeddings are trained using word2vec [Mikolov
4. Go to 1. et al., 2013] on the Wikipedia corpus (1B words). The vo-
cabulary size is about 500,000. The word embeddings are
Fine Tuning For model-I and model-III, there is a shared fine-tuned during training to improve the performance [Col-
layer for all the tasks. Thus, after the joint learning phase, we lobert et al., 2011]. The other parameters are initialized by
can use a fine tuning strategy to further optimize the perfor- randomly sampling from uniform distribution in [-0.1, 0.1].
mance for each task. The hyperparameters which achieve the best performance on
Pre-training of the shared layer with neural language the development set will be chosen for the final evaluation.
model For model-III, the shared layer can be initialized For datasets without development set, we use 10-fold cross-
by an unsupervised pre-training phase. Here, for the shared validation (CV) instead.
LSTM layer in Model-III, we initialize it by a language model The final hyper-parameters are as follows. The embedding
[Bengio et al., 2007], which is trained on all the four task size for specific task and shared layer are 64. For Model-I,
dataset. there are two embeddings for each word, and both their sizes
are 64. The hidden layer size of LSTM is 50. The initial
5 Experiment learning rate is 0.1. The regularization weight of the parame-
In this section, we investigate the empirical performances of ters is 10−5 .
our proposed three models on four related text classification
tasks and then compare it to other state-of-the-art models. 5.3 Effect of Multi-task Training
Table 2-4 show the classification accuracies on the four Table 5: Results of shared-layer multi-task model against
datasets. The second line (“Single Task”) of each table shows state-of-the-art neural models.
the result of the standard LSTM for each individual task.
• RNTN Recursive Neural Tensor Network with tensor-
Uniform-layer Architecture For the first uniform-layer ar- based feature function and parse trees [Socher et al.,
chitecture, we train the model on four datasets simultane- 2013].
ously. The LSTM layer is shared across all the tasks. The
average improvement of the performances on four datasets is • DCNN Dynamic Convolutional Neural Network with
0.8%. With the further fine-tuning phase, the improvement dynamic k-max pooling [Kalchbrenner et al., 2014].
achieves 2.0% on average. • PV Logistic regression on top of paragraph vectors [Le
and Mikolov, 2014]. Here, we use the popular open
source implementation of PV in Gensim3 .
Coupled-layer Architecture For the second coupled-layer
architecture, the information is shared with a pair of tasks. • Tree-LSTM A generalization of LSTMs to tree-
Therefore, there are six combinations for the four datasets. structured network topologies. [Tai et al., 2015]
We train six models on the different pairs of datasets. We can
Table 5 shows the performance of the shared-layer archi-
find that the pair-wise joint learning also improves the perfor-
tecture compared with the competitor models, which shows
mances. The more relevant the tasks are, the more significant
our model is competitive for the neural-based state-of-the-art
the improvements are. Since SST-1 and SST-2 are from the
models.
same corpus, their improvements are more significant than
Although Tree-LSTM outperforms our model on SST-1, it
the other combinations. The improvement is 2.3% on aver-
needs an external parser to get the sentence topological struc-
age with simultaneously learning on SST-1 and SST-2.
ture. It is worth noticing that our models are compatible with
the other RNN based models. For example, we can easily
Shared-layer Architecture The shared-layer architecture extend our models to incorporate the Tree-LSTM model.
is more general than uniform-layer architecture. Besides a
shared layer for all the tasks, each task has own task-specific 5.5 Case Study
layer. As shown in Table 4, we can see that the average To get an intuitive understanding of what is happening when
improvement of the performances on four datasets is 1.4%, we use the single LSTM or the shared-layer LSTM to pre-
which is better than the uniform-layer architecture. We also dict the class of text, we design an experiment to analyze
investigate the strategy of unsupervised pre-training towards the output of the single LSTM and the shared-layer LSTM
shared LSTM layer. With the LM pre-training, the perfor- at each time step. We sample two sentences from the SST-2
mance is improved by an extra 0.5% on average. Besides, the test dataset, and the changes of the predicted sentiment score
further fine-tuning can significantly improve the performance at different time steps are shown in Figure 3. To get more
by an another 0.9%. insights into how the shared structures influences the specific
To recap, all our proposed models outperform the baseline task. We observe the activation of global gates g (s) , which
of single-task learning. The shared-layer architecture gives controls signals flowing from one shared LSTM layer to task-
the best performances. Moreover, compared with vanilla spcific layer, to understand the behaviour of neurons. We plot
LSTM, our proposed three models don’t cause much extra evolving activation of global gates g (s) through time and sort
computational cost while converge faster. In our experiment, the neurons according to their activations at the last time step.
the most complicated model-III, costs 2.5 times as long as For the sentence “A merry movie about merry
vanilla LSTM. period people’s life.”, which has a positive sen-
timent, while the standard LSTM gives a wrong prediction.
5.4 Comparisons with State-of-the-art Neural
The reason can be inferred from the activation of global gates
Models
g (s) . As shown in Figure 3-(c), we can see clearly the neurons
We compare our model with the following models: are activated much when they take input as “merry”, which
• NBOW The NBOW sums the word vectors and applies a indicates the task-specific layer takes much information from
non-linearity followed by a softmax classification layer. shared layer towards the word “merry”, and this ultimately
makes the model give a correct prediction.
• MV-RNN Matrix-Vector Recursive Neural Network
3
with parse trees [Socher et al., 2012]. https://github.com/piskvorky/gensim/
0.7 1
0.6 0.8
0.5 0.6
0.4
0.4
0.3
0.2
LSTM
0.2 LSTM
Model-III
0 Model-3
<s> A merry movie about merry period people’s life . <s> Not everything works but average higher than mary most other recent comdies
(a) (b)
0.8
0.6 5 0.7 5
10 0.6 10
0.5
0.5
0.4 15 15
0.4
0.3 20 0.3 20
0.2
25 25
0.2 0.1
30 30
<s> A merry movie about merry period people’s life . <s> Not everythingworks but average higher than mary most other recent comdies
(c) (d)
Figure 3: (a)(b) The change of the predicted sentiment score at different time steps. Y-axis represents the sentiment score,
while X-axis represents the input words in chronological order. The red horizontal line gives a border between the positive and
negative sentiments. (c)(d) Visualization of the global gate’s (g (s) ) activation.
Another case “Not everything works, but the within one framework. However, only one lookup table is
average is higher than in Mary and most shared, and the other lookup-tables and layers are task spe-
other recent comedies.” is positive and has a little cific. To deal with the variable-length text sequence, they
complicated semantic composition. As shown in Figure used window-based method to fix the input size.
3-(b,d), simple LSTM cannot capture the structure of “but Liu et al. [2015b] developed a multi-task DNN for learning
... higher than ” while our model is sensitive to representations across multiple tasks. Their multi-task DNN
it, which indicates the shared layer can not only enrich the approach combines tasks of query classification and ranking
meaning of certain words, but can teach some information of for web search. But the input of the model is bag-of-word
structure to specific task. representation, which lose the information of word order.
Different with the two above methods, our models are
5.6 Error Analysis based on recurrent neural network, which is better to model
We analyze the bad cases induced by our proposed shared- the variable-length text sequence.
layer model on SST-2 dataset. Most of the bad cases can be More recently, several multi-task encoder-decoder net-
generalized into two categories works were also proposed for neural machine translation
Complicated Sentence Structure Some sentences in- [Dong et al., 2015; Firat et al., 2016], which can make use
volved complicated structure can not be handled prop- of cross-lingual information. Unlike these works, in this pa-
erly, such as double negation “it never fails to per we design three architectures, which can control the in-
engage us.” and subjunctive sentences “Still, I formation flow between shared layer and task-specific layer
thought it could have been more.”. To solve flexibly, thus obtaining better sentence representations.
these cases, some architectural improvements are necessary,
such as tree-based LSTM [Tai et al., 2015]. 7 Conclusion and Future Work
Sentences Required Reasoning The sentiments of some In this paper, we introduce three RNN based architectures
sentences can be mislead if only considering the literal mean- to model text sequence with multi-task learning. The differ-
ing. For example, the sentence “I tried to read the ences among them are the mechanisms of sharing information
time on my watch.” expresses negative attitude to- among the several tasks. Experimental results show that our
wards a movie, which can be understood correctly by rea- models can improve the performances of a group of related
soning based on common sense. tasks by exploring common features.
In future work, we would like to investigate the other shar-
6 Related Work ing mechanisms of the different tasks.
Neural networks based multi-task learning has proven effec-
tive in many NLP problems [Collobert and Weston, 2008; Acknowledgments
Liu et al., 2015b].
Collobert and Weston [2008] used a shared representa- We would like to thank the anonymous reviewers for their
tion for input words and solve different traditional NLP tasks valuable comments. This work was partially funded by Na-
such as part-of-Speech tagging and semantic role labeling tional Natural Science Foundation of China (No. 61532011,
61473092, and 61472088), the National High Technol- network for modelling sentences. In Proceedings of ACL,
ogy Research and Development Program of China (No. 2014.
2015AA015408). [Le and Mikolov, 2014] Quoc V. Le and Tomas Mikolov.
Distributed representations of sentences and documents.
References In Proceedings of ICML, 2014.
[Bengio et al., 2007] Yoshua Bengio, Pascal Lamblin, Dan [Liu et al., 2015a] PengFei Liu, Xipeng Qiu, Xinchi Chen,
Popovici, Hugo Larochelle, et al. Greedy layer-wise train- Shiyu Wu, and Xuanjing Huang. Multi-timescale long
ing of deep networks. Advances in neural information pro- short-term memory neural network for modelling sen-
cessing systems, 19:153, 2007. tences and documents. In Proceedings of the Conference
[Caruana, 1997] Rich Caruana. Multitask learning. Machine on Empirical Methods in Natural Language Processing,
learning, 28(1):41–75, 1997. 2015.
[Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, [Liu et al., 2015b] Xiaodong Liu, Jianfeng Gao, Xiaodong
Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representa-
Yoshua Bengio. Learning phrase representations using tion learning using multi-task deep neural networks for se-
rnn encoder-decoder for statistical machine translation. In mantic classification and information retrieval. In NAACL,
Proceedings of EMNLP, 2014. 2015.
[Chung et al., 2014] Junyoung Chung, Caglar Gulcehre, [Maas et al., 2011] Andrew L Maas, Raymond E Daly, Pe-
KyungHyun Cho, and Yoshua Bengio. Empirical evalua- ter T Pham, Dan Huang, Andrew Y Ng, and Christopher
tion of gated recurrent neural networks on sequence mod- Potts. Learning word vectors for sentiment analysis. In
eling. arXiv preprint arXiv:1412.3555, 2014. Proceedings of the ACL, pages 142–150, 2011.
[Collobert and Weston, 2008] Ronan Collobert and Jason [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg
Weston. A unified architecture for natural language pro- Corrado, and Jeffrey Dean. Efficient estimation of
cessing: Deep neural networks with multitask learning. In word representations in vector space. arXiv preprint
Proceedings of ICML, 2008. arXiv:1301.3781, 2013.
[Collobert et al., 2011] Ronan Collobert, Jason Weston, [Pang and Lee, 2004] Bo Pang and Lillian Lee. A sentimen-
Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and tal education: Sentiment analysis using subjectivity sum-
Pavel Kuksa. Natural language processing (almost) from marization based on minimum cuts. In Proceedings of
scratch. The Journal of Machine Learning Research, ACL, 2004.
12:2493–2537, 2011. [Socher et al., 2011] Richard Socher, Jeffrey Pennington,
[Dong et al., 2015] Daxiang Dong, Hua Wu, Wei He, Dian- Eric H Huang, Andrew Y Ng, and Christopher D Man-
hai Yu, and Haifeng Wang. Multi-task learning for multi- ning. Semi-supervised recursive autoencoders for predict-
ple language translation. In Proceedings of the ACL, 2015. ing sentiment distributions. In Proceedings of EMNLP,
[Duchi et al., 2011] John Duchi, Elad Hazan, and Yoram 2011.
Singer. Adaptive subgradient methods for online learn- [Socher et al., 2012] Richard Socher, Brody Huval, Christo-
ing and stochastic optimization. The Journal of Machine pher D Manning, and Andrew Y Ng. Semantic composi-
Learning Research, 12:2121–2159, 2011. tionality through recursive matrix-vector spaces. In Pro-
[Elman, 1990] Jeffrey L Elman. Finding structure in time. ceedings of EMNLP, pages 1201–1211, 2012.
Cognitive science, 14(2):179–211, 1990. [Socher et al., 2013] Richard Socher, Alex Perelygin, Jean Y
[Firat et al., 2016] Orhan Firat, Kyunghyun Cho, and Wu, Jason Chuang, Christopher D Manning, Andrew Y
Ng, and Christopher Potts. Recursive deep models for
Yoshua Bengio. Multi-way, multilingual neural machine
semantic compositionality over a sentiment treebank. In
translation with a shared attention mechanism. arXiv
Proceedings of EMNLP, 2013.
preprint arXiv:1601.01073, 2016.
[Tai et al., 2015] Kai Sheng Tai, Richard Socher, and
[Graves, 2013] Alex Graves. Generating sequences with re-
Christopher D Manning. Improved semantic representa-
current neural networks. arXiv preprint arXiv:1308.0850,
tions from tree-structured long short-term memory net-
2013.
works. arXiv preprint arXiv:1503.00075, 2015.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
[Turian et al., 2010] Joseph Turian, Lev Ratinov, and Yoshua
Jürgen Schmidhuber. Long short-term memory. Neural
Bengio. Word representations: a simple and general
computation, 9(8):1735–1780, 1997.
method for semi-supervised learning. In Proceedings of
[Hochreiter et al., 2001] Sepp Hochreiter, Yoshua Bengio, ACL, 2010.
Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow
in recurrent nets: the difficulty of learning long-term de-
pendencies, 2001.
[Kalchbrenner et al., 2014] Nal Kalchbrenner, Edward
Grefenstette, and Phil Blunsom. A convolutional neural