Ensemble Application of Convolutional and Recurrent Neural Networks For Multi-Label Text Categorization
Ensemble Application of Convolutional and Recurrent Neural Networks For Multi-Label Text Categorization
AbstractText categorization, or text classication, is one of the state-of-the-art word-vector based CNN feature extraction
key tasks for representing the semantic information of documents. and we deal with high-order label correlation while keeping
Multi-label text categorization is ner-grained approach to text a tractable computational complexity by using RNN. We
categorization which consists of assigning multiple target labels
to documents. It is more challenging compared to the task of perform experiments to investigate the inuence of parameters
multi-class text categorization due to the exponential growth in our model. Additionally, we compare our method with
of label combinations. Existing approaches to multi-label text several baselines using two publicly available datasets, i.e.,
categorization fall short to extract local semantic information and Reuters-21578 and RCV1-v2. The former reects the behavior
to model label correlations. In this paper, we propose an ensemble of our model in a relatively small dataset, and the latter is
application of convolutional and recurrent neural networks to
capture both the global and the local textual semantics and considered as a large-scale dataset.
to model high-order label correlations while having a tractable The remainder of the paper is organized as follows. Section
computational complexity. Extensive experiments show that our II covers the basic concepts for multi-label text categorization.
approach achieves the state-of-the-art performance when the We review related work in Section III. Section IV presents the
CNN-RNN model is trained using a large-sized dataset. details of our CNN-RNN method. Experiments for exploring
the parameters of our model and comparisons with the base-
I. I NTRODUCTION lines are illustrated from Section V to Section VII. And nally,
Multi-label text categorization refers to the task of assigning Section VIII concludes the paper.
one or multiple categories (or labels) to a textual document,
which can be time-consuming and sometimes intractable. Text II. BACKGROUND
categorization and multi-label text categorization have been Multi-label text categorization can generally be divided
enormously applied to real-world problems, e.g., information into two sub-tasks: text feature extraction and multi-label
retrieval [1], affective computing [2], sentiment analysis [3], classication. An introduction of these two sub-tasks will be
email spam detection [4], multimodal content analysis [5], etc. described in the following two subsections.
Efcient ways to tackling this task require advanced se-
mantic representations of texts that go beyond the bag-of- A. Text features
words model which ignores local orderings of words and Raw text information cannot be directly used in the sub-
phrases, and hence, is unable to grasp local semantic informa- sequent multi-label classier. Many research works represent
tion [6]. Moreover, labels often exhibit strong co-occurrence the features of texts and words as vectors in a framework
dependences. For example, a color label green often occurs termed vector space model (VSM). Instead of representing the
with the subject labels tree or leaf, however, the color whole text in one step, e.g., tf-idf weighting for a document,
label blue will never come with the subject label dog. most recent works focus on the distributional representation
By considering such label correlations instead of considering of individual words, which is nevertheless pre-processed and
labels independently, a more efcient prediction model is tokenized from the original raw text data. Grouping similar
expected to be obtained. words in the vector space has shown good performance in
In this work, we propose an ensemble application of convo- many natural language processing (NLP) tasks, e.g., multi-
lutional neural network (CNN) and recurrent neural network task learning [7], sentence classication [8], sentiment analysis
(RNN) to tackle the problem of multi-label text categorization. [9], semantically equivalent detection [10], etc. Among them,
We develop a CNN-RNN architecture to model the global research efforts starting from [11] bring in a word2vec model
and local semantic information of texts, and we then utilize that can capture both syntactic and semantic information with
such label correlations for prediction. In particular, we employ demonstrated effectiveness in many NLP tasks. The basic idea
2378
repeatedly to each possible window of h words in the sen- relationship can be modeled. The standard LSTM is expressed
tence (i.e., x1:h , x2:h+1 , ..., xnh+1:n ) to produce an output as follows:
sequence o Rnh+1 , i.e., o = [o1 , o2 , ..., onh+1 ]. We apply
a non-linear activation function f to each oi to produce a i = (W (i) xt + U (i) ht1 + b(i) ),
feature map c Rnh+1 where ci = f (oi ). The non-linear o = (W (o) xt + U (o) ht1 + b(o) ),
activation function is chosen as ReLu, because this function
f = (W (f ) xt + U (f ) ht1 + b(f ) ),
have more benets over sigmoid and tanh, as studied in [32]. (4)
u = tanh(W (u) xt + U (u) ht1 + b(u) ),
ReLu(oi ) = max(0, oi ) (3)
ct = i u + f ct1 ,
One may also specify multiple kinds of lters with different ht = o tanh(ct )
window sizes, or use multiple lters for the same window size
to learn complementary features from the same word windows. where denotes element-wise multiplication and (.) is the
The dimensionality of the feature map generated by each lter sigmoid function. xt Rd is the input from lower layer at time
will vary as a function of the text length and the lters window step t, and d can be the dimension of word vector of the labels
size. Thus, we apply a pooling function to each feature map if the lower layer is word embedding of the labels or can be
to induce a xed-length vector. A common strategy is 1-max the hidden-state dimension of the lower layer if the lower layer
pooling, which extracts a scalar (i.e., a feature vector of length is LSTM. If there are q LSTM units, then ht Rq , W (.)
1) with the maximum value for each lter. As a result, a lter Rqd , U (.) Rqq and b(.) Rq for all types (i, o, f, u).
with a certain window size will only produce one scalar value, Memory cell ct is the key in LSTM which keeps the long-
which means that if there are i window-size lters and each term dependencies while getting rid of the vanishing/exploding
of them have j lters, there will be i j outputs in total. For gradients problems. Forget gate f is used to erase some parts
example, concatenating output of k lters of each the three of memory cells, while the input gate i and output gate o
window-size (1, 3, 5) will result in a 3k-dimension output control what to read into and write out of from the memory
vector. By projecting its output vector into a lower dimensional cells.
space using a full-connect layer, we obtain a low-dimension
feature representation for the text. This feature vector can be C. Combining CNN and RNN
fed into the next layer of the CNN for further convolution, or
be used as the output vector for different NLP tasks. Here, As shown in Section IV-A, the CNN is used to extract the
this text feature vector is used as the input for the RNN. feature of text so that the semantic information is represented
as a output vector from CNN. After feeding this output vector
B. Label sequence prediction by RNN into LSTM as the initial state for the label prediction, the
RNN is mainly used for sequential labeling or predic- whole network is able to predict a sequence of related labels
tion, such as language model, auto-encoder, etc. It can be to the input documents according to the features extracted by
considered as an extension of the hidden Markov model CNN (Figure 1. The feeding of the feature vector into the
that introduces nonlinearity transitions to model long-term LSTM is by a linear transformation as an additional addition
nonlinear dependences. LSTM is considered as one of most term W (T ) T in (4) for each types (i, o, f, u), where T is the
successful RNN variants, which introduces three additional output text feature from CNN with a x-dimension t, W (T )
gates. In the text mining domain, LSTMs have been involved Rqt , and q is the hidden dimension of LSTM. For example,
in the task of sentiment analysis [33], sentence classication the formula for the input gate is changed to:
[34], etc. When using it for the whole long document, the
training of LSTMs is not stable and underperformed traditional i = (W (i) xt + U (i) ht1 + b(i) + W (T ) T ) (5)
linear predictors as shown in [35]. Moreover, the training and
testing time of LSTMs are also time/resource consuming for The label sequence prediction is always started with the token
the long document. [35] illustrates that the training of LSTM <START>. In each time step, there is softmax layer above the
can be stabilized by pre-training the LSTMs as sequence auto- top layer of LSTM to calculate the probability of each label by
encoders or recurrent language models. However, the above using linear transform the hidden-state of top LSTM layer rst.
problem is avoided when we use LSTMs for label sequence Then, the label with maximal probability will be predicted.
predictions, which is typically much shorter than a document. The prediction of labels is ended with the <END> token.
The label sequence here is the assignments of ordered labels to Therefore, for each piece of text, a various length of label
a text. Although several variants of LSTMs exists, the standard sequence will be predicted. The ideal case will be that the label
LSTM is used. An additional word embedding layer is also sequence of each input text is exactly matched to the subset of
applied for the labels. labels that belongs to that text. For example, in Figure 1, the
A LSTM consists of three gates: an input gate i, an output input text is the cat is sleeping near a small American ag
gate o and a forget gate f . The three gates work collaboratively in a bed and this corresponding ground truth labels include
to control what to be read by input, what to output, and the subject (Animal), the location (United States) and the time
what should be forgotten, so that some complex long-term (Midnight).
2379
#'!$''$'''$'$
#%
"0%!
"01
"0&"
!"
*$'% (
.
+%,$''!$( +%,$''!$(
%$'"$%
'$'$
) )
#!
$
"$% "$%
"0%!
"01
"0&"
-$&'%,&&"
)
3 '
''''''''''('''(! "' ///''''''''''''',&
V. DATASETS
The characteristics of the above dataset are summarized in
To test the performance of our algorithm, we use two Table I. Since the RCV1-v2 dataset provides both the token
publicly-available datasets, i.e., Reuters-21578 and RCV1-v2. and vector (tf-idf, cosine-normalized) versions, we can directly
Reuters-21578: The documents in this dataset were use the token version for our CNN-RNN model and the vector
collected from the Reuters newswire in 1987. This was model for all the baseline algorithms. However, the Reuters-
once a popular dataset for researchers who worked in text 21578 only has a raw dataset. Thus, we preprocess these
categorization since 1996 (now superseded by RCV1). documents and choose the top 1000 tf-idf features according to
Adapted from this preview Reuters-22173 version, it their document frequency. Then, each document is presented
now contains 21,578 documents. In the Modied Apte as a 1000-dimensional feature vector. This vector model can
(ModApte) Split, there are 9,603 docs for training then be used for all baseline algorithms.
and 3,299 docs for testing. We use this split to do the
VI. E VALUATION M ETRICS
empirical studies of parameters of our neural network.
After that, we use all these documents for 10-fold cross- There are two types of evaluation metrics for multi-label
validation to compare various algorithms. classication: ranking and classication. Since our model does
RCV1-v2: Reuters Corpus Volume I (RCV1) is an not provide a complete ranking for all the labels, the ranking
archive of over 800,000 manually categorized newswire metric evaluated here is one-error. For the classication met-
stories recently made available by Reuters Ltd for re- rics, we consider the hamming loss, the macro/micro-averaged
search purposes. By correcting the errors of the raw precision, recall, and F1 score.
2380
The one-error counts the fraction of instances whose top-
1 predicted label is not in the relevant labels.
The hamming loss counts the symmetric difference of
the predicted labels and the relevant labels and calculates
the fraction of its difference over the label space.
Precision, Recall and F1 score are binary evaluation
measures B(tp, tn, f p, f n) to evaluate the performance
of a classication problem, which is calculated based on
the number of true positives (tp), true negatives (tn), (a) One error (b) Hamming loss
false positives (f p) and false negatives (f n). There are
two ways to calculate these metrics over the whole
test data: macro-averaged and micro-averaged. Macro-
averaged refers to the average performance (Precision,
Recall and F1 score) over labels, while micro-averaged
counts all true positives, true negatives, false positives
and false negatives rst among all labels and then has a
binary evaluation for its overall counts.
2381
TABLE II: Comparison of various algorithms (Mean Deviation)
Algorithm BR CC ML-kNN ML-HARAM CNN-RNN
Reuters-21578
One-error 0.091 0.012 0.084 0.010 0.474 0.031 0.112 0.024 0.083 0.009
H-loss 0.0032 0.0005 0.0031 0.0005 0.0088 0.0009 0.0066 0.0011 0.0038 0.0004
MiP 0.940 0.014 0.937 0.014 0.803 0.041 0.748 0.030 0.902 0.014
MiR 0.823 0.019 0.828 0.019 0.473 0.032 0.775 0.036 0.813 0.026
MiF 0.878 0.016 0.879 0.016 0.595 0.035 0.762 0.032 0.855 0.015
MaP 0.464 0.038 0.470 0.046 0.342 0.032 0.298 0.040 0.369 0.002
MaR 0.350 0.020 0.361 0.023 0.205 0.028 0.242 0.019 0.287 0.026
MaF 0.385 0.024 0.395 0.029 0.239 0.027 0.237 0.023 0.322 0.017
RCV1-v2
One-error 0.012 0.001 0.045 0.002 N/A N/A 0.010 0.002
H-loss 0.0088 0.0003 0.0093 0.0004 N/A N/A 0.0086 0.0004
MiP 0.898 0.005 0.886 0.005 N/A N/A 0.889 0.010
MiR 0.812 0.008 0.811 0.008 N/A N/A 0.815 0.015
MiF 0.853 0.006 0.847 0.006 N/A N/A 0.849 0.012
MaP 0.794 0.006 0.768 0.004 N/A N/A 0.803 0.004
MaR 0.623 0.008 0.644 0.009 N/A N/A 0.646 0.012
MaF 0.687 0.008 0.693 0.007 N/A N/A 0.712 0.008
effort and keep a comparable performance, in the experiment that ML-kNN is inferior to any other methods. The state-
with baselines, we set the CNN-RNN parameter ratio as 50% of-art neural network based approach ML-HARAM is also
and the training epoch is limited to 50. inferior to our proposed neural network based approach in
all metrics. For the RCV1-v2 dataset, the proposed method
B. Baseline comparison outperforms any other baseline methods in most of the metrics,
We test various algorithms over the above datasets, in- especially in macro-averaged metrics. This veries that deep
cluding Binary relevance (BR), Classier chain (CC), ML- learning methods would generally benet from a vast number
kNN, ML-HARAM and our CNN-RNN model. BR is used to of training data. We conjecture that modeling the high-order
represent multi-label algorithms without taking into account label correlation helps to predict the minor labels that occurs
the correlation between labels, CC is an algorithm that takes less frequently in the dataset.
into consideration the high-order label correlation, ML-kNN Meanwhile, both of the traditional methods, BR and CC,
is a transformation-based algorithm, and ML-HARAM is a show substantially decreased performances in micro-averaged
state-of-the-art neural network based algorithm. metrics and the hamming loss. The increases in the Macro-
10-fold cross-validation is performed over these algorithms. averaged metrics relative to the previous dataset are due to the
Each of the two datasets is randomly divided into 10 equal- more balanced distribution of labels in this dataset. Note that
sized subsets. We perform 10 iteration experiments, and each the metrics for ML-kNN and ML-HARAM are not available
of them takes 9 subsets, and the rest one is hold out for testing. as the Scikit-multilearn implementation of these algorithms is
Then, we calculate the mean and deviation of the 10 exper- not scalable with the data size and results in memory error. We
iments. The result is shown in Table II. The implementation expect that our model can also outperform these two methods
of all the baseline methods are done in Scikit-multilearn [38], as it does in the small dataset. However, further experiments
an open-source library for the multi-label classication that is are required to conrm it.
built on top of the scikit-learn ecosystem [39]. In BR and CC,
the linear SVMs are used as the base classier. We investigate
the number of neighbors in ML-kNN from 5 to 20 with the VIII. C ONCLUSION
step of 5, and choose 10 nally that is optimized for both
Hamming-loss and F1 score. ML-HARAM follows its default Multi-label text categorization is the task of assigning pre-
setting. dened categories to textual documents. Existing approaches
As shown in Table II, for the dataset Reuters-21578, the to multi-label text categorization fall short to extract local
CNN-RNN model does not show an improvement over the semantic information and to model label correlations. In this
two traditional methods (BR and CC). We investigate the paper, we proposed a convolutional neural network (CNN) and
performance over training data, and nd that this model tends recurrent neural network (RNN) based method that is capable
to overt the training data of this relative small dataset. For of efciently representing textual features and modeling high-
example, the micro-averaged precision in the training dataset order label correlation with a reasonable computational com-
is 99%. Besides, CC is observed to have a slightly better plexity. Our evaluations reveal that the power of the proposed
performance than BR except for the micro-averaged precision. method is affected by the size of the training dataset. If the
The reason could be that the average number of labels data size is too small, the system may suffer from overtting.
per document in the Reuters-21578 are about one, so that it However, when trained over a large-scale dataset, the proposed
makes not much sense to model high-order correlation. Notice model can achieve the state-of-the-art performance.
2382
R EFERENCES [22] S. Poria, E. Cambria, and A. Gelbukh, Deep convolutional neural
network textual features and multiple kernel learning for utterance-level
[1] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A multi-view embedding
multimodal sentiment analysis, in EMNLP, 2015, pp. 25392544.
space for modeling internet images, tags, and their semantics, Interna-
[23] B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, Predicting seman-
tional journal of computer vision, vol. 106, no. 2, pp. 210233, 2014.
tically linkable knowledge in developer online forums via convolutional
[2] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, A review of affective
neural network, in Proceedings of the 31st IEEE/ACM International
computing: From unimodal analysis to multimodal fusion, Information
Conference on Automated Software Engineering (ASE 2016). New
Fusion, vol. 37, pp. 98125, 2017.
York, NY, USA: ACM, 2016, pp. 5162.
[3] E. Cambria, Affective computing and sentiment analysis, IEEE Intel- [24] I. Chaturvedi, Y.-S. Ong, I. Tsang, R. Welsch, and E. Cambria, Learning
ligent Systems, vol. 31, no. 2, pp. 102107, 2016. word dependencies in text by means of a deep recurrent belief network,
[4] X. Carreras and L. Marquez, Boosting trees for anti-spam email Knowledge-Based Systems, vol. 108, pp. 144154, 2016.
ltering, in RANLP, 2001, pp. 5864. [25] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
[5] S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain, Fusing H. Schwenk, and Y. Bengio, Learning phrase representations using
audio, visual and textual clues for sentiment analysis from multimodal rnn encoder-decoder for statistical machine translation, arXiv preprint
content, Neurocomputing, vol. 174, pp. 5059, 2016. arXiv:1406.1078, 2014.
[6] E. Cambria and B. White, Jumping NLP curves: A review of natural [26] S. Poria, E. Cambria, D. Hazarika, and P. Vij, A deeper look into
language processing research, IEEE Computational Intelligence Mag- sarcastic tweets using deep convolutional neural networks, in COLING,
azine, vol. 9, no. 2, pp. 4857, 2014. 2016, pp. 16011612.
[7] R. Collobert and J. Weston, A unied architecture for natural language [27] S. Poria, E. Cambria, and A. Gelbukh, Aspect extraction for opinion
processing: Deep neural networks with multitask learning, in Proceed- mining with a deep convolutional neural network, Knowledge-Based
ings of the 25th international conference on Machine learning. ACM, Systems, vol. 108, pp. 4249, 2016.
2008, pp. 160167. [28] N. Majumder, S. Poria, A. Gelbukh, and E. Cambria, Deep learning
[8] Y. Kim, Convolutional neural networks for sentence classication, based document modeling for personality detection from text, IEEE
arXiv preprint arXiv:1408.5882, 2014. Intelligent Systems, vol. 32, no. 2, 2017.
[9] E. Cambria, J. Fu, F. Bisio, and S. Poria, AffectiveSpace 2: Enabling [29] M.-L. Zhang and Z.-H. Zhou, Multilabel neural networks with applica-
affective intuition for concept-level sentiment analysis, in AAAI, Austin, tions to functional genomics and text categorization, IEEE transactions
2015, pp. 508514. on Knowledge and Data Engineering, vol. 18, no. 10, pp. 13381351,
[10] D. Bogdanova, C. dos Santos, L. Barbosa, and B. Zadrozny, Detecting 2006.
semantically equivalent questions in online user forums, CoNLL 2015, [30] J. Nam, J. Kim, E. L. Menca, I. Gurevych, and J. Furnkranz, Large-
p. 123, 2015. scale multi-label text classicationrevisiting neural networks, in Joint
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efcient estimation of European Conference on Machine Learning and Knowledge Discovery
word representations in vector space, arXiv preprint arXiv:1301.3781, in Databases. Springer, 2014, pp. 437452.
2013. [31] F. Benites and E. Sapozhnikova, Haram: a hierarchical aram neural
[12] G. Salton, A. Wong, and C.-S. Yang, A vector space model for network for large-scale text classication, in 2015 IEEE International
automatic indexing, Communications of the ACM, vol. 18, no. 11, pp. Conference on Data Mining Workshop (ICDMW). IEEE, 2015, pp.
613620, 1975. 847854.
[13] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, Syntactic [32] V. Nair and G. E. Hinton, Rectied linear units improve restricted boltz-
clustering of the web, Computer Networks and ISDN Systems, vol. 29, mann machines, in Proceedings of the 27th International Conference
no. 8, pp. 11571166, 1997. on Machine Learning (ICML-10), 2010, pp. 807814.
[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning represen- [33] D. Tang, B. Qin, and T. Liu, Document modeling with gated recurrent
tations by back-propagating errors, Cognitive modeling, vol. 5, no. 3, neural network for sentiment classication, in Proceedings of the 2015
p. 1, 1988. Conference on Empirical Methods in Natural Language Processing,
[15] Y. Bengio, H. Schwenk, J.-S. Senecal, F. Morin, and J.-L. Gauvain, 2015, pp. 14221432.
Neural probabilistic language models, in Innovations in Machine [34] K. S. Tai, R. Socher, and C. D. Manning, Improved semantic represen-
Learning. Springer, 2006, pp. 137186. tations from tree-structured long short-term memory networks, arXiv
[16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and preprint arXiv:1503.00075, 2015.
P. Kuksa, Natural language processing (almost) from scratch, The [35] A. M. Dai and Q. V. Le, Semi-supervised sequence learning, in
Journal of Machine Learning Research, vol. 12, pp. 24932537, 2011. Advances in Neural Information Processing Systems, 2015, pp. 3079
[17] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, Learning multi-label 3087.
scene classication, Pattern recognition, vol. 37, no. 9, pp. 17571771, [36] D. Kingma and J. Ba, Adam: A method for stochastic optimization,
2004. arXiv preprint arXiv:1412.6980, 2014.
[18] J. Read, B. Pfahringer, G. Holmes, and E. Frank, Classier chains for [37] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, Rcv1: A new benchmark
multi-label classication, in Joint European Conference on Machine collection for text categorization research, Journal of machine learning
Learning and Knowledge Discovery in Databases. Springer, 2009, pp. research, vol. 5, no. Apr, pp. 361397, 2004.
254269. [38] P. Szymanski, Scikit-multilearn: Enhancing multi-label classication in
[19] M.-L. Zhang and Z.-H. Zhou, Ml-knn: A lazy learning approach to python, 2014, manuscript in preparation.
multi-label learning, Pattern recognition, vol. 40, no. 7, pp. 20382048, [39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
2007. O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
[20] A. Clare and R. D. King, Knowledge discovery in multi-label pheno- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
type data, in European Conference on Principles of Data Mining and esnay, Scikit-learn: Machine learning in Python, Journal of Machine
Knowledge Discovery. Springer, 2001, pp. 4253. Learning Research, vol. 12, pp. 28252830, 2011.
[21] A. Elisseeff and J. Weston, A kernel method for multi-labelled classi-
cation, in Advances in neural information processing systems, 2001,
pp. 681687.
2383