Deep Learning in Natural Language Processing A State-of-the-Art Survey
Deep Learning in Natural Language Processing A State-of-the-Art Survey
A STATE-OF-THE-ART SURVEY
JUNYI CHAI1, *, ANMING LI2
1
Division of Business and Management, Beijing Normal University-Hong Kong Baptist University United
International College, Zhuhai, China. 2Department of Management and Marketing, Faculty of Business, The Hong Kong
Polytechnic University, Hung Hom, Kowloon, Hong Kong, China
*E-MAIL: [email protected]; [email protected]
Authorized licensed use limited to: Queens University Belfast. Downloaded on May 23,2023 at 11:30:35 UTC from IEEE Xplore. Restrictions apply.
word, sentences and their combinations. Typical tasks whether RNN architecture is optimized. They experimented
include name entity recognition, sentiment analysis, on ten thousands of RNN architectures and found one RNN
machine translation, and question answering. architecture that outperformed both LSTM and GRU. Also,
There is another basic task that does not fall into the another notable variant is recursive neural network.
two categories above, which is the word/char imbedding. It
aims to solve the problem of representing words/chars in 2.1.2 CNN
vocabulary in vectors to enable applying machine learning
techniques on the above two tasks. A typical method is the CNN is another important DL techniques for NLP
word2vec that projects words into vector spaces such that tasks. It applies filters on sub-region of image. CNN-based
similarity of words can be measured using cosine similarity methods apply filters over regions comprise of chunks of
text. Collobert et al. [4] proposed a multi-layer CNN-based
2.1.1. RNN neural network to create a general-purpose model without
linguistic knowledge. They state that it can be used in
Since the order matters a lot in understanding a chuck several types of NLP tasks with words even characters as
of text, it is natural to design a network in order to inputs. They used two methods for word embedding that are
‘remember’ orders of each input word and keep key window-based method and sentence-based method.
information in memory. Recurrent neural network is a Compared with RNN, CNN-based methods require the
network designed to capture temporal behavior. As shown considerable short training time.
in Figure 1, a chunk of RNN loops inside itself. With each
coming input, it updates network parameters considering
current input (Xt) and previous state (ht-1). As length of
input gets longer, however, it is more time-consuming to
train a RNN. Moreover, more recent information may not
be more important necessarily. It requires RNN to keep
more “long-term” information in memory.
Authorized licensed use limited to: Queens University Belfast. Downloaded on May 23,2023 at 11:30:35 UTC from IEEE Xplore. Restrictions apply.
networks: SSWEh, SSWEr and SSWEu. SSWEh and vector-based CNN, recurrent neural networks) on several
SSWEr are used to predict sentiment distribution of text in dataset with different language and size. They concluded
constraint and relaxed modes, respectively. SSWEu is a that for small dataset, traditional methods perform better
unified model that captures both syntactic information and than CNN while CNN works better on user-generated data.
sentiment information. Experiments showed that their While choice of alphabet may lead to different performance,
method provided favourable results in task of sentiment semantics does not seem to matter.
classification compared with methods using hand-craft Huang et al. [12]’s work aims to solve the problem of
features. matching query and document for Web search. They
Kim et al. [6] designed a simple CNN-based network proposed a typical DL model named Deep Structured
to build a language model from characters. They argued Semantic Model (DSSM) that has three hidden layers. The
that character-based language model can capture sub-word goal is to project both query and document into a common
information and produce better embedding for rare words. low dimension semantic space where distance between a
These features are particularly useful for morphologically query and a document can be computed by cosine similarity.
rich languages. A merit of their paper is that they propose to They also introduced a new method called word hashing for
attach a highway network to the pooling layer. Output of making DSSM feasible for Web application. They
this network is fed into a multilayer recurrent neural conducted an experiment on real-world data and showed
network language model (RNN-LM) to predict next word. that their method outperformed the state-of-the-art method
significantly.
2.3. POS, NER, Chunking, Semantic role labelling Socher et al. [13] established a dataset called
(SRL) Sentiment Treebank with 11,855 parsed sentences marked
with fine-grained labels for meeting the needs of the
Inspired by extended Collobert et al. [7]’s work, Zheng supervised training-based sentiment detection. It aims at
et al. [8] tried to use it to solve Chinese segmentation and capturing compositional effects sentiment. Recursive
POS tagging problem. Instead of using sentence-level log Neural Tensor Network, a new structure of DL, is proposed
likelihood, they proposed a new perceptron-style training for sentiment analysis. Their experiment shows that this
method. Experiment showed that their method produced method outperforms all previous methods on several
comparable result to the state-of-the-art. Their model also metrics and it is the only model that captures negation.
requires less memory and time to run. Sutskever et al. [14] proposed a multilayered LSTM to
Santos and Zadrozny [9] extended Collobert et al. [7]'s map a sequence to another sequence. They first turn a
work by attaching character vectors to the end of word sequence into a vector (i.e. encode), and then infer another
vectors to solve the problem of part-of speech tagging. This sequence through this vector (i.e. decode). This method can
method is called CharWNN. They argued that this method be useful in tasks like machine translation. Experiment
can capture syntactic and semantic information at conducted on an English to French dataset showed that this
word-level embedding. It also can shape and morphological model performed well on long sentences. Reversing order
information at character-level embedding. Both character of words in all source sentences also can improve
embedding and word embedding are learned in an performance of LSTM model.
unsupervised manner. They conducted experiment on Bahdanau et al. [15] pointed out that encoder-decoder
English corpus and Portuguese corpus, achieving best approach for machine translation had difficulty when
results of previous known works. Santos and Guimaraes [10] dealing with long sentence because it has to compress each
was concerned about the problem of the named entity input sequence into a fixed length vector. To solve this
recognition (NER). Their work is built on CharWNN and problem, they proposed a method that can align and
the goal is to prove the result of that CharWNN can also be translate sentence at the same time. For each word to be
applied to other sequence classification problem like NER. predicted, it will conduct a soft-search that predicts the
word based on its contexts in source language and previous
2.4. Sentiment analysis, Text classification, Machine generated words.
translation
2.5. Other raised NLP problems
Zhang et al. [11] proposed two types of character-level
CNN to perform a text classification task. They compared Kumar et al. [16] proposed a neural-network-based
their methods with the traditional NLP methods (bag of model called dynamic memory network (DMN) to handle
words, n-grams) and the deep learning methods (word the problem of question answering. The authors claimed
Authorized licensed use limited to: Queens University Belfast. Downloaded on May 23,2023 at 11:30:35 UTC from IEEE Xplore. Restrictions apply.
that this model could solve question-answering problems, different small video dataset.
sequence-tagging problems, classification problems, and Simonyan and Zisserman [21] proposed a two-stream
sequence-to-sequence tasks. ConvNets to solve the problem of action recognition in
DMN consists of four modules: input module, videos. Two models are used to handle special component
question module, episodic memory module, and finally and temporal component of videos. One CNN processes
answer module. Input and question modules compute still images in order to capture special information. Another
feature representations for inputs and questions, takes multiple optical flows as input to exploit temporal
respectively. When a question arrives, DNM searches information. One observation is classification accuracy,
inputs and relevant facts. Episodic memory module then which is improved with optical flows as input even lack of
conducts a reasoning process over retrieved facts. A vector training videos. They point out that multi-task learning,
that represents all relevant information is generated and is when apply to two different classification dataset, can not
fed to the answer module in order to generate final answer. only extend training dataset but also improve accuracy of
classification on both datasets.
3. Advances in video and speech processing Wang et al. [22] argued that current CNN based
methods for action recognition were constrained to
3.1. Video classification relatively shallow CNN architecture and lack of training
videos. They attempted to solve these problems using the
Xue et al. [17] reviewed recent development of deep following methods: a) pre-train spatial and temporal nets, b)
learning on domain of video classification and video use learning rate, c) high drop rate. They have achieved
captioning. For video classification, it mainly depends on 91.4% accuracy on UCF101 dataset.
CNN that is used to extract spatial feature from frame and
LSTM that is used to capture temporal information. Deep 3.2. Video surveillance
learning for visual attention is also discussed. They point
out that there are two directions for video captioning: the Dong et al. [23] was concerned about the problem of
template-based language model and the sequence learning retrieving videos containing faces of a person given a video
models. The former one splits sentence into words and of this person. They identified two challenges, which are (a)
maps word to a specific segment of an image (frame) and the intra-variation of faces, and (b) the constraints of space
form a sentence using language constraints while the latter and time. They advocated to use deep CNN to get
one matches video contents to sentences directly. discriminative and compact representations by unifying
Ye et al. [18] tried to build a library. Firstly, they deep CNN and hash functions. They firstly initialize the
define 500 events from articles of WikiHow. They network with a pre-trained CNN model. Then, they learn a
organized events and concepts in a hierarchical structure. hashing function using proposed low-rank discriminative
They also crawl videos and get deep video feature using binary method in a supervised manner. Finally, they tune
CNN of these videos. 4490 binary classifiers are trained this model in order to meet the need of video retrieval.
over these videos. Xue et al. [24] attempted to solve the problem of
Wu et al. [19] applied a hybrid method to solve video tracking multiple persons in RGBD videos. They proposed
classification problem. First, two CNN-based models are to train a CNN model in order to classify pedestrians
used to extract feature and to model short-term motion. detected in video through using a few frames from
Built upon these two models, a LSTM network is used to beginning of the video. To further reduce the failure of
capture longer-term temporal clues. recognition, they used a classifier in conjunction with a
Karpathy et al. [20] extensively evaluated different motion model that could predict transition of location.
CNN-based methods on the task of video classification.
They introduced a new dataset containing 1 million 3.3. Automatic speech recognition
YouTube videos with 487 sports categories (Sports-1M). To
increase computational efficiency, they proposed a Hinton et al. [25] reviewed DL for acoustic modelling
two-stream (context stream which process low-resolution in speech recognition from four research groups including
frame and fovea stream that processes middle portion of University of Toronto, Microsoft Research (MSR), Google,
high-resolution frame) architecture to process input frame. and IBM Research.
Experiment shows this architecture speeds 2x up in runtime Graves et al. [26] used an advanced recurrent neural
while cost little on classification accuracy. They also network (RNN) named Deep RNN in deep networks.
examined the generic ability of this proposed methods on a Through experiments, Deep Long-Short-term Memory
Authorized licensed use limited to: Queens University Belfast. Downloaded on May 23,2023 at 11:30:35 UTC from IEEE Xplore. Restrictions apply.
RNNs outperforms other existing techniques. their movie releases. This method firstly trained a deep
Yu and Deng [27] provided a review of automatic neural sentences embeddings in an unsupervised manner.
speech recognition with the concern about DL techniques. They then compute the similarity between video clips and
An analogous review work can be found in Li et al. [28]. sentences by exploiting video-sentence embeddings. A
Chen and Salman [29] developed a DL framework for contextual alignment model is also proposed to make better
speaker-specific characteristics from mel-frequency cepstral local alignment prediction. The authors created a
coefficients. This approach outperforms other method in movie-book dataset that contains 19,985 shots and 85,238
speaker verification and segmentation. sentences.
Han and Wang [30] proposed an DP approach for pitch
determination, where a feedforward deep NN is trained on 4. Conclusion
static frame-level acoustic features. A recurrent deep NN is
trained on sequential frame-level features. This paper provides a clear and systematic literature
review on articles published in past three years on the
3.4. Image caption/scenes parsing application of DL in NLP. This study provides valuable
knowledge accumulation on current state of research.
Socher et al. [31] proposed a method to parse natural
scenes using RNN. Their method is based on observations, Acknowledgments
in which natural scenes are nested in image segments. This
method first maps the feature of image segments to a The first author thanks the financial supports by
semantic space and assigns each segment a score. The score Beijing Normal University - Hong Kong Baptist University
is used to decide whether it should be merged to a larger United International College Research Grant under Grant
region. It also generates a new feature representation for the R201917.
larger region and assigns a new label to this region. The
same algorithm is also applicable to natural language References
sentences. Instead of image segments, it merges words to
form the phrase. The author claimed that their method [1] Kyunghyun Cho, B.V. Merrienboer, et al. “Learning
outperformed other conditional-random-fields-based phrase representations using RNN encoder–decoder for
methods. Statistical Machine Translation”, Proceedings of the
Socher et al. [32] introduced a DT-RNN model. This Conference on Empirical Methods in Natural Language
model maps sentences into a vector space. Authors claimed Processing (EMNLP), pp. 1724-1734, 2014.
that this manner is useful for retrieving images described by [2] Klaus Greff, R.K. Srivastava, et al. “LSTM: A search
those sentences and vice versa. Different from other space odyssey”, IEEE Transactions on Neural Networks
RNN-based methods, this model developed the dependency and Learning Systems, 28(10), pp. 2222-2232, 2017.
trees (DT-RNN) instead of the constituency trees [3] Rafal Jozefowicz, W. Zaremba, and Sutskever, I. “An
(CT-RNN). Thus it is more robust to the changes of word empirical exploration of recurrent network architectures”,
order and syntactic structure. Proceedings of the International Conference on Machine
Leanring (ICML’15), pp. 2342-2350, 2015.
Karpathy and Li [33] proposed a method that aimed at
[4] Ronan Collobert, J. Weston, et al. “Natural language
generating natural language descriptions for images. This
processing (almost) from scratch”, Journal of Machine
method combines CNN and RNN through a structured Learning Research, 12, pp. 2493–2537, 2011.
objective. The authors claimed that their model achieved [5] Duyu Tang, F. Wei, et al. “Learning sentiment-specific
state-of-the-art results on retrieval task over several image word embedding for twitter sentiment classification”,
dataset. They firstly detected the region of interest (ROI) Proceeding of the Annual Meeting of the Association for
through using RCNN and then extract CNN features for top Computational Linguistics. pp. 1555-1565, 2014.
19 detected regions as well as the whole input image. After [6] Yoon Kim, Y. Jernite, et al. “Character-aware neural
that, they calculate the word representation through using language models”, Proceedings of the International
Bidirectional RNN (BRNN). They select best word for a Conference on Artificial Intelligence, pp. 2741-2749,
region by using a sentence score. 2016.
[7] Ronan Collobert, J. Weston, et al. “Natural language
3.5. Video to Text processing (almost) from scratch”, Journal of Machine
Learning Research, 12, pp. 2493–2537, 2011.
Zhu et al. [34] proposed a method to align books and [8] Xiaoqing Zheng, H. Chen, and T. Xu, “Deep learning for
Chinese word segmentation and POS tagging”,
Authorized licensed use limited to: Queens University Belfast. Downloaded on May 23,2023 at 11:30:35 UTC from IEEE Xplore. Restrictions apply.
Proceeding of the Conference on Empirical Methods in Proceedings of the International Conference on Neural
Natural Language Processing, pp. 647-657, 2013. Information Processing Systems, 1, pp. 568-576, 2014.
[9] Cicero N.D. Santos, and B. Zadrozny, “Learning [22] Li Wang, T. Liu, et al. “Video tracking using learned
character-level representations for part-of-speech hierarchical features”, IEEE Transactions on Image
tagging”’ Proceedings of the International Conference on Processing, 24(4) pp. 1424-1435, 2015.
Machine Learning, pp. 1818-1826, 2014. [23] Zhen Dong, S. Jia, et al. “Face video retrieval via deep
[10] Cicero N.D. Santos, and V. Guimaraes, “Boosting named learning of binary hash representations’, Paper presented
entity recognition with neural character embeddings”, at the 30th AAAI Conference on Artificial Intelligence,
Proceedings of the Fifth Named Entity Workshop, joint pp.3471-3477, 2016.
with 53rd ACL and the 7th IJCNLP, pp. 25-33, 2015. [24] Hongyang Xue, Y. Liu, et al. “Tracking people in RGBD
[11] Xiang Zhang, J. Zhao, and Y. LeCun, “Character-level videos using deep learning and motion
convolutional networks for text classification”, Advances clues. Neurocomputing, 204, pp. 70-76, 2016.
in Neural Information Processing Systems, 2015. [25] Geoffrey Hinton, L. Deng, et al. “Deep neural networks
[12] Po-Sen Huang, X. He, et al. “Learning deep structured for acoustic modeling in speech recognition”, IEEE
semantic models for web search using clickthrough data”, Signal Processing Magazine, 29(6) pp. 82-97, 2012.
Proceedings of the ACM International Conference on [26] Alex Graves, A. Mohamed, and G. Hinton, “Speech
Information and Knowledge Management, pp. 2333-2338, recognition with deep recurrent neural networks”,
2013. Proceedings of IEEE International Conference on
[13] Richard Socher, A. Perelygin, et al. “Recursive deep Acoustics, Speech and Signal Processing, pp. 6645-6649,
models for semantic compositionality over a sentiment 2013.
treebank”, Proceedings of the Conference on Empirical [27] Dong Yu, and L. Deng, “Automatic speech recognition-
Methods in Natural Language Processing, pp. 1631-1642, A deep learning approach”, Springer-Verlag London,
2013. 2015.
[14] Ilya Sutskever, O. Vinyals, O. and O.V. Le, “Sequence to [28] Zhen-Hua S.Y. Li, et al. “Deep learning for acoustic
sequence learning with neural networks. Proceedings of modeling in parametric speech generation: A systematic
Advances in neural information processing systems. pp. review of existing techniques and future trends”, IEEE
3104-3112, 2014. Signal Processing Magazine, 32, (3), pp. 32-52, 2015.
[15] Dzmitry Bahdanau, K. Cho, and Y. Bengio, “Neural [29] Ke Chen, and A. Salman, “Learning speaker-specific
machine translation by jointly learning to align and characteristics with a deep neural architecture”, IEEE
translate. ICLR, 2015. arXiv:1409.0473v7 Transactions on Neural Networks, 22(11), pp.1744-1756,
[16] Ankit Kumar, O. Irsoy, et al. “Ask me anything: 2011.
Dynamic memory networks for natural language [30] Kun Han, and D.L. Wang, “Neural network-based pitch
processing”, Proceedings of The International tracking in very noisy speech”, IEEE/ACM Transactions
Conference on Machine Learning, 2016. on Speech and Language Processing, 22(12), pp.
[17] Hongyang Xue, Y. Liu, et al. “Tracking people in RGBD 2158-2168, 2014.
videos using deep learning and motion [31] Richard Socher, C.C. Lin, et al. “Parsing natural scenes
clues”, Neurocomputing, 204, pp. 70-76, 2016. and natural language with recursive neural networks”,
[18] Guangnan Ye, Y, Li, et al. “EventNet: A large scale Proceedings of the 28th International Conference on
structured concept library for complex event detection in Machine Learning, pp. 129-136, 2011.
video, Proceedings of the ACM Multimedia [32] Richard Socher, A. Karpathy, et al. “Grounded
Conference, pp.471-480, 2015. compositional semantics for finding and describing
[19] Zuxuan Wu, X. Wang, et al. “Modeling spatial-temporal images with sentences”, Transactions of the Association
clues in a hybrid deep learning framework for video for Computational Linguistics, 2, pp. 207-218, 2014.
classification. Proceedings of the ACM Multimedia [33] Andrej Karpathy, and F.F. Li, “Deep visual-semantic
Conference, pp.461-470, 2015. alignments for generating image descriptions”,
[20] Andrej Karpathy, G. Toderici, et al. “Large-Scale video Proceedings of the IEEE Conference on Computer Vision
classification with convolutional neural networks’, and Pattern Recognition, pp. 3128-3137, 2015.
Proceeding of IEEE Conference on Computer Vision and [34] Yukun Zhu, R. Kiros, et al. “Aligning books and movies:
Pattern Recognition, 23-28, 2014. Towards story-like visual explanations by watching
[21] Karen Simonyan, and A. Zisserman, A. “Two-stream movies and reading books”, Proceedings of the IEEE
convolutional networks for action recognition in videos”, International Conference on Computer Vision, pp. 19-27,
2015.
Authorized licensed use limited to: Queens University Belfast. Downloaded on May 23,2023 at 11:30:35 UTC from IEEE Xplore. Restrictions apply.