100% found this document useful (2 votes)
23 views

Instant download Deep Learning in Natural Language Processing 1st Edition Li Deng pdf all chapter

Deng

Uploaded by

hadatmillet23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
23 views

Instant download Deep Learning in Natural Language Processing 1st Edition Li Deng pdf all chapter

Deng

Uploaded by

hadatmillet23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.

com

Deep Learning in Natural Language Processing 1st


Edition Li Deng

https://textbookfull.com/product/deep-learning-in-natural-
language-processing-1st-edition-li-deng/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Deep learning in natural language processing Deng

https://textbookfull.com/product/deep-learning-in-natural-language-
processing-deng/

textboxfull.com

Deep Learning for Natural Language Processing Develop Deep


Learning Models for Natural Language in Python Jason
Brownlee
https://textbookfull.com/product/deep-learning-for-natural-language-
processing-develop-deep-learning-models-for-natural-language-in-
python-jason-brownlee/
textboxfull.com

Python Natural Language Processing Advanced machine


learning and deep learning techniques for natural language
processing 1st Edition Jalaj Thanaki
https://textbookfull.com/product/python-natural-language-processing-
advanced-machine-learning-and-deep-learning-techniques-for-natural-
language-processing-1st-edition-jalaj-thanaki/
textboxfull.com

Applied Natural Language Processing with Python:


Implementing Machine Learning and Deep Learning Algorithms
for Natural Language Processing 1st Edition Taweh Beysolow
Ii
https://textbookfull.com/product/applied-natural-language-processing-
with-python-implementing-machine-learning-and-deep-learning-
algorithms-for-natural-language-processing-1st-edition-taweh-beysolow-
ii/
textboxfull.com
Deep Learning for Natural Language Processing (MEAP V07)
Stephan Raaijmakers

https://textbookfull.com/product/deep-learning-for-natural-language-
processing-meap-v07-stephan-raaijmakers/

textboxfull.com

Natural language processing with TensorFlow Teach language


to machines using Python s deep learning library 1st
Edition Thushan Ganegedara
https://textbookfull.com/product/natural-language-processing-with-
tensorflow-teach-language-to-machines-using-python-s-deep-learning-
library-1st-edition-thushan-ganegedara/
textboxfull.com

Natural Language Processing 1st Edition Jacob Eisenstein

https://textbookfull.com/product/natural-language-processing-1st-
edition-jacob-eisenstein/

textboxfull.com

Machine Learning with PySpark: With Natural Language


Processing and Recommender Systems 1st Edition Pramod
Singh
https://textbookfull.com/product/machine-learning-with-pyspark-with-
natural-language-processing-and-recommender-systems-1st-edition-
pramod-singh/
textboxfull.com

Natural Language Processing in Artificial Intelligence 1st


Edition Brojo Kishore Mishra

https://textbookfull.com/product/natural-language-processing-in-
artificial-intelligence-1st-edition-brojo-kishore-mishra/

textboxfull.com
Li Deng · Yang Liu Editors

Deep Learning
in Natural
Language
Processing
Deep Learning in Natural Language Processing
Li Deng Yang Liu

Editors

Deep Learning in Natural


Language Processing

123
Editors
Li Deng Yang Liu
AI Research at Citadel Tsinghua University
Chicago, IL Beijing
USA China

and

AI Research at Citadel
Seattle, WA
USA

ISBN 978-981-10-5208-8 ISBN 978-981-10-5209-5 (eBook)


https://doi.org/10.1007/978-981-10-5209-5
Library of Congress Control Number: 2018934459

© Springer Nature Singapore Pte Ltd. 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
part of Springer Nature
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Foreword

“Written by a group of the most active researchers in the field, led by Dr. Deng, an
internationally respected expert in both NLP and deep learning, this book provides
a comprehensive introduction to and up-to-date review of the state of art in applying
deep learning to solve fundamental problems in NLP. Further, the book is highly
timely, as demands for high-quality and up-to-date textbooks and research refer-
ences have risen dramatically in response to the tremendous strides in deep learning
applications to NLP. The book offers a unique reference guide for practitioners in
various sectors, especially the Internet and AI start-ups, where NLP technologies
are becoming an essential enabler and a core differentiator.”
Hongjiang Zhang (Founder, Sourcecode Capital; former CEO of KingSoft)
“This book provides a comprehensive introduction to the latest advances in deep
learning applied to NLP. Written by experienced and aspiring deep learning and
NLP researchers, it covers a broad range of major NLP applications, including
spoken language understanding, dialog systems, lexical analysis, parsing, knowl-
edge graph, machine translation, question answering, sentiment analysis, and social
computing.
The book is clearly structured and moves from major research trends, to the
latest deep learning approaches, to their limitations and promising future work.
Given its self-contained content, sophisticated algorithms, and detailed use cases,
the book offers a valuable guide for all readers who are working on or learning
about deep learning and NLP.”
Haifeng Wang (Vice President and Head of Research, Baidu; former President
of ACL)
“In 2011, at the dawn of deep learning in industry, I estimated that in most speech
recognition applications, computers still made 5 to 10 times more errors than human
subjects, and highlighted the importance of knowledge engineering in future
directions. Within only a handful of years since, deep learning has nearly closed the
gap in the accuracy of conversational speech recognition between human and
computers. Edited and written by Dr. Li Deng—a pioneer in the recent speech

v
vi Foreword

recognition revolution using deep learning—and his colleagues, this book elegantly
describes this part of the fascinating history of speech recognition as an important
subfield of natural language processing (NLP). Further, the book expands this
historical perspective from speech recognition to more general areas of NLP,
offering a truly valuable guide for the future development of NLP.
Importantly, the book puts forward a thesis that the current deep learning trend is
a revolution from the previous data-driven (shallow) machine learning era, although
ostensibly deep learning appears to be merely exploiting more data, more com-
puting power, and more complex models. Indeed, as the book correctly points out,
the current state of the art of deep learning technology developed for NLP appli-
cations, despite being highly successful in solving individual NLP tasks, has not
taken full advantage of rich world knowledge or human cognitive capabilities.
Therefore, I fully embrace the view expressed by the book’s editors and authors that
more advanced deep learning that seamlessly integrates knowledge engineering will
pave the way for the next revolution in NLP.
I highly recommend speech and NLP researchers, engineers, and students to read
this outstanding and timely book, not only to learn about the state of the art in NLP
and deep learning, but also to gain vital insights into what the future of the NLP
field will hold.”
Sadaoki Furui (President, Toyota Technological Institute at Chicago)
Preface

Natural language processing (NLP), which aims to enable computers to process


human languages intelligently, is an important interdisciplinary field crossing
artificial intelligence, computing science, cognitive science, information processing,
and linguistics. Concerned with interactions between computers and human lan-
guages, NLP applications such as speech recognition, dialog systems, information
retrieval, question answering, and machine translation have started to reshape the
way people identify, obtain, and make use of information.
The development of NLP can be described in terms of three major waves:
rationalism, empiricism, and deep learning. In the first wave, rationalist approaches
advocated the design of handcrafted rules to incorporate knowledge into NLP
systems based on the assumption that knowledge of language in the human mind is
fixed in advance by generic inheritance. In the second wave, empirical approaches
assume that rich sensory input and the observable language data in surface form are
required and sufficient to enable the mind to learn the detailed structure of natural
language. As a result, probabilistic models were developed to discover the regu-
larities of languages from large corpora. In the third wave, deep learning exploits
hierarchical models of nonlinear processing, inspired by biological neural systems
to learn intrinsic representations from language data, in ways that aim to simulate
human cognitive abilities.
The intersection of deep learning and natural language processing has resulted in
striking successes in practical tasks. Speech recognition is the first industrial NLP
application that deep learning has strongly impacted. With the availability of
large-scale training data, deep neural networks achieved dramatically lower
recognition errors than the traditional empirical approaches. Another prominent
successful application of deep learning in NLP is machine translation. End-to-end
neural machine translation that models the mapping between human languages
using neural networks has proven to improve translation quality substantially.
Therefore, neural machine translation has quickly become the new de facto tech-
nology in major commercial online translation services offered by large technology
companies: Google, Microsoft, Facebook, Baidu, and more. Many other areas of
NLP, including language understanding and dialog, lexical analysis and parsing,

vii
viii Preface

knowledge graph, information retrieval, question answering from text, social


computing, language generation, and text sentiment analysis, have also seen much
significant progress using deep learning, riding on the third wave of
NLP. Nowadays, deep learning is a dominating method applied to practically all
NLP tasks.
The main goal of this book is to provide a comprehensive survey on the recent
advances in deep learning applied to NLP. The book presents state of the art of
NLP-centric deep learning research, and focuses on the role of deep learning played
in major NLP applications including spoken language understanding, dialog sys-
tems, lexical analysis, parsing, knowledge graph, machine translation, question
answering, sentiment analysis, social computing, and natural language generation
(from images). This book is suitable for readers with a technical background in
computation, including graduate students, post-doctoral researchers, educators, and
industrial researchers and anyone interested in getting up to speed with the latest
techniques of deep learning associated with NLP.
The book is organized into eleven chapters as follows:
• Chapter 1: A Joint Introduction to Natural Language Processing and to Deep
Learning (Li Deng and Yang Liu)
• Chapter 2: Deep Learning in Conversational Language Understanding (Gokhan
Tur, Asli Celikyilmaz, Xiaodong He, Dilek Hakkani-Tür, and Li Deng)
• Chapter 3: Deep Learning in Spoken and Text-Based Dialog Systems
(Asli Celikyilmaz, Li Deng, and Dilek Hakkani-Tür)
• Chapter 4: Deep Learning in Lexical Analysis and Parsing (Wanxiang Che and
Yue Zhang)
• Chapter 5: Deep Learning in Knowledge Graph (Zhiyuan Liu and Xianpei Han)
• Chapter 6: Deep Learning in Machine Translation (Yang Liu and Jiajun Zhang)
• Chapter 7: Deep Learning in Question Answering (Kang Liu and Yansong Feng)
• Chapter 8: Deep Learning in Sentiment Analysis (Duyu Tang and Meishan
Zhang)
• Chapter 9: Deep Learning in Social Computing (Xin Zhao and Chenliang Li)
• Chapter 10: Deep Learning in Natural Language Generation from Images
(Xiaodong He and Li Deng)
• Chapter 11: Epilogue (Li Deng and Yang Liu)
Chapter 1 first reviews the basics of NLP as well as the main scope of NLP
covered in the following chapters of the book, and then goes in some depth into the
historical development of NLP summarized as three waves and future directions.
Subsequently, in Chaps. 2–10, an in-depth survey on the recent advances in deep
learning applied to NLP is organized into nine separate chapters, each covering a
largely independent application area of NLP. The main body of each chapter is
written by leading researchers and experts actively working in the respective field.
The origin of this book was the set of comprehensive tutorials given at the 15th
China National Conference on Computational Linguistics (CCL 2016) held in
October 2016 in Yantai, Shandong, China, where both of us, editors of this book,
Preface ix

were active participants and were taking leading roles. We thank our Springer’s
senior editor, Dr. Celine Lanlan Chang, who kindly invited us to create this book
and who has been providing much of timely assistance needed to complete this
book. We are grateful also to Springer’s Assistant Editor, Jane Li, for offering
invaluable help through various stages of manuscript preparation.
We thank all authors of Chaps. 2–10 who devoted their valuable time carefully
preparing the content of their chapters: Gokhan Tur, Asli Celikyilmaz, Dilek
Hakkani-Tur, Wanxiang Che, Yue Zhang, Xianpei Han, Zhiyuan Liu, Jiajun Zhang,
Kang Liu, Yansong Feng, Duyu Tang, Meishan Zhang, Xin Zhao, Chenliang Li,
and Xiaodong He. The authors of Chaps. 4–9 are CCL 2016 tutorial speakers. They
spent a considerable amount of time in updating their tutorial material with the
latest advances in the field since October 2016.
Further, we thank numerous reviewers and readers, Sadaoki Furui, Andrew Ng,
Fred Juang, Ken Church, Haifeng Wang, and Hongjiang Zhang, who not only gave
us much needed encouragements but also offered many constructive comments
which substantially improved earlier drafts of the book.
Finally, we give our appreciations to our organizations, Microsoft Research and
Citadel (for Li Deng) and Tsinghua University (for Yang Liu), who provided
excellent environments, supports, and encouragements that have been instrumental
for us to complete this book. Yang Liu is also supported by National Natural
Science Foundation of China (No.61522204, No.61432013, and No.61331013).

Seattle, USA Li Deng


Beijing, China Yang Liu
October 2017
Contents

1 A Joint Introduction to Natural Language Processing and to


Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Li Deng and Yang Liu
2 Deep Learning in Conversational Language Understanding . . . . . . 23
Gokhan Tur, Asli Celikyilmaz, Xiaodong He, Dilek Hakkani-Tür
and Li Deng
3 Deep Learning in Spoken and Text-Based Dialog Systems . . . . . . 49
Asli Celikyilmaz, Li Deng and Dilek Hakkani-Tür
4 Deep Learning in Lexical Analysis and Parsing . . . . . . . . . . . . . . . 79
Wanxiang Che and Yue Zhang
5 Deep Learning in Knowledge Graph . . . . . . . . . . . . . . . . . . . . . . 117
Zhiyuan Liu and Xianpei Han
6 Deep Learning in Machine Translation . . . . . . . . . . . . . . . . . . . . 147
Yang Liu and Jiajun Zhang
7 Deep Learning in Question Answering . . . . . . . . . . . . . . . . . . . . . 185
Kang Liu and Yansong Feng
8 Deep Learning in Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . 219
Duyu Tang and Meishan Zhang
9 Deep Learning in Social Computing . . . . . . . . . . . . . . . . . . . . . . . . 255
Xin Zhao and Chenliang Li
10 Deep Learning in Natural Language Generation from Images . . . . 289
Xiaodong He and Li Deng
11 Epilogue: Frontiers of NLP in the Deep Learning Era . . . . . . . . . . 309
Li Deng and Yang Liu
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

xi
Contributors

Asli Celikyilmaz Microsoft Research, Redmond, WA, USA


Wanxiang Che Harbin Institute of Technology, Harbin, China
Li Deng Citadel, Seattle & Chicago, USA
Yansong Feng Peking University, Beijing, China
Dilek Hakkani-Tür Google, Mountain View, CA, USA
Xianpei Han Institute of Software, Chinese Academy of Sciences, Beijing, China
Xiaodong He Microsoft Research, Redmond, WA, USA
Chenliang Li Wuhan University, Wuhan, China
Kang Liu Institute of Automation, Chinese Academy of Sciences, Beijing, China
Yang Liu Tsinghua University, Beijing, China
Zhiyuan Liu Tsinghua University, Beijing, China
Duyu Tang Microsoft Research Asia, Beijing, China
Gokhan Tur Google, Mountain View, CA, USA
Jiajun Zhang Institute of Automation, Chinese Academy of Sciences, Beijing,
China
Meishan Zhang Heilongjiang University, Harbin, China
Yue Zhang Singapore University of Technology and Design, Singapore
Xin Zhao Renmin University of China, Beijing, China

xiii
Acronyms

AI Artificial intelligence
AP Averaged perceptron
ASR Automatic speech recognition
ATN Augmented transition network
BiLSTM Bidirectional long short-term memory
BiRNN Bidirectional recurrent neural network
BLEU Bilingual evaluation understudy
BOW Bag-of-words
CBOW Continuous bag-of-words
CCA Canonical correlation analysis
CCG Combinatory categorial grammar
CDL Collaborative deep learning
CFG Context free grammar
CYK Cocke–Younger–Kasami
CLU Conversational language understanding
CNN Convolutional neural network
CNNSM Convolutional neural network based semantic model
cQA Community question answering
CRF Conditional random field
CTR Collaborative topic regression
CVT Compound value typed
DA Denoising autoencoder
DBN Deep belief network
DCN Deep convex net
DNN Deep neural network
DSSM Deep structured semantic model
DST Dialog state tracking
EL Entity linking
EM Expectation maximization
FSM Finite state machine

xv
xvi Acronyms

GAN Generative adversarial network


GRU Gated recurrent unit
HMM Hidden Markov model
IE Information extraction
IRQA Information retrieval-based question answering
IVR Interactive voice response
KBQA Knowledge-based question answering
KG Knowledge graph
L-BFGS Limited-memory Broyden–Fletcher–Goldfarb–Shanno
LSI Latent semantic indexing
LSTM Long short-term memory
MC Machine comprehension
MCCNN Multicolumn convolutional neural network
MDP Markov decision process
MERT Minimum error rate training
METEOR Metric for evaluation of translation with explicit ordering
MIRA Margin infused relaxed algorithm
ML Machine learning
MLE Maximum likelihood estimation
MLP Multiple layer perceptron
MMI Maximum mutual information
M-NMF Modularized nonnegative matrix factorization
MRT Minimum risk training
MST Maximum spanning tree
MT Machine translation
MV-RNN Matrix-vector recursive neural network
NER Named entity recognition
NFM Neural factorization machine
NLG Natural language generation
NMT Neural machine translation
NRE Neural relation extraction
OOV Out-of-vocabulary
PA Passive aggressive
PCA Principal component analysis
PMI Point-wise mutual information
POS Part of speech
PV Paragraph vector
QA Question answering
RAE Recursive autoencoder
RBM Restricted Boltzmann machine
RDF Resource description framework
RE Relation extraction
RecNN Recursive neural network
RL Reinforcement learning
RNN Recurrent neural network
Acronyms xvii

ROUGE Recall-oriented understudy for gisting evaluation


RUBER Referenced metric and unreferenced metric blended evaluation routine
SDS Spoken dialog system
SLU Spoken language understanding
SMT Statistical machine translation
SP Semantic parsing
SRL Semantic role labeling
SRNN Segmental recurrent neural network
STAGG Staged query graph generation
SVM Support vector machine
UAS Unlabeled attachment score
UGC User-generated content
VIME Variational information maximizing exploration
VPA Virtual personal assistant
Chapter 1
A Joint Introduction to Natural
Language Processing and to Deep
Learning

Li Deng and Yang Liu

Abstract In this chapter, we set up the fundamental framework for the book. We
first provide an introduction to the basics of natural language processing (NLP) as an
integral part of artificial intelligence. We then survey the historical development of
NLP, spanning over five decades, in terms of three waves. The first two waves arose
as rationalism and empiricism, paving ways to the current deep learning wave. The
key pillars underlying the deep learning revolution for NLP consist of (1) distributed
representations of linguistic entities via embedding, (2) semantic generalization due
to the embedding, (3) long-span deep sequence modeling of natural language, (4)
hierarchical networks effective for representing linguistic levels from low to high,
and (5) end-to-end deep learning methods to jointly solve many NLP tasks. After
the survey, several key limitations of current deep learning technology for NLP are
analyzed. This analysis leads to five research directions for future advances in NLP.

1.1 Natural Language Processing: The Basics

Natural language processing (NLP) investigates the use of computers to process or to


understand human (i.e., natural) languages for the purpose of performing useful tasks.
NLP is an interdisciplinary field that combines computational linguistics, computing
science, cognitive science, and artificial intelligence. From a scientific perspective,
NLP aims to model the cognitive mechanisms underlying the understanding and pro-
duction of human languages. From an engineering perspective, NLP is concerned
with how to develop novel practical applications to facilitate the interactions between
computers and human languages. Typical applications in NLP include speech recog-
nition, spoken language understanding, dialogue systems, lexical analysis, parsing,
machine translation, knowledge graph, information retrieval, question answering,

L. Deng (B)
Citadel, Seattle & Chicago, USA
e-mail: [email protected]
Y. Liu
Tsinghua University, Beijing, China
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2018 1


L. Deng and Y. Liu (eds.), Deep Learning in Natural
Language Processing, https://doi.org/10.1007/978-981-10-5209-5_1
2 L. Deng and Y. Liu

sentiment analysis, social computing, natural language generation, and natural lan-
guage summarization. These NLP application areas form the core content of this
book.
Natural language is a system constructed specifically to convey meaning or seman-
tics, and is by its fundamental nature a symbolic or discrete system. The surface or
observable “physical” signal of natural language is called text, always in a sym-
bolic form. The text “signal” has its counterpart—the speech signal; the latter can
be regarded as the continuous correspondence of symbolic text, both entailing the
same latent linguistic hierarchy of natural language. From NLP and signal processing
perspectives, speech can be treated as “noisy” versions of text, imposing additional
difficulties in its need of “de-noising” when performing the task of understanding the
common underlying semantics. Chapters 2 and 3 as well as current Chap. 1 of this
book cover the speech aspect of NLP in detail, while the remaining chapters start
directly from text in discussing a wide variety of text-oriented tasks that exemplify
the pervasive NLP applications enabled by machine learning techniques, notably
deep learning.
The symbolic nature of natural language is in stark contrast to the continuous
nature of language’s neural substrate in the human brain. We will defer this discussion
to Sect. 1.6 of this chapter when discussing future challenges of deep learning in NLP.
A related contrast is how the symbols of natural language are encoded in several
continuous-valued modalities, such as gesture (as in sign language), handwriting
(as an image), and, of course, speech. On the one hand, the word as a symbol is
used as a “signifier” to refer to a concept or a thing in real world as a “signified”
object, necessarily a categorical entity. On the other hand, the continuous modalities
that encode symbols of words constitute the external signals sensed by the human
perceptual system and transmitted to the brain, which in turn operates in a continuous
fashion. While of great theoretical interest, the subject of contrasting the symbolic
nature of language versus its continuous rendering and encoding goes beyond the
scope of this book.
In the next few sections, we outline and discuss, from a historical perspective, the
development of general methodology used to study NLP as a rich interdisciplinary
field. Much like several closely related sub- and super-fields such as conversational
systems, speech recognition, and artificial intelligence, the development of NLP can
be described in terms of three major waves (Deng 2017; Pereira 2017), each of which
is elaborated in a separate section next.

1.2 The First Wave: Rationalism

NLP research in its first wave lasted for a long time, dating back to 1950s. In 1950,
Alan Turing proposed the Turing test to evaluate a computer’s ability to exhibit intelli-
gent behavior indistinguishable from that of a human (Turing 1950). This test is based
on natural language conversations between a human and a computer designed to gen-
erate human-like responses. In 1954, the Georgetown-IBM experiment demonstrated
1 A Joint Introduction to Natural Language Processing and to Deep Learning 3

the first machine translation system capable of translating more than 60 Russian sen-
tences into English.
The approaches, based on the belief that knowledge of language in the human
mind is fixed in advance by generic inheritance, dominated most of NLP research
between about 1960 and late 1980s. These approaches have been called rationalist
ones (Church 2007). The dominance of rationalist approaches in NLP was mainly
due to the widespread acceptance of arguments of Noam Chomsky for an innate
language structure and his criticism of N-grams (Chomsky 1957). Postulating that
key parts of language are hardwired in the brain at birth as a part of the human
genetic inheritance, rationalist approaches endeavored to design hand-crafted rules
to incorporate knowledge and reasoning mechanisms into intelligent NLP systems.
Up until 1980s, most notably successful NLP systems, such as ELIZA for simulating
a Rogerian psychotherapist and MARGIE for structuring real-world information into
concept ontologies, were based on complex sets of handwritten rules.
This period coincided approximately with the early development of artificial
intelligence, characterized by expert knowledge engineering, where domain experts
devised computer programs according to the knowledge about the (very narrow)
application domains they have (Nilsson 1982; Winston 1993). The experts designed
these programs using symbolic logical rules based on careful representations and
engineering of such knowledge. These knowledge-based artificial intelligence sys-
tems tend to be effective in solving narrow-domain problems by examining the
“head” or most important parameters and reaching a solution about the appropriate
action to take in each specific situation. These “head” parameters are identified in
advance by human experts, leaving the “tail” parameters and cases untouched. Since
they lack learning capability, they have difficulty in generalizing the solutions to new
situations and domains. The typical approach during this period is exemplified by
the expert system, a computer system that emulates the decision-making ability of a
human expert. Such systems are designed to solve complex problems by reasoning
about knowledge (Nilsson 1982). The first expert system was created in 1970s and
then proliferated in 1980s. The main “algorithm” used was the inference rules in the
form of “if-then-else” (Jackson 1998). The main strength of these first-generation
artificial intelligence systems is its transparency and interpretability in their (limited)
capability in performing logical reasoning. Like NLP systems such as ELIZA and
MARGIE, the general expert systems in the early days used hand-crafted expert
knowledge which was often effective in narrowly defined problems, although the
reasoning could not handle uncertainty that is ubiquitous in practical applications.
In specific NLP application areas of dialogue systems and spoken language under-
standing, to be described in more detail in Chaps. 2 and 3 of this book, such ratio-
nalistic approaches were represented by the pervasive use of symbolic rules and
templates (Seneff et al. 1991). The designs were centered on grammatical and onto-
logical constructs, which, while interpretable and easy to debug and update, had
experienced severe difficulties in practical deployment. When such systems worked,
they often worked beautifully; but unfortunately this happened just not very often
and the domains were necessarily limited.
4 L. Deng and Y. Liu

Likewise, speech recognition research and system design, another long-standing


NLP and artificial intelligence challenge, during this rationalist era were based
heavily on the paradigm of expert knowledge engineering, as elegantly analyzed
in (Church and Mercer 1993). During 1970s and early 1980s, the expert system
approach to speech recognition was quite popular (Reddy 1976; Zue 1985). How-
ever, the lack of abilities to learn from data and to handle uncertainty in reasoning was
acutely recognized by researchers, leading to the second wave of speech recognition,
NLP, and artificial intelligence described next.

1.3 The Second Wave: Empiricism

The second wave of NLP was characterized by the exploitation of data corpora and
of (shallow) machine learning, statistical or otherwise, to make use of such data
(Manning and Schtze 1999). As much of the structure of and theory about natural
language were discounted or discarded in favor of data-driven methods, the main
approaches developed during this era have been called empirical or pragmatic ones
(Church and Mercer 1993; Church 2014). With the increasing availability of machine-
readable data and steady increase of computational power, empirical approaches have
dominated NLP since around 1990. One of the major NLP conferences was even
named “Empirical Methods in Natural Language Processing (EMNLP)” to reflect
most directly the strongly positive sentiment of NLP researchers during that era
toward empirical approaches.
In contrast to rationalist approaches, empirical approaches assume that the human
mind only begins with general operations for association, pattern recognition, and
generalization. Rich sensory input is required to enable the mind to learn the detailed
structure of natural language. Prevalent in linguistics between 1920 and 1960, empiri-
cism has been undergoing a resurgence since 1990. Early empirical approaches to
NLP focused on developing generative models such as the hidden Markov model
(HMM) (Baum and Petrie 1966), the IBM translation models (Brown et al. 1993),
and the head-driven parsing models (Collins 1997) to discover the regularities of
languages from large corpora. Since late 1990s, discriminative models have become
the de facto approach in a variety of NLP tasks. Representative discriminative mod-
els and methods in NLP include the maximum entropy model (Ratnaparkhi 1997),
supporting vector machines (Vapnik 1998), conditional random fields (Lafferty et al.
2001), maximum mutual information and minimum classification error (He et al.
2008), and perceptron (Collins 2002).
Again, this era of empiricism in NLP was paralleled with corresponding approaches
in artificial intelligence as well as in speech recognition and computer vision. It came
about after clear evidence that learning and perception capabilities are crucial for
complex artificial intelligence systems but missing in the expert systems popular in
the previous wave. For example, when DARPA opened its first Grand Challenge for
autonomous driving, most vehicles then relied on the knowledge-based artificial intel-
ligence paradigm. Much like speech recognition and NLP, the autonomous driving and
1 A Joint Introduction to Natural Language Processing and to Deep Learning 5

computer vision researchers immediately realized the limitation of the knowledge-


based paradigm due to the necessity for machine learning with uncertainty handling
and generalization capabilities.
The empiricism in NLP and speech recognition in this second wave was based
on data-intensive machine learning, which we now call “shallow” due to the general
lack of abstractions constructed by many-layer or “deep” representations of data
which would come in the third wave to be described in the next section. In machine
learning, researchers do not need to concern with constructing precise and exact rules
as required for the knowledge-based NLP and speech systems during the first wave.
Rather, they focus on statistical models (Bishop 2006; Murphy 2012) or simple neural
networks (Bishop 1995) as an underlying engine. They then automatically learn or
“tune” the parameters of the engine using ample training data to make them handle
uncertainty, and to attempt to generalize from one condition to another and from one
domain to another. The key algorithms and methods for machine learning include EM
(expectation-maximization), Bayesian networks, support vector machines, decision
trees, and, for neural networks, backpropagation algorithm.
Generally speaking, the machine learning based NLP, speech, and other artificial
intelligence systems perform much better than the earlier, knowledge-based counter-
parts. Successful examples include almost all artificial intelligence tasks in machine
perception—speech recognition (Jelinek 1998), face recognition (Viola and Jones
2004), visual object recognition (Fei-Fei and Perona 2005), handwriting recognition
(Plamondon and Srihari 2000), and machine translation (Och 2003).
More specifically, in a core NLP application area of machine translation, as to be
described in detail in Chap. 6 of this book as well as in (Church and Mercer 1993), the
field has switched rather abruptly around 1990 from rationalistic methods outlined in
Sect. 1.2 to empirical, largely statistical methods. The availability of sentence-level
alignments in the bilingual training data made it possible to acquire surface-level
translation knowledge not by rules but from data directly, at the expense of discarding
or discounting structured information in natural languages. The most representative
work during this wave is that empowered by various versions of IBM translation
models (Brown et al. 1993). Subsequent developments during this empiricist era of
machine translation further significantly improved the quality of translation systems
(Och and Ney 2002; Och 2003; Chiang 2007; He and Deng 2012), but not at the
level of massive deployment in real world (which would come after the next, deep
learning wave).
In the dialogue and spoken language understanding areas of NLP, this empiri-
cist era was also marked prominently by data-driven machine learning approaches.
These approaches were well suited to meet the requirement for quantitative evalua-
tion and concrete deliverables. They focused on broader but shallow, surface-level
coverage of text and domains instead of detailed analyses of highly restricted text
and domains. The training data were used not to design rules for language under-
standing and response action from the dialogue systems but to learn parameters of
(shallow) statistical or neural models automatically from data. Such learning helped
reduce the cost of hand-crafted complex dialogue manager’s design, and helped
improve robustness against speech recognition errors in the overall spoken language
6 L. Deng and Y. Liu

understanding and dialogue systems; for a review, see He and Deng (2013). More
specifically, for the dialogue policy component of dialogue systems, powerful rein-
forcement learning based on Markov decision processes had been introduced during
this era; for a review, see Young et al. (2013). And for spoken language understand-
ing, the dominant methods moved from rule- or template-based ones during the first
wave to generative models like hidden Markov models (HMMs) (Wang et al. 2011)
to discriminative models like conditional random fields (Tur and Deng 2011).
Similarly, in speech recognition, over close to 30 years from early 1980 s to around
2010, the field was dominated by the (shallow) machine learning paradigm using the
statistical generative model based on the HMM integrated with Gaussian mixture
models, along with various versions of its generalization (Baker et al. 2009a, b;
Deng and O’Shaughnessy 2003; Rabiner and Juang 1993). Among many versions of
the generalized HMMs were statistical and neural-network-based hidden dynamic
models (Deng 1998; Bridle et al. 1998; Deng and Yu 2007). The former adopted EM
and switching extended Kalman filter algorithms for learning model parameters (Ma
and Deng 2004; Lee et al. 2004), and the latter used backpropagation (Picone et al.
1999). Both of them made extensive use of multiple latent layers of representations for
the generative process of speech waveforms following the long-standing framework
of analysis-by-synthesis in human speech perception. More significantly, inverting
this “deep” generative process to its counterpart of an end-to-end discriminative
process gave rise to the first industrial success of deep learning (Deng et al. 2010,
2013; Hinton et al. 2012), which formed a driving force of the third wave of speech
recognition and NLP that will be elaborated next.

1.4 The Third Wave: Deep Learning

While the NLP systems, including speech recognition, language understanding, and
machine translation, developed during the second wave performed a lot better and
with higher robustness than those during the first wave, they were far from human-
level performance and left much to desire. With a few exceptions, the (shallow)
machine learning models for NLP often did not have the capacity sufficiently large to
absorb the large amounts of training data. Further, the learning algorithms, methods,
and infrastructures were not powerful enough. All this changed several years ago,
giving rise to the third wave of NLP, propelled by the new paradigm of deep-structured
machine learning or deep learning (Bengio 2009; Deng and Yu 2014; LeCun et al.
2015; Goodfellow et al. 2016).
In traditional machine learning, features are designed by humans and feature
engineering is a bottleneck, requiring significant human expertise. Concurrently,
the associated shallow models lack the representation power and hence the ability
to form levels of decomposable abstractions that would automatically disentangle
complex factors in shaping the observed language data. Deep learning breaks away
the above difficulties by the use of deep, layered model structure, often in the form of
neural networks, and the associated end-to-end learning algorithms. The advances in
1 A Joint Introduction to Natural Language Processing and to Deep Learning 7

deep learning are one major driving force behind the current NLP and more general
artificial intelligence inflection point and are responsible for the resurgence of neural
networks with a wide range of practical, including business, applications (Parloff
2016).
More specifically, despite the success of (shallow) discriminative models in a
number of important NLP tasks developed during the second wave, they suffered from
the difficulty of covering all regularities in languages by designing features manually
with domain expertise. Besides the incompleteness problem, such shallow models
also face the sparsity problem as features usually only occur once in the training
data, especially for highly sparse high-order features. Therefore, feature design has
become one of the major obstacles in statistical NLP before deep learning comes
to rescue. Deep learning brings hope for addressing the human feature engineering
problem, with a view called “NLP from scratch” (Collobert et al. 2011), which was
in early days of deep learning considered highly unconventional. Such deep learning
approaches exploit the powerful neural networks that contain multiple hidden layers
to solve general machine learning tasks dispensing with feature engineering. Unlike
shallow neural networks and related machine learning models, deep neural networks
are capable of learning representations from data using a cascade of multiple layers of
nonlinear processing units for feature extraction. As higher level features are derived
from lower level features, these levels form a hierarchy of concepts.
Deep learning originated from artificial neural networks, which can be viewed as
cascading models of cell types inspired by biological neural systems. With the advent
of backpropagation algorithm (Rumelhart et al. 1986), training deep neural networks
from scratch attracted intensive attention in 1990s. In these early days, without large
amounts of training data and without proper design and learning methods, during
neural network training the learning signals vanish exponentially with the number
of layers (or more rigorously the depth of credit assignment) when propagated from
layer to layer, making it difficult to tune connection weights of deep neural networks,
especially the recurrent versions. Hinton et al. (2006) initially overcame this problem
by using unsupervised pretraining to first learn generally useful feature detectors.
Then, the network is further trained by supervised learning to classify labeled data.
As a result, it is possible to learn the distribution of a high-level representation using
low-level representations. This seminal work marks the revival of neural networks. A
variety of network architectures have since been proposed and developed, including
deep belief networks (Hinton et al. 2006), stacked auto-encoders (Vincent et al. 2010),
deep Boltzmann machines (Hinton and Salakhutdinov 2012), deep convolutional
neural works (Krizhevsky et al. 2012), deep stacking networks (Deng et al. 2012),
and deep Q-networks (Mnih et al. 2015). Capable of discovering intricate structures
in high-dimensional data, deep learning has since 2010 been successfully applied to
real-world tasks in artificial intelligence including notably speech recognition (Yu
et al. 2010; Hinton et al. 2012), image classification (Krizhevsky et al. 2012; He et al.
2016), and NLP (all chapters in this book). Detailed analyses and reviews of deep
learning have been provided in a set of tutorial survey articles (Deng 2014; LeCun
et al. 2015; Juang 2016).
8 L. Deng and Y. Liu

As speech recognition is one of core tasks in NLP, we briefly discuss it here due to
its importance as the first industrial NLP application in real world impacted strongly
by deep learning. Industrial applications of deep learning to large-scale speech recog-
nition started to take off around 2010. The endeavor was initiated with a collaboration
between academia and industry, with the original work presented at the 2009 NIPS
Workshop on Deep Learning for Speech Recognition and Related Applications. The
workshop was motivated by the limitations of deep generative models of speech, and
the possibility that the big-compute, big-data era warrants a serious exploration of
deep neural networks. It was believed then that pretraining DNNs using generative
models of deep belief nets based on the contrastive divergence learning algorithm
would overcome the main difficulties of neural nets encountered in the 1990s (Dahl
et al. 2011; Mohamed et al. 2009). However, early into this research at Microsoft, it
was discovered that without contrastive divergence pretraining, but with the use of
large amounts of training data together with the deep neural networks designed with
corresponding large, context-dependent output layers and with careful engineering,
dramatically lower recognition errors could be obtained than then-state-of-the-art
(shallow) machine learning systems (Yu et al. 2010, 2011; Dahl et al. 2012). This
finding was quickly verified by several other major speech recognition research
groups in North America (Hinton et al. 2012; Deng et al. 2013) and subsequently
overseas. Further, the nature of recognition errors produced by the two types of sys-
tems was found to be characteristically different, offering technical insights into how
to integrate deep learning into the existing highly efficient, run-time speech decod-
ing system deployed by major players in speech recognition industry (Yu and Deng
2015; Abdel-Hamid et al. 2014; Xiong et al. 2016; Saon et al. 2017). Nowadays,
backpropagation algorithm applied to deep neural nets of various forms is uniformly
used in all current state-of-the-art speech recognition systems (Yu and Deng 2015;
Amodei et al. 2016; Saon et al. 2017), and all major commercial speech recogni-
tion systems—Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google
Assistant, Apple Siri, Baidu and iFlyTek voice search, and more—are all based on
deep learning methods.
The striking success of speech recognition in 2010–2011 heralded the arrival of
the third wave of NLP and artificial intelligence. Quickly following the success of
deep learning in speech recognition, computer vision (Krizhevsky et al. 2012) and
machine translation (Bahdanau et al. 2015) were taken over by the similar deep
learning paradigm. In particular, while the powerful technique of neural embedding
of words was developed in as early as 2011 (Bengio et al. 2001), it is not until more
than 10 year later it was shown to be practically useful at a large and practically useful
scale (Mikolov et al. 2013) due to the availability of big data and faster computation.
In addition, a large number of other real-world NLP applications, such as image
captioning (Karpathy and Fei-Fei 2015; Fang et al. 2015; Gan et al. 2017), visual
question answering (Fei-Fei and Perona 2016), speech understanding (Mesnil et al.
2013), web search (Huang et al. 2013b), and recommendation systems, have been
made successful due to deep learning, in addition to many non-NLP tasks including
drug discovery and toxicology, customer relationship management, recommendation
systems, gesture recognition, medical informatics, advertisement, medical image
1 A Joint Introduction to Natural Language Processing and to Deep Learning 9

analysis, robotics, self-driving vehicles, board and eSports games (e.g., Atari, Go,
Poker, and the latest, DOTA2), and so on. For more details, see https://en.wikipedia.
org/wiki/deep_learning.
In more specific text-based NLP application areas, machine translation is perhaps
impacted the most by deep learning. Advancing from the shallow statistical machine
translation developed during the second wave of NLP, the current best machine
translation systems in real-world applications are based on deep neural networks. For
example, Google announced the first stage of its move to neural machine translation
in September 2016 and Microsoft made a similar announcement 2 months later.
Facebook has been working on the conversion to neural machine translation for
about a year, and by August 2017 it is at full deployment. Details of the deep learning
techniques in these state-of-the-art large-scale machine translation systems will be
reviewed in Chap. 6.
In the area of spoken language understanding and dialogue systems, deep learning
is also making a huge impact. The current popular techniques maintain and expand
the statistical methods developed during second-wave era in several ways. Like the
empirical, (shallow) machine learning methods, deep learning is also based on data-
intensive methods to reduce the cost of hand-crafted complex understanding and
dialogue management, to be robust against speech recognition errors under noise
environments and against language understanding errors, and to exploit the power
of Markov decision processes and reinforcement learning for designing dialogue
policy, e.g., (Gasic et al. 2017; Dhingra et al. 2017). Compared with the earlier
methods, deep neural network models and representations are much more powerful
and they make end-to-end learning possible. However, deep learning has not yet
solved the problems of interpretability and domain scalability associated with earlier
empirical techniques. Details of the deep learning techniques popular for current
spoken language understanding and dialogue systems as well as their challenges
will be reviewed in Chaps. 2 and 3.
Two important recent technological breakthroughs brought about in applying deep
learning to NLP problems are sequence-to-sequence learning (Sutskevar et al. 2014)
and attention modeling (Bahdanau et al. 2015). The sequence-to-sequence learning
introduces a powerful idea of using recurrent nets to carry out both encoding and
decoding in an end-to-end manner. While attention modeling was initially developed
to overcome the difficulty of encoding a long sequence, subsequent developments
significantly extended its power to provide highly flexible alignment of two arbitrary
sequences that can be learned together with neural network parameters. The key
concepts of sequence-to-sequence learning and of attention mechanism boosted the
performance of neural machine translation based on distributed word embedding over
the best system based on statistical learning and local representations of words and
phrases. Soon after this success, these concepts have also been applied successfully
to a number of other NLP-related tasks such as image captioning (Karpathy and
Fei-Fei 2015; Devlin et al. 2015), speech recognition (Chorowski et al. 2015), meta-
learning for program execution, one-shot learning, syntactic parsing, lip reading, text
understanding, summarization, and question answering and more.
10 L. Deng and Y. Liu

Setting aside their huge empirical successes, models of neural-network-based


deep learning are often simpler and easier to design than the traditional machine
learning models developed in the earlier wave. In many applications, deep learning
is performed simultaneously for all parts of the model, from feature extraction all
the way to prediction, in an end-to-end manner. Another factor contributing to the
simplicity of neural network models is that the same model building blocks (i.e., the
different types of layers) are generally used in many different applications. Using
the same building blocks for a large variety of tasks makes the adaptation of models
used for one task or data to another task or data relatively easy. In addition, software
toolkits have been developed to allow faster and more efficient implementation of
these models. For these reasons, deep neural networks are nowadays a prominent
method of choice for a large variety of machine learning and artificial intelligence
tasks over large datasets including, prominently, NLP tasks.
Although deep learning has proven effective in reshaping the processing of speech,
images, and videos in a revolutionary way, the effectiveness is less clear-cut in inter-
secting deep learning with text-based NLP despite its empirical successes in a number
of practical NLP tasks. In speech, image, and video processing, deep learning effec-
tively addresses the semantic gap problem by learning high-level concepts from raw
perceptual data in a direct manner. However, in NLP, stronger theories and structured
models on morphology, syntax, and semantics have been advanced to distill the under-
lying mechanisms of understanding and generation of natural languages, which have
not been as easily compatible with neural networks. Compared with speech, image,
and video signals, it seems less straightforward to see that the neural representations
learned from textual data can provide equally direct insights onto natural language.
Therefore, applying neural networks, especially those having sophisticated hierar-
chical architectures, to NLP has received increasing attention and has become the
most active area in both NLP and deep learning communities with highly visible
progresses made in recent years (Deng 2016; Manning and Socher 2017). Surveying
the advances and analyzing the future directions in deep learning for NLP form the
main motivation for us to write this chapter and to create this book, with the desire
for the NLP researchers to accelerate the research further in the current fast pace of
the progress.

1.5 Transitions from Now to the Future

Before analyzing the future dictions of NLP with more advanced deep learning, here
we first summarize the significance of the transition from the past waves of NLP to
the present one. We then discuss some clear limitations and challenges of the present
deep learning technology for NLP, to pave a way to examining further development
that would overcome these limitations for the next wave of innovations.
1 A Joint Introduction to Natural Language Processing and to Deep Learning 11

1.5.1 From Empiricism to Deep Learning: A Revolution

On the surface, the deep learning rising wave discussed in Sect. 1.4 in this chapter
appears to be a simple push of the second, empiricist wave of NLP (Sect. 1.3) into
an extreme end with bigger data, larger models, and greater computing power. After
all, the fundamental approaches developed during both waves are data-driven and
are based on machine learning and computation, and have dispensed with human-
centric “rationalistic” rules that are often brittle and costly to acquire in practical
NLP applications. However, if we analyze these approaches holistically and at a
deeper level, we can identify aspects of conceptual revolution moving from empiricist
machine learning to deep learning, and can subsequently analyze the future directions
of the field (Sect. 1.6). This revolution, in our opinion, is no less significant than the
revolution from the earlier rationalist wave to empiricist one as analyzed at the
beginning (Church and Mercer 1993) and at the end of the empiricist era (Charniak
2011).
Empiricist machine learning and linguistic data analysis during the second NLP
wave started in early 1990 s by crypto-analysts and computer scientists working
on natural language sources that are highly limited in vocabulary and application
domains. As we discussed in Sect. 1.3, surface-level text observations, i.e., words
and their sequences, are counted using discrete probabilistic models without relying
on deep structure in natural language. The basic representations were “one-hot” or
localist, where no semantic similarity between words was exploited. With restric-
tions in domains and associated text content, such structure-free representations and
empirical models are often sufficient to cover much of what needs to be covered.
That is, the shallow, count-based statistical models can naturally do well in limited
and specific NLP tasks. But when the domain and content restrictions are lifted for
more realistic NLP applications in real-world, count-based models would necessarily
become ineffective, no manner how many tricks of smoothing have been invented
in an attempt to mitigate the problem of combinatorial counting sparseness. This
is where deep learning for NLP truly shines—distributed representations of words
via embedding, semantic generalization due to the embedding, longer span deep
sequence modeling, and end-to-end learning methods have all contributed to beat-
ing empiricist, count-based methods in a wide range of NLP tasks as discussed in
Sect. 1.4.

1.5.2 Limitations of Current Deep Learning Technology

Despite the spectacular successes of deep learning in NLP tasks, most notably in
speech recognition/understanding, language modeling, and in machine translation,
there remain huge challenges. The current deep learning methods based on neu-
ral networks as a black box generally lack interpretability, even further away from
explainability, in contrast to the “rationalist” paradigm established during the first
12 L. Deng and Y. Liu

NLP wave where the rules devised by experts were naturally explainable. In practice,
however, it is highly desirable to explain the predictions from a seemingly “black-
box” model, not only for improving the model but for providing the users of the
prediction system with interpretations of the suggested actions to take (Koh and
Liang 2017).
In a number of applications, deep learning methods have proved to give recog-
nition accuracy close to or exceeding humans, but they require considerably more
training data, power consumption, and computing resources than humans. Also,
the accuracy results are statistically impressive but often unreliable on the individ-
ual basis. Further, most of the current deep learning models have no reasoning and
explaining capabilities, making them vulnerable to disastrous failures or attacks with-
out the ability to foresee and thus to prevent them. Moreover, the current NLP models
have not taken into account the need for developing and executing goals and plans
for decision-making via ultimate NLP systems. A more specific limitation of current
NLP methods based on deep learning is their poor abilities for understanding and
reasoning inter-sentential relationships, although huge progresses have been made
for interwords and phrases within sentences.
As discussed earlier, the success of deep learning in NLP has largely come from a
simple strategy thus far—given an NLP task, apply standard sequence models based
on (bidirectional) LSTMs, add attention mechanisms if information required in the
task needs to flow from another source, and then train the full models in an end-to-
end manner. However, while sequence modeling is naturally appropriate for speech,
human understanding of natural language (in text form) requires more complex
structure than sequence. That is, current sequence-based deep learning systems for
NLP can be further advanced by exploiting modularity, structured memories, and
recursive, tree-like representations for sentences and larger text (Manning 2016).
To overcome the challenges outlined above and to achieve the ultimate success
of NLP as a core artificial intelligence field, both fundamental and applied research
are needed. The next new wave of NLP and artificial intelligence will not come until
researchers create new paradigmatic, algorithmic, and computation (including hard-
ware) breakthroughs. Here, we outline several high-level directions toward potential
breakthroughs.

1.6 Future Directions of NLP

1.6.1 Neural-Symbolic Integration

A potential breakthrough is in developing advanced deep learning models and meth-


ods that are more effective than current methods in building, accessing, and exploit-
ing memories and knowledge, including, in particular, common-sense knowledge.
It is not clear how to best integrate the current deep learning methods, centered
on distributed representations (of everything), with explicit, easily interpretable, and
1 A Joint Introduction to Natural Language Processing and to Deep Learning 13

localist-represented knowledge about natural language and the world and with related
reasoning mechanisms.
One path to this goal is to seamlessly combine neural networks and symbolic
language systems. These NLP and artificial intelligence systems will aim to discover
by themselves the underlying causes or logical rules that shape their prediction and
decision-making processes interpretable to human users in symbolic natural language
forms. Recently, very preliminary work in this direction made use of an integrated
neural-symbolic representation called tensor-product neural memory cells, capable
of decoding back to symbolic forms. This structured neural representation is provably
lossless in the coded information after extensive learning within the neural-tensor
domain (Palangi et al. 2017; Smolensky et al. 2016; Lee et al. 2016). Extensions
of such tensor-product representations, when applied to NLP tasks such as machine
reading and question answering, are aimed to learn to process and understand mas-
sive natural language documents. After learning, the systems will be able not only to
answer questions sensibly but also to truly understand what it reads to the extent that
it can convey such understanding to human users in providing clues as to what steps
have been taken to reach the answer. These steps may be in the form of logical reason-
ing expressed in natural language which is thus naturally understood by the human
users of this type of machine reading and comprehension systems. In our view, natu-
ral language understanding is not just to accurately predict an answer from a question
with relevant passages or data graphs as its contextual knowledge in a supervised
way after seeing many examples of matched questions–passages–answers. Rather,
the desired NLP system equipped with real understanding should resemble human
cognitive capabilities. As an example of such capabilities (Nguyen et al. 2017)—
after an understanding system is trained well, say, in a question answering task
(using supervised learning or otherwise), it should master all essential aspects of the
observed text material provided to solve the question answering tasks. What such
mastering entails is that the learned system can subsequently perform well on other
NLP tasks, e.g., translation, summarization, recommendation, etc., without seeing
additional paired data such as raw text data with its summary, or parallel English and
Chinese texts, etc.
One way to examine the nature of such powerful neural-symbolic systems is
to regard them as ones incorporating the strength of the “rationalist” approaches
marked by expert reasoning and structure richness popular during the first wave of
NLP discussed in Sect. 1.2. Interestingly, prior to the rising of deep learning (third)
wave of NLP, (Church 2007) argued that the pendulum from rationalist to empiri-
cist approaches has swung too far at almost the peak of the second NLP wave, and
predicted that the new rationalist wave would arrive. However, rather than swinging
back to a renewed rationalist era of NLP, deep learning era arrived in full force in just
a short period from the time of writing by Church (2007). Instead of adding the ratio-
nalist flavor, deep learning has been pushing empiricism of NLP to its pinnacle with
big data and big compute, and with conceptually revolutionary ways of representing
a sweeping range of linguistic entities by massive parallelism and distributedness,
thus drastically enhancing the generalization capability of new-generation NLP mod-
els. Only after the sweeping successes of current deep learning methods for NLP
14 L. Deng and Y. Liu

(Sect. 1.4) and subsequent analyses of a series of their limitations, do researchers


look into the next wave of NLP—not swinging back to rationalism while abandon-
ing empiricism but developing more advanced deep learning paradigms that would
organically integrate the missing essence of rationalism into the structured neural
methods that are aimed to approach human cognitive functions for language.

1.6.2 Structure, Memory, and Knowledge

As discussed earlier in this chapter as well as in the current NLP literature (Man-
ning and Socher 2017), NLP researchers at present still have very primitive deep
learning methods for exploiting structure and for building and accessing memories
or knowledge. While LSTM (with attention) has been pervasively applied to NLP
tasks to beat many NLP benchmarks, LSTM is far from a good memory model
for human cognition. In particular, LSTM lacks adequate structure for simulating
episodic memory, and one key component of human cognitive ability is to retrieve
and re-experience aspects of a past novel event or thought. This ability gives rise
to one-shot learning skills and can be crucial in reading comprehension of natural
language text or speech understanding, as well as reasoning over events described by
natural language. Many recent studies have been devoted to better memory model-
ing, including external memory architectures with supervised learning (Vinyals et al.
2016; Kaiser et al. 2017) and augmented memory architectures with reinforcement
learning (Graves et al. 2016; Oh et al. 2016). However, they have not shown general
effectiveness, but have suffered from a number of of limitations including notably
scalability (arising from the use of attention which has to access every stored element
in the memory). Much work remains in the direction of better modeling of memory
and exploitation of knowledge for text understanding and reasoning.

1.6.3 Unsupervised and Generative Deep Learning

Another potential breakthrough in deep learning for NLP is in new algorithms for
unsupervised deep learning, which makes use of ideally no direct teaching signals
paired with inputs (token by token) to guide the learning. Word embedding discussed
in Sect. 1.4 can be viewed as a weak form of unsupervised learning, making use of
adjacent words as “cost-free” surrogate teaching signals, but for real-world NLP pre-
diction tasks, such as translation, understanding, summarization, etc., such embed-
ding obtained in an “unsupervised manner” has to be fed into another supervised
architecture which requires costly teaching signals. In truly unsupervised learning
which requires no expensive teaching signals, new types of objective functions and
new optimization algorithms are needed, e.g., the objective function for unsupervised
learning should not require explicit target label data aligned with the input data as
in cross entropy that is most popular for supervised learning. Development of unsu-
1 A Joint Introduction to Natural Language Processing and to Deep Learning 15

pervised deep learning algorithms has been significantly behind that of supervised
and reinforcement deep learning where backpropagation and Q-learning algorithms
have been reasonably mature.
The most recent preliminary development in unsupervised learning takes the
approach of exploiting sequential output structure and advanced optimization meth-
ods to alleviate the need for using labels in training prediction systems (Russell and
Stefano 2017; Liu et al. 2017). Future advances in unsupervised learning are promis-
ing by exploiting new sources of learning signals including the structure of input data
and the mapping relationships from input to output and vice versa. Exploiting the rela-
tionship from output to input is closely connected to building conditional generative
models. To this end, the recent popular topic in deep learning—generative adversar-
ial networks (Goodfellow et al. 2014)—is a highly promising direction where the
long-standing concept of analysis-by-synthesis in pattern recognition and machine
learning is likely to return to spotlight in the near future in solving NLP tasks in new
ways.
Generative adversarial networks have been formulated as neural nets, with dense
connectivity among nodes and with no probabilistic setting. On the other hand,
probabilistic and Bayesian reasoning, which often takes computational advantage
of sparse connections among “nodes” as random variables, has been one of the
principal theoretical pillars to machine learning and has been responsible for many
NLP methods developed during the empiricist wave of NLP discussed in Sect. 1.3.
What is the right interface between deep learning and probabilistic modeling? Can
probabilistic thinking help understand deep learning techniques better and motivate
new deep learning methods for NLP tasks? How about the other way around? These
issues are widely open for future research.

1.6.4 Multimodal and Multitask Deep Learning

Multimodal and multitask deep learning are related learning paradigms, both con-
cerning the exploitation of latent representations in the deep networks pooled from
different modalities (e.g., audio, speech, video, images, text, source codes, etc.) or
from multiple cross-domain tasks (e.g., point and structured prediction, ranking,
recommendation, time-series forecasting, clustering, etc.). Before the deep learning
wave, multimodal and multitask learning had been very difficult to be made effective,
due to the lack of intermediate representations that share across modalities or tasks.
See a most striking example of this contrast for multitask learning—multilingual
speech recognition during the empiricist wave (Lin et al. 2008) and during the deep
learning wave (Huang et al. 2013a).
Multimodal information can be exploited as low-cost supervision. For instance,
standard speech recognition, image recognition, and text classification methods make
use of supervision labels within each of the speech, image, and text modalities sepa-
rately. This, however, is far from how children learn to recognize speech, image, and
to classify text. For example, children often get the distant “supervision” signal for
Exploring the Variety of Random
Documents with Different Content
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning of
this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1 with
active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or expense
to the user, provide a copy, a means of exporting a copy, or a means
of obtaining a copy upon request, of the work in its original “Plain
Vanilla ASCII” or other form. Any alternate format must include the
full Project Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in paragraph
1.F.3, the Project Gutenberg Literary Archive Foundation, the owner
of the Project Gutenberg™ trademark, and any other party
distributing a Project Gutenberg™ electronic work under this
agreement, disclaim all liability to you for damages, costs and
expenses, including legal fees. YOU AGREE THAT YOU HAVE NO
REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF
WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE
PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE
FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving it,
you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or entity
that provided you with the defective work may elect to provide a
replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and distribution
of Project Gutenberg™ electronic works, harmless from all liability,
costs and expenses, including legal fees, that arise directly or
indirectly from any of the following which you do or cause to occur:
(a) distribution of this or any Project Gutenberg™ work, (b)
alteration, modification, or additions or deletions to any Project
Gutenberg™ work, and (c) any Defect you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many small
donations ($1 to $5,000) are particularly important to maintaining tax
exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed


editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like