Natural Language Processing A Textbook With Python Implementation Raymond S T Lee download
Natural Language Processing A Textbook With Python Implementation Raymond S T Lee download
https://ebookbell.com/product/natural-language-processing-a-
textbook-with-python-implementation-raymond-s-t-lee-53684048
https://ebookbell.com/product/natural-language-processing-a-textbook-
with-python-implementation-raymond-s-t-lee-201154538
https://ebookbell.com/product/natural-language-processing-a-machine-
learning-perspective-1st-yue-zhang-36314864
https://ebookbell.com/product/practical-natural-language-processing-a-
comprehensive-guide-to-building-realworld-nlp-systems-sowmya-
vajjala-25766928
https://ebookbell.com/product/practical-natural-language-processing-a-
comprehensive-guide-to-building-realworld-nlp-systems-harshit-surana-
anuj-gupta-bodhisattwa-majumder-sowmya-vajjala-36704970
Formal Analysis For Natural Language Processing A Handbook Zhiwei Feng
https://ebookbell.com/product/formal-analysis-for-natural-language-
processing-a-handbook-zhiwei-feng-50086352
https://ebookbell.com/product/information-retrieval-and-natural-
language-processing-a-graph-theory-approach-sheetal-s-
sonawane-38557570
https://ebookbell.com/product/natural-language-processing-as-a-
foundation-of-the-semantic-web-foundations-and-trends-in-web-science-
yorick-wilks-1908784
https://ebookbell.com/product/natural-language-processing-with-flair-
a-practical-guide-to-understanding-and-solving-nlp-problems-with-
flair-1st-edition-tadej-magajna-42937728
https://ebookbell.com/product/handson-natural-language-processing-
with-python-a-practical-guide-to-applying-deep-learning-architectures-
to-your-nlp-applications-rajesh-arumugam-rajalingappaa-shanmugamani-
arumugam-35440088
Raymond S. T. Lee
Natural
Language
Processing
A Textbook with Python
Implementation
Natural Language Processing
Raymond S. T. Lee
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is dedicated to all readers and
students taking my undergraduate and
postgraduate courses in Natural Language
Processing, your enthusiasm in seeking
knowledge incited me to write this book.
Preface
Natural Language Processing (NLP) and its related applications become part of
daily life with exponential growth of Artificial Intelligence (AI) in past decades.
NLP applications including Information Retrieval (IR) systems, Text Summarization
System, and Question-and-Answering (Chatbot) System became one of the preva-
lent topics in both industry and academia that had evolved routines and benefited
immensely to a wide array of day-to-day services.
The objective of this book is to provide NLP concepts and knowledge to readers
with a 14-h 7 step-by-step workshops to practice various core Python-based NLP
tools: NLTK, spaCy, TensorFlow Keras, Transformer, and BERT Technology to
construct NLP applications.
vii
viii Preface
This introductory chapter begins with human language and intelligence con-
stituting six levels of linguistics followed by a brief history of NLP with major
components and applications. It serves as the cornerstone to the NLP concepts
and technology discussed in the following chapters. This chapter also serves as
the conceptual basis for Workshop#1: Basics of Natural Language Toolkit
(NLTK) in Chap. 10.
• Chapter 2: N-gram Language Model
Language model is the foundation of NLP. This chapter introduces N-gram
language model and Markov Chains using classical literature The Adventures of
Sherlock Holmes by Sir Conan Doyle (1859–1930) to illustrate how N-gram
model works that form NLP basics in text analysis followed by Shannon’s model
and text generation with evaluation schemes. This chapter also serves as the con-
ceptual basis for Workshop#2 on N-gram modelling with NLTK in Chap. 11.
• Chapter 3: Part-of-Speech Tagging
Part-of-Speech (POS) Tagging is the foundation of text processing in
NLP. This chapter describes how it relates to NLP and Natural Language
Understanding (NLU). There are types and algorithms for POS Tagging includ-
ing Rule-based POS Tagging, Stochastic POS Tagging, and Hybrid POS Tagging
with Brill Tagger and evaluation schemes. This chapter also serves as the concep-
tual basis for Workshop#3: Part-of-Speech using Natural Language Toolkit in
Chap. 12.
• Chapter 4—Syntax and Parsing
As another major component of Natural Language Understanding (NLU),
this chapter explores syntax analysis and introduces different types of constitu-
ents in English language followed by the main concept of context-free grammar
(CFG) and CFG parsing. It also studies different major parsing techniques,
including lexical and probabilistic parsing with live examples for illustration.
• Chapter 5: Meaning Representation
Before the study of Semantic Analysis, this chapter explores meaning repre-
sentation, a vital component in NLP. It studies four major meaning representa-
tion techniques which include: first-order predicate calculus (FOPC), semantic
net, conceptual dependency diagram (CDD), and frame-based representation.
After that it explores canonical form and introduces Fillmore’s theory of univer-
sal cases followed by predicate logic and inference work using FOPC with live
examples.
• Chapter 6: Semantic Analysis
This chapter studies Semantic Analysis, one of the core concepts for learning
NLP. First, it studies the two basic schemes of semantic analysis: lexical and
compositional semantic analysis. After that it explores word senses and six com-
monly used lexical semantics followed by word sense disambiguation (WSD)
and various WSD schemes. Further, it also studies WordNet and online thesauri
for word similarity and various distributed similarity measurement including
Point-wise Mutual Information (PMI) and Positive Point-wise Mutual informa-
tion (PPMI) models with live examples for illustration. Chapters 4 and 5 also
Preface ix
serve as the conceptual basis for Workshop#4: Semantic Analysis and Word
Vectors using spaCy in Chap. 13.
• Chapter 7: Pragmatic Analysis
After the discussion of semantic meaning and analysis, this chapter explores
pragmatic analysis in linguistics and discourse phenomena. It also studies coher-
ence and coreference as the key components of pragmatics and discourse critical
to NLP, followed by discourse segmentation with different algorithms on Co-
reference Resolution including Hobbs Algorithm, Centering Algorithm, Log-
Linear Model, the latest machine learning methods, and evaluation schemes.
This chapter also serves as the conceptual basis for Workshop#5: Sentiment
Analysis and Text Classification in Chap. 14.
• Chapter 8: Transfer Learning and Transformer Technology
Transfer learning is a commonly used deep learning model to minimize com-
putational resources. This chapter explores: (1) Transfer Learning (TL) against
traditional Machine Learning (ML); (2) Recurrent Neural Networks (RNN), a
significant component of transfer learning with core technologies such as Long
Short-Term Memory (LSTM) Network and Bidirectional Recurrent Neural
Networks (BRNNs) in NLP applications, and (3) Transformer technology archi-
tecture, Bidirectional Encoder Representation from Transformers (BERT)
Model, and related technologies including Transformer-XL and ALBERT tech-
nologies. This chapter also serves as the conceptual basis for Workshop#6:
Transformers with spaCy and Tensorflow in Chap. 15.
• Chapter 9: Major Natural Language Processing Applications
This is a summary of Part I with three core NLP applications: Information
Retrieval (IR) systems, Text Summarization (TS) systems, and Question-and-
Answering (Q&A) chatbot systems, how they work and related R&D in building
NLP applications. This chapter also serves as the conceptual basis for
Workshop#7: Building Chatbot with TensorFlow and Transformer Technology
in Chap. 16.
provide a foundation technique for text analysis, parsing and semantic analysis
in subsequent workshops. Part II introduces spaCy, the second important NLP
Python implementation tools not only for teaching and learning (like NLTK) but
also widely used for NLP applications including text summarization, informa-
tion extraction, and Q&A chatbot. It is a critical mass to integrate with
Transformer Technology in subsequent workshops.
• Chapter 12: Workshop#3 Part-of-Speech Tagging with Natural Language Toolkit
(Hour 5–6)
In Chap. 3, we studied basic concepts and theories related to Part-of-Speech
(POS) and various POS tagging techniques. This workshop explores how to
implement POS tagging by using NLTK starting from a simple recap on tokeni-
zation techniques and two fundamental processes in word-level progressing:
stemming and stop-word removal, which will introduce two types of stemming
techniques: Porter Stemmer and Snowball Stemmer that can be integrated with
WordCloud commonly used in data visualization followed by the main theme of
this workshop with the introduction of PENN Treebank Tagset and to create your
own POS tagger.
• Chapter 13: Workshop#4 Semantic Analysis and Word Vectors using spaCy
(Hour 7–8)
In Chaps. 5 and 6, we studied the basic concepts and theories related to mean-
ing representation and semantic analysis. This workshop explores how to use
spaCy technology to perform semantic analysis starting from a revisit on word
vectors concept, implement and pre-train them followed by the study of similar-
ity method and other advanced semantic analysis.
• Chapter 14: Workshop#5 Sentiment Analysis and Text Classification (Hour 9–10)
This is a coherent workshop of Chap. 7, this workshop explores how to posi-
tion NLP implementation techniques into two important NLP applications: text
classification and sentiment analysis. TensorFlow and Kera are two vital compo-
nents to implement Long Short-Term Memory networks (LSTM networks), a
commonly used Recurrent Neural Networks (RNN) on machine learning espe-
cially in NLP applications.
• Chapter 15: Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
In Chap. 8, the basic concept about Transfer Learning, its motivation and
related background knowledge such as Recurrent Neural Networks (RNN) with
Transformer Technology and BERT model are introduced. This workshop
explores how to put these concepts and theories into practice. More importantly,
is to implement Transformers, BERT Technology with the integration of spaCy’s
Transformer Pipeline Technology and TensorFlow. First, it gives an overview
and summation on Transformer and BERT Technology. Second, it explores
Transformer implementation with TensorFlow by revisiting Text Classification
using BERT model as example. Third, it introduces spaCy’s Transformer Pipeline
Technology and how to implement Sentiment Analysis and Text Classification
system using Transformer Technology.
Preface xi
This book is both an NLP textbook and NLP Python implementation book tai-
lored for:
• Undergraduates and postgraduates of various disciplines including AI, Computer
Science, IT, Data Science, etc.
• Lecturers and tutors teaching NLP or related AI courses.
• NLP, AI scientists and developers who would like to learn NLP basic concepts,
practice and implement via Python workshops.
• Readers who would like to learn NLP concepts, practice Python-based NLP
workshops using various NLP implementation tools such as NLTK, spaCy,
TensorFlow Keras, BERT, and Transformer technology.
This book can be served as a textbook for undergraduates and postgraduate courses
on Natural Language Processing, and a reference book for general readers who
would like to learn key technologies and implement NLP applications with contem-
porary implementation tools such as NLTK, spaCy, TensorFlow, BERT, and
Transformer technology.
Part I (Chaps. 1–9) covers the main course materials of basic concepts and key
technologies which include N-gram Language Model, Part-of-Speech Tagging,
Syntax and Parsing, Meaning Representation, Semantic Analysis, Pragmatic
xii Preface
xiii
About the Book
xv
Contents
xvii
xviii Contents
16
Workshop#7 Building Chatbot with TensorFlow and Transformer
Technology (Hour 13–14)������������������������������������������������������������������������ 401
16.1 Introduction������������������������������������������������������������������������������������ 401
16.2 Technical Requirements������������������������������������������������������������������ 401
16.3 AI Chatbot in a Nutshell ���������������������������������������������������������������� 402
16.3.1 What Is a Chatbot?������������������������������������������������������������ 402
16.3.2 What Is a Wake Word in Chatbot?������������������������������������ 403
16.3.3 NLP Components in a Chatbot ���������������������������������������� 404
16.4 Building Movie Chatbot by Using TensorFlow
and Transformer Technology���������������������������������������������������������� 404
16.4.1 The Chatbot Dataset���������������������������������������������������������� 405
16.4.2 Movie Dialog Preprocessing�������������������������������������������� 405
16.4.3 Tokenization of Movie Conversation�������������������������������� 407
16.4.4 Filtering and Padding Process������������������������������������������ 408
16.4.5 Creation of TensorFlow Movie Dataset
Object (mDS)�������������������������������������������������������������������� 409
16.4.6 Calculate Attention Learning Weights������������������������������ 410
16.4.7 Multi-Head-Attention (MHAttention)������������������������������ 411
16.4.8 System Implementation���������������������������������������������������� 412
16.5 Related Works �������������������������������������������������������������������������������� 430
References�������������������������������������������������������������������������������������������������� 431
Index�������������������������������������������������������������������������������������������������������������������� 433
About the Author
Raymond Lee is the founder of the Quantum Finance Forecast System (QFFC)
(https://qffc.uic.edu.cn) and currently an Associate Professor at United International
College (UIC) with 25+ years’ experience in AI research and consultancy, Chaotic
Neural Networks, NLP, Intelligent Fintech Systems, Quantum Finance, and
Intelligent E-Commerce Systems. He has published over 100 publications and
authored 8 textbooks in the fields of AI, chaotic neural networks, AI-based fintech
systems, intelligent agent technology, chaotic cryptosystems, ontological agents,
neural oscillators, biometrics, and weather simulation and forecasting systems.
Upon completion of the QFFC project, in 2018 he joined United International
College (UIC), China, to pursue further R&D work on AI-Fintech and to share his
expertise in AI-Fintech, chaotic neural networks, and related intelligent systems
with fellow students and the community. His three latest textbooks, Quantum
Finance: Intelligent Forecast and Trading Systems (2019), Artificial Intelligence in
Daily Life (2020), and this NLP book have been adopted as the main textbooks for
various AI courses in UIC.
xxix
Abbreviations
AI Artificial intelligence
ASR Automatic speech recognition
BERT Bidirectional encoder representations from transformers
BRNN Bidirectional recurrent neural networks
CDD Conceptual dependency diagram
CFG Context-free grammar
CFL Context-free language
CNN Convolutional neural networks
CR Coreference resolution
DNN Deep neural networks
DT Determiner
FOPC First-order predicate calculus
GRU Gate recurrent unit
HMM Hidden Markov model
IE Information extraction
IR Information retrieval
KAI Knowledge acquisition and inferencing
LSTM Long short-term memory
MEMM Maximum entropy Markov model
MeSH Medical subject thesaurus
ML Machine learning
NER Named entity recognition
NLP Natural language processing
NLTK Natural language toolkit
NLU Natural language understanding
NN Noun
NNP Proper noun
Nom Nominal
NP Noun phrase
PCFG Probabilistic context-free grammar
PMI Pointwise mutual information
xxxi
xxxii Abbreviations
POS Part-of-speech
POST Part-of-speech tagging
PPMI Positive pointwise mutual information
Q&A Question-and-answering
RNN Recurrent neural networks
TBL Transformation-based learning
VB Verb
VP Verb phrase
WSD Word sense disambiguation
Part I
Concepts and Technology
Chapter 1
Natural Language Processing
Consider this scenario: Late in the evening, Jack starts a mobile app and talks with
AI Tutor Max.
1.1 Introduction
There are many chatbots that allow humans to communicate with a device in natural
language nowadays. Figure 1.1 illustrates dialogue between a student who had
returned to dormitory after a full day classes and initiated communication with a
mobile application called AI Tutor 2.0 (Cui et al. 2020) from our latest research on
AI tutor chatbot. The objective is to enable the user (Jack) not only can learn from
book reading but also can communicate candidly with AI Tutor 2.0 (Max) to provide
knowledge responses in natural language. It is different from chatbots that respond
with basic commands but is human–computer interaction to demonstrate how a user
wishes to communicate in a way like a student convers with a tutor about subject
knowledge in the physical world. It is a dynamic process consisting of (1) world
knowledge on simple handshaking dialogue such as greetings and general discus-
sions. This is not an easy task as it involves knowledge and common sense to con-
struct a functional chatbot with daily dialogues, and (2) technical knowledge of a
particular knowledge domain, or domain expert as it required to learn from author’s
book AI in Daily Life (Lee 2020) first which covers all basic knowledge on the sub-
ject to form a knowledge tree or ontology graph that can be served as a new type of
publication and interactive device between human and computer to learn new
knowledge.
Natural language processing (NLP) is related to several disciplines including
human linguistic, computation linguistic, statistical engineering, AI in machine
learning, data mining, human voice processing recognition and synthesis, etc. There
are many genius chatbots initiated by NLP and AI scientists which become com-
mercial products in past decades.
This chapter will introduce this prime technology and components followed by
pertinent technologies in subsequent chapters.
There is an old saying: The way you behave says more about who you are. It is
because we never know what people think, the only method is to evaluate and judge
their behaviors.
1.2 Human Language and Intelligence 5
NLP core technologies and methodologies arose from famous Turing Test
(Eisenstein 2019; Bender 2013; Turing 1936, 1950) proposed by Sir Alan Turing
(1912–1954) in 1950s, the father of AI. Figure 1.2 shows a human judge convers
with two individuals in two rooms. One is a human, the other is either a robot, a
chatbot, or an NLP application. During a 20 min conversation, the judge can ask
human/machine technical/non-technical questions and require response on every
question so that the judge can decide whether the respondent is a human or a
machine. NLP in Turing Test is to recognize, understand questions, and respond in
human language. It remains a popular topic in AI today because we cannot see and
judge people’s thinking to define intelligence. It is the ultimate challenge in AI.
Human language is a significant component in human behavior and civilization.
It can be categorized into (1) written and (2) oral aspects generally. Written lan-
guage undertakes to process, store, and pass human/natural language knowledge to
next generations. Oral or spoken language acts as a communication media among
other individuals.
NLP has examined the basic effects on philosophy such as meaning and knowl-
edge, psychology in words meanings, linguistics in phrases and sentences forma-
tion, computational linguists in language models. Hence, NLP is cross-disciplinary
integration of disciplines such as philosophy in human language ontology models,
psychology behavior between natural and human language, linguistics in mathe-
matical and language models, computational linguistics in agents and ontology
trees technology as shown in Fig. 1.3.
6 1 Natural Language Processing
Pragmatic ambiguity arises from a statement that is not clearly defined when the
context of a sentence provides multiple interpretations such as I like that too. It can
describe I like that too, other likes that too but the description of that is uncertain.
NLP analyzes sentences ambiguity incessantly. If they can be identified earlier,
it will be easier to define proper meanings.
There are several major NLP transformation stages in NLP history (Santilal 2020).
NLP major development was focused on how it can be used in different areas
such as knowledge engineering called agent ontology to shape meaning repre-
sentations following AI grew popular over time. BASEBALL system
1.5 A Brief History of NLP 9
(Green et al. 1961) was a typical example of Q&A-based domain expert system
of human and computer interaction developed in 1960s, but inputs were restric-
tive and language processing techniques remained in basic language processing.
In 1968, Prof. Marvin Minsky (1927–2016) developed a more powerful NLP
system. This advanced system used an AI-based question-answering inference
engine between humans and computers to provide knowledge-based interpretations
of questions and answers. Further, Prof. William A. Woods proposed an augmented
translation network (ATN) to represent natural language input in 1970. During this
period, many programmers started to transcribe codes in different AI languages to
conceptualize natural language ontology knowledge of real-world structural infor-
mation into human understanding mode status. Yet these expert systems were unable
to meet expectation signified the second winter of AI.
NLP statistical technique and rule-based system R&D had evolved into cloud com-
puting technology on mobile computing and big data in deep network analysis, e.g.
recurrent neural networks using LSTM and related networks. Google, Amazon,
Facebook contributed to agent technologies and deep neural networks development
in 2010 to devise products such as auto-driving, Q&A chatbots, and storage devel-
opment are under way.
1.6 NLP and AI
Spoken
Language
Speech
Recognition
Lexicon
Syntax
Analysis
Grammar
Semantic Semantic
Rules Analysis
Contextual Pragmatic
Information Analysis
Target Meaning
Representation
1.8.1 Speech Recognition
Speech recognition (Li et al. 2015) is the first stage in NLU that performs phonetic,
phonological, and morphological processing to analyze spoken language. The task
involves breaking down the stems of spoken words called utterances, into distinct
tokens representing paragraphs, sentences, and words in different parts. Current
speech recognition models apply spectrogram analysis to extract distinct frequen-
cies, e.g. the word uncanny can be split into two-word tokens un and canny. Different
languages have different spectrogram analysis.
1.8.2 Syntax Analysis
Syntax analysis (Sportier et al. 2013) is the second stage of NLU direct response
speech recognition, analyzing the structural meaning of spoken sentences. This task
has two purposes: (1) check syntax correctness of the sentence/utterance, (2) break
down spoken sentences into syntactic structures to reflect syntactic relationship
between words. For instance, the utterance oranges to the boys will be rejected by
syntax parser because of syntactic errors.
1.8.3 Semantic Analysis
Semantic analysis (Goddard 1998) is the third stage in NLU which corresponds to
syntax analysis. This task is to extract the precise meaning of a sentence/utterance,
or dictionary meanings defined by the text and reject meaningless, e.g. semantic
analyzer rejects word phrase like hot snowflakes despite correct syntactic words
meaning but incorrect semantic meaning.
1.8.4 Pragmatic Analysis
Pragmatic analysis (Ibileye 2018) is the fourth stage in NLU and a challenging part
in spoken language analysis involving high level or expert knowledge with common
sense, e.g. will you crack open the door? I’m getting hot. This sentence/utterance
requires extra knowledge in the second clause to understand crack is to break in
semantic meaning, but it should be interpreted as to open in pragmatic meaning.
14 1 Natural Language Processing
After years of research and development from machine translation and rule-based
systems to data mining and deep networks, NLP technology has a wide range of
applications in everyday activities such as machine translation, information retrieval,
sentiment analysis, information extraction, and question-answering chatbots as in
Fig. 1.8.
Machine translation (Scott 2018) is the earliest application in NLP since 1950s.
Although it is not difficult to translate one language to another yet there are two
major challenges (1) naturalness (or fluency) means different languages have differ-
ent styles and usages and (2) adequacy (or accuracy) means different languages may
present independent ideas in different languages. Experienced human translators
address this trade-off in creative ways such as statistical methods, or case-by-case
rule-based systems in the past but since there have been many ambiguity scenarios
in language translation, the goal of machine translation R&D nowadays strive sev-
eral AI techniques applications for recurrent networks, or deep networks backbox
systems to enhance machine learning capabilities.
1.9.4 Sentiment Analysis
Sentiment analysis (Liu 2012) is a kind of data mining system in NLP to analyze
user sentiment towards products, people, ideas from social media, forums, and
online platforms. It is an important application for extracting data from messages,
comments, and conversations published on these platforms; and assigning a labeled
sentiment classification as in Fig. 1.9 to understand natural language and utterances.
Deep networks are ways to analyze large amounts of data. In Part II: NLP
Implementation Workshop will explore how to implement sentiment analysis in
detail using Python spaCy and Transformer technology.
16 1 Natural Language Processing
Q&A systems is the objective in NLP (Raj 2018). A process flow is necessary to
implement a Q&A chatbot. It includes voice recognition to convert into a list of
tokens in sentences/utterances, syntactic grammatical analysis, semantic meaning
analysis of whole sentences, and pragmatic analysis for embedded or complex
meanings. When enquirer’s utterance meaning is generated, it is necessary to search
from knowledge base for the most appropriate answer or response through inferenc-
ing either by rule-based system, statistical system, or deep network, e.g. Google
BERT system. Once a response is available, reverse engineering is required to gen-
erate natural voice from verbal language called voice synthesis. Hence, Q&A sys-
tem in NLP is an important technology that can apply to daily activities such as
human–computer interaction in auto-driving, customer services support, and lan-
guage skills improvement.
The final workshop will discuss how to integrate various Python NLP implemen-
tation tools including NLTK, spaCy, TensorFlow Keras, and Transformer Technology
to implement a Q&A movies chatbot system.
References
Cui, Y., Huang, C., Lee, Raymond (2020). AI Tutor: A Computer Science Domain Knowledge
Graph-Based QA System on JADE platform. World Academy of Science, Engineering and
Technology, Open Science Index 168, International Journal of Industrial and Manufacturing
Engineering, 14(12), 543 - 553.
Eisenstein, J. (2019) Introduction to Natural Language Processing (Adaptive Computation and
Machine Learning series). The MIT Press.
Goddard, C. (1998) Semantic Analysis: A Practical Introduction (Oxford Textbooks in Linguistics).
Oxford University Press.
Green, B., Wolf, A., Chomsky, C. and Laughery, K. (1961). BASEBALL: an automatic question-
answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM com-
puter conference (IRE-AIEE-ACM ’61 (Western)). Association for Computing Machinery,
New York, NY, USA, 219–224.
Hausser, R. (2014) Foundations of Computational Linguistics: Human-Computer Communication
in Natural Language (3rd edition). Springer.
Hemdev, P. (2011) Information Extraction: A Smart Calendar Application: Using NLP,
Computational Linguistics, Machine Learning and Information Retrieval Techniques. VDM
Verlag Dr. Müller.
Ibileye, G. (2018) Discourse Analysis and Pragmatics: Issues in Theory and Practice.
Malthouse Press.
Lee, R. S. T. (2020). AI in Daily Life. Springer.
Li, J. et al. (2015) Robust Automatic Speech Recognition: A Bridge to Practical Applications.
Academic Press.
Liu, B. (2012) Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers.
Peters, C. et al. (2012) Multilingual Information Retrieval: From Research To Practice. Springer.
Raj, S. (2018) Building Chatbots with Python: Using Natural Language Processing and Machine
Learning. Apress.
Santilal, U. (2020) Natural Language Processing: NLP & its History (Kindle edition). Amazon.com.
Scott, B. (2018) Translation, Brains and the Computer: A Neurolinguistic Solution to Ambiguity
and Complexity in Machine Translation (Machine Translation: Technologies and Applications
Book 2). Springer.
Sportier, D. et al. (2013) An Introduction to Syntactic Analysis and Theory. Wiley-Blackwell.
Tuchong (2020a) The Turing Test. https://stock.tuchong.com/image/detail?imag
eId=921224657742331926. Accessed 14 May 2022.
Tuchong (2020b) NLP and AI. https://stock.tuchong.com/image/detail?imag
eId=1069700818174345308. Accessed 14 May 2022.
Turing, A. (1936) On computable numbers, with an application to the Entscheidungs problem. In:
Proc. London Mathematical Society, Series 2, 42:230–26
Turing, A. (1950) Computing Machinery and Intelligence. Mind, LIX (236): 433–460.
Chapter 2
N-Gram Language Model
2.1 Introduction
A text highlights spelling and grammatic errors in yellow and blue colors is
shown in Fig. 2.2. This method can calculate words probabilities occurrence fre-
quency to provide substitution of higher frequency probability but cannot always
present accurate options.
Figure 2.3 illustrates a simple scenario of next word prediction in sample utter-
ances I like photography, I like science, and I love mathematics. The probability of
I like is 0.67 (2/3) compared with I love is 0.33 (1/3), the probability of like photog-
raphy and like science is similar at 0.5 (1/2). Assigning probability to scenarios, I
like photography and I like science are both 0.67 × 0.5 = 0.335, and I love mathe-
matics is 0.33 × 1 = 0.33.
When applying probability on language models, it must always note (1) domain
specific verity of keywords togetherness and terminology knowledge varies accord-
ing to domains, e.g. medical science, AI, etc., (2) syntactic knowledge attributes to
syntax, lexical knowledge, and (3) common sense or world knowledge attributes to
the collection of habitual behaviors from past experiences, and (4) languages usage
significance in high-level NLP.
When applying probability on words prediction in an utterance, there are words
often proposed by rank and frequency to provide a sequential optimum estimation.
For example:
[2.1] I notice three children standing on the ??? (ground, bench …)
[2.2] I just bought some oranges from the ??? (supermarket, shop …)
[2.3] She stopped the car and then opened the ??? (door, window, …)
The structure of [2.3] is perplexed because word counting method with a sizeable
knowledge domain is adequate but common sense, world knowledge, or specific
domain knowledge are among the sources. It involves scenario syntactic knowledge
that attributes to do something with superior level at scene such as descriptive
knowledge to help the guesswork. Although it is plain and mundane to study pre-
ceding and words tracking but it is one the most useful techniques on words predic-
tion. Let us begin with some simple word counting methods in NLP, the N-gram
language model.
It was learnt that the motivations on words prediction can apply to voice recogni-
tion, text generation, and Q&A chatbot. N-gram language model, also called
N-gram model or N-gram (Sidorov 2019; Liu et al. 2020) is a fundamental method
to formalize words prediction using probability calculation. N-gram is statistical
model that consists of word sequence in N-number, commonly used N-grams
include:
• Unigram refers to a single word, i.e. N = 1. It is seldomly used in practice because
it contains only one word in N-gram. However, it is important to serve as the base
for higher order N-gram probability normalization.
• Bigram refers to a collection of two words, i.e. N = 2. For example: I have, I do,
he thinks, she knows, etc. It is used in many applications because its occurrence
frequency is high and easy to count.
• Trigram refers to a collection of three words, i.e. N = 3. For example: I noticed
that, noticed three children, children standing on, standing on the. It is useful
because it contains more meanings and not lengthy. Given a count knowledge of
first three words can easily guess the next word in a sequence. However, its
occurrence frequency is low in a moderate corpus.
22 2 N-Gram Language Model
Here is a list of common terminologies in NLP (Jurafsky et al. 1999; Eisenstein 2019):
• Sentence is a unit of written language. It is a basic entity in a conversation or
utterance.
2.2 N-Gram Language Model 23
Fig. 2.4 Computerized axial tomography scanner (aka. CAT scan) (Tuchong 2022)
P ( A ∩ B)
P ( A|B ) = (2.1)
P ( B)
P ( A ∩ B ) = P ( A|B ) P ( B ) (2.3)
For a sequence of events, A, B, C and D, the Chain Rule formulation will become
In general:
If word sequence from position 1 to n as w1n is defined, the Chain Rule applied to
word sequence will become
Note: Normally, <s> and </s> are used to denote the start and end of sentence/
utterance for better formulation.
This method seems fair and easy to understand but poses two major problems.
First, it is unlikely to gather the right statistics for prefixes which means that not
knowing the starting point of the sentence. Second, the calculation for word
sequence probability is mundane. If it is a long sentence, conditional probability at
the end of this equation is complex to calculate.
Let us explore how genius Markov Chain is applied to solve this problem.
Random documents with unrelated
content Scribd suggests to you:
Imperial (Berlín, MDCCCXC). Larrainzar, Estudios sobre la Historia de
América, etc. (México, 1875-78). H. Strebel, Alt. México (Hamburgo,
1885). Waitz, Amerikaner, vol. II (1864). Ad. Bastian, Culturlander
des alten America (Berlín, 1878). Las obras citadas en las notas del
presente capítulo y en las de los referentes á la "Vida Psíquica" del
Indio Americano (IV-V).
Fuentes.—Bernal Díaz del Castillo, Verdadera Historia de la
Conquista de la Nueva España (Hist. Prim. Ind. II). Icazbalceta, Coll.
de Documentos para la Historia de México (1858-66). Id., Nueva
Colección de Documentos (1886-92). Pacheco y Cárdenas, Coll. de
Documentos. Ternaux-Compans, Voyages, relations et memoires
originaux, etc. Obras Históricas de Don Fernando de Alva Ixtlilxochitl
(Ed. Alfredo Chavero). Diego Muñoz Camargo, Historia de Tlascala
(Ed. Alfredo Chavero). Fr. Bernardo de Lizana, Hist. del Yucatán (Ed.
Mus. Nac. México). Dorantes, Sumaria Relación de las cosas de
Nueva España (Ed. Mus. Nac. México). Gaspar de Villagra, Hist.
Nueva México (Museo Nacional México). Los Anales del Museo
Nacional de México, (1.ª época, vols. I á VII, y 2.ª época, vols. I á
V). Crónica Mexicana, escrita por D. Hdo. Alvarado Tezocomoc hacia
el año MDXCVIII, anotada por Orozco y Berra, etc. (Edición Vigil,
México). Sahagún, Hist. General de las cosas de la Nueva España
(Ed. Jourdanet y Simeón, París, 1880). Boturini, Idea de una Nueva
Hist. Gen. de la Amca. Sepnal. (Ed. Madrid, 1746). Clavijero, Historia
Antigua de México, etc. (Ed. Española, Londres, 1826). Hdo. Cortés,
Cartas de Relación (Hist. Prim. de Indias). Landa, Rel. de las cosas
del Yucatán (Ed. de la Rada y Delgado, Madrid, 1884). Fuentes y
Guzmán, Hist. de Guatemala ó recordación Florida, etc. (Ed. J.
Zaragoza, Madrid, 1882-83). Alonso de Zurita, Rapports sur les
differents classes de chefs, etc. (Ed. Ternaux Compans, París, 1840).
Fray Gerónimo de Mendieta, Hist. Eclesiástica Indiana (Ed.
Icazbalceta, México, MDCCCLXX). Los preciosos Manuscritos de la
Bca. Escurialense, relacionados y descritos críticamente por el P. M.
Gutiérrez (La Ciudad de Dios, vol. LXXXI núms. Abril 5-20, Mayo 5-
20, Junio 5-20-1910). Los Ms. de la Colección Muñoz (Ac. de la
Historia), vols. II, III, IV (Ixtlilxochilt); VII, VIII (Mem. Nueva
España); IX, X, XI, XII, XIV, XVI (Pimas); XVII (P. Kino); XXII, XXIII,
XXIX (Cohahila); XXX, XXXI, XXXIX (Zapolitatlan); XLI (Alonso de
Çorita, Relación, 1633); XLII (Orden sucesión en terrenos y baldíos),
etc. Col. Mata Linares, vol. I, XXXIX, XLI, XCXXIX, XCXXXVI, etc.
Bca. Nacional Madrid, Ms. (I. 43), (I. 89), (I. 116), (I. 28, 29, 31),
etc. Colecciones García Figueroa (Ac. de la Hist., Madrid). Bureau of
Am. Etnology, Report 3 (Thomas, Mtos. Mayas); 1 (Central American
Picture writing, etc.); 16 (Thomas, Maya Códices); 19 (Symbols Maya
Year; Mounds Northern Honduras; Calendario Maya) y Boletín 28-
1904 (Descrip. Colecciones Seller), etc., etc.
Códices indígenas.—Los citados en las notas del presente
capítulo; los llamados de "Porfirio Díaz", "Baranda", "Dehesa",
publicados por la Junta Colombina México (México, 1892); El
Fejervary-Meyer, Museo de Liverpool (Ed. Duc. de Loubat, Berlín,
1901); el Codex Nuttall (Cambridge, Mass., 1912), el Codex Osuna
(Madrid, 1878), etc., etc.
Bibliografías.—Winsor, op. cit., I, pág. 153 y sig. y apéndices I, II,
pág. 397 y sig. Icazbalceta, Bibliog. Mexicana del siglo xvi (México,
1886). Bancroft, Native Races, vol. V-136, etc. Bca. Hisp. Americana
Sepnal. de Beristain y Souza (Ed. Vera-Amecameca, 1883). Leclerc,
Biblioteca Americana, etc. (París, 1878). Las notas de Bandelier (10,
11, 12 Rep. Peabody Museum). Field, Essay towards an Indian
Bibliog. (N. Y., 1873). Fischer, Bca. Mexicana, etc. (Londres, 1869).
Pinart, Catalogue de livres rares et precieux, etc. (1883, París). Los
Catálogos de Hiersemann, Quaritch, etc., y las citadas en los
capítulos anteriores (Títs. I y II).
CAPÍTULO VIII
Observaciones generales.
La Región Amazónica.
La familia Tupi-Guarani.
Los Tapuyas.
La Región Pampeana.
9.—Al Sur de las altiplanicies que separan las aguas del Bajo
Amazonas de las de los afluentes del Plata, se extiende el Continente
en llanuras inmensas regadas por numerosos ríos navegables.
Comprende de Norte á Sur esta región llamada Pampeana, los
territorios del Gran Chaco, las célebres Pampas desde el río Salado al
río Negro, y los desiertos rocallosos y estériles de Patagonia y las
soledades Antárticas. Está limitada al Este por el Océano Atlántico y
al Oeste por la Cordillera de los Andes.
Pampeanos y Araucanos.
11.—Al Sur del Gran Chaco, y hacia los 35º de latitud, empieza la
Región de las Pampas. No hemos de detenernos á describir la
grandiosa belleza de sus llanuras como mares, la inacabable
variedad de sus pastos y la honda serenidad de sus desiertos sin
término. Útil es, sin embargo, recordar estos rasgos fisiográficos de
la Pampa para mejor comprender las peculiaridades de sus
aborígenes.
ebookbell.com