100% found this document useful (1 vote)
101 views

Natural Language Processing A Textbook With Python Implementation Raymond S T Lee download

The document is about 'Natural Language Processing: A Textbook with Python Implementation' by Raymond S. T. Lee, which aims to provide readers with foundational concepts and practical workshops in NLP using Python tools. It includes two main parts: the first covering theoretical concepts and technologies in NLP, and the second offering hands-on workshops using tools like NLTK, spaCy, TensorFlow, and BERT. The book is designed for students, educators, and professionals interested in learning about NLP and its applications.

Uploaded by

urmosgenerz3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
101 views

Natural Language Processing A Textbook With Python Implementation Raymond S T Lee download

The document is about 'Natural Language Processing: A Textbook with Python Implementation' by Raymond S. T. Lee, which aims to provide readers with foundational concepts and practical workshops in NLP using Python tools. It includes two main parts: the first covering theoretical concepts and technologies in NLP, and the second offering hands-on workshops using tools like NLTK, spaCy, TensorFlow, and BERT. The book is designed for students, educators, and professionals interested in learning about NLP and its applications.

Uploaded by

urmosgenerz3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Natural Language Processing A Textbook With

Python Implementation Raymond S T Lee download

https://ebookbell.com/product/natural-language-processing-a-
textbook-with-python-implementation-raymond-s-t-lee-53684048

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Natural Language Processing A Textbook With Python Implementation


Raymond S T Lee

https://ebookbell.com/product/natural-language-processing-a-textbook-
with-python-implementation-raymond-s-t-lee-201154538

Natural Language Processing A Machine Learning Perspective 1st Yue


Zhang

https://ebookbell.com/product/natural-language-processing-a-machine-
learning-perspective-1st-yue-zhang-36314864

Practical Natural Language Processing A Comprehensive Guide To


Building Realworld Nlp Systems Sowmya Vajjala

https://ebookbell.com/product/practical-natural-language-processing-a-
comprehensive-guide-to-building-realworld-nlp-systems-sowmya-
vajjala-25766928

Practical Natural Language Processing A Comprehensive Guide To


Building Realworld Nlp Systems Harshit Surana Anuj Gupta Bodhisattwa
Majumder Sowmya Vajjala

https://ebookbell.com/product/practical-natural-language-processing-a-
comprehensive-guide-to-building-realworld-nlp-systems-harshit-surana-
anuj-gupta-bodhisattwa-majumder-sowmya-vajjala-36704970
Formal Analysis For Natural Language Processing A Handbook Zhiwei Feng

https://ebookbell.com/product/formal-analysis-for-natural-language-
processing-a-handbook-zhiwei-feng-50086352

Information Retrieval And Natural Language Processing A Graph Theory


Approach Sheetal S Sonawane

https://ebookbell.com/product/information-retrieval-and-natural-
language-processing-a-graph-theory-approach-sheetal-s-
sonawane-38557570

Natural Language Processing As A Foundation Of The Semantic Web


Foundations And Trends In Web Science Yorick Wilks

https://ebookbell.com/product/natural-language-processing-as-a-
foundation-of-the-semantic-web-foundations-and-trends-in-web-science-
yorick-wilks-1908784

Natural Language Processing With Flair A Practical Guide To


Understanding And Solving Nlp Problems With Flair 1st Edition Tadej
Magajna

https://ebookbell.com/product/natural-language-processing-with-flair-
a-practical-guide-to-understanding-and-solving-nlp-problems-with-
flair-1st-edition-tadej-magajna-42937728

Handson Natural Language Processing With Python A Practical Guide To


Applying Deep Learning Architectures To Your Nlp Applications Rajesh
Arumugam Rajalingappaa Shanmugamani Arumugam

https://ebookbell.com/product/handson-natural-language-processing-
with-python-a-practical-guide-to-applying-deep-learning-architectures-
to-your-nlp-applications-rajesh-arumugam-rajalingappaa-shanmugamani-
arumugam-35440088
Raymond S. T. Lee

Natural
Language
Processing
A Textbook with Python
Implementation
Natural Language Processing
Raymond S. T. Lee

Natural Language Processing


A Textbook with Python Implementation
Raymond S. T. Lee
United International College
Beijing Normal University-Hong Kong Baptist University
Zhuhai, China

ISBN 978-981-99-1998-7    ISBN 978-981-99-1999-4 (eBook)


https://doi.org/10.1007/978-981-99-1999-4

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is dedicated to all readers and
students taking my undergraduate and
postgraduate courses in Natural Language
Processing, your enthusiasm in seeking
knowledge incited me to write this book.
Preface

Motivation of This Book

Natural Language Processing (NLP) and its related applications become part of
daily life with exponential growth of Artificial Intelligence (AI) in past decades.
NLP applications including Information Retrieval (IR) systems, Text Summarization
System, and Question-and-Answering (Chatbot) System became one of the preva-
lent topics in both industry and academia that had evolved routines and benefited
immensely to a wide array of day-to-day services.
The objective of this book is to provide NLP concepts and knowledge to readers
with a 14-h 7 step-by-step workshops to practice various core Python-based NLP
tools: NLTK, spaCy, TensorFlow Keras, Transformer, and BERT Technology to
construct NLP applications.

Organization and Structure of This Book

This book consists of two parts:


Part I Concepts and Technology (Chaps. 1–9)
Discuss concepts and technology related to NLP including: Introduction, N-gram
Language Model, Part-of-Speech Tagging, Syntax and Parsing, Meaning
Representation, Semantic Analysis, Pragmatic Analysis, Transfer Learning and
Transformer Technology, Major NLP Applications.
Part II Natural Language Processing Workshops with Python Implementation
(Chaps. 10–16)
7 Python workshops to provide step-by-step Python implementation tools includ-
ing: NLTK, spaCy, TensorFlow Keras, Transformer, and BERT Technology.
This book is organized and structured as follows:
Part I: Concepts and Technology
• Chapter 1: Introduction to Natural Language Processing

vii
viii Preface

This introductory chapter begins with human language and intelligence con-
stituting six levels of linguistics followed by a brief history of NLP with major
components and applications. It serves as the cornerstone to the NLP concepts
and technology discussed in the following chapters. This chapter also serves as
the conceptual basis for Workshop#1: Basics of Natural Language Toolkit
(NLTK) in Chap. 10.
• Chapter 2: N-gram Language Model
Language model is the foundation of NLP. This chapter introduces N-gram
language model and Markov Chains using classical literature The Adventures of
Sherlock Holmes by Sir Conan Doyle (1859–1930) to illustrate how N-gram
model works that form NLP basics in text analysis followed by Shannon’s model
and text generation with evaluation schemes. This chapter also serves as the con-
ceptual basis for Workshop#2 on N-gram modelling with NLTK in Chap. 11.
• Chapter 3: Part-of-Speech Tagging
Part-of-Speech (POS) Tagging is the foundation of text processing in
NLP. This chapter describes how it relates to NLP and Natural Language
Understanding (NLU). There are types and algorithms for POS Tagging includ-
ing Rule-based POS Tagging, Stochastic POS Tagging, and Hybrid POS Tagging
with Brill Tagger and evaluation schemes. This chapter also serves as the concep-
tual basis for Workshop#3: Part-of-Speech using Natural Language Toolkit in
Chap. 12.
• Chapter 4—Syntax and Parsing
As another major component of Natural Language Understanding (NLU),
this chapter explores syntax analysis and introduces different types of constitu-
ents in English language followed by the main concept of context-free grammar
(CFG) and CFG parsing. It also studies different major parsing techniques,
including lexical and probabilistic parsing with live examples for illustration.
• Chapter 5: Meaning Representation
Before the study of Semantic Analysis, this chapter explores meaning repre-
sentation, a vital component in NLP. It studies four major meaning representa-
tion techniques which include: first-order predicate calculus (FOPC), semantic
net, conceptual dependency diagram (CDD), and frame-based representation.
After that it explores canonical form and introduces Fillmore’s theory of univer-
sal cases followed by predicate logic and inference work using FOPC with live
examples.
• Chapter 6: Semantic Analysis
This chapter studies Semantic Analysis, one of the core concepts for learning
NLP. First, it studies the two basic schemes of semantic analysis: lexical and
compositional semantic analysis. After that it explores word senses and six com-
monly used lexical semantics followed by word sense disambiguation (WSD)
and various WSD schemes. Further, it also studies WordNet and online thesauri
for word similarity and various distributed similarity measurement including
Point-wise Mutual Information (PMI) and Positive Point-wise Mutual informa-
tion (PPMI) models with live examples for illustration. Chapters 4 and 5 also
Preface ix

serve as the conceptual basis for Workshop#4: Semantic Analysis and Word
Vectors using spaCy in Chap. 13.
• Chapter 7: Pragmatic Analysis
After the discussion of semantic meaning and analysis, this chapter explores
pragmatic analysis in linguistics and discourse phenomena. It also studies coher-
ence and coreference as the key components of pragmatics and discourse critical
to NLP, followed by discourse segmentation with different algorithms on Co-­
reference Resolution including Hobbs Algorithm, Centering Algorithm, Log-­
Linear Model, the latest machine learning methods, and evaluation schemes.
This chapter also serves as the conceptual basis for Workshop#5: Sentiment
Analysis and Text Classification in Chap. 14.
• Chapter 8: Transfer Learning and Transformer Technology
Transfer learning is a commonly used deep learning model to minimize com-
putational resources. This chapter explores: (1) Transfer Learning (TL) against
traditional Machine Learning (ML); (2) Recurrent Neural Networks (RNN), a
significant component of transfer learning with core technologies such as Long
Short-Term Memory (LSTM) Network and Bidirectional Recurrent Neural
Networks (BRNNs) in NLP applications, and (3) Transformer technology archi-
tecture, Bidirectional Encoder Representation from Transformers (BERT)
Model, and related technologies including Transformer-XL and ALBERT tech-
nologies. This chapter also serves as the conceptual basis for Workshop#6:
Transformers with spaCy and Tensorflow in Chap. 15.
• Chapter 9: Major Natural Language Processing Applications
This is a summary of Part I with three core NLP applications: Information
Retrieval (IR) systems, Text Summarization (TS) systems, and Question-and-­
Answering (Q&A) chatbot systems, how they work and related R&D in building
NLP applications. This chapter also serves as the conceptual basis for
Workshop#7: Building Chatbot with TensorFlow and Transformer Technology
in Chap. 16.

Part II: Natural Language Processing Workshops with Python


Implementation in 14 h
• Chapter 10: Workshop#1 Basics of Natural Language Toolkit (Hour 1–2)
With the basic NLP concept being learnt in Chap. 1, this introductory work-
shop gives a NLTK overview and system installation procedures are the founda-
tions of Python NLP development tool used for text processing which include
simple text analysis, text analysis with lexical dispersion plot, text tokenization,
and basic statistical tools in NLP.
• Chapter 11: Workshop#2 N-grams Modelling with Natural Language Toolkit
(Hour 3–4)
This is a coherent workshop of Chap. 2 using NTLK technology for N-gram
generation and statistics. This workshop consists of two parts. Part I introduces
N-gram language model using NLTK in Python and N-grams class to generate
N-gram statistics on any sentence, text objects, whole document, literature to
x Preface

provide a foundation technique for text analysis, parsing and semantic analysis
in subsequent workshops. Part II introduces spaCy, the second important NLP
Python implementation tools not only for teaching and learning (like NLTK) but
also widely used for NLP applications including text summarization, informa-
tion extraction, and Q&A chatbot. It is a critical mass to integrate with
Transformer Technology in subsequent workshops.
• Chapter 12: Workshop#3 Part-of-Speech Tagging with Natural Language Toolkit
(Hour 5–6)
In Chap. 3, we studied basic concepts and theories related to Part-of-Speech
(POS) and various POS tagging techniques. This workshop explores how to
implement POS tagging by using NLTK starting from a simple recap on tokeni-
zation techniques and two fundamental processes in word-level progressing:
stemming and stop-word removal, which will introduce two types of stemming
techniques: Porter Stemmer and Snowball Stemmer that can be integrated with
WordCloud commonly used in data visualization followed by the main theme of
this workshop with the introduction of PENN Treebank Tagset and to create your
own POS tagger.
• Chapter 13: Workshop#4 Semantic Analysis and Word Vectors using spaCy
(Hour 7–8)
In Chaps. 5 and 6, we studied the basic concepts and theories related to mean-
ing representation and semantic analysis. This workshop explores how to use
spaCy technology to perform semantic analysis starting from a revisit on word
vectors concept, implement and pre-train them followed by the study of similar-
ity method and other advanced semantic analysis.
• Chapter 14: Workshop#5 Sentiment Analysis and Text Classification (Hour 9–10)
This is a coherent workshop of Chap. 7, this workshop explores how to posi-
tion NLP implementation techniques into two important NLP applications: text
classification and sentiment analysis. TensorFlow and Kera are two vital compo-
nents to implement Long Short-Term Memory networks (LSTM networks), a
commonly used Recurrent Neural Networks (RNN) on machine learning espe-
cially in NLP applications.
• Chapter 15: Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
In Chap. 8, the basic concept about Transfer Learning, its motivation and
related background knowledge such as Recurrent Neural Networks (RNN) with
Transformer Technology and BERT model are introduced. This workshop
explores how to put these concepts and theories into practice. More importantly,
is to implement Transformers, BERT Technology with the integration of spaCy’s
Transformer Pipeline Technology and TensorFlow. First, it gives an overview
and summation on Transformer and BERT Technology. Second, it explores
Transformer implementation with TensorFlow by revisiting Text Classification
using BERT model as example. Third, it introduces spaCy’s Transformer Pipeline
Technology and how to implement Sentiment Analysis and Text Classification
system using Transformer Technology.
Preface xi

• Chapter 16: Workshop#7 Building Chatbot with TensorFlow and Transformer


Technology (Hour 13–14)
In previous six NLP workshops, we studied NLP implementation tools and
techniques ranging from tokenization, N-gram generation to semantic and senti-
ment analysis with various key NLP Python enabling technologies: NLTK,
spaCy, TensorFlow and contemporary Transformer Technology. This final work-
shop explores how to integrate them for the design and implementation of a live
domain-based chatbot system on a movie domain. First, it explores the basis of
chatbot system and introduce a knowledge domain—the Cornell Large Movie
Conversation Dataset. Second, it conducts a step-by-step implementation of
movie chatbot system which involves dialog preprocessing, model construction,
attention learning implementation, system integration, and performance evalua-
tion followed by live tests. Finally, it introduces a mini project for this workshop
and present related chatbot datasets with resources in summary.

Readers of This Book

This book is both an NLP textbook and NLP Python implementation book tai-
lored for:
• Undergraduates and postgraduates of various disciplines including AI, Computer
Science, IT, Data Science, etc.
• Lecturers and tutors teaching NLP or related AI courses.
• NLP, AI scientists and developers who would like to learn NLP basic concepts,
practice and implement via Python workshops.
• Readers who would like to learn NLP concepts, practice Python-based NLP
workshops using various NLP implementation tools such as NLTK, spaCy,
TensorFlow Keras, BERT, and Transformer technology.

How to Use This book?

This book can be served as a textbook for undergraduates and postgraduate courses
on Natural Language Processing, and a reference book for general readers who
would like to learn key technologies and implement NLP applications with contem-
porary implementation tools such as NLTK, spaCy, TensorFlow, BERT, and
Transformer technology.
Part I (Chaps. 1–9) covers the main course materials of basic concepts and key
technologies which include N-gram Language Model, Part-of-Speech Tagging,
Syntax and Parsing, Meaning Representation, Semantic Analysis, Pragmatic
xii Preface

Analysis, Transfer Learning and Transformer Technology, and Major NLP


Applications. Part II (Chaps. 10–16) provides materials for a 14-h, step-by-step
Python-based NLP implementation in 7 workshops.
For readers and AI scientists, this book can be served as both reference in learn-
ing NLP and Python implementation toolbook on NLP applications by using the
latest Python-based NLP development tools, platforms, and libraries.
For seven NLP Workshops in Part II (Chaps. 10–16), readers can download all
JupyterNB files and data files from my NLP GitHub directory: https://github.com/
raymondshtlee/nlp/. For any query, please feel free to contact me via email: ray-
[email protected].

Zhuhai, China Raymond S. T. Lee


Acknowledgements

I would like to express my gratitude:


To my wife Iris for her patience, encouragement, and understanding, especially
during my time spent on research and writing in the past 30 years.
To Ms. Celine Cheng, executive editor of Springer NATURE and her profes-
sional editorial and book production team for their support, valuable comments,
and advice.
To Prof. Tang Tao, President of UIC, for the provision of excellent environment
for research, teaching, and writing this book.
To Prof. Weijia Jia, Vice President (Research and Development) of UIC for their
supports for R&D of NLP and related AI projects.
To Prof. Jianxin Pan, Dean of Faculty of Science and Technology of UIC, and
Prof. Weifeng Su, Head of Department of Computer Science of UIC for their con-
tinuous supports for AI and NLP courses.
To research assistant Mr. Zihao Huang for the help of NLP workshops prepara-
tion. To research student Ms. Clarissa Shi and student helpers Ms. Siqi Liu, Mr.
Mingjie Wang, and Ms. Jie Lie to help with literature review on major NLP applica-
tions and Transformer technology, and Mr. Zhuohui Chen to help for bugs fixing
and version update for the workshop programs.
To UIC for the prominent support in part by the Guangdong Provincial Key
Laboratory IRADS (2022B1212010006, R0400001-22), Key Laboratory for
Artificial Intelligence and Multi-Model Data Processing of Department of Education
of Guangdong Province and Guangdong Province F1 project grant on Curriculum
Development and Teaching Enhancement on course development UICR0400050-21
CTL for the provision of an excellent environment and computer facilities for the
preparation of this book.
Dr. Raymond Lee
December 2022
Beijing Normal University-Hong Kong Baptist University United
International College
Zhuhai
China

xiii
About the Book

This textbook presents an up-to-date and comprehensive overview of Natural


Language Processing (NLP) from basic concepts to core algorithms and key appli-
cations. It contains 7 step-by-step workshops (total 14 h) to practice essential Python
tools like NLTK, spaCy, TensorFlow Kera, Transformer, and BERT.
The objective of this book is to provide readers with fundamental knowledge,
core technologies, and enable to build their own applications (e.g. Chatbot systems)
using Python-based NLP tools. It is both a textbook and toolbook intended for
undergraduate students from various disciplines who want to learn, lecturers and
tutors who want to teach courses or tutorials for undergraduate/graduate students on
the subject and related AI topics, and readers with various backgrounds who want
to learn and build practicable applications after completing 14 h Python-based
workshops.

xv
Contents

Part I Concepts and Technology


1 Natural Language Processing ����������������������������������������������������������������    3
1.1 Introduction��������������������������������������������������������������������������������������    3
1.2 Human Language and Intelligence ��������������������������������������������������    4
1.3 Linguistic Levels of Human Language��������������������������������������������    6
1.4 Human Language Ambiguity������������������������������������������������������������    7
1.5 A Brief History of NLP��������������������������������������������������������������������    8
1.5.1 First Stage: Machine Translation (Before 1960s) ����������������    8
1.5.2 Second Stage: Early AI on NLP
from 1960s to 1970s�������������������������������������������������������������    8
1.5.3 Third Stage: Grammatical Logic
on NLP (1970s–1980s) ��������������������������������������������������������    9
1.5.4 Fourth Stage: AI and Machine Learning
(1980s–2000s)����������������������������������������������������������������������    9
1.5.5 Fifth Stage: AI, Big Data, and Deep Networks
(2010s–Present)��������������������������������������������������������������������   10
1.6 NLP and AI ��������������������������������������������������������������������������������������   10
1.7 Main Components of NLP����������������������������������������������������������������   11
1.8 Natural Language Understanding (NLU) ����������������������������������������   12
1.8.1 Speech Recognition��������������������������������������������������������������   13
1.8.2 Syntax Analysis��������������������������������������������������������������������   13
1.8.3 Semantic Analysis����������������������������������������������������������������   13
1.8.4 Pragmatic Analysis���������������������������������������������������������������   13
1.9 Potential Applications of NLP����������������������������������������������������������   14
1.9.1 Machine Translation (MT)����������������������������������������������������   14
1.9.2 Information Extraction (IE)��������������������������������������������������   15
1.9.3 Information Retrieval (IR)����������������������������������������������������   15
1.9.4 Sentiment Analysis ��������������������������������������������������������������   15
1.9.5 Question-Answering (Q&A) Chatbots ��������������������������������   16
References��������������������������������������������������������������������������������������������������   16

xvii
xviii Contents

2 N-Gram Language Model ����������������������������������������������������������������������   19


2.1 Introduction��������������������������������������������������������������������������������������   19
2.2 N-Gram Language Model ����������������������������������������������������������������   21
2.2.1 Basic NLP Terminology�������������������������������������������������������   22
2.2.2 Language Modeling and Chain Rule������������������������������������   24
2.3 Markov Chain in N-Gram Model�����������������������������������������������������   26
2.4 Live Example: The Adventures of Sherlock Holmes������������������������   27
2.5 Shannon’s Method in N-Gram Model����������������������������������������������   31
2.6 Language Model Evaluation and Smoothing Techniques����������������   34
2.6.1 Perplexity������������������������������������������������������������������������������   34
2.6.2 Extrinsic Evaluation Scheme������������������������������������������������   35
2.6.3 Zero Counts Problems����������������������������������������������������������   35
2.6.4 Smoothing Techniques����������������������������������������������������������   36
2.6.5 Laplace (Add-One) Smoothing��������������������������������������������   36
2.6.6 Add-k Smoothing������������������������������������������������������������������   38
2.6.7 Backoff and Interpolation Smoothing����������������������������������   39
2.6.8 Good Turing Smoothing ������������������������������������������������������   40
References��������������������������������������������������������������������������������������������������   41
3 Part-of-Speech (POS) Tagging����������������������������������������������������������������   43
3.1 What Is Part-of-Speech (POS)?��������������������������������������������������������   43
3.1.1 Nine Major POS in English Language���������������������������������   43
3.2 POS Tagging ������������������������������������������������������������������������������������   44
3.2.1 What Is POS Tagging in Linguistics? ����������������������������������   44
3.2.2 What Is POS Tagging in NLP? ��������������������������������������������   45
3.2.3 POS Tags Used in the PENN Treebank Project��������������������   45
3.2.4 Why Do We Care About POS in NLP?��������������������������������   46
3.3 Major Components in NLU��������������������������������������������������������������   48
3.3.1 Computational Linguistics and POS������������������������������������   48
3.3.2 POS and Semantic Meaning ������������������������������������������������   49
3.3.3 Morphological and Syntactic Definition of POS������������������   49
3.4 9 Key POS in English ����������������������������������������������������������������������   50
3.4.1 English Word Classes������������������������������������������������������������   51
3.4.2 What Is a Preposition?����������������������������������������������������������   51
3.4.3 What Is a Conjunction?��������������������������������������������������������   52
3.4.4 What Is a Pronoun?��������������������������������������������������������������   53
3.4.5 What Is a Verb? ��������������������������������������������������������������������   53
3.5 Different Types of POS Tagset����������������������������������������������������������   56
3.5.1 What Is Tagset?��������������������������������������������������������������������   56
3.5.2 Ambiguous in POS Tags ������������������������������������������������������   57
3.5.3 POS Tagging Using Knowledge ������������������������������������������   58
3.6 Approaches for POS Tagging ����������������������������������������������������������   58
3.6.1 Rule-Based Approach POS Tagging ������������������������������������   58
3.6.2 Example of Rule-Based POS Tagging����������������������������������   59
Contents xix

3.6.3 Example of Stochastic-Based POS Tagging ������������������������   60


3.6.4 Hybrid Approach for POS Tagging Using
Brill Taggers��������������������������������������������������������������������������   61
3.7 Taggers Evaluations��������������������������������������������������������������������������   63
3.7.1 How Good Is an POS Tagging Algorithm?��������������������������   64
References��������������������������������������������������������������������������������������������������   65
4 Syntax and Parsing����������������������������������������������������������������������������������   67
4.1 Introduction and Motivation ������������������������������������������������������������   67
4.2 Syntax Analysis��������������������������������������������������������������������������������   68
4.2.1 What Is Syntax����������������������������������������������������������������������   68
4.2.2 Syntactic Rules���������������������������������������������������������������������   68
4.2.3 Common Syntactic Patterns��������������������������������������������������   69
4.2.4 Importance of Syntax and Parsing in NLP ��������������������������   70
4.3 Types of Constituents in Sentences��������������������������������������������������   70
4.3.1 What Is Constituent?������������������������������������������������������������   70
4.3.2 Kinds of Constituents������������������������������������������������������������   72
4.3.3 Noun-Phrase (NP)����������������������������������������������������������������   72
4.3.4 Verb-Phrase (VP)������������������������������������������������������������������   72
4.3.5 Complexity on Simple Constituents ������������������������������������   73
4.3.6 Verb Phrase Subcategorization ��������������������������������������������   74
4.3.7 The Role of Lexicon in Parsing��������������������������������������������   75
4.3.8 Recursion in Grammar Rules������������������������������������������������   76
4.4 Context-Free Grammar (CFG)����������������������������������������������������������   76
4.4.1 What Is Context-Free Language (CFL)?������������������������������   76
4.4.2 What Is Context-Free Grammar (CFG)?������������������������������   77
4.4.3 Major Components of CFG��������������������������������������������������   77
4.4.4 Derivations Using CFG��������������������������������������������������������   78
4.5 CFG Parsing��������������������������������������������������������������������������������������   79
4.5.1 Morphological Parsing����������������������������������������������������������   79
4.5.2 Phonological Parsing������������������������������������������������������������   79
4.5.3 Syntactic Parsing������������������������������������������������������������������   79
4.5.4 Parsing as a Kind of Tree Searching ������������������������������������   80
4.5.5 CFG for Fragment of English ����������������������������������������������   80
4.5.6 Parse Tree for “Play the Piano” for Prior CFG ��������������������   80
4.5.7 Top-Down Parser������������������������������������������������������������������   81
4.5.8 Bottom-Up Parser ����������������������������������������������������������������   82
4.5.9 Control of Parsing ����������������������������������������������������������������   84
4.5.10 Pros and Cons of Top-Down vs. Bottom-Up Parsing ����������   84
4.6 Lexical and Probabilistic Parsing������������������������������������������������������   85
4.6.1 Why Using Probabilities in Parsing?������������������������������������   85
4.6.2 Semantics with Parsing ��������������������������������������������������������   86
4.6.3 What Is PCFG? ��������������������������������������������������������������������   87
4.6.4 A Simple Example of PCFG������������������������������������������������   87
xx Contents

4.6.5 Using Probabilities for Language Modeling������������������������   90


4.6.6 Limitations for PCFG ����������������������������������������������������������   90
4.6.7 The Fix: Lexicalized Parsing������������������������������������������������   91
References��������������������������������������������������������������������������������������������������   94
5 Meaning Representation�������������������������������������������������������������������������   95
5.1 Introduction��������������������������������������������������������������������������������������   95
5.2 What Is Meaning? ����������������������������������������������������������������������������   95
5.3 Meaning Representations������������������������������������������������������������������   96
5.4 Semantic Processing ������������������������������������������������������������������������   97
5.5 Common Meaning Representation ��������������������������������������������������   98
5.5.1 First-Order Predicate Calculus (FOPC)��������������������������������   98
5.5.2 Semantic Networks ��������������������������������������������������������������   98
5.5.3 Conceptual Dependency Diagram (CDD)����������������������������   99
5.5.4 Frame-Based Representation������������������������������������������������   99
5.6 Requirements for Meaning Representation�������������������������������������� 100
5.6.1 Verifiability �������������������������������������������������������������������������� 100
5.6.2 Ambiguity ���������������������������������������������������������������������������� 100
5.6.3 Vagueness������������������������������������������������������������������������������ 101
5.6.4 Canonical Forms ������������������������������������������������������������������ 101
5.7 Inference ������������������������������������������������������������������������������������������ 102
5.7.1 What Is Inference?���������������������������������������������������������������� 102
5.7.2 Example of Inferencing with FOPC ������������������������������������ 103
5.8 Fillmore’s Theory of Universal Cases���������������������������������������������� 103
5.8.1 What Is Fillmore’s Theory of Universal Cases? ������������������ 104
5.8.2 Major Case Roles in Fillmore’s Theory�������������������������������� 105
5.8.3 Complications in Case Roles������������������������������������������������ 106
5.9 First-Order Predicate Calculus���������������������������������������������������������� 107
5.9.1 FOPC Representation Scheme���������������������������������������������� 107
5.9.2 Major Elements of FOPC����������������������������������������������������� 107
5.9.3 Predicate-Argument Structure of FOPC ������������������������������ 108
5.9.4 Meaning Representation Problems in FOPC������������������������ 110
5.9.5 Inferencing Using FOPC������������������������������������������������������ 111
References�������������������������������������������������������������������������������������������������� 113
6 Semantic Analysis������������������������������������������������������������������������������������ 115
6.1 Introduction�������������������������������������������������������������������������������������� 115
6.1.1 What Is Semantic Analysis?������������������������������������������������� 115
6.1.2 The Importance of Semantic Analysis in NLP �������������������� 116
6.1.3 How Human Is Good in Semantic Analysis?������������������������ 116
6.2 Lexical Vs Compositional Semantic Analysis���������������������������������� 117
6.2.1 What Is Lexical Semantic Analysis?������������������������������������ 117
6.2.2 What Is Compositional Semantic Analysis? ������������������������ 117
6.3 Word Senses and Relations �������������������������������������������������������������� 118
6.3.1 What Is Word Sense?������������������������������������������������������������ 118
6.3.2 Types of Lexical Semantics�������������������������������������������������� 119
Contents xxi

6.4 Word Sense Disambiguation������������������������������������������������������������ 123


6.4.1 What Is Word Sense Disambiguation (WSD)? �������������������� 123
6.4.2 Difficulties in Word Sense Disambiguation�������������������������� 123
6.4.3 Method for Word Sense Disambiguation������������������������������ 124
6.5 WordNet and Online Thesauri���������������������������������������������������������� 126
6.5.1 What Is WordNet?���������������������������������������������������������������� 126
6.5.2 What Is Synsets? ������������������������������������������������������������������ 126
6.5.3 Knowledge Structure of WordNet���������������������������������������� 127
6.5.4 What Are Major Lexical Relations Captured
in WordNet?�������������������������������������������������������������������������� 129
6.5.5 Applications of WordNet and Thesauri?������������������������������ 129
6.6 Other Online Thesauri: MeSH���������������������������������������������������������� 130
6.6.1 What Is MeSH?�������������������������������������������������������������������� 130
6.6.2 Uses of the MeSH Ontology ������������������������������������������������ 131
6.7 Word Similarity and Thesaurus Methods������������������������������������������ 131
6.8 Introduction�������������������������������������������������������������������������������������� 131
6.8.1 Path-based Similarity������������������������������������������������������������ 132
6.8.2 Problems with Path-based Similarity������������������������������������ 133
6.8.3 Information Content Similarity�������������������������������������������� 134
6.8.4 The Resnik Method�������������������������������������������������������������� 135
6.8.5 The Dekang Lin Method������������������������������������������������������ 135
6.8.6 The (Extended) Lesk Algorithm ������������������������������������������ 136
6.9 Distributed Similarity������������������������������������������������������������������������ 137
6.9.1 Distributional Models of Meaning���������������������������������������� 137
6.9.2 Word Vectors������������������������������������������������������������������������ 137
6.9.3 Term-Document Matrix�������������������������������������������������������� 137
6.9.4 Point-wise Mutual Information (PMI)���������������������������������� 139
6.9.5 Example of Computing PPMI on a Term-Context
Matrix������������������������������������������������������������������������������������ 140
6.9.6 Weighing PMI Techniques���������������������������������������������������� 141
6.9.7 K-Smoothing in PMI Computation�������������������������������������� 142
6.9.8 Context and Word Similarity Measurement�������������������������� 144
6.9.9 Evaluating Similarity������������������������������������������������������������ 145
References�������������������������������������������������������������������������������������������������� 146
7 Pragmatic Analysis and Discourse �������������������������������������������������������� 149
7.1 Introduction�������������������������������������������������������������������������������������� 149
7.2 Discourse Phenomena���������������������������������������������������������������������� 149
7.2.1 Coreference Resolution�������������������������������������������������������� 150
7.2.2 Why Is it Important? ������������������������������������������������������������ 150
7.2.3 Coherence and Coreference�������������������������������������������������� 151
7.2.4 Importance of Coreference Relations ���������������������������������� 152
7.2.5 Entity-Based Coherence������������������������������������������������������� 153
7.3 Discourse Segmentation�������������������������������������������������������������������� 154
7.3.1 What Is Discourse Segmentation?���������������������������������������� 154
xxii Contents

7.3.2 Unsupervised Discourse Segmentation�������������������������������� 154


7.3.3 Hearst’s TextTiling Method�������������������������������������������������� 155
7.3.4 TextTiling Algorithm������������������������������������������������������������ 157
7.3.5 Supervised Discourse Segmentation������������������������������������ 158
7.4 Discourse Coherence������������������������������������������������������������������������ 158
7.4.1 What Makes a Text Coherent?���������������������������������������������� 158
7.4.2 What Is Coherence Relation?����������������������������������������������� 159
7.4.3 Types of Coherence Relations���������������������������������������������� 159
7.4.4 Hierarchical Structure of Discourse Coherence�������������������� 160
7.4.5 Types of Referring Expressions�������������������������������������������� 161
7.4.6 Features for Filtering Potential Referents ���������������������������� 162
7.4.7 Preferences in Pronoun Interpretation���������������������������������� 162
7.5 Algorithms for Coreference Resolution�������������������������������������������� 163
7.5.1 Introduction�������������������������������������������������������������������������� 163
7.5.2 Hobbs Algorithm������������������������������������������������������������������ 163
7.5.3 Centering Algorithm ������������������������������������������������������������ 166
7.5.4 Machine Learning Method���������������������������������������������������� 169
7.6 Evaluation ���������������������������������������������������������������������������������������� 171
References�������������������������������������������������������������������������������������������������� 172
8 Transfer Learning and Transformer Technology���������������������������������� 175
8.1 What Is Transfer Learning?�������������������������������������������������������������� 175
8.2 Motivation of Transfer Learning������������������������������������������������������ 176
8.2.1 Categories of Transfer Learning ������������������������������������������ 176
8.3 Solutions of Transfer Learning �������������������������������������������������������� 178
8.4 Recurrent Neural Network (RNN)���������������������������������������������������� 180
8.4.1 What Is RNN?���������������������������������������������������������������������� 180
8.4.2 Motivation of RNN �������������������������������������������������������������� 180
8.4.3 RNN Architecture ���������������������������������������������������������������� 181
8.4.4 Long Short-Term Memory (LSTM) Network���������������������� 183
8.4.5 Gate Recurrent Unit (GRU)�������������������������������������������������� 185
8.4.6 Bidirectional Recurrent Neural Networks (BRNNs)������������ 186
8.5 Transformer Technology������������������������������������������������������������������ 188
8.5.1 What Is Transformer? ���������������������������������������������������������� 188
8.5.2 Transformer Architecture������������������������������������������������������ 188
8.5.3 Deep Into Encoder���������������������������������������������������������������� 189
8.6 BERT������������������������������������������������������������������������������������������������ 192
8.6.1 What Is BERT? �������������������������������������������������������������������� 192
8.6.2 Architecture of BERT ���������������������������������������������������������� 192
8.6.3 Training of BERT������������������������������������������������������������������ 192
8.7 Other Related Transformer Technology�������������������������������������������� 194
8.7.1 Transformer-XL�������������������������������������������������������������������� 194
8.7.2 ALBERT ������������������������������������������������������������������������������ 195
References�������������������������������������������������������������������������������������������������� 196
Contents xxiii

9 Major NLP Applications ������������������������������������������������������������������������ 199


9.1 Introduction�������������������������������������������������������������������������������������� 199
9.2 Information Retrieval Systems���������������������������������������������������������� 199
9.2.1 Introduction to IR Systems �������������������������������������������������� 199
9.2.2 Vector Space Model in IR ���������������������������������������������������� 200
9.2.3 Term Distribution Models in IR�������������������������������������������� 202
9.2.4 Latent Semantic Indexing in IR�������������������������������������������� 207
9.2.5 Discourse Segmentation in IR���������������������������������������������� 208
9.3 Text Summarization Systems������������������������������������������������������������ 212
9.3.1 Introduction to Text Summarization Systems���������������������� 212
9.3.2 Text Summarization Datasets����������������������������������������������� 214
9.3.3 Types of Summarization Systems���������������������������������������� 214
9.3.4 Query-Focused Vs Generic Summarization Systems ���������� 215
9.3.5 Single and Multiple Document Summarization�������������������� 217
9.3.6 Contemporary Text Summarization Systems������������������������ 218
9.4 Question-and-Answering Systems���������������������������������������������������� 224
9.4.1 QA System and AI���������������������������������������������������������������� 224
9.4.2 Overview of Industrial QA Systems ������������������������������������ 228
References�������������������������������������������������������������������������������������������������� 236

Part II Natural Language Processing Workshops with Python


Implementation in 14 Hours
10 
Workshop#1 Basics of Natural Language Toolkit (Hour 1–2)������������ 243
10.1 Introduction������������������������������������������������������������������������������������ 243
10.2 What Is Natural Language Toolkit (NLTK)?���������������������������������� 243
10.3 A Simple Text Tokenization Example Using NLTK���������������������� 244
10.4 How to Install NLTK?�������������������������������������������������������������������� 245
10.5 Why Using Python for NLP?���������������������������������������������������������� 246
10.6 NLTK with Basic Text Processing in NLP ������������������������������������ 248
10.7 Simple Text Analysis with NLTK �������������������������������������������������� 249
10.8 Text Analysis Using Lexical Dispersion Plot �������������������������������� 253
10.8.1 What Is a Lexical Dispersion Plot?���������������������������������� 253
10.8.2 Lexical Dispersion Plot Over Context Using
Sense and Sensibility�������������������������������������������������������� 253
10.8.3 Lexical Dispersion Plot Over Time Using
Inaugural Address Corpus������������������������������������������������ 254
10.9 Tokenization in NLP with NLTK���������������������������������������������������� 255
10.9.1 What Is Tokenization in NLP?������������������������������������������ 255
10.9.2 Different Between Tokenize() vs Split() �������������������������� 256
10.9.3 Count Distinct Tokens������������������������������������������������������ 257
10.9.4 Lexical Diversity�������������������������������������������������������������� 258
10.10 Basic Statistical Tools in NLTK������������������������������������������������������ 260
10.10.1 Frequency Distribution: FreqDist()���������������������������������� 260
xxiv Contents

10.10.2 Rare Words: Hapax ���������������������������������������������������������� 262


10.10.3 Collocations���������������������������������������������������������������������� 263
References�������������������������������������������������������������������������������������������������� 265
11 
Workshop#2 N-grams in NLTK and Tokenization in SpaCy
(Hour 3–4)������������������������������������������������������������������������������������������������ 267
11.1 Introduction������������������������������������������������������������������������������������ 267
11.2 What Is N-Gram?���������������������������������������������������������������������������� 267
11.3 Applications of N-Grams in NLP �������������������������������������������������� 268
11.4 Generation of N-Grams in NLTK �������������������������������������������������� 268
11.5 Generation of N-Grams Statistics �������������������������������������������������� 270
11.6 spaCy in NLP���������������������������������������������������������������������������������� 276
11.6.1 What Is spaCy? ���������������������������������������������������������������� 276
11.7 How to Install spaCy? �������������������������������������������������������������������� 277
11.8 Tokenization using spaCy �������������������������������������������������������������� 278
11.8.1 Step 1: Import spaCy Module ������������������������������������������ 278
11.8.2 Step 2: Load spaCy Module "en_core_web_sm". ������������ 278
11.8.3 Step 3: Open and Read Text File
"Adventures_Holmes.txt" Into file_handler
"fholmes"�������������������������������������������������������������������������� 278
11.8.4 Step 4: Read Adventures of Sherlock Holmes������������������ 278
11.8.5 Step 5: Replace All Newline Symbols������������������������������ 279
11.8.6 Step 6: Simple Counting �������������������������������������������������� 279
11.8.7 Step 7: Invoke nlp() Method in spaCy������������������������������ 280
11.8.8 Step 8: Convert Text Document Into Sentence
Object�������������������������������������������������������������������������������� 280
11.8.9 Step 9: Directly Tokenize Text Document������������������������ 282
References�������������������������������������������������������������������������������������������������� 284
12 
Workshop#3 POS Tagging Using NLTK (Hour 5–6)���������������������������� 285
12.1 Introduction������������������������������������������������������������������������������������ 285
12.2 A Revisit on Tokenization with NLTK ������������������������������������������ 285
12.3 Stemming Using NLTK������������������������������������������������������������������ 288
12.3.1 What Is Stemming?���������������������������������������������������������� 288
12.3.2 Why Stemming?��������������������������������������������������������������� 289
12.3.3 How to Perform Stemming? �������������������������������������������� 289
12.3.4 Porter Stemmer ���������������������������������������������������������������� 289
12.3.5 Snowball Stemmer������������������������������������������������������������ 291
12.4 Stop-Words Removal with NLTK�������������������������������������������������� 292
12.4.1 What Are Stop-Words? ���������������������������������������������������� 292
12.4.2 NLTK Stop-Words List ���������������������������������������������������� 292
12.4.3 Try Some Texts ���������������������������������������������������������������� 294
12.4.4 Create Your Own Stop-Words������������������������������������������ 295
12.5 Text Analysis with NLTK �������������������������������������������������������������� 296
Contents xxv

12.6 Integration with WordCloud ���������������������������������������������������������� 299


12.6.1 What Is WordCloud?�������������������������������������������������������� 299
12.7 POS Tagging with NLTK���������������������������������������������������������������� 301
12.7.1 What Is POS Tagging?������������������������������������������������������ 301
12.7.2 Universal POS Tagset������������������������������������������������������� 301
12.7.3 PENN Treebank Tagset (English and Chinese)���������������� 302
12.7.4 Applications of POS Tagging ������������������������������������������ 303
12.8 Create Own POS Tagger with NLTK���������������������������������������������� 306
References�������������������������������������������������������������������������������������������������� 312
13 
Workshop#4 Semantic Analysis and Word Vectors Using spaCy
(Hour 7–8)������������������������������������������������������������������������������������������������ 313
13.1 Introduction������������������������������������������������������������������������������������ 313
13.2 What Are Word Vectors?���������������������������������������������������������������� 313
13.3 Understanding Word Vectors���������������������������������������������������������� 314
13.3.1 Example: A Simple Word Vector�������������������������������������� 314
13.4 A Taste of Word Vectors������������������������������������������������������������������ 316
13.5 Analogies and Vector Operations���������������������������������������������������� 319
13.6 How to Create Word Vectors? �������������������������������������������������������� 320
13.7 spaCy Pre-trained Word Vectors ���������������������������������������������������� 320
13.8 Similarity Method in Semantic Analysis���������������������������������������� 323
13.9 Advanced Semantic Similarity Methods with spaCy �������������������� 326
13.9.1 Understanding Semantic Similarity���������������������������������� 326
13.9.2 Euclidian Distance������������������������������������������������������������ 326
13.9.3 Cosine Distance and Cosine Similarity���������������������������� 327
13.9.4 Categorizing Text with Semantic Similarity �������������������� 329
13.9.5 Extracting Key Phrases ���������������������������������������������������� 330
13.9.6 Extracting and Comparing Named Entities���������������������� 331
References�������������������������������������������������������������������������������������������������� 333
14 
Workshop#5 Sentiment Analysis and Text Classification
with LSTM Using spaCy (Hour 9–10) �������������������������������������������������� 335
14.1 Introduction������������������������������������������������������������������������������������ 335
14.2 Text Classification with spaCy and LSTM Technology ���������������� 335
14.3 Technical Requirements������������������������������������������������������������������ 336
14.4 Text Classification in a Nutshell ���������������������������������������������������� 336
14.4.1 What Is Text Classification? �������������������������������������������� 336
14.4.2 Text Classification as AI Applications������������������������������ 337
14.5 Text Classifier with spaCy NLP Pipeline���������������������������������������� 338
14.5.1 TextCategorizer Class ������������������������������������������������������ 339
14.5.2 Formatting Training Data for the TextCategorizer������������ 340
14.5.3 System Training���������������������������������������������������������������� 344
14.5.4 System Testing������������������������������������������������������������������ 346
14.5.5 Training TextCategorizer for Multi-Label
Classification�������������������������������������������������������������������� 347
xxvi Contents

14.6 Sentiment Analysis with spaCy������������������������������������������������������ 351


14.6.1 IMDB Large Movie Review Dataset�������������������������������� 351
14.6.2 Explore the Dataset ���������������������������������������������������������� 351
14.6.3 Training the TextClassfier ������������������������������������������������ 355
14.7 Artificial Neural Network in a Nutshell������������������������������������������ 357
14.8 An Overview of TensorFlow and Keras������������������������������������������ 358
14.9 Sequential Modeling with LSTM Technology�������������������������������� 358
14.10 Keras Tokenizer in NLP������������������������������������������������������������������ 359
14.10.1 Embedding Words������������������������������������������������������������ 363
14.11 Movie Sentiment Analysis with LTSM Using Keras
and spaCy���������������������������������������������������������������������������������������� 364
14.11.1 Step 1: Dataset������������������������������������������������������������������ 365
14.11.2 Step 2: Data and Vocabulary Preparation�������������������������� 366
14.11.3 Step 3: Implement the Input Layer ���������������������������������� 368
14.11.4 Step 4: Implement the Embedding Layer ������������������������ 368
14.11.5 Step 5: Implement the LSTM Layer �������������������������������� 368
14.11.6 Step 6: Implement the Output Layer�������������������������������� 369
14.11.7 Step 7: System Compilation �������������������������������������������� 369
14.11.8 Step 8: Model Fitting and Experiment Evaluation ���������� 370
References�������������������������������������������������������������������������������������������������� 371
15 
Workshop#6 Transformers with spaCy and TensorFlow
(Hour 11–12)�������������������������������������������������������������������������������������������� 373
15.1 Introduction������������������������������������������������������������������������������������ 373
15.2 Technical Requirements������������������������������������������������������������������ 373
15.3 Transformers and Transfer Learning in a Nutshell ������������������������ 374
15.4 Why Transformers?������������������������������������������������������������������������ 375
15.5 An Overview of BERT Technology������������������������������������������������ 377
15.5.1 What Is BERT? ���������������������������������������������������������������� 377
15.5.2 BERT Architecture������������������������������������������������������������ 378
15.5.3 BERT Input Format���������������������������������������������������������� 378
15.5.4 How to Train BERT?�������������������������������������������������������� 380
15.6 Transformers with TensorFlow ������������������������������������������������������ 382
15.6.1 HuggingFace Transformers���������������������������������������������� 382
15.6.2 Using the BERT Tokenizer ���������������������������������������������� 383
15.6.3 Word Vectors in BERT������������������������������������������������������ 386
15.7 Revisit Text Classification Using BERT ���������������������������������������� 388
15.7.1 Data Preparation��������������������������������������������������������������� 388
15.7.2 Start the BERT Model Construction �������������������������������� 389
15.8 Transformer Pipeline Technology�������������������������������������������������� 392
15.8.1 Transformer Pipeline for Sentiment Analysis������������������ 393
15.8.2 Transformer Pipeline for QA System ������������������������������ 393
15.9 Transformer and spaCy������������������������������������������������������������������ 394
References�������������������������������������������������������������������������������������������������� 398
Contents xxvii

16 
Workshop#7 Building Chatbot with TensorFlow and Transformer
Technology (Hour 13–14)������������������������������������������������������������������������ 401
16.1 Introduction������������������������������������������������������������������������������������ 401
16.2 Technical Requirements������������������������������������������������������������������ 401
16.3 AI Chatbot in a Nutshell ���������������������������������������������������������������� 402
16.3.1 What Is a Chatbot?������������������������������������������������������������ 402
16.3.2 What Is a Wake Word in Chatbot?������������������������������������ 403
16.3.3 NLP Components in a Chatbot ���������������������������������������� 404
16.4 Building Movie Chatbot by Using TensorFlow
and Transformer Technology���������������������������������������������������������� 404
16.4.1 The Chatbot Dataset���������������������������������������������������������� 405
16.4.2 Movie Dialog Preprocessing�������������������������������������������� 405
16.4.3 Tokenization of Movie Conversation�������������������������������� 407
16.4.4 Filtering and Padding Process������������������������������������������ 408
16.4.5 Creation of TensorFlow Movie Dataset
Object (mDS)�������������������������������������������������������������������� 409
16.4.6 Calculate Attention Learning Weights������������������������������ 410
16.4.7 Multi-Head-Attention (MHAttention)������������������������������ 411
16.4.8 System Implementation���������������������������������������������������� 412
16.5 Related Works �������������������������������������������������������������������������������� 430
References�������������������������������������������������������������������������������������������������� 431

Index�������������������������������������������������������������������������������������������������������������������� 433
About the Author

Raymond Lee is the founder of the Quantum Finance Forecast System (QFFC)
(https://qffc.uic.edu.cn) and currently an Associate Professor at United International
College (UIC) with 25+ years’ experience in AI research and consultancy, Chaotic
Neural Networks, NLP, Intelligent Fintech Systems, Quantum Finance, and
Intelligent E-Commerce Systems. He has published over 100 publications and
authored 8 textbooks in the fields of AI, chaotic neural networks, AI-based fintech
systems, intelligent agent technology, chaotic cryptosystems, ontological agents,
neural oscillators, biometrics, and weather simulation and forecasting systems.
Upon completion of the QFFC project, in 2018 he joined United International
College (UIC), China, to pursue further R&D work on AI-Fintech and to share his
expertise in AI-Fintech, chaotic neural networks, and related intelligent systems
with fellow students and the community. His three latest textbooks, Quantum
Finance: Intelligent Forecast and Trading Systems (2019), Artificial Intelligence in
Daily Life (2020), and this NLP book have been adopted as the main textbooks for
various AI courses in UIC.

xxix
Abbreviations

AI Artificial intelligence
ASR Automatic speech recognition
BERT Bidirectional encoder representations from transformers
BRNN Bidirectional recurrent neural networks
CDD Conceptual dependency diagram
CFG Context-free grammar
CFL Context-free language
CNN Convolutional neural networks
CR Coreference resolution
DNN Deep neural networks
DT Determiner
FOPC First-order predicate calculus
GRU Gate recurrent unit
HMM Hidden Markov model
IE Information extraction
IR Information retrieval
KAI Knowledge acquisition and inferencing
LSTM Long short-term memory
MEMM Maximum entropy Markov model
MeSH Medical subject thesaurus
ML Machine learning
NER Named entity recognition
NLP Natural language processing
NLTK Natural language toolkit
NLU Natural language understanding
NN Noun
NNP Proper noun
Nom Nominal
NP Noun phrase
PCFG Probabilistic context-free grammar
PMI Pointwise mutual information

xxxi
xxxii Abbreviations

POS Part-of-speech
POST Part-of-speech tagging
PPMI Positive pointwise mutual information
Q&A Question-and-answering
RNN Recurrent neural networks
TBL Transformation-based learning
VB Verb
VP Verb phrase
WSD Word sense disambiguation
Part I
Concepts and Technology
Chapter 1
Natural Language Processing

Consider this scenario: Late in the evening, Jack starts a mobile app and talks with
AI Tutor Max.

1.1 Introduction

There are many chatbots that allow humans to communicate with a device in natural
language nowadays. Figure 1.1 illustrates dialogue between a student who had
returned to dormitory after a full day classes and initiated communication with a
mobile application called AI Tutor 2.0 (Cui et al. 2020) from our latest research on
AI tutor chatbot. The objective is to enable the user (Jack) not only can learn from
book reading but also can communicate candidly with AI Tutor 2.0 (Max) to provide
knowledge responses in natural language. It is different from chatbots that respond
with basic commands but is human–computer interaction to demonstrate how a user
wishes to communicate in a way like a student convers with a tutor about subject
knowledge in the physical world. It is a dynamic process consisting of (1) world
knowledge on simple handshaking dialogue such as greetings and general discus-
sions. This is not an easy task as it involves knowledge and common sense to con-
struct a functional chatbot with daily dialogues, and (2) technical knowledge of a
particular knowledge domain, or domain expert as it required to learn from author’s
book AI in Daily Life (Lee 2020) first which covers all basic knowledge on the sub-
ject to form a knowledge tree or ontology graph that can be served as a new type of
publication and interactive device between human and computer to learn new
knowledge.
Natural language processing (NLP) is related to several disciplines including
human linguistic, computation linguistic, statistical engineering, AI in machine

© The Author(s), under exclusive license to Springer Nature Singapore Pte 3


Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_1
4 1 Natural Language Processing

Max: Good evening, Jack. How are you?


Jack: Busy day. Just finished dinner after whole af-
ternoon lecture.
Max: I see. Anything I can help?
Jack: Yes. Today Dr Lee has talked about data min-
ing and search engine. Some concepts are not
so clear, might need your help.
Max: Sure thing. Which part are you not clear?
Jack: What is the limitation of search engine?
Max: Search engines today have one intrinsic prob-
lem, that is all the search results are searched
automatically and ranked according to similar
query keywords.
Jack: Why Hidden Markov Model is popular?
Max: The reason why HMM is popular because it
is: 1) reliable; 2) easy to implement technical-
ly; and 3) the Marko chains can be trained
automatically and computationally feasible to
use
Jack: What is Data Mining?
Max: Data mining is the process of discovering pat-
terns in large datasets involving methods at
the intersection of machine learning, statis-
tics, and database systems.
Jack: I see. Thank you very much. Have a nice even-
ing.
Max: You too. Good night!

Fig. 1.1 A snapshot of AI Tutor chatbot

learning, data mining, human voice processing recognition and synthesis, etc. There
are many genius chatbots initiated by NLP and AI scientists which become com-
mercial products in past decades.
This chapter will introduce this prime technology and components followed by
pertinent technologies in subsequent chapters.

1.2 Human Language and Intelligence

There is an old saying: The way you behave says more about who you are. It is
because we never know what people think, the only method is to evaluate and judge
their behaviors.
1.2 Human Language and Intelligence 5

Fig. 1.2 Turing test (Tuchong 2020a)

NLP core technologies and methodologies arose from famous Turing Test
(Eisenstein 2019; Bender 2013; Turing 1936, 1950) proposed by Sir Alan Turing
(1912–1954) in 1950s, the father of AI. Figure 1.2 shows a human judge convers
with two individuals in two rooms. One is a human, the other is either a robot, a
chatbot, or an NLP application. During a 20 min conversation, the judge can ask
human/machine technical/non-technical questions and require response on every
question so that the judge can decide whether the respondent is a human or a
machine. NLP in Turing Test is to recognize, understand questions, and respond in
human language. It remains a popular topic in AI today because we cannot see and
judge people’s thinking to define intelligence. It is the ultimate challenge in AI.
Human language is a significant component in human behavior and civilization.
It can be categorized into (1) written and (2) oral aspects generally. Written lan-
guage undertakes to process, store, and pass human/natural language knowledge to
next generations. Oral or spoken language acts as a communication media among
other individuals.
NLP has examined the basic effects on philosophy such as meaning and knowl-
edge, psychology in words meanings, linguistics in phrases and sentences forma-
tion, computational linguists in language models. Hence, NLP is cross-disciplinary
integration of disciplines such as philosophy in human language ontology models,
psychology behavior between natural and human language, linguistics in mathe-
matical and language models, computational linguistics in agents and ontology
trees technology as shown in Fig. 1.3.
6 1 Natural Language Processing

Fig. 1.3 Various discipline related to NLP

1.3 Linguistic Levels of Human Language

Linguistic levels (Hausser 2014) are regarded as functional analysis of human-­


written and spoken languages. There are six levels in linguistics analysis (1) phonet-
ics, (2) phonology, (3) morphology, (4) syntax, (5) semantics, and (6) pragmatics
(discourse) classified in basic sound linguistic. The six-levels of linguistic are shown
in Fig. 1.4.
The basic linguistic structure of spoken language includes phonetics and phonol-
ogy. Phonetics refers to the physical aspects of sound, the study of production and
perception of sounds called phones. Phonetics governs the production of human
speech often without preceding knowledge of spoken language, organizes sounds,
and studies the phonemes of languages that can provide various meanings between
words and phrases.
Direct language structure is related to morphological and syntactic levels.
Morphology is the form and word level determined by grammar and syntax gener-
ally. It refers to the smallest form in linguistic analysis, consisting of sounds, to
combine words with grammatical or lexical function.
Lexicology is the study of vocabulary from a word form to a derived-form. Syntax
represents the primary level of clauses and sentences to organize meaning of differ-
ent words order, i.e. addition and subtraction of spoken language, and deals with
related sentence patterns and ambiguous analysis.
The advanced structure deals with actual language meaning at semantic and
pragmatic levels. Semantic level is the domain of meaning that consists of morphol-
ogy and syntax but is seen as a level that requires one’s own learning to assign cor-
rect meaning promptly with vocabulary, terminology form, grammar, sentence, and
discourse perspective. Pragmatics is the use of language in definitive settings. The
meaning of discourse does not have to be the same as abstract form in actual use. It
is largely based on concept of speech acts and the contents of statement with intent
and effect analysis of language performance.
1.4 Human Language Ambiguity 7

Fig. 1.4 Linguistic levels of human languages

1.4 Human Language Ambiguity

In many language models, cultural differences often produce identical utterance


with more than single meaning in conversation. Ambiguity are the capabilities to
understand sentence structures in many ways. There are (1) lexical, (2) syntactic, (3)
semantic, and (4) pragmatics ambiguities in NLP.
Lexical ambiguity arises from words where a word meaning depends on contex-
tual utterance. For instance, the word green is normally a noun for color. But it can
be an adjective or even a verb in different situations.
Syntactic ambiguity arises from sentences that are parsed differently, e.g. Jack
watched Helen with a telescope. It can describe either Jack watched Helen by using
a telescope or Jack watched Helen holding a telescope.
Semantic ambiguity arises from word meaning that can be misinterpreted, or a
sentence has ambiguous words or phrases, e.g. The van hits the boar while it is mov-
ing. It can describe either the van hits the boar while the van is moving, or the van
hits the boar while the boar is moving. It has more than a simple syntactic meaning
and required to work out the correct interpretation.
8 1 Natural Language Processing

Pragmatic ambiguity arises from a statement that is not clearly defined when the
context of a sentence provides multiple interpretations such as I like that too. It can
describe I like that too, other likes that too but the description of that is uncertain.
NLP analyzes sentences ambiguity incessantly. If they can be identified earlier,
it will be easier to define proper meanings.

1.5 A Brief History of NLP

There are several major NLP transformation stages in NLP history (Santilal 2020).

1.5.1 First Stage: Machine Translation (Before 1960s)

The concept of NLP was introduced in seventeenth century by philosopher and


mathematician Gottfried Wilhelm Leibniz (1646–1716) and polymath René
Descartes (1596–1650). Their studies of the relationships between words and lan-
guages formed the basis for language translation engine development (Santilal 2020).
The first patent for an invention related to machine translation was filed by inven-
tor and engineer Georges Artsrouni in 1933, but formal study and research was
rendered by Sir Alan Turing from his remarkable article Computing Machinery and
Intelligence published in 1950 (Turing 1936, 1950) and his famous Turing test offi-
cially used as an evaluation criterion for machine intelligence since NLP research
and development were mainly focused on language translation at that time.
The first and second International Conference on Machine Translation held in
1952 and 1956 used basic rule-based and stochastic techniques. The 1954
Georgetown-IBM experiment engaged wholly automatic machine translation of
more than 60 Russian sentences into English and was over optimistic that the whole
machine translation problem can be solved within a few years. However, break-
through on NLP was achieved by Emeritus Prof. Noam Chomsky on universal
grammar for linguistics in 1957, but since the ALPAC report published in 1966
revealed deficient progress for AI and machine translation in the past 10 years signi-
fied the first winter of AI.

1.5.2 Second Stage: Early AI on NLP from 1960s to 1970s

NLP major development was focused on how it can be used in different areas
such as knowledge engineering called agent ontology to shape meaning repre-
sentations following AI grew popular over time. BASEBALL system
1.5 A Brief History of NLP 9

(Green et al. 1961) was a typical example of Q&A-based domain expert system
of human and computer interaction developed in 1960s, but inputs were restric-
tive and language processing techniques remained in basic language processing.
In 1968, Prof. Marvin Minsky (1927–2016) developed a more powerful NLP
system. This advanced system used an AI-based question-answering inference
engine between humans and computers to provide knowledge-based interpretations
of questions and answers. Further, Prof. William A. Woods proposed an augmented
translation network (ATN) to represent natural language input in 1970. During this
period, many programmers started to transcribe codes in different AI languages to
conceptualize natural language ontology knowledge of real-world structural infor-
mation into human understanding mode status. Yet these expert systems were unable
to meet expectation signified the second winter of AI.

1.5.3 Third Stage: Grammatical Logic on NLP (1970s–1980s)

Research turned to knowledge representation, programming logic, and reasoning in


AI. This period was regarded as the grammatical logic phase of NLP in which pow-
erful sentence processing techniques such as SRI’s core language engine and dis-
course representation theory, a new pragmatic representation and discourse
interpretation with practical resources and tools such as parsers and Q&A chatbots.
Although R&D was hampered by computational power but lexicon in 1980s aimed
to expand NLP.

1.5.4 Fourth Stage: AI and Machine Learning (1980s–2000s)

The revolutionary success of Hopfield Network in the field of machine learning


proposed by Prof. Emeritus John Hopfield activated a new era of NLP research
using machine learning techniques as an alternative to complex rule-based and sto-
chastic methods in the past decades.
Computational technology upgrades in computational power and memory
complemented Chomsky’s theory of linguistics had enhanced language process-
ing from machine learning methods of corpus linguistics. This development
stage was also known as NLP lexical, and corpus referred to grammar emergence
in lexicalization method in late 1980s, which signified the IBM DeepQA project
led by Dr. David Ferrucci for their remarkable question-answering system devel-
oped in 2006.
10 1 Natural Language Processing

1.5.5 Fifth Stage: AI, Big Data, and Deep Networks


(2010s–Present)

NLP statistical technique and rule-based system R&D had evolved into cloud com-
puting technology on mobile computing and big data in deep network analysis, e.g.
recurrent neural networks using LSTM and related networks. Google, Amazon,
Facebook contributed to agent technologies and deep neural networks development
in 2010 to devise products such as auto-driving, Q&A chatbots, and storage devel-
opment are under way.

1.6 NLP and AI

NLP can be regarded as automatic or semi-automatic processing of human language


(Eisenstein 2019). It requires extensive knowledge of linguistics and logical theory
in theoretical mathematics, also known as computational linguistics. It is a multidis-
ciplinary study of epistemology, philosophy, psychology, cognitive science, and
agent ontology.
NLP is an area of AI which computer machines can analyze and interpret human
speech for human–computer interaction (HCI) to generate structural knowledge for
information retrieval operations, text and automatic text summarization, sentiment
and speech recognition analysis, data mining, deep learning, and machine transla-
tion agent ontologies at different levels of Q&A chatbots (Fig. 1.5).

Fig. 1.5 NLP and AI (Tuchong 2020b)


1.7 Main Components of NLP 11

1.7 Main Components of NLP

NLP consists of (1) Natural Language Understanding (NLU), (2) Knowledge


Acquisition and Inferencing (KAI), (3) Natural Language Generation (NLG) com-
ponents as shown in Fig. 1.6.
NLU is a technique and method devised to understand the meanings of human
spoken languages by syntax, semantic, and pragmatic analyses.
KAI is a system to generate proper responses after spoken languages are fully
recognized by NLU. It is an unresolved knowledge acquisition and inferencing
problem in machine learning and AI by conventional rule-based system due to the
intricacies of natural language and conversation, i.e. an if-then-else types of query-­
response used in expert systems, most KAI systems strive to regulate knowledge
domain at a specific industry for resolution, i.e. customer service knowledge for
insurance, medical, etc. Further, agent ontology has achieved favorable outcome.
NLG includes answer, response, and feedback generation in human–machine
dialogue. It is a multi-facet machine translation process that converts responses into
text and sentences to perform text-to-speech synthesis from target language and
produce near human speech responses.

Fig. 1.6 NLP main components


12 1 Natural Language Processing

1.8 Natural Language Understanding (NLU)

Natural Language Understanding (NLU) is a process of recognizing and under-


standing spoken language in four stages (Allen 1994): (1) speech recognition, (2)
syntactic (syntax) analysis, (3) semantic analysis, and (4) pragmatic analysis as
shown in Fig. 1.7.

Spoken
Language

Speech
Recognition

Lexicon
Syntax
Analysis
Grammar

Semantic Semantic
Rules Analysis

Contextual Pragmatic
Information Analysis

Target Meaning
Representation

Fig. 1.7 NLU systematic diagram


1.8 Natural Language Understanding (NLU) 13

1.8.1 Speech Recognition

Speech recognition (Li et al. 2015) is the first stage in NLU that performs phonetic,
phonological, and morphological processing to analyze spoken language. The task
involves breaking down the stems of spoken words called utterances, into distinct
tokens representing paragraphs, sentences, and words in different parts. Current
speech recognition models apply spectrogram analysis to extract distinct frequen-
cies, e.g. the word uncanny can be split into two-word tokens un and canny. Different
languages have different spectrogram analysis.

1.8.2 Syntax Analysis

Syntax analysis (Sportier et al. 2013) is the second stage of NLU direct response
speech recognition, analyzing the structural meaning of spoken sentences. This task
has two purposes: (1) check syntax correctness of the sentence/utterance, (2) break
down spoken sentences into syntactic structures to reflect syntactic relationship
between words. For instance, the utterance oranges to the boys will be rejected by
syntax parser because of syntactic errors.

1.8.3 Semantic Analysis

Semantic analysis (Goddard 1998) is the third stage in NLU which corresponds to
syntax analysis. This task is to extract the precise meaning of a sentence/utterance,
or dictionary meanings defined by the text and reject meaningless, e.g. semantic
analyzer rejects word phrase like hot snowflakes despite correct syntactic words
meaning but incorrect semantic meaning.

1.8.4 Pragmatic Analysis

Pragmatic analysis (Ibileye 2018) is the fourth stage in NLU and a challenging part
in spoken language analysis involving high level or expert knowledge with common
sense, e.g. will you crack open the door? I’m getting hot. This sentence/utterance
requires extra knowledge in the second clause to understand crack is to break in
semantic meaning, but it should be interpreted as to open in pragmatic meaning.
14 1 Natural Language Processing

1.9 Potential Applications of NLP

After years of research and development from machine translation and rule-based
systems to data mining and deep networks, NLP technology has a wide range of
applications in everyday activities such as machine translation, information retrieval,
sentiment analysis, information extraction, and question-answering chatbots as in
Fig. 1.8.

1.9.1 Machine Translation (MT)

Machine translation (Scott 2018) is the earliest application in NLP since 1950s.
Although it is not difficult to translate one language to another yet there are two
major challenges (1) naturalness (or fluency) means different languages have differ-
ent styles and usages and (2) adequacy (or accuracy) means different languages may
present independent ideas in different languages. Experienced human translators
address this trade-off in creative ways such as statistical methods, or case-by-case
rule-based systems in the past but since there have been many ambiguity scenarios
in language translation, the goal of machine translation R&D nowadays strive sev-
eral AI techniques applications for recurrent networks, or deep networks backbox
systems to enhance machine learning capabilities.

Fig. 1.8 Potential NLP applications


1.9 Potential Applications of NLP 15

1.9.2 Information Extraction (IE)

Information extraction (Hemdev 2011) is an application task to extract key lan-


guage information from texts or utterances automatically. It can be structural, semi-­
structural machine-readable documents or from users’ languages of NLP in most
cases. Recent activities in complex formats such as audio, video and even interac-
tive dialogue can be extracted from multiple medias. Hence, many commercial IE
programs become domain-specific like medical science, law, or AI Tutor specified
AI knowledge in our case. By doing so, it is easier to set up an ontology graph and
ontology knowledgebase to contain all the retrieved information can be referenced
to these domain knowledge graphs to extract useful knowledge.

1.9.3 Information Retrieval (IR)

Information retrieval (Peters et al. 2012) is an application for organizing, retrieving,


storing, and evaluating information from documents, source repositories, especially
textual information, and multimedia such as video and audio knowledge bases. It
helps users to locate relevant documents without answering any questions explicitly.
The user must make a request for IR system to retrieve the relevant output and
respond in document form.

1.9.4 Sentiment Analysis

Sentiment analysis (Liu 2012) is a kind of data mining system in NLP to analyze
user sentiment towards products, people, ideas from social media, forums, and
online platforms. It is an important application for extracting data from messages,
comments, and conversations published on these platforms; and assigning a labeled
sentiment classification as in Fig. 1.9 to understand natural language and utterances.
Deep networks are ways to analyze large amounts of data. In Part II: NLP
Implementation Workshop will explore how to implement sentiment analysis in
detail using Python spaCy and Transformer technology.
16 1 Natural Language Processing

Fig. 1.9 NLP on sentiment analysis

1.9.5 Question-Answering (Q&A) Chatbots

Q&A systems is the objective in NLP (Raj 2018). A process flow is necessary to
implement a Q&A chatbot. It includes voice recognition to convert into a list of
tokens in sentences/utterances, syntactic grammatical analysis, semantic meaning
analysis of whole sentences, and pragmatic analysis for embedded or complex
meanings. When enquirer’s utterance meaning is generated, it is necessary to search
from knowledge base for the most appropriate answer or response through inferenc-
ing either by rule-based system, statistical system, or deep network, e.g. Google
BERT system. Once a response is available, reverse engineering is required to gen-
erate natural voice from verbal language called voice synthesis. Hence, Q&A sys-
tem in NLP is an important technology that can apply to daily activities such as
human–computer interaction in auto-driving, customer services support, and lan-
guage skills improvement.
The final workshop will discuss how to integrate various Python NLP implemen-
tation tools including NLTK, spaCy, TensorFlow Keras, and Transformer Technology
to implement a Q&A movies chatbot system.

References

Allen, J. (1994) Natural Language Understanding (2nd edition). Pearson


Bender, E. M. (2013) Linguistic Fundamentals for Natural Language Processing: 100 Essentials
from Morphology and Syntax (Synthesis Lectures on Human Language Technologies). Morgan
& Claypool Publishers
References 17

Cui, Y., Huang, C., Lee, Raymond (2020). AI Tutor: A Computer Science Domain Knowledge
Graph-Based QA System on JADE platform. World Academy of Science, Engineering and
Technology, Open Science Index 168, International Journal of Industrial and Manufacturing
Engineering, 14(12), 543 - 553.
Eisenstein, J. (2019) Introduction to Natural Language Processing (Adaptive Computation and
Machine Learning series). The MIT Press.
Goddard, C. (1998) Semantic Analysis: A Practical Introduction (Oxford Textbooks in Linguistics).
Oxford University Press.
Green, B., Wolf, A., Chomsky, C. and Laughery, K. (1961). BASEBALL: an automatic question-­
answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM com-
puter conference (IRE-AIEE-ACM ’61 (Western)). Association for Computing Machinery,
New York, NY, USA, 219–224.
Hausser, R. (2014) Foundations of Computational Linguistics: Human-Computer Communication
in Natural Language (3rd edition). Springer.
Hemdev, P. (2011) Information Extraction: A Smart Calendar Application: Using NLP,
Computational Linguistics, Machine Learning and Information Retrieval Techniques. VDM
Verlag Dr. Müller.
Ibileye, G. (2018) Discourse Analysis and Pragmatics: Issues in Theory and Practice.
Malthouse Press.
Lee, R. S. T. (2020). AI in Daily Life. Springer.
Li, J. et al. (2015) Robust Automatic Speech Recognition: A Bridge to Practical Applications.
Academic Press.
Liu, B. (2012) Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers.
Peters, C. et al. (2012) Multilingual Information Retrieval: From Research To Practice. Springer.
Raj, S. (2018) Building Chatbots with Python: Using Natural Language Processing and Machine
Learning. Apress.
Santilal, U. (2020) Natural Language Processing: NLP & its History (Kindle edition). Amazon.com.
Scott, B. (2018) Translation, Brains and the Computer: A Neurolinguistic Solution to Ambiguity
and Complexity in Machine Translation (Machine Translation: Technologies and Applications
Book 2). Springer.
Sportier, D. et al. (2013) An Introduction to Syntactic Analysis and Theory. Wiley-Blackwell.
Tuchong (2020a) The Turing Test. https://stock.tuchong.com/image/detail?imag
eId=921224657742331926. Accessed 14 May 2022.
Tuchong (2020b) NLP and AI. https://stock.tuchong.com/image/detail?imag
eId=1069700818174345308. Accessed 14 May 2022.
Turing, A. (1936) On computable numbers, with an application to the Entscheidungs problem. In:
Proc. London Mathematical Society, Series 2, 42:230–26
Turing, A. (1950) Computing Machinery and Intelligence. Mind, LIX (236): 433–460.
Chapter 2
N-Gram Language Model

2.1 Introduction

NLP entities like word-to-word tokenization using NTLK, spaCy technologies in


Workshop 1 (Chap. 10) analyzed words in insolation, but the relationship between
words is important in NLP. This chapter will focus on word sequences, its modeling
and analysis.
In many NLP applications, there are noises and disruptions effecting incorrect
words pronunciation regularly in applications like speech recognition, text classifi-
cation, text generation, machine translation, Q&A chatbots, Q&A conversation
machines or agents being used in auto-driving.
Humans experience mental confusion about spelling errors as in Fig. 2.1 often
caused by pronunciations, typing speeds, and keystroke’s location. They can be cor-
rected by looking up in a dictionary, a spell checker, and grammars usage.
Applying word prediction in a word sequence can provide automatic spell-check
corrections, its corresponding concept terminology can model words relationships,
estimate occurrence frequency to generate new texts with classification, and apply
in machine translation to correct errors.
Probability or word counting method can work on a large databank called corpus
(Pustejovsky and Stubbs 2012) which can be the collection of text documents, lit-
eratures, public speeches, conversations, and other online comments/opinions.

Fig. 2.1 Common spelling COMMON SPELLING ERRORS


errors
1. It’s “calendar”, not “calender”.
2. It’s “definitely”, not “definately”.
3. It’s “tomorrow”, not “tommorrow”.
4. It’s “noticeable”, not “noticable”.
5. It’s “convenient”, not “convinient”.

© The Author(s), under exclusive license to Springer Nature Singapore Pte 19


Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_2
20 2 N-Gram Language Model

A text highlights spelling and grammatic errors in yellow and blue colors is
shown in Fig. 2.2. This method can calculate words probabilities occurrence fre-
quency to provide substitution of higher frequency probability but cannot always
present accurate options.
Figure 2.3 illustrates a simple scenario of next word prediction in sample utter-
ances I like photography, I like science, and I love mathematics. The probability of
I like is 0.67 (2/3) compared with I love is 0.33 (1/3), the probability of like photog-
raphy and like science is similar at 0.5 (1/2). Assigning probability to scenarios, I
like photography and I like science are both 0.67 × 0.5 = 0.335, and I love mathe-
matics is 0.33 × 1 = 0.33.

Fig. 2.2 Spelling and grammar checking tools

Fig. 2.3 Next word


prediction in simple
utterances
2.2 N-Gram Language Model 21

When applying probability on language models, it must always note (1) domain
specific verity of keywords togetherness and terminology knowledge varies accord-
ing to domains, e.g. medical science, AI, etc., (2) syntactic knowledge attributes to
syntax, lexical knowledge, and (3) common sense or world knowledge attributes to
the collection of habitual behaviors from past experiences, and (4) languages usage
significance in high-level NLP.
When applying probability on words prediction in an utterance, there are words
often proposed by rank and frequency to provide a sequential optimum estimation.
For example:
[2.1] I notice three children standing on the ??? (ground, bench …)
[2.2] I just bought some oranges from the ??? (supermarket, shop …)
[2.3] She stopped the car and then opened the ??? (door, window, …)
The structure of [2.3] is perplexed because word counting method with a sizeable
knowledge domain is adequate but common sense, world knowledge, or specific
domain knowledge are among the sources. It involves scenario syntactic knowledge
that attributes to do something with superior level at scene such as descriptive
knowledge to help the guesswork. Although it is plain and mundane to study pre-
ceding and words tracking but it is one the most useful techniques on words predic-
tion. Let us begin with some simple word counting methods in NLP, the N-gram
language model.

2.2 N-Gram Language Model

It was learnt that the motivations on words prediction can apply to voice recogni-
tion, text generation, and Q&A chatbot. N-gram language model, also called
N-gram model or N-gram (Sidorov 2019; Liu et al. 2020) is a fundamental method
to formalize words prediction using probability calculation. N-gram is statistical
model that consists of word sequence in N-number, commonly used N-grams
include:
• Unigram refers to a single word, i.e. N = 1. It is seldomly used in practice because
it contains only one word in N-gram. However, it is important to serve as the base
for higher order N-gram probability normalization.
• Bigram refers to a collection of two words, i.e. N = 2. For example: I have, I do,
he thinks, she knows, etc. It is used in many applications because its occurrence
frequency is high and easy to count.
• Trigram refers to a collection of three words, i.e. N = 3. For example: I noticed
that, noticed three children, children standing on, standing on the. It is useful
because it contains more meanings and not lengthy. Given a count knowledge of
first three words can easily guess the next word in a sequence. However, its
occurrence frequency is low in a moderate corpus.
22 2 N-Gram Language Model

• Quadrigram refers to a collection of four words, i.e. N = 4. For example: I noticed


that three, noticed that three children, three children standing on. It is useful with
literatures or large corpus like Brown Corpus because of their extensive words’
combinations.
A sizeable N-gram can present more central knowledge but can pose a problem.
If is too large, it means that probability and occurrence of word sequence is infre-
quent and even 0 in terms of probability counts.
Corpus volume and other factors also affect performance. N-gram model training
is based on an extensive knowledge base (KB) or databank from specific domains
such as public speeches, literatures, topic articles like news, finance, medical, sci-
ence, or chat messages from social media platforms. Hence, a moderate N-gram is
the balance by frequency and proportions.
The knowledge of counts acquired by a N-gram can assess to conditional prob-
ability of candidate words as the next word in a sequence, e.g. It is not difficult. It is
a bigram which means to count the occurrence of is given that it has already men-
tioned from a large corpus, or the conditional probability of it is given that it has
already mentioned or can apply it one by one to calculate the conditional probability
of an entire words sequence. It is like words and sentences formation of day-to-day
conversations which is a psychological interpretation in logical thinking. N-gram
progresses in this orderly fashion.
It serves to rank the likelihood of a sequence consisting of various alternative
hypotheses in a sentence/utterance for application like automatic speech recogni-
tion (ASR), e.g. [2.4] The cinema staff told me that popcorn/amelcorn sales have
doubled. It is understood that it refers to popcorn and not amelcorn because the
concept of popcorn is always attributed to conversations about cinema. Since the
occurrence of popcorn in a sentence/utterance has a higher rank than amelcorn, it is
natural to select popcorn as the best answer.
Another purpose is to assess the likelihood of a sentence/utterance for text gen-
eration or machine translation, e.g. [2.5] The doctor recommended a cat scan to the
patient. It may be difficult to understand what a cat scan is or how can a scan be
related to a cat without any domain knowledge. Since the occurrence of doctor is
attributed to medical domains, it is natural to search articles, literatures, websites
about medical knowledge to learn that CAT scan refers to a computerized axial
tomography scanner as in Fig. 2.4 instead of a cat. This type of words prediction is
often domain specific associated with the preceding word as guidance to select an
appropriate expression.

2.2.1 Basic NLP Terminology

Here is a list of common terminologies in NLP (Jurafsky et al. 1999; Eisenstein 2019):
• Sentence is a unit of written language. It is a basic entity in a conversation or
utterance.
2.2 N-Gram Language Model 23

Fig. 2.4 Computerized axial tomography scanner (aka. CAT scan) (Tuchong 2022)

• Utterance is a unit of spoken language. Different from the concept of sentence,


utterance is usually domain and culture specific which means it varies according
to countries and even within country.
• Word Form is an inflected form occurs in a corpus. It is another basic entity in
a corpus.
• Types/Word Types are distinct words in a corpus.
• Tokens are generic entities or objects of a passage. It is different from word form
as tokens can be meaningful words or symbols, punctuations, simple and distinct
character(s).
• Stem is a root form of words. Stemming is the process of reducing inflected, or
derived words from their word stem.
• Lemma is an abstract form shared by word forms in the same stem, part of speech,
and word sense. Lemmatization is the process of grouping together the inflected
forms of a word so that they can be analyzed as a single item which can be identi-
fied by the word’s lemma or dictionary form.
An example to demonstrate meaning representations between lemma and stem is
shown in Fig. 2.5. Lemmatization is the abstract form to generate a concept. It indi-
cated that stem or root word can be a meaningful word, or meaningless, or a symbol
such as inform or comput to formulate meaningful words such as information, infor-
mative, computer, or computers.
There are several corpora frequently used in NLP applications.
Google (2022) is the largest corpus as it contains words and texts from its search
engine and the internet. It has over trillion English tokens with over million mean-
ingful wordform types sufficient to generate sentences/utterances for daily use.
24 2 N-Gram Language Model

Fig. 2.5 Stemming vs. lemmatization

Brown Corpus (2022) is an important and well-known corpus because it is the


first well-organized corpus in human history founded by Brown University from
1961 with continuous updates. At present, it has over 583 million tokens, 293,181
wordform types and words in foreign languages. It is one of the most comprehen-
sive corpora for daily use, and a KB used in many N-grams, related NLP models and
applications.
Further, there are many domain specific corpora such as Wall Street Journal is
one of earliest corpora to discover knowledge from financial news, Associated Press
focus on news and world events, Hansard is a prominent Corpus of British
Parliament speeches, Boston University Radio News corpus, NLTK Corpora Library
and others etc. (Bird et al. 2009; Eisenstein 2019; Pustejovsky and Stubbs 2012).
Let us return to words prediction. A language model is a traditional word count-
ing model to count and calculate conditional probability to predict the probability
based on a word sequence, e.g. when applying to utterance it is difficult to… that
with a sizeable corpus like Brown Corpus, the traditional word counting method
may suggest either say/tell/guess based on occurrence frequency. This traditional
language model terminology is applied to predictions and forecasts at advanced
computer systems and research in specialized deep networks and models in
AI. Although there has been a technology shift, statistical model is always the fun-
damental model in many cases (Jurafsky et al. 1999; Eisenstein 2019).

2.2.2 Language Modeling and Chain Rule

Conditional probability calculation is to study the definition of conditional proba-


bilities and look for counts, given by

P ( A ∩ B)
P ( A|B ) = (2.1)
P ( B)

For example, to evaluate conditional probability: The garden is so beautiful that


given by the word sequence “The garden is so beautiful” will be
2.2 N-Gram Language Model 25

P ( The garden is so beautiful that )


P ( that|The garden is so beautiful ) =
P ( The garden is so beautiful )
(2.2)
Count ( The garden is so beautiful that )
=
Count ( The garden is so beautiful )

Although the calculation is straightforward but if the corpus or text collection is


moderate, this conditional probability (counts) will probably be zero.
Chain rules of probability is useful as an independent assumption to rectify this
problem.
By rewriting the conditional probability equation (2.1), it will be

P ( A ∩ B ) = P ( A|B ) P ( B ) (2.3)

For a sequence of events, A, B, C and D, the Chain Rule formulation will become

P ( A,B,C ,D ) = P ( A ) P ( B|A ) P ( C|A, B ) P ( D|,A|,B|,C ) (2.4)

In general:

P ( x1 ,x2 ,x3 , … xn ) = P ( x1 ) P ( x2 |x1 ) P ( x3 |x1 , x2 )… P ( xn |x1 … xn −1 ) (2.5)

If word sequence from position 1 to n as w1n is defined, the Chain Rule applied to
word sequence will become

P ( w1n ) = P ( w1 ) P ( w2 |w1 ) P ( w3 |w12 )… P ( wn |w1n −1 )


n (2.6)
= ∏P ( wk |w1k −1 )
k =1

So, the conditional probability for previous example will be

P ( the garden isso beautiful that ) = P ( the ) ∗ P ( garden|the )


P ( is|the garden ) ∗ P ( so|the garden is ) ∗ P ( beautiful|the garden is so ) (2.7)


P ( that|the garden isso beautiful )

Note: Normally, <s> and </s> are used to denote the start and end of sentence/
utterance for better formulation.
This method seems fair and easy to understand but poses two major problems.
First, it is unlikely to gather the right statistics for prefixes which means that not
knowing the starting point of the sentence. Second, the calculation for word
sequence probability is mundane. If it is a long sentence, conditional probability at
the end of this equation is complex to calculate.
Let us explore how genius Markov Chain is applied to solve this problem.
Random documents with unrelated
content Scribd suggests to you:
Imperial (Berlín, MDCCCXC). Larrainzar, Estudios sobre la Historia de
América, etc. (México, 1875-78). H. Strebel, Alt. México (Hamburgo,
1885). Waitz, Amerikaner, vol. II (1864). Ad. Bastian, Culturlander
des alten America (Berlín, 1878). Las obras citadas en las notas del
presente capítulo y en las de los referentes á la "Vida Psíquica" del
Indio Americano (IV-V).
Fuentes.—Bernal Díaz del Castillo, Verdadera Historia de la
Conquista de la Nueva España (Hist. Prim. Ind. II). Icazbalceta, Coll.
de Documentos para la Historia de México (1858-66). Id., Nueva
Colección de Documentos (1886-92). Pacheco y Cárdenas, Coll. de
Documentos. Ternaux-Compans, Voyages, relations et memoires
originaux, etc. Obras Históricas de Don Fernando de Alva Ixtlilxochitl
(Ed. Alfredo Chavero). Diego Muñoz Camargo, Historia de Tlascala
(Ed. Alfredo Chavero). Fr. Bernardo de Lizana, Hist. del Yucatán (Ed.
Mus. Nac. México). Dorantes, Sumaria Relación de las cosas de
Nueva España (Ed. Mus. Nac. México). Gaspar de Villagra, Hist.
Nueva México (Museo Nacional México). Los Anales del Museo
Nacional de México, (1.ª época, vols. I á VII, y 2.ª época, vols. I á
V). Crónica Mexicana, escrita por D. Hdo. Alvarado Tezocomoc hacia
el año MDXCVIII, anotada por Orozco y Berra, etc. (Edición Vigil,
México). Sahagún, Hist. General de las cosas de la Nueva España
(Ed. Jourdanet y Simeón, París, 1880). Boturini, Idea de una Nueva
Hist. Gen. de la Amca. Sepnal. (Ed. Madrid, 1746). Clavijero, Historia
Antigua de México, etc. (Ed. Española, Londres, 1826). Hdo. Cortés,
Cartas de Relación (Hist. Prim. de Indias). Landa, Rel. de las cosas
del Yucatán (Ed. de la Rada y Delgado, Madrid, 1884). Fuentes y
Guzmán, Hist. de Guatemala ó recordación Florida, etc. (Ed. J.
Zaragoza, Madrid, 1882-83). Alonso de Zurita, Rapports sur les
differents classes de chefs, etc. (Ed. Ternaux Compans, París, 1840).
Fray Gerónimo de Mendieta, Hist. Eclesiástica Indiana (Ed.
Icazbalceta, México, MDCCCLXX). Los preciosos Manuscritos de la
Bca. Escurialense, relacionados y descritos críticamente por el P. M.
Gutiérrez (La Ciudad de Dios, vol. LXXXI núms. Abril 5-20, Mayo 5-
20, Junio 5-20-1910). Los Ms. de la Colección Muñoz (Ac. de la
Historia), vols. II, III, IV (Ixtlilxochilt); VII, VIII (Mem. Nueva
España); IX, X, XI, XII, XIV, XVI (Pimas); XVII (P. Kino); XXII, XXIII,
XXIX (Cohahila); XXX, XXXI, XXXIX (Zapolitatlan); XLI (Alonso de
Çorita, Relación, 1633); XLII (Orden sucesión en terrenos y baldíos),
etc. Col. Mata Linares, vol. I, XXXIX, XLI, XCXXIX, XCXXXVI, etc.
Bca. Nacional Madrid, Ms. (I. 43), (I. 89), (I. 116), (I. 28, 29, 31),
etc. Colecciones García Figueroa (Ac. de la Hist., Madrid). Bureau of
Am. Etnology, Report 3 (Thomas, Mtos. Mayas); 1 (Central American
Picture writing, etc.); 16 (Thomas, Maya Códices); 19 (Symbols Maya
Year; Mounds Northern Honduras; Calendario Maya) y Boletín 28-
1904 (Descrip. Colecciones Seller), etc., etc.
Códices indígenas.—Los citados en las notas del presente
capítulo; los llamados de "Porfirio Díaz", "Baranda", "Dehesa",
publicados por la Junta Colombina México (México, 1892); El
Fejervary-Meyer, Museo de Liverpool (Ed. Duc. de Loubat, Berlín,
1901); el Codex Nuttall (Cambridge, Mass., 1912), el Codex Osuna
(Madrid, 1878), etc., etc.
Bibliografías.—Winsor, op. cit., I, pág. 153 y sig. y apéndices I, II,
pág. 397 y sig. Icazbalceta, Bibliog. Mexicana del siglo xvi (México,
1886). Bancroft, Native Races, vol. V-136, etc. Bca. Hisp. Americana
Sepnal. de Beristain y Souza (Ed. Vera-Amecameca, 1883). Leclerc,
Biblioteca Americana, etc. (París, 1878). Las notas de Bandelier (10,
11, 12 Rep. Peabody Museum). Field, Essay towards an Indian
Bibliog. (N. Y., 1873). Fischer, Bca. Mexicana, etc. (Londres, 1869).
Pinart, Catalogue de livres rares et precieux, etc. (1883, París). Los
Catálogos de Hiersemann, Quaritch, etc., y las citadas en los
capítulos anteriores (Títs. I y II).
CAPÍTULO VIII

TRIBUS DE LA AMÉRICA DEL SUR


(DIVISIÓN DEL ATLÁNTICO)

1.—Observaciones generales. 2.—La región Amazónica. 3.—


La familia Tupi-Guarani. 4.—Los Tapuyas. 5.—Arawaks. 6.—
Caribes ó Karinas. 7.—Las tribus del alto Orinoco y alto
Amazonas. 8.—Las de las mesetas Bolivianas. 9.—La Región
Pampeana. 10.—Las tribus del Gran Chaco. 11.—
Pampeanos y Araucanos. 12.—Patagones y Fueguinos. 13.
—Los Calchaquies.

Observaciones generales.

1.—Conformes están los modernos etnólogos en circunscribir las


culturas aborígenes de la América del Sur, á la zona geográfica
llamada Andina, que se extiende desde Chile y las Provincias
Argentinas Mediterráneas, hasta más allá de las mesetas de
Colombia. Las tribus de esta región llegaron antes del
Descubrimiento á los grados superiores del barbarismo; formaron
curiosos organismos sociales y construyeron curiosos edificios.
Fig. 256.—La primera representación gráfica conocida de los
Aborígenes Americanos (Augsburgo 1497 á 1503).

En cambio, las tribus del Este de la referida Zona Andina, vivieron,


salvo raras excepciones, en estado salvaje; construyeron sólo
rudimentarias chozas, su vida social fué nula y su existencia física
abyecta.
Estos evidentes contrastes nos autorizan á dividir en primer lugar las
agrupaciones Sud-Americanas primitivas, en dos grandes Secciones
Geográficas, la del Océano Atlántico y la del Pacífico[404], que
estudiaremos separadamente.
La clasificación lingüística de la multitud de tribus que poblaron estas
dos grandes Secciones ofrece dificultades insuperables. El irritante y
extraordinario número de lenguas irreducibles desconocidas ó no
estudiadas, su irregular distribución en el Continente, la facilidad de
los movimientos emigratorios de las diversas tribus á lo largo de sus
enormes vías fluviales, la natural inestabilidad y despreocupación de
los primitivos colonos Europeos, etcétera, etc., han hecho hasta
ahora infructuosos los admirables esfuerzos científicos de antiguos y
modernos filólogos para establecer una clasificación exacta de las
Sud-Americanas lenguas[405].
Fig. 257.—Niño Indio (Época actual).

Teniendo esto en cuenta, y con el único fin de sistematizar en lo


posible nuestro estudio de la América Aborigen, adoptaremos la
clasificación que de las tribus Sud-Americanas hace Brinton, fijando
como siempre nuestra atención en aquellas agrupaciones tribales,
más cuidadosamente estudiadas y de mayor interés por sus
asociaciones históricas.
Distingue el mencionado filólogo en el Grupo del Atlántico dos
regiones (Amazónica y Pampeana) y otras dos (Colombiana y
Peruana) en el Grupo del Pacífico[406].
Fig. 258.—Danza ceremonial.

La Región Amazónica.

2.—Comprende la Región Etnológica, llamada Amazónica, los


inmensos territorios regados por el Amazonas, el Orinoco y sus
numerosísimos y caudalosos afluentes, incluyendo los Estados de
Santa Cruz y el Beni, en Bolivia, casi todos los del Brasil, los de
Venezuela y las Guayanas y las grandes y pequeñas Antillas. Los
extensísimos bosques y prodigiosos valles tropicales de estos dos
colosales sistemas hidrográficos ofrecían al hombre primitivo
abundantísima caza y pesca, sabrosísimos frutos y abundancia de
naturales recursos. Tales facilidades de vida, unidas al efecto
depresivo de un clima ardiente y húmedo, enervaron, sin duda, las
actividades de los aborígenes, haciéndoles perezosos y nómadas.
Por otra parte, los miles de kilómetros de vías fluviales navegables
que caracterizan esta parte del Continente Sud-Americano,
proporcionaron á las tribus comunicaciones naturales y fáciles, que
aprovecharon para diseminarse en dilatadas regiones geográficas.
No es extraño, pues, que encontremos en esta Sección algunas
familias lingüísticas cuyos miembros llegaron á grandes distancias de
su probable lugar de origen. De entre ellas las más conocidas y
dispersas son la "Tupi-Guarani", la Tapuya, la Arawak y la Caribe,
cuyas peculiaridades etnológicas, etc., indicaremos
sucintamente[407].

Fig. 259.—Danza del Escudo "Warraus" (Guayana Británica).

La familia Tupi-Guarani.

3.—La célebre familia lingüística de los Tupis, Guaranis, Baranis,


Curios, etc., fué una de las más notables, extendidas y numerosas
de toda Sudamérica. Desde las Guayanas al Paraguay y desde las
mesetas del Brasil á las costas de Bolivia, se hallaba, con más ó
menos variantes, la llamada "Lingua geral do Brasil", derivada
esencialmente de la de los Tupis, y una de las más suaves, musicales
y flexibles de las conocidas en América.
Vivían estos indígenas en aldeas provisionales, llamadas "Tabas",
compuestas de miserables y escasos ranchos, que se abandonaban
por conveniencia. Las aldeas abandonadas se denominaban "taperâs
ó taboeiras". Cultivaban el algodón, el maíz y la mandioca y eran
aficionadísimos al tabaco, que fumaban en pipa, mezclado con otras
yerbas. Los Omaguas y Cocamas, de cabezas deformadas "como
mitras", enseñaron á los Europeos los usos del "caout-chout", del
que hacían vestidos, sandalias, etc.; trabajaron hábilmente los
metales y vivieron en aldeas permanentes.
Las demás tribus de la familia Tupi-Guarani, no pasaron de los
grados inferiores del barbarismo. Algunas de sus alfarerías, sin
embargo, (igasanas) pueden competir con las mejores de Sud-
América.

Fig. 260.—Indios Caribes (Akawais).

Su organización social no difería en esencia de la del resto de las


tribus Americanas. El "morubixabá" ó jefe de los guerreros tenía
autoridad absoluta en tiempo de guerra y limitada en el de paz por
las decisiones del Consejo ("nheemougaba"). Las jefaturas eran
generalmente hereditarias, formando sus titulares dentro de la tribu
una clase social privilegiada y distinta de los "mboyás" ó chusma
indígena. Eran antropófagos, polígamos, sin limitaciones ni freno;
vivían en común en los recintos tribales, y sabían construir canoas
rudas y fuertes. Supieron también algunas de estas tribus defender
sus provisiones de las crecientes periódicas de sus grandes ríos,
enterrándolas en aquellas cuevas ó silos hondos, peculiares de las
tribus Amazónicas. Reconocían un poder superior (Tupá-¿Quién
eres?), y multitud de espíritus activos y malignos; conservaban los
huesos de algunos magos famosos (pagés-piages ó caraibes) en
chozas especiales y aisladas, atribuyéndoles poderes oraculares y
rindiéndoles especial reverencia. Su Mitología era rica é imaginativa,
y esperaban como la mayoría de las tribus Americanas al redentor ó
maestro extraordinario que había de venir de lejanas tierras (Sumé).
Con excepción de las tribus próximas á los dominios Incásicos
(Omaguas, Chirihuanos, etc.), desconocían todas el vestido, siendo
en cambio aficionadísimas al adorno, las músicas y danzas,
embriagándose en ellas con rapé de parica (Turas-Río Madeira) ó los
zumos fermentados del "curupá" (Omaguas) y otras varias
plantas[408].

Fig. 261.—Indios Onas (Tierra del Fuego).

Los Chiriguanos ó Chirihuanos, cuyo valor militar y canibalística


fiereza tan profundo terror inspiraban á los guerreros Quechuas, son
históricamente célebres por su tenaz resistencia á los diez mil
hombres de guerra del Ynca Yupanqui y á los soldados del virrey
Toledo[409].

Fig. 262.—Choza Yaghan (Tierra del Fuego).

Los Tapuyas.

4.—Rivaliza en antigüedad y extensión con la familia Tupi ó Guarani


la de los Tapuyas (enemigos), cuyas numerosas bandas poblaron y
aún pueblan en parte el Continente Sud-Americano, desde los 5° á
los 20° de latitud Sur, y desde el Océano Atlántico al Río Xingú
(Pará, Matto-Grosso, Goyaz, etc.). Eran también conocidos con los
nombres de Crens ó Guerens (antiguos, pueblo antiguo) acaso por
suponer que antes de la llegada de los Tupis fueron los Tapuyas
dueños de la costa del Atlántico, cuyos depósitos conchíferos
(sambaquis) parece ser que construyeron.
La apariencia física de los Tapuyas no era del todo desagradable, y la
conformación de sus cráneos es idéntica á la de los descubiertos en
los yacimientos declarados pre-glaciales de Lagoa-Santa[410].
Algunas tribus de esta familia como los llamados Botocudos,
deformaban de tan horrible manera su labio inferior con "botoques"
ó pedazos de piedra ó madera, que ante los ojos europeos no
podían menos de aparecer repugnantes.
No pasaron en general estas agrupaciones del salvajismo. Vivían
desnudos, sin organización tribal definida ni más habitaciones que
los abrigos naturales del bosque. No fabricaron alfarerías ni canoas.
Eran caníbales por costumbre y nómadas por temperamento.
Fueron, en cambio, cazadores habilísimos y de las raras
agrupaciones indígenas que supieron usar antorchas de fibras
vegetales, revestidas de cera de abejas. Aunque faltos de ideas
religiosas concretas, sepultaban cuidadosamente sus muertos y
veneraban con temor las almas desencarnadas de sus jefes.
Fig. 263.—Indio Yaghan, arreglando su arpón (Hyades y Deniker).

La lengua de los Tapuyas es de difícil fonética y contrasta con el


resto de las Americanas por su tendencia á las formas aisladas y su
escasa proporción de palabras aglutinantes. Habitan actualmente
algunas de estas bandas salvajes en las cercanías de los Ríos
Madeira, Tapajos, Dulce, etc., en los bancos meridionales del
Amazonas (Mundrucus, Paiguizé) y en los boscajes del Yapurá y el
Putumayo (Miranhas, etc.)[411].
Fig. 264.—Indios Guaranis ó Carios (Schmide).

Los Arawaks ó Maipures.

5.—La familia lingüística de los Arawak ó Maipures es también una


de las más extendidas de Sudamérica. Ocupaban sus tribus parte del
alto Paraguay (Guanas, etc.) y las mesetas Bolivianas (Moxos, etc.),
y llegaban, casi sin solución de continuidad, hasta las Grandes y
Pequeñas Antillas y las Lucayas ó de Baháma.
Fueron los primeros aborígenes Americanos que conocieron los
descubridores Europeos. Las palabras indias recojidas por Colón y
sus compañeros en Haiti, Cuba, etc., pertenecen á las formas
dialectales de esta familia lingüística.
La cultura de los Arawak ó Maipures era, en general, superior á la de
los Tupis y Tapuyas. Cultivaban el maíz, el tabaco, y la mandioca.
Sabían tejer el algodón en finos paños, y sus armas de piedra tenían
notable pulimento. Labraban el oro, hacían máscaras de madera,
tallaban ídolos y construían canoas.
Fig. 265.—Topu Calchaqui (Ambrosetti).

Algunos grupos (Guayanas) estaban organizados tribalmente, con


matriarcado, clanes y sistema totemístico. Sus casas (no comunales)
estaban provistas de hamacas, esteras de fibras y alfarerías,
relativamente perfectas. Tenían rica Mitología, danzas y ritos
definidos y lugares reservados para cementerios. Las tribus más
conocidas y notables de esta familia son la de los Antis ó Campas,
del "Gran Pajonal" (Ríos Ucayali, Pachitea, etc.), que sabían
domesticar monos, cotorras, tapirs, etc., conviviendo en sus chozas
con ellos; los Guanas, del Alto Paraguay, pacíficos é inteligentes; los
Tarumas (Guayana Británica), célebres por sus alfarerías y sus
hermosos perros de caza; los Maipures, propiamente dichos, y los
Moxos, del Alto Mamoré, heroicamente evangelizados por los
misioneros Jesuítas[412].

Fig. 266.--Juego del látigo (Arawaks).


Los Caribes.

6.—Los Caribes ó Karinas, vecinos y enemigos implacables de los


Maipures, etc., llegaron desde las Guayanas hasta las Antillas y las
Lucayas. En la época del descubrimiento Colombino se hablaban sus
dialectos en las mencionadas islas y en el Continente, desde la boca
del río Esequibo hasta el golfo de Maracaibo y las dichas Guayanas,
tierra adentro. Según antiguos misioneros, el dialecto Cumanagoto
(Cumaná ó Nueva Andalucía) era corriente á lo largo de estas
regiones hasta más allá de Caracas.

Fig. 267.—Chiriguanos y Matacos.

La cultura de la mayoría de estas tribus, cuya ferocidad se ha hecho


legendaria (Caníbal, de Karina), era muy semejante, y acaso
superior, á la de sus vecinos los Arawak, etc. Sus canoas eran
grandes y muy marineras; supieron tejer hamacas de algodón ó pita,
con sus torzales y rapacejos, cultivar la tierra y fabricar alfarerías
notables. Los célebres petroglifos del Esequibo y la isla de San
Vicente se atribuyen á los Caribes por la mayoría de los Arqueólogos.
Los ritos mágico-religiosos de estas tribus (Cumanagotos, etc.) eran
definidos y complejos. Sacrificaban maíz al sol y á la luna; tenían sus
magos (piayes) y sus fetiches y cremaban ceremonialmente sus
cadáveres.
La base de su organización social era el grupo ó grupos de parientes
(clan exogámico) que convivían en casas grandes, redondas, con
particiones, formadas de madera y techadas de palma. En algunos
lugares (Deltas del Orinoco, etc.) las levantaban sobre postes en el
agua, como los habitantes prehistóricos de los lagos Europeos. Las
flechas de guerra de los Caribes eran herboladas, con un veneno tan
mortífero y activo que, en rasguñando, la herida era incurable. La
antropofagia de estas tribus era sólo ritual y consecuencia de
guerreros triunfos. Sus alimentos ordinarios eran el cazabe, los
plátanos, el pescado y carne de monte. Eran muy aficionados á
músicas y cantos, se pintarrajeaban imitando animales (sus
"totems"), se horadaban las orejas y ternillas de la nariz, distinguían
los meses por las lunas y observaban por las estrellas los
tiempos[413].
Fig. 268.—Indios Macusi (Caribes).

Tribus del Alto Orinoco y el Alto Amazonas.

7.—Forman parte los extensos llanos de Venezuela de la enorme


cuenca de los afluentes del Amazonas y el Orinoco. Están cubiertos
de altísimos pastizales y espesos bosques, que las llanuras
invernales convierten en pantanos y los ardores estivales en
insalubres ciénagas. Poblaban y aún pueblan estas inexploradas
regiones escasos grupos salvajes de afinidades filológicas inciertas.
En las páginas de los viajeros y en las crónicas de las Misiones de
esta comarca (antiguo territorio de Caqueta) encontramos un
sinnúmero de nombres de tribus desaparecidas ó transformadas,
cuya clasificación es imposible.

Fig. 269.—India Ona (Tierra del Fuego).

Otro tanto puede afirmarse de las confusas tribus del Alto


Amazonas. No hay regiones en el Continente Americano que más
desesperen al historiador y al filólogo. Los datos de que disponemos
son tan contradictorios y los cambios tribales tan rápidos y
continuos, que es pretensión inútil el concordar las noticias de los
cronistas antiguos con las observaciones de los etnólogos modernos.
De tales tribus las más conocidas ó mejor estudiadas son los
Otomacos, del Río Meta; los gitanescos Guahibos, del Casanare; los
Panos, del Ucayali; los Cashibos, del Aguaitía, repugnantes
endocaníbales; los indómitos Jibaros (Río Pastaza, Santiago, etc.),
cuyos extraños atambores de guerra y cabezas peculiarmente
disecadas se admiran hasta hoy en los Museos y los Maynas ó
Mayorunas, etc., sometidos por Diego de Vaca cerca del antiguo San
Francisco de Borja (1616), evangelizados con heroicas fatigas por
Franciscanos y Jesuítas, y perpetuados por el glorioso mártir
Francisco de Figueroa en una preciosa y verídica relación
histórica[414].

Las Mesetas de Bolivia.

8.—La región Oriental de la República de Bolivia, bañada por el Beni,


el Mamoré y demás tributarios del caudaloso Madeira, estaba
poblada por multitud de tribus de diferentes familias lingüísticas. Las
más conocidas de entre ellas son las de los Chiquitos, que habitaban
principalmente la región de su nombre, entre los 16° y 18° de latitud
Sur, desde las fuentes del río Paraguay hasta el territorio de los
Incas.
Fig. 270.—Indios Timbues (Schmidel).

Sometidos por Nuño de Chaves (1557), formaron estas tribus el


núcleo principal de las Reducciones Jesuíticas de esta comarca,
adoptando con relativa facilidad costumbres sedentarias y agrícolas.
Su lengua, en extremo flexible, fué medio ó vehículo para la
cristianización de las tribus vecinas (Yurucarés, Arounas, Morotocos,
etc.), que merced al ardiente y abnegado celo de los Jesuítas, fueron
paulatinamente agrupándose en las aldeas permanentes de los
Chiquitos, á cuya lengua procuraron los Misioneros reducir sus
dialectos bárbaros.
Fig 271.—Indios Querandies (Schmidel).

La gloriosa muerte del P. Arce y sus heroicos compañeros, alma de


estas incipientes cristiandades, las invasiones de los Paulistas y
mercaderes de esclavos, la disolución de la Compañia de Jesús y los
luctuosos acontecimientos posteriores, no llegaron á extinguir por
completo las aldeas de Chiquitos, que en número de 20 ó 30.000
viven hasta hoy en parte de sus territorios tribales, y conservan el
régimen cooperativo ó comunista que sus Misioneros
instauraron[415].

La Región Pampeana.

9.—Al Sur de las altiplanicies que separan las aguas del Bajo
Amazonas de las de los afluentes del Plata, se extiende el Continente
en llanuras inmensas regadas por numerosos ríos navegables.
Comprende de Norte á Sur esta región llamada Pampeana, los
territorios del Gran Chaco, las célebres Pampas desde el río Salado al
río Negro, y los desiertos rocallosos y estériles de Patagonia y las
soledades Antárticas. Está limitada al Este por el Océano Atlántico y
al Oeste por la Cordillera de los Andes.

Fig. 272.—Indios Arounas.


Sus tribus indígenas forman una sección etnográfica peculiar distinta
de la Peruana y acaso remotamente relacionada con la Amazónica.
Para facilitar el estudio de tales tribus y sin pretensión alguna
dogmática, podemos clasificarlas en tres grupos geográficos de
límites más ó menos definidos. Forman el primero de estos grupos
las tribus del Gran Chaco; las Araucanas y Pampeanas propiamente
dichas, el segundo, y el tercero, las Fueguinas y Patagónicas.

Tribus del Gran Chaco.

10.—Se conoce con el nombre de Gran Chaco la región que se


extiende del río Salado hacia el Norte hasta los 18° próximamente
de latitud Sur, limitada al Este por los ríos Paraguay y Paraná, y al
Oeste por la Cordillera Andina. Es un país ondulado de grandes
llanuras y bosques espesos, abundantemente regado por tres
hermosos ríos: el Pilcomayo, el Salado y el Vermejo, que lo dividen
de N. O. á S. O., en tres fajas casi paralelas (Chaco Boreal, Central y
Austral) aunque de distinta extensión[416]. La suavidad de su clima,
la abundancia de caza de sus enmarañadas selvas y la sabrosa pesca
de sus ríos y lagos, facilitaron la vida de las tribus indígenas que
densamente lo poblaron. Prescindiendo de los grupos tribales
relacionados filológicamente con los Tupis ó Guaranis, las principales
familias lingüísticas del Gran Chaco, son las de los Matacos, Lules,
Charruas y Guaycurus.
Fig. 273.—Cetro de mando (Ambrosetti).

Habitaban los Matacos en populosas rancherías extendidas por las


riberas del Vermejo. Eran menos fuertes y altos que la generalidad
de los indios del Gran Chaco. Al decir de sus Misioneros, fueron
naciones viles, indómitas, salvajes y refractarias á toda
cristianización. Viven hasta hoy, aunque muy reducidos en su
número, en sus primitivos boscajes, prefiriendo la vida del gitano
nómada á la sedentaria del agricultor.
La antiguamente poderosa nación de los Lules habitaba
principalmente en las márgenes del Salado y el Tabiriri.
Evangelizados primero por el célebre P. Bárcena, huyeron á sus
bosques, y sólo reaparecen años después en la historia de las
Misiones del Chaco (Colegio de Córdoba de Tucuman), sin que
pueda afirmarse con certeza que los Lules ó Tonicotes, estudiados
por los Jesuítas del siglo xviii (P. Machoni), sean los mismos que el P.
Bárcena evangelizó.
A la nación Charrua, sangrientamente célebre en la historia del Río
de la Plata, pertenecían los formidables Yaros, Chanes, Bohanes, etc.
Sus tribus eran también muy numerosas. Usaban las bolas
arrojadizas y la flecha, con precisión terrible; desconocían en general
la fabricación de alfarerías, y vivían en ranchos misérrimos. Eran
grandes cazadores, vagabundos incorregibles, sanguinarios y
arrestados en la guerra, astutos, inconstantes, vanidosos en extremo
é inclinados al juego y la embriaguez. Solos ó aliados con otras
tribus resistieron con indomable furia los avances del conquistador.

Fig. 274.—Alfarerías (Alto Amazonas).

A la extendida familia lingüística de los Guaycurus pertenecían, entre


otras tribus, las de los Abipones, genialmente estudiados por uno de
los misioneros (Dobrizhoffer); los feroces Tobas, que todavía
pueblan parte del Gran Chaco, refugiados en sus espesuras; los
Vilelas, del Río Salado (25° á 26° latitud Sur), y los célebres
Querandies[417], de corta y luctuosa historia.
Con excepción de los Payaguás (Río Paraguay), tribus esencialmente
nadadoras, marineras y de curiosa Mitología y costumbres, todos los
indígenas del Chaco fueron ginetes admirables. La rápida
propagación del caballo en América favoreció sus errantes y
guerreras costumbres. Verdaderos Centauros de la selva, sus
corceles y sus lanzas de guerra fueron por mucho tiempo para el
Europeo motivo de constante inquietud y terror.
Por lo demás, los indios del Gran Chaco no pasaron, en general, de
los grados superiores de salvajismo. Encontramos en algunos de sus
grupos indicios de totemismo y exogamia. Obedecían á sus caciques,
eran fetichistas, veneraban á sus manes y temían á sus magos y
hechiceros[418].

Pampeanos y Araucanos.

11.—Al Sur del Gran Chaco, y hacia los 35º de latitud, empieza la
Región de las Pampas. No hemos de detenernos á describir la
grandiosa belleza de sus llanuras como mares, la inacabable
variedad de sus pastos y la honda serenidad de sus desiertos sin
término. Útil es, sin embargo, recordar estos rasgos fisiográficos de
la Pampa para mejor comprender las peculiaridades de sus
aborígenes.

Fig. 275.—Indios Caribes.

Una sola familia lingüística (Auca ó Aucaniana) ocupaba á trechos


tan dilatadas tierras. Pertenecen á ella, no sólo los "Pampas",
propiamente dichos (Guarpes, Moluches, Pehurenches, Ranqueles,
etc.), de la República Argentina, sino también los célebres Araucanos
ó Mapuches del Sur de la República Chilena.
Fig. 276.—Mapa de Sud-América de la Edición Latina de Schimdel
(1599).

Formaban los "Pampas" hordas nómadas y bárbaras que se


estacionaban en míseras tolderías mientras duraban sus
subsistencias y emprendían despiadados merodeos cuando el
hambre ó la ocasión les incitaba al pillaje y la guerra. Fueron
asombrosos ginetes. Sirvióles el caballo de medio de transporte y
terrible elemento de guerra; aprovecharon su piel para múltiples
usos, y su carne y su sangre para alimento. Fueron siempre
indómitos, errabundos, ladrones, borrachos y abyectos. Refractarios
á toda cultura, vivieron del saqueo y la matanza, temiendo sólo á
sus hechiceros y caciques, creyendo en sus "gualiches" y
repugnantes brujerías, degollando sin piedad y peleando sin
concierto. Salvo los Moluches ó Manzaneros (Río Limay, etc.),
sedentarios y agricultores, las demás tribus "Pampas" sólo supieron
cultivar su astucia de serpientes, su temeridad de leones y su
crueldad de felinos carniceros.

Fig. 277.—Placa de bronce calchaquie (Coll. Lafone Quevedo).

La "Expedición al Desierto", del dictador Rosas (1833), debilitó un


tanto los salvajes bríos de estos indios; pero volvieron bien pronto á
asolar los territorios de la República, hasta que el general, J. A. Roca
y sus esforzados compañeros Villegas, Lavalle, Winter, Lagos, etc.,
merced á habilísimo plan de combate, y después de años de fatigas
abnegadas (1874-1885), consiguieron aniquilar el feroz poderío de
los principales caciques, izar la Bandera Nacional en los últimos
baluartes de su irreducible barbarie, y abrir en consecuencia miles
de leguas de feraces y hermosísimos campos á su actual estado de
civilización y progreso[419].
Los indomables Araucanos, como los llamó Ercilla, ó Mapuches
(hombres de la tierra), como ellos mismos se llamaron, ocupaban en
el siglo xvi la mayor parte del territorio de la República Chilena,
desde la actual provincia de Coquimbo á la de Chiloé inclusive (29° á
45° lat. Sur). Divididos localmente en tribus del Norte, del Sur, etc.
(Picunches, Huiliches, etc.), hablaban todos dialectos de la misma
lengua (Chilidegu), exageradamente alabada por algunos, pero
indudablemente suave, harmoniosa, flexible y apta para la oratoria,
á que tan aficionados eran aquellos guerreros. Hasta hoy se habla
esta curiosa lengua por cerca de cien mil individuos de raza indígena
pura, que habitan en la comarca Chilena del Arauco.

Fig. 278.—Caribe (Guayanas).


Son las tribus Mapuches célebres en la Historia Americana por sus
épicas luchas con los conquistadores Incásicos (Huayna Capac,
Tupac-Yupanqui) primero, y con los Españoles más tarde, y
alcanzaron un grado de cultura indiscutiblemente superior al de sus
afines de las Pampas[420].
Vivían los Mapuches en chozas (rucas) de madera ó paja, muy
separadas entre sí, y formando rancherías ó pueblos (lov) á la orilla
de los ríos y arroyos. En cada ruca vivía una familia; predominaba el
patriarcado, y la condición de las mujeres era inferior y penosa.
Cultivaban éstas la tierra (maíz, patatas, etc.), tejían durables
mantas (chamales), fabricaban ollas y cestos, y desempeñaban en
general todos los duros menesteres de la vida bárbara, en tanto que
sus maridos, hijos ó hermanos, cazaban, pescaban ó preparaban sus
continuas expediciones de guerra.
Fig. 279.—Mapa del Gran Chaco del P. José Jolis (1789).

Tenían los Araucanos jefes supremos y secundarios de paz y de


guerra. La autoridad de tales jefes (toquis), casi siempre hereditaria,
estaba limitada por el Consejo, y los usos y costumbres tribales (ad-
mapos). Los brujos y curanderos eran consultados y temidos. Ante la
ruca de las hechiceras (machis) se construían altares (rehué) donde
sacrificaban animales y hombres á los manes ú otros espíritus
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like