Natural Language Processing in The Real World Text Processing, Analytics, and Classification
Natural Language Processing in The Real World Text Processing, Analytics, and Classification
This book covers the basic concepts behind NLP and text processing and discusses the appli-
cations across 15 industry verticals. From data sources and extraction to transformation and
modeling, and classic Machine Learning to Deep Learning and Transformers, several popular
applications of NLP are discussed and implemented.
This book provides a hands-on and holistic guide for anyone looking to build NLP solutions,
from students of Computer/Data Science to those working as Data Science professionals.
CHAPMAN & HALL/CRC DATA SCIENCE SERIES
Reflecting the interdisciplinary nature of the field, this book series brings together researchers,
practitioners, and instructors from statistics, computer science, machine learning, and analyt-
ics. The series will publish cutting-edge research, industry applications, and textbooks in data
science.
The inclusion of concrete examples, applications, and methods is highly encouraged. The scope
of the series includes titles in the areas of machine learning, pattern recognition, predictive ana-
lytics, business analytics, Big Data, visualization, programming, software, learning analytics,
data wrangling, interactive graphics, and reproducible research.
Published Titles
Urban Informatics
Using Big Data to Understand and Serve Communities
Daniel T. O’Brien
Introduction to Environmental Data Science
Jerry Douglas Davis
Hands-On Data Science for Librarians
Sarah Lin and Dorris Scott
Geographic Data Science with R
Visualizing and Analyzing Environmental Change
Michael C. Wimberly
Practitioner’s Guide to Data Science
Hui Lin and Ming Li
Data Science and Analytics Strategy
An Emergent Design Approach
Kailash Awati and Alexander Scriven
Telling Stories with Data
With Applications in R
Rohan Alexander
Data Science for Sensory and Consumer Scientists
Thierry Worch, Julien Delarue, Vanessa Rios De Souza and John Ennis
Big Data Analytics
A Guide to Data Science Practitioners Making the Transition to Big Data
Ulrich Matter
Data Science in Practice
Tom Alby
Natural Language Processing in the Real World
Text Processing, Analytics, and Classification
Jyotika Singh
Jyotika Singh
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2023 Jyotika Singh
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
List of Figures xv
Preface xxiii
Acknowledgments xxix
vii
viii ■ Contents
1.5 SETUP 26
1.6 TOOLS 27
3.2 VISUALIZATION 85
3.3 DATA AUGMENTATION 87
3.4 DATA TRANSFORMATION 89
3.4.1 Encoding 90
3.4.2 Frequency-based vectorizers 92
3.4.3 Co-occurrence matrix 94
3.4.4 Word embeddings 95
Bibliography 337
Index 357
List of Figures
xv
xvi ■ List of Figures
8.1 Books whose descriptions were used to build our LDA model. Source
doc1 [23], doc2 [82], doc3 [90], doc4 [76]. 232
8.2 The book used to test our LDA model. Source [188]. 234
8.3 The book description to test our LDA model. 234
8.4 Confusion matrix for spam vs ham classification model using Multi-
nomial Naive Bayes classifier. 245
8.5 Training and validation accuracy and loss for ham/spam CNN model. 252
8.6 Curating labeled data using clustering experiments. 254
8.7 Where data science modeling fits within a business’s goal and its
driving factors. 262
9.9 nlu.yml intents related to greeting, user agreement, and user disagree-
ment. 278
9.10 nlu.yml intents related to pizza ordering. 279
9.11 RASA pizza-ordering chatbot - sample conversations. 282
9.12 RASA chatbot conversation with typos. 282
9.13 RASA chatbot bad conversation samples. 282
10.1 Performing comment review analysis from a company KPI perspective. 288
10.2 Data science tasks breakdown for customer review analysis project. 289
10.3 Data science tasks breakdown for customer review analysis project
(sentiment analysis). 292
10.4 Word cloud for positive comments. 293
10.5 Word cloud for negative comments. 293
10.6 Word cloud for positive comments (nouns only). 295
10.7 Word cloud for negative comments (nouns only). 295
10.8 Data science tasks breakdown for customer review analysis project
(identification of topics and themes). 296
10.9 Room-word cloud. 299
10.10 Location-word cloud. 299
10.11 Service and staff-word cloud. 300
10.12 Data science tasks breakdown for customer review analysis project
(curating training data). 301
10.13 Confusion matrix for hotel review classification model. 302
10.14 Data science tasks breakdown for customer review analysis project
(training a classification model). 302
10.15 Data science tasks breakdown for customer review analysis project
(model evaluation). 304
10.16 Data science tasks breakdown for customer review analysis project
(pipeline). 305
10.17 Data science tasks breakdown for customer review analysis project
(curating training data). 307
11.7 Next word prediction BiLSTM model accuracy and loss at 20 and 40
epochs. 323
11.8 Next word prediction output from the BiLSTM model with the pre-
dicted words in bold. 324
xxi
Preface
In the modern day, data digitization has scaled and there are means to store every
interaction happening across the world. Text data is heavily generated across the
globe. Some common sources of text data include social media data, consumer inter-
action, reviews, articles, documents, emails, and others. More and more businesses
have started leveraging machine learning, and a large majority have some type of text
data available to them. Over the last decade, several businesses have explored and
been successful in getting intelligence out of text data generated by them or publicly
available from the web. While many are on that path, many want to get on that path
and exploit the potential of building data-driven offerings. Thus, knowing about NLP
and how you can use it is prime in today’s time.
Natural language processing (NLP) is a hot topic with a lot of applications and
an increasing amount of research across the globe. NLP refers to a machine’s process
to understand language. With the immense amount of text data generated today,
there is an increase in the scope for leveraging NLP to build intelligent solutions.
Google Trends suggests a 112% increase in searches on the topic of natural language
processing in the past seven years. Many businesses today offer products and ser-
vices powered by NLP. Common examples include Amazon Alexa, Gmail sentence
auto-completion, and Google Translate for language translation. With the increasing
demand for NLP-based products and services, there is a strong need for a workforce
that is able to understand and implement NLP solutions.
I started working in the industry as a Data Scientist after finishing grad school.
At the time, I didn’t have any guidance in my field at the company I was working at.
I was faced with tasks that seemed impossible to solve given my grad school back-
ground. In an educational setting, you are working on defined problems. In the real
world, you need to define these problems yourself given the knowledge of the business
objective. In an educational setting, you have data available. You’re either working on
publicly available datasets or one available at your educational institution. In the real
world, you may not have labeled data, you may not have enough data, and you may
not even have any data at all. Having faced these obstacles, I learned several lessons
that over time helped me to excel at my work. I would often share my learnings
and findings with the Python and Data Science community in the form of talks and
presentations at conferences across the globe. After accumulating close to a decade
of experience in working with language data and building NLP solutions in the real
world, I wrote this book.
xxiii
xxiv Preface
open-source tools and the Python programming language, readers will gain hands-on
experience and be able to apply the solutions in your work. Readers will be able to
learn the concepts and refer back to the book any time they need to brush up on
their understanding of NLP usage and applications across industry verticals.
Assuming the reader has a basic understanding of machine learning and program-
ming in Python, this book focuses on practical aspects of NLP, covering the basic
concepts from a practical perspective, rather than diving into detailed architectures.
As such, this book is set to be a valuable resource for anyone looking to develop
practical NLP solutions.
The solutions we build involve using classic machine learning approaches, deep
learning models, and transformers, covering everything from the basics to the state-
of-the-art solutions that are used by companies for building real-world applications.
The reader will:
• Gain knowledge about necessary concepts and methods to build NLP solutions.
• Curate, extract, process, transform, and model text data for various use cases.
• Learn about how several industries solve NLP problems and apply the learnings
to new and unseen NLP tasks.
• Get practical tips throughout the book around different scenarios with data,
processing, and modeling.
Author Bio
For nearly a decade, Jyotika has focused her career on Machine Learning (ML) and
Natural Language Processing (NLP) across various industry verticals, using practical
real-world datasets to develop innovative solutions. Her work has resulted in multiple
patents that have been utilized by well-known tech companies for their advancements
in NLP and ML. Jyotika’s expertise in the subject has made her a highly sought-after
public speaker, having presented at more than 20 conferences and events around the
world.
Her work on building proprietary NLP solutions for ICX Media, a previous em-
ployer, resulted in unique business propositions that played a pivotal role in secur-
ing multi-million dollar business and the successful acquisition by Salient Global.
Jyotika currently holds the position of Director of Data Science at Placemakr, a
leading technology-enabled hospitality company in the USA. Moreover, Jyotika is
the creator and maintainer of open-source Python libraries, such as pyAudioProcess-
ing, that have been downloaded over 24,000 times.
Jyotika’s commitment to promoting diversity in STEM is evident through her ac-
tive support of women and underrepresented communities. She provides early-career
mentorship to build a diverse talent pool and volunteers as a mentor at Data Science
Nigeria, where she engages in mentorship sessions with young Nigerians aspiring for
a career in data and technology. Furthermore, Jyotika serves as a mentor at Women
Impact Tech, US, supporting women in technology, product, and engineering.
Jyotika has received numerous awards for her contributions to the field, including
being recognized as one of the top 50 Women of Impact in 2023 and being named
one of the top 100 most Influential people in Data 2022 by DataIQ. Additionally,
Jyotika has been honored with the Data Science Leadership award in 2022, Leadership
Excellence in Technology award in 2021, and other accolades.
xxvii
Acknowledgments
Writing this book would not have been possible without the plethora of excellent
resources, such as papers, articles, open-source code, conferences, and online tools.
I am thankful to the Python, machine learning, and natural language processing
community for their efforts and contributions toward knowledge sharing. Along my
journey, I have asked a lot of individuals I do not personally know a lot of questions
about this topic and the book publishing process. Thank you all for selflessly taking
the time to answer my questions. Thank you to all the companies and publishers that
have permitted me to use their figures to aid the material of my book. I am grateful
for your contributions to this field and your prompt responses.
I am grateful to everyone who has reviewed sections and chapters of this book.
Thank you Shubham Khandelwal, Manvir Singh Walia, Neeru, Jed Divina, Rebecca
Bilbro, Steven McCord, Neha Tiwari, Sumanik Singh, Joey McCord, Daniel Jolicoeur,
Rekha, and Devesh for taking the time and sharing all your helpful suggestions along
my writing journey. Your feedback helped shape this book into what it is today, and
I could not have completed it without your input and support. It has been a pleasure
knowing each one of you and being able to count on your support.
The team at Taylor and Francis has been incredibly helpful throughout this pro-
cess. Your prompt responses and incredible input into this book are huge contributors.
Thank you, Randi (Cohen) Slack, for being a part of this journey.
I am grateful to my employer, Placemakr, for always encouraging and supporting
my book-writing journey. Thank you for sharing my excitement and supporting me
with everything I needed to be able to write this book.
On a personal note, I want to thank my family, the Walias and the Khandelwals,
for motivating me throughout this process. I wrote this book alongside my full-time
job responsibilities, volunteer mentorship work, and other life struggles. It has in-
volved a lot of late nights and weekends to get this book completed. My husband
and my parents have been tremendously helpful in taking care of everything else so
I got to focus on this book. Thank you Shubham, Mumma, and Papa. Your support
means the world to me. I want to especially acknowledge my late grandparents, Sar-
dar Sardul Singh and Raminder Kaur, and my husband’s grandmother, Radhadevi
Khandelwal. I have received nothing but love, support, and blessings from you all.
Thank you for being a part of my life.
xxix
I
NLP Concepts
In this section, we will go over some basic concepts that lead up to natural lan-
guage processing (NLP). Believe it or not, each one of us has at some point interacted
with a technology that uses NLP. Yes, it is that common! We will describe NLP and
share some examples of where you may have seen a product or technology powered
by NLP.
We will dive into where it all starts from and is centered around, i.e., language.
We will follow it with a brief introduction to concepts of linguistics that form the
basis for many NLP tasks. Often when thinking of how to implement a method for a
machine to do a task that humans perform well, it is useful to consider the perspective
– how would I (human) solve this? The answer often inspires mathematical modeling
and computer implementation for the task. Thus, we will spend some time in this
section on how the human-based understanding of language influences NLP tasks.
Language data needs preparation before a machine can find meaning from it. Have
you ever received a text message from a friend with a term you didn’t understand that
you had to look up on the Internet? Have you ever needed to translate a sentence from
one language to another to understand its meaning? Machines can require similar and
various additional types of preprocessing before they can make sense of the language
input. In general, language is not numeric (not represented as numbers), whereas a
machine understands data in only binary numbers – 1’s and 0’s. We’ll introduce the
basic concepts of converting language into numeric features before diving into further
details in the later chapters.
To build successful NLP solutions, it is important to note challenges in NLP and
why they arise. There are many challenges, some that remain challenging, and some
that can be fully or partially solved by using certain techniques. We will introduce
NLP challenges and potential solution options.
Finally, we will list setup requirements and introduce popular tools that we will
use in the rest of the book.
This section entails the following topics:
• Language concepts
• NLP challenges
• Setup
• Tools
CHAPTER 1
NLP Basics
DOI: 10.1201/9781003264774-1 5
6 ■ Natural Language Processing in the Real-World
TV turns on
You: Pause TV
You: Play TV
4. Text similarity: Text similarity is a popular NLP application that finds use in
systems that depend on finding documents with close affinities. A popular ex-
ample is content recommendations seen on social media platforms. Ever noticed
that when you search for a particular topic, your next-to-watch recommended
list gets flooded with very similar content? Credit goes to text similarity al-
gorithms, among some other data points that help inform user interest and
ranking.
While natural language is not only text but also other forms of com-
munication, such as speech or gestures, the methods and implementation
in this book are focused primarily on text data. Here are a few reasons for that.
- A lot of popular products using speech as input often first transcribe speech
to text and then process the text data for further analysis. The resultant
text is converted to speech after analysis for applications using a speech output.
- Speech processing is a large field of its own. On the other hand, gesture
detections fall under the realm of image processing and computer vision, which
is also a large field of its own. These fields are different, rich, diverse, and call for
a massive write-up like an entire book pertaining to these individual topics to
do them justice. For reference, a brief introduction, some resources, and open-
source Python tools are listed below that you might find useful if interested in
diving further into language processing for speech or gestures.
10 ■ Natural Language Processing in the Real-World
Speech
Introduction
Speech is a form of audio that humans use to communicate with one another.
Speaking is the exercise where forced air is passed through the vocal cords, and
depending on the pressure areas and amount, certain sounds are produced. Reading
speech using a Python program, speech signals are seen as time-series events where
the amplitude of one’s speech varies at different points. Often in speech processing,
frequency is of massive interest. Any sound contains underlying frequencies of its
component sounds. Frequency can be defined as the number of waves that pass a fixed
place in a given amount of time. These frequencies convey a great deal of information
about speech and the frequency domain representation is called the spectrum. Derived
from the spectrum is another domain of speech, called cepstrum. Common features
used from speech signals for machine learning applications include spectral features,
cepstral features, and temporal (time-domain) features.
Challenges
Common challenges in this field include the quality and diversity of data. Speech
in the presence of different background noises forms challenges for a machine to
interpret the signals and distinguish between the main speech versus the background
sounds. Basic techniques such as spectral subtraction [186], and more sophisticated
and actively researched noise removal models are used. There is scope for speech
recognition to be made available for more languages and cover wider topics [164].
Tools
Some popular tools help extract features from speech and audio and build
machine learning models [154]. Examples of such open-source tools include
pyAudioProcessing3 [156], pyAudioAnalysis,4 pydub,5 and librosa.6
Gestures
Introduction
Gestures form an important type of language. Many individuals rely on gestures
as their primary source of communication. Building systems that understand gestures
and smart machines that can interact with gestures is a prime application vertical.
Other applications include programming a system to understand specific gestures and
programming smart devices to optionally take an action based on the gesture, e.g.,
turn off a room light, play music, etc. For gesture analysis, there has been ongoing
research in improving and creating gesture detection and recognition systems [79].
Challenges
Some of the main issues have been around image quality and dataset sizes. Train-
ing a model to recognize images from a clean/fixed dataset may seem simpler. But
in a more realistic setting, the image quality is not always homogeneous or clean,
and training a model to recognize images that it hasn’t seen before in real-time can
be challenging. Data augmentation techniques to artificially add noise to clean sam-
3
https://github.com/jsingh811/pyAudioProcessing
4
https://pypi.org/project/pyAudioAnalysis/
5
https://pypi.org/project/pydub/
6
https://librosa.org/doc/latest/index.html
NLP Basics ■ 11
ples have been popularly implemented in this area to build a model that is able to
circumvent the noise.
Tools
Popular libraries include OpenCV7 , scikit-image8 , SciPy9 and PIL10 . Artificial
neural networks have been popular in image processing. [79] walks through a simple
model to understand gestures. Here’s another guide to developing a gesture recogni-
tion model using convolutional neural networks (CNN) [36].
We have visited several applications and products powered by NLP. How does a
machine make sense of language? A lot of the inspiration comes from how humans
understand language. Before diving further into machine processes, let’s discuss how
humans understand language and some basic linguistic concepts.
Ears
Per Sound Relief Healing Center [43], the human ear is fully developed at birth
and responds to sounds that are very faint as well as very loud sounds. Even
before birth, infants respond to sound. Three parts in the human ear help relay
signals to the brain; the outer ear, middle ear, and inner ear. The outer ear canal
collects sounds and causes the eardrum to vibrate. The eardrum is connected to
three bones called ossicles. These tiny bones are connected to the inner ear at the
other end. Vibrations from the eardrum cause the ossicles to vibrate which, in
turn, creates movement of the fluid in the inner ear. The movement of the fluid
in the inner ear, or cochlea, causes changes in tiny structures called hair cells
that sends electric signals from the inner ear up the auditory nerve to the brain.
The brain then interprets these electrical signals as sound.
7
https://opencv.org/
8
https://scikit-image.org/
9
https://scipy.org/
10
https://pillow.readthedocs.io/en/stable/
12 Natural Language Processing in the Real-World
Eyes
An article in Scientific Journal on ‘The Reading Brain in the Digital Age: The
Science of Paper versus Screens’ [88] yields insights on how the eyes help in
reading. Regarding reading text or understanding gestures, the part of the brain
that processes visual information comes into play, the visual cortex. Reading
is essentially object detection done by the brain. Just as we learn that certain
features—roundness, a twiggy stem, smooth skin—characterize an apple, we learn
to recognize each letter by its particular arrangement of lines, curves, and hollow
spaces. Some of the earliest forms of writing, such as Sumerian cuneiform, began
as characters shaped like the objects they represented—a person’s head, an ear of
barley, or a fish. Some researchers see traces of these origins in modern alphabets:
C as a crescent moon, S as a snake. Especially intricate characters—such as
Chinese hanzi and Japanese kanji—activate motor regions in the brain involved
in forming those characters on paper: The brain goes through the motions of
writing when reading, even if the hands are empty.
How we make sense of these signals as a language that conveys meaning comes
from our existing knowledge about language rules and different components of lan-
guage including form, semantics, and pragmatics. Even though some language rules
apply, because of the different ways people can communicate, often there are no
regular patterns or syntax that natural language follows. The brain relies on an in-
dividual’s understanding of language and context that lies outside of linguistic rules.
Whether we are consciously aware of it or not, any external sound, gesture, or
written text is converted to signals that the brain can operate with. To perform
the same tasks using a machine, language needs to be converted to signals that a
computer can interpret and understand. The processing required to do so is referred
to as Natural Language Processing (See Figure 1.4).
1. Form
FIGURE 1.5 Some popular applications of NLP that leverage different language com-
ponents.
Each component described above forms a basis for how we, as humans, interpret
the meaning of speech or text. It also forms the basis for many language features that
are used popularly in NLP to understand language. Figure 1.5 shows popular NLP
applications that make use of the different language components discussed above.
Tokenization refers to breaking down sentences into words, and words into the base
form. Part-of-speech tagging marks words in the text as parts of speech such as nouns,
verbs, etc.
it requires to have seen such data and its usage patterns before and learn the language
rules that humans are likely to follow. This section lists some methods and factors
that are important when thinking of using language as data.
1.3.1 Look-up
Consider a task where you need to find all the entries in a dataset of sentences
where the entry contains content about the movie – Ghostbusters. What solutions
come to mind? Curate training data, manually label some samples, and train a model
that predicts – Ghostbusters versus not-Ghostbusters?
Let’s look at a much easier and much faster solution. Why not look up the pres-
ence of the string ‘ghostbusters’ in each data sample? If it is present, mark it as
Ghostbusters, else not-Ghostbusters.
Limitations?
Some samples may mention ‘ecto-1’ which is the vehicle name in the movie and
not the term ‘ghostbusters’. Such a sample would be missed by our approach. Solution
– how about using multiple relevant keywords to search the samples with, including
popular actor names, character names, director names, and popular movie elements
such as the vehicle name? The results may not be all-encompassing but would cer-
tainly return an easy and fast solution and could serve as a great first approach before
a complex solution needs to be scoped out. Furthermore, this method can form a first
step for curating data labels for your dataset that can come in handy for future model
building.
Look-ups and similarly other basic approaches to NLP tasks such as using word
counts work as a great starting point for simpler tasks and result in simple yet effective
solutions.
1.3.2 Linguistics
Let’s look at the following sentence where the task is to identify location names from
text:
How could we solve this problem? One simple solution might be to have a list
of all location names and search for their presence in the sentence. While it is not
an incorrect solution and would work perfectly for many use cases, there are certain
limitations.
16 ■ Natural Language Processing in the Real-World
The look-up approach would detect ‘Arizona’ and ‘New York’ as location names.
We, as humans, know that Arizona is a location name, but based on the sentence
above, it refers to a person and not the location.
There are advanced techniques that can distinguish between Arizona and New
York in the above example. The process of being able to recognize such entities
is called named-entity recognition, information extraction, or information retrieval
and leverages the syntax rules of language. How does it work? The process includes
tagging the text, detecting the boundaries of the sentence, and capitalization rules.
You can use a collection of data sets containing terms, and their relationships or use
a deep learning approach using word embeddings to understand the semantic and
syntactic relationship between various words. Don’t worry if this sounds unfamiliar.
We’ll dive further into it in Section III and Section V. The best part is that there
are existing tools that offer models that do a reasonable job for such tasks. Using the
spaCy library en_core_web_sm trained model, the below results can be accomplished:
With the knowledge of linguistics and the relationship between terms, the machine
can accomplish the detection of location names from a challenging sentence.
Many other NLP tasks can be solved using the knowledge of linguistics as seen
previously in Figure 1.5.
Humans may guess ‘meal’, ‘game’, ‘date’, ‘movie’, ‘vacation’, or ‘holiday’. Given
enough text samples, a machine could guess the same answers. The guesses can only
be as good as the data it has seen before. If all our training dataset contains is a
few samples of someone going out for a ‘movie’ with their spouse, then that’s the
best prediction we can get. But if the dataset is more representative, we could have
the machine capture many other possible answers, such as ‘date’, ‘game’, ‘meal’,
‘vacation’, ‘grocery run’, and even the less common events that every human may
not be able to guess. Why may that be? We as humans meet several people in our
lives, watch TV and movies, text our friends, read, and perform many such activities
that open our imagination to different degrees. Let’s consider a person named Emma.
Emma is unmarried and has very few married friends. She may not able to guess where
one may go out with their spouse. This is because Emma hasn’t seen many examples
of such an event. However, a machine has the capacity to learn from a lot more data
NLP Basics 17
than what a human brain can process and remember. Having large enough datasets
can not only represent Emma’s imagination of what David may be going out with his
spouse for, but also represent the imagination of several such individuals, and thus
make guesses that a single individual may not think of.
Now we know that data quantity matters, let’s consider something a bit more
ambiguous now. Let’s say we want to infer which sentence is related to Art:
While ‘safflower oil’ is used in both examples, the topic of the first is completely
different from the second. This is known to humans because when we see the word
‘fry’ or ‘onions’ used with ‘oil’, it becomes apparent that it is likely not about art.
Similarly, ‘oil paints’ and ‘safflower oil’ used together seem likely to be about art. We
are able to make that inference because we know what food is and what paints are.
To make a machine understand the same, it is important to feed in relevant
training data so it can make similar inferences based on prior knowledge. If the
machine has never seen food items used in a sentence or has not seen it enough,
it would be an easy mistake to mark the first sentence as art if it has seen enough
samples of ‘safflower oil’ usage in art.
To successfully build an art/not-art classifier, we not only need a representative,
relevant, and good quantity of training dataset, but also preprocessing and cleaning
of data, a machine learning model, and numerical features constructed from the text
that can help the model learn.
1.3.4 Preprocessing
Data preprocessing refers to the process of passing the data through certain cleaning
and modification methods before analyzing or converting it into numerical represen-
tation for modeling. Depending on the source of data, the text can contain certain
noises that may make it hard for the machine to interpret.
For instance, consider a task where you have a list of text documents that were
written by people regarding a private review of a product. The product owners have
permission to display these reviews selectively on their websites. Now let’s talk about
18 Natural Language Processing in the Real-World
constraints. The program that needs these reviews as input to display on the website
cannot parse language other than English. So as a preprocessing step, you’ll remove
any non-English language content. Furthermore, the product managers desire to not
display any reviews having less than 10 characters of text on the website. Thus, you’ll
further apply a filtering step where you only pass the documents that have a length
of more than 10. But when you further look at the data samples resulting after the
filters are applied, you find some reviews contain meaningless information in the form
of random URLs and non-alphabets. Thus, for a cleaner output, you may pass the
data through further steps, such as removing URLs, checking for the presence of
alphabets, stripping leading and trailing spaces to get the relevant text lengths, etc.
All these steps count as preprocessing and are very tailored towards the goal.
It is often found useful to communicate to the machine that ‘grad’ and ‘gradua-
tion’ mean the same, and words with or without apostrophes can be considered the
same for a dataset as such. There are techniques to achieve normalization as such,
such as stemming, lemmatizing, ensuring diverse dataset representation, and creat-
ing custom maps for word normalizations. This will be further discussed in detail in
Chapter 3 (Section 3.1).
Additionally, different geographical locations can represent different accents and
usage of words. Not only is the way of saying the same word different for the same
language across the world, but what they mean at times changes with geography. For
20 ■ Natural Language Processing in the Real-World
instance, biscuit in the USA refers to a quick bread that is typically unsweetened.
In the UK, a biscuit is a hard, flat item. A British biscuit is an American cookie,
an American biscuit is a British scone, and an American scone is something else
entirely [123]. Sometimes challenging for humans to understand, such differences cer-
tainly pose challenges for a machine meant to understand language globally. Having
a well-represented dataset for your use case becomes very important in solving such
challenges.
Semantic ambiguity
Semantic ambiguity results when words that are spelled the same have different
meanings.
I went out in the forest and found a bat.
Was it a bat, the animal? Or a baseball, cricket, or table tennis bat? The word
‘bat’ is a homonym, which means it can have multiple meanings but reads and sounds
the same. This forms a simple example of semantic ambiguity.
Syntactic ambiguity
Syntactic ambiguity is also called structural or grammatical ambiguity and occurs
when the structure of the sentence leads to multiple possible meanings.
The end . . . ?
There’s an ambiguous ending in the American science fiction horror film The
Blob (1958). The film ends with parachutes bearing the monstrous creature on a
pallet down to an Arctic ice field with the superimposed words ‘The End’ morphing
into a question mark [196]. The question mark at the end leaves a possibility that
the monster is not dead or may resurface. A classic ending for a horror film.
Narrative ambiguity
Narrative ambiguity arises when the intent of the text is unclear. If someone aims
a stone at a person and hits the target, it may count as a good shot, but not necessarily
a good deed. At such an event, commenting ‘that was good’ without the context of
what exactly the commenter found good is an example of narrative ambiguity.
Consider the following example:
Sarah gave a bath to her dog wearing a pink t-shirt.
Ambiguity: Is the dog wearing the pink t-shirt or is Sarah wearing a pink t-shirt?
Sometimes it is tricky even for humans to depict the intended meaning of an
ambiguous sentence. The same holds true for a machine.
NLP Basics 21
FIGURE 1.6 Word cloud of top 100 most spoken languages across the world.
Translation has the capacity to lose certain information, as goes with the popular
phrase – ‘lost in translation’. Nonetheless, it works well for many applications.
We can tell that the second sentence is likely erroneous because the probability of
an object like the fridge fixing a human, i.e., the handyman, is very low. A machine
may not be able to tell when it encounters erroneous sentences. Bad or incorrect data
as such can impact the performance of your NLP model. Identifying and eliminating
incorrect samples and outliers can help with such data problems.
While removing outliers and bad data samples can help many applications, cases
like sarcasm detection remain challenging. The language concept of pragmatics plays
a role in humans detecting sarcasm. For a machine, many times in sentiment or
emotion classification tasks, the use of sarcasm is observed along with non-sarcastic
sentences. If sarcasm is known to occur in conjunction with certain topics, then
building a model to detect that can be reasonably successful. Many practitioners
have developed models to help detect sarcasm. There continues to be research in
the area and this kind of problem is an example of one that remains challenging
today [201].
NLP Basics 23
Every individual can describe a similar emotion in innumerable ways. One would
need to know various expressions of happy emotion to build a model that successfully
infers the happiness state of an individual based on text. If our model was only built
on the happy expressions of Charan, we would not be able to guess when Beatrice or
Arthur is happy.
As another example, let’s say we have a pre-trained category classifier. The clas-
sifier was trained on social media data from 2008 and has historically given an 80%
accuracy. It does not work as well on social media data from 2022 and gives a 65%
accuracy. Why? Because the way people communicate changes over time along with
the topics that people talk about. This is an example of language evolution and data
drift and can present a lower accuracy of classification if training data differs from
the data you want to classify.
When is it that you can’t use a pre-trained model?
Does that mean no model works well if built on a different dataset? No! Transfer
learning refers to the process of learning from one data source and applying it to data
from a different source. This works very well in many cases. If any existing models do
not work for you, they can still form a good baseline that you can refer to. Sometimes,
they may also form as good inputs to your model and might require you to need less
new training data. This is further illustrated in Figure 1.8.
While some of these challenges are difficult to find a way around by both humans
and machines, several NLP techniques help take care of the most commonly seen noise
and challenges. Examples include cleaning techniques to strip off URLs and emojis,
spelling correction, language translation, stemming and lemmatization, data quantity
and quality considerations, and more. We will be discussing data preprocessing and
cleaning techniques in further detail in Chapter 3 (Section 3.1).
We started this section with a few questions. Let’s summarize their answers below.
Thus far, we have discussed NLP examples, language concepts, NLP concepts, and
NLP challenges. Before we wrap up this chapter, we will go over setup notes and some
popular tools that we will be using for the remaining chapters while implementing
NLP using Python.
1.5 SETUP
First, you will need Python >=3.7, pip >=3.0, and Jupyter [99] installed on your
machine. If you don’t already have these, follow11 to download Python,12 for installing
pip on MAC and13 on Windows, and14 for installing Jupyter on MAC and15 on
Windows. Another option is to install Anaconda16 , which comes with pre-installed
Jupyter. You can then install many libraries using conda instead of pip. Both pip and
conda are package managers that facilitate installation, upgrade, and uninstallation
of Python packages.
We’ll also be showing some examples using bash in this book. Bash is pre-installed
on MAC machines (known as Terminal). Follow17 to install bash on Windows.
You can launch a Jupyter notebook by typing the following bash command.
jupyter notebook
To install a library in Python, pip [198] can be used as follows using bash.
11
https://www.python.org/downloads/
12
https://www.geeksforgeeks.org/how-to-install-pip-in-macos/
13
https://www.geeksforgeeks.org/how-to-install-pip-on-windows/
14
https://www.geeksforgeeks.org/how-to-install-jupyter-notebook-on-macos/
15
https://www.geeksforgeeks.org/how-to-install-jupyter-notebook-in-windows/
16
https://www.anaconda.com/
17
https://itsfoss.com/install-bash-on-windows/
NLP Basics ■ 27
When working with Jupyter notebooks, you can do this from within the notebook
itself as follows.
! pip install < library >
To install a particular version of a library, you can specify it as follows. The below
example installs NLTK version 3.6.5.
! pip install nltk ==3.6.5
For this book, most demonstrated Python code was written in Jupyter notebooks.
You will find Jupyter notebook notation for library installs throughout the book
unless otherwise specified.
Some libraries may require you to install using Homebrew.18 Follow the instruc-
tions in the URL for installing Homebrew. Homebrew is MacOS-only command line
installer application and it does not exist for Windows. The Windows alternative is
Chocolatey.19
1.6 TOOLS
There are several open-source libraries in Python to help us leverage existing imple-
mentations of various NLP methods. The below list introduces some of the popular
Python libraries used for text NLP. We’ll find ourselves leveraging these and some
others for many implementations in Sections II, III, V, and VI.
1. NLTK20 : NLTK stands for natural language toolkit and provides easy-to-use
interfaces to over 50 corpora and lexical resources such as WordNet21 (large
lexical database for English), along with a suite of text-processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reason-
ing, wrappers for industrial-strength NLP libraries, and an active discussion
forum [34].
NLTK can be installed as follows.
! pip install nltk
2. spaCy22 : spaCy is a library for advanced NLP in Python and Cython. spaCy
comes with pre-trained pipelines and currently supports tokenization and train-
ing for 60+ languages. It features state-of-the-art speed and neural network
models for tagging, parsing, named entity recognition (NER), text classifica-
tion, and multi-task learning with pre-trained transformers like BERT, as well
18
https://brew.sh/
19
https://chocolatey.org/
20
NLTK https://www.nltk.org/
21
https://wordnet.princeton.edu/
22
spaCy https://spacy.io/
28 ■ Natural Language Processing in the Real-World
# For CPU - support only , you can install transformers and a deep
learning library in one line .
# PyTorch
! pip install transformers [ torch ]
# TensorFlow 2.0
! pip install transformers [ tf - cpu ]
# Flax ( neural network library for JAX )
! pip install transformers [ flax ]
For code demonstrated in this book, we have used the following versions of
these libraries. We have specified if any different versions were used for any ap-
plication. Other details can be found in the Jupyter notebooks containing the
code using different libraries. All the code used in this book can be found at
https://github.com/jsingh811/NLP-in-the-real-world and can be downloaded from
there.
• NLTK 3.6.5
• spaCy 3.2.3
• Gensim 4.2.0
• scikit-learn 1.1.3
• Tensorflow 2.11.0
• Keras 2.11.0
• Torch 1.13.0
• Torchvision 0.14.0
• Transformers 4.17.0
28
Hugging Face transformers https://huggingface.co/docs/transformers/main/en/index
29
JAX https://jax.readthedocs.io/en/latest/notebooks/quickstart.html
30 ■ Natural Language Processing in the Real-World
Windup
We discussed language, natural language processing, examples of applications and
products, the challenges associated with building successful NLP solutions, and the
consideration factors and how language can be used as data that machines can un-
derstand. There is a significant amount of preprocessing that is prime, along with
feature-building techniques, machine learning, and neural networks. For many lan-
guage tasks, your end goal may be around the analysis of text rather than building a
model. Preprocessing text followed by data visualization can come in handy for such
applications. Before diving into those stages in Section III, let’s talk about the one
thing NLP is not possible without– the data! Where can you get text data from?
How can you read text from different types of data sources? Where can you store
text? The next section will answer all these questions and further dive into popu-
lar data sources, text extraction with code examples of reading text from common
sources and formats, popular storage considerations and options with code samples,
and data maintenance.
II
Data Curation
In this section, our focus will surround data curation - where from, how, where
to, and other consideration factors. First, we will dive into the various sources which
text is commonly curated from. We will list publicly available data sources, as well
as common sources of data found within organizations. We’ll then dive into data
extraction and how data can be read from commonly used formats for storing text,
such as CSV, PDF, Word documents, images, APIs (Application Programming In-
terface), and other structured and unstructured formats using Python and open-
source libraries. Finally, with all the text data at hand, data storage becomes prime.
Sometimes saving the data on your machine in different files suffices. Other times, a
database management system serves as a more practical solution. We’ll discuss some
popular databases that are used widely for text data. Each database comes with ways
to query and perform operations on text. We’ll implement some of these operations.
Finally, we will introduce the concept of data maintenance and discuss some useful
tips and tricks to prevent the data from corruption.
This section includes the following topics:
• Sources of data
• Data extraction
• Data storage
All the code demonstrated in this section can be found of section 2 folder of the
GitHub repository (https://github.com/jsingh811/NLP-in-the-real-world).
CHAPTER 2
1. Customer reviews/comments
User comments are a very common source of text, especially from social me-
dia, e-commerce, and hospitality businesses that collect product reviews. For
instance, Google and Yelp collect reviews across brands as well as small and
large businesses.
3. Chat data
E-commerce, banking, and many other industries leverage chat data. Chat data
is essentially the chat history of messages exchanged between a business and
its client or customer. This is a common source of text data in industries and is
DOI: 10.1201/9781003264774-2 35
36 ■ Natural Language Processing in the Real-World
useful for monitoring user sentiment, improving customer experience, and cre-
ating smart chatbots where computers respond to customer messages, thereby
reducing human labor.
4. Product descriptions
Text descriptions are attached to most products or services being sold to people
or businesses. For instance, when you check any product on Amazon, you will
find a detailed product description section that helps you get more information.
Descriptive and categorical data associated with a product and service is yet
another common text data source in the industry.
5. News
News is used very popularly across finance and real estate industries to predict
stock and property prices. For many other industry verticals, what’s in the
news impacts their businesses, and ingesting news and articles can be beneficial.
Thus, it is common for organizations to have analytical engines built specifically
for curating and classifying news and article headlines on the Internet.
6. Documents
Resumes, legal documents, research publications, contracts, and internal docu-
ments are examples of document-type data sources. Across industries, several
resume filtering and searching algorithms are put in place to sift through the
hundreds of applications received for an open job position. In law and banking,
as well as many other industries, legal and contractual documents are present in
bulk and need to be sifted through for an important compliance term or detail.
In physics, NLP plays an important role in automatically sifting through bulk
volumes of research publications to find relevant material for drawing inspira-
tion, using as a guide, or referencing.
7. Entry-based data
Feedback and survey forms are another source of text data in the enterprise.
As an example, SurveyMonkey and Google Forms allow creation of custom
forms with categorical, select one, select many, or free-form text entry options.
Parsing free-form text fields manually, especially if they are present in large
volumes, can be a time-consuming effort. Building tools to parse the relevant
information for analysis is a common solution.
8. Search-based data
The searches that a customer or a client performs on a website is an example
of search-based data. This type of data consists of free-form text searches along
with categorical selects and timestamps. One popular application for such data
is to understand consumer interest and intent, design customer experience, and
recommend relevant items.
available that form a great resource for just practicing machine learning and NLP
skills or using it for a real-world problem that you might be trying to solve as an
industry practitioner. Some popular text datasets are listed in Table 2.1.
1
https://archive.ics.uci.edu/ml/datasets.php
2
https://snap.stanford.edu/data/web-Amazon.html
3
https://dumps.wikimedia.org/
4
https://nlp.stanford.edu/sentiment/index.html
5
https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
6
https://paperswithcode.com/dataset/standardized-project-gutenberg-corpus
7
https://www.cs.cmu.edu/~enron/
8
https://www.kaggle.com/rtatman/blog-authorship-corpus
9
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
10
https://cseweb.ucsd.edu/~jmcauley/datasets.html
11
https://wordnet.princeton.edu/download
12
https://github.com/nproellochs/SentimentDictionaries
13
http://help.sentiment140.com/for-students/
14
https://www.cs.jhu.edu/~mdredze/datasets/sentiment/
15
https://www.yelp.com/dataset
16
http://qwone.com/~jason/20Newsgroups/
17
https://www.microsoft.com/en-us/download/details.aspx?id=52419
18
https://www.statmt.org/europarl/
19
http://kavita-ganesan.com/entity-ranking-data/#.Yw1NsuzMKXj
20
https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports
21
https://rajpurkar.github.io/SQuAD-explorer/
22
https://catalog.ldc.upenn.edu/LDC93s1
23
https://www.imdb.com/interfaces/
24
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_
a_json_file/
38 ■ Natural Language Processing in the Real-World
1. YouTube
YouTube has multiple APIs, some that allow you to access publicly available
data and some that allow you to access data that you own on YouTube or
that you have been given permission to access by the content owners. YouTube
Data API [204] allows you to gather public data from YouTube. This includes
YouTube channel title, channel description, channel statistics such as video
count and subscriber count, video title, video description, video tags, video
statistics (number of likes, dislikes, views, and comments), comments, user pub-
lic subscriptions, and more. The openness of the API is subject to change and
is based on YouTube’s developer terms. To access this API, you will need to
register your account and generate your API tokens. There are daily rate limits
associated with each account, which sets a limit on how many requests you can
make in a 24-hour window [205]. More on accessing data using YouTube data
API is discussed in Section 2.2.7.
2. Twitter
Twitter has multiple APIs. Most of the APIs that offer advanced data metrics
require payment. Further details on Twitter data and APIs are linked here [184].
Common types of data you can fetch for free includes tweets, users, followers,
and friends. Your friends on Twitter refer to the users you follow as per the
API’s term definitions. The free API comes with a rate limit that refreshes
every 15 minutes. More on accessing data using Twitter API is discussed in
Section 2.2.7.
3. Reddit
Reddit offers some freely accessible data. Post data, subreddits, and user data
are available with the Reddit API. Their developer terms are linked here [139].
4. LinkedIn
LinkedIn API offers user, company, and post data [111]. Requests can be made
for a particular user or company to access the posts. Client authentication/per-
missions is a requirement for gathering more details.
Data Sources and Extraction ■ 39
6. Facebook
Facebook [119] and Instagram [120] data can be accessed using the Graph API.
The amount of data accessible via the API for these platforms is limited. Certain
Facebook page data can be publicly accessed. For a richer data dimension,
authentication is required so you are only able to gather either your own data
or the data of a client that has authorized you to do so on their behalf.
7. Twitch
Twitch allows you to ingest comments and details of a known and active stream-
ing event, along with the user names of individuals making the comments.
Terms and services, and details of the data available can be found here [183].
Note of caution
Subsets of IMDb data are available for access to customers for personal and
non-commercial use. You can hold local copies of this data, and it is subject
to our terms and conditions. Please refer to the Non-Commercial Licensing
and copyright/license and verify compliance.
Some other data sources do not object to using for the commercialization
and some others require proper citation of the resource. Following the
guidelines will certainly save you time and effort at a later stage while
ensuring good ethical standing and compliance.
a
https://www.imdb.com/interfaces/
25
https://www.imdb.com/interfaces/
40 ■ Natural Language Processing in the Real-World
Step 2:
Let’s read a sample PDF file in Python. You can download any PDF from the
web. Name it sample.pdf and place it in your code directory.
# Imports
from PyPDF2 import PdfFileReader
You can also do the above without using the ‘with’ clause. In that case, remember
to close your file towards the end using file.close() to avoid unintentional and
excess memory usage.
One limitation of this approach is that it does not work for scanned files saved as
PDFs. Next, we’ll look into an approach that works for extracting text from scanned
documents.
Data Sources and Extraction ■ 41
Step 2:
Read a sample scanned PNG (source [132]) with Python.
# Imports
import cv2
from pytesseract import image_to_string
While using mac OS, several users have reported errors as follows.
FileNotFoundError: [Errno 2] No such file or directory: ‘tesseract’:
‘tesseract’
To resolve, run the below install using Homebrew and try the Step 2 code again.
Homebrew is a missing package manager. Don’t have Homebrew installed? Follow
this installation guide26 .
brew install tesseract
The results can differ depending on the quality of the scanned document. Thus,
passing the image through certain filters can help get better results [207]. Examples
of such filters can be seen below.
import numpy as np
import cv2
26
https://docs.brew.sh/Installation
42 ■ Natural Language Processing in the Real-World
FIGURE 2.1 Image of a page in a book [132] scanned from a smart phone.
For a sample scanned PDF shown in Figure 2.1, the results can be seen in
Figure 2.2. The code used can be found in section2/ocr-book-page-image.ipynb.
You can also build models that perform OCR using open-source libraries27 .
Like all methods, there are expected to be certain drawbacks in terms of false
detections based on the quality of the scanned file and the limitations of the under-
lying algorithm. There is a chance that your output may contain spelling errors and
other data noise issues. We’ll talk about data cleaning and preprocessing in Chapter 3
27
https://pyimagesearch.com/2020/08/24/ocr-handwriting-recognition-with-opencv-
keras-and-tensorflow/
Data Sources and Extraction 43
FIGURE 2.2 Results of OCR on Figure 2.1. On the left, results are produced without
any image filtering. On the right, results are produced with the thresholding filter
applied to the image. The errors are highlighted in grey.
(Section 3.1) which will highlight spelling correction techniques and cleanup methods
that can handle some of such data inconsistencies.
# Printing
print ( " Sample written to json file : {} " . format ( sample ) )
print ( " Sample read from json file : {} " . format ( read_sample ) )
Using readlines
file = open ( ' sample_csv . csv ')
content = file . readlines ()
Step 2:
Let’s read a sample CSV file in Python with the file in your working directory.
# Imports
import pandas as pd
# print data
print ( data )
Step 2:
# Imports
from bs4 import BeautifulSoup
from urllib . request import urlopen
# URL to questions
myurl = " https :// stackoverflow . com / questions /19410018/ how - to - count -
the - number - of - words - in -a - sentence - ignoring - numbers - punctuation - an "
28
https://stackoverflow.com
46 ■ Natural Language Processing in the Real-World
answers = soup . find ( " div " , { " class " : " answer " })
# print to check the correct class tag .
An example of another library that can help with HTML parsing is scrappy29 .
Step 2:
Let’s read a sample Word file in Python with the name sample_word.docx in
your working directory.
# Imports
from docx import Document
two social media APIs - YouTube Data API and Twitter API. Both require you to
register an app and generate tokens before you can start making requests to their
APIs.
YouTube API
Step 1: Registration and token generation
The first two steps include registering a project and enabling it30 . The API key
produced as a result can be used to make requests to the YouTube API.
Step 2: Making requests using Python
YouTube API has great documentation31 and guides on accessing it using Python.
Below is an example of searching for videos using a keyword on YouTube, grabbing
video tags and statistics, reading video comments, and fetching commenter subscrip-
tions.
Note that it is a good practice to not keep any API keys and secrets in your
Python scripts for security reasons. A common practice is to keep these defined as
local environment variables or fetch these from a secure location at runtime.
! pip install google - api - python - client ==2.66.0
! pip install google - auth - httplib2 ==0.1.0
! pip install google - auth - oauthlib ==0.7.1
# Imports
from googleapiclient . discovery import build
from googleapiclient . errors import HttpError
youtube = build (
YOUTUBE_API_SERVICE_NAME ,
YOUTUBE_API_VERSION ,
developerKey = DEVELOPER_KEY
)
30
https://developers.google.com/youtube/v3/getting-started
31
https://developers.google.com/youtube/v3/docs
48 ■ Natural Language Processing in the Real-World
print ( video_details [ " items " ][0][ " snippet " ][ " title " ])
print ( video_details [ " items " ][0][ " statistics " ])
Getting comments.
# Get comments for one video
comment_details = youtube . commentThreads () . list (
videoId = video_ids [0] ,
part = ' snippet ' ,
maxResults =50
) . execute ()
first_cmnt = comment_details [ " items " ][0]
top_level_data = first_cmnt [ " snippet " ][ " topLevelComment " ]
print (
top_level_data [ " snippet " ][ " textDisplay " ] ,
top_level_data [ " snippet " ][ " authorDisplayName " ]
)
Getting subscriptions.
# Get commenting user IDs
commeters = [
i [ ' snippet ' ][ ' topLevelComment ' ][ ' snippet ' ]\
[ ' authorChannelId ' ][ ' value ']
for i in comment_details . get ( ' items ' , [])
]
) . execute ()
except HttpError as err :
print ( """ Could not get subscriptions
for channel ID {}.\ n {} """ . format (
com_id , err
)
)
print ( ' Videos : {} '. format ( video_details ) )
print ( ' Comments : {} '. format ( comment_details ) )
print ( ' Subscriptions : {} '. format ( subs ) )
Twitter API
Step 1: Registration and token generation
To use the Twitter API, you will need to create an app33 . The form leads to
a few questions that you will need to answer. Once the app is created, you should
be able to generate API tokens - consumer key, consumer secret, API token, and
API secret. There are standard limits associated with your application and tokens
that determine how many requests you can make to the Twitter API in a given time
frame. The limits are different for different requests and can be found here34 . There
are packages offered by Twitter to businesses that would like higher limits at different
costs. This can be determined by reaching out to a Twitter API contact.
Step 2: Making requests using Python
You can make requests to the Twitter API using the library tweepy. An example
for searching for users, tweets, and fetching followers and friends can be found below.
For more code samples, tweepy’s API guide is a great resource35 .
! pip install tweepy ==4.12.1
# Imports
import tweepy
from tweepy import OAuthHandler
# Globals
CONSUMER_KEY = ' REPLACE_ME '
CONSUMER_SECRET = ' REPLACE_ME '
ACCESS_TOKEN = ' REPLACE_ME '
ACCESS_SECRET = ' REPLACE_ME '
32
https://developers.google.com/youtube/v3/determine_quota_cost
33
https://developer.twitter.com/en/apps
34
https://developer.twitter.com/en/docs/twitter-api/v1/rate-limits
35
https://docs.tweepy.org/en/stable/api.html
50 ■ Natural Language Processing in the Real-World
# Set connection
auth = OAuthHandler ( CONSUMER_KEY , CONSUMER_SECRET )
auth . set_access_token ( ACCESS_TOKEN , ACCESS_SECRET )
query = tweepy . API ( auth )
The below code snippet gets user details when the screen name or Twitter ID of
the desired user is known.
screen_names = [ ' CNN ']
users = query . lookup_users ( screen_name = screen_names )
for user in users :
print ( user . _json )
If the screen name or ID is not known, you can also search for users using free-form
text as seen in the following code snippet.
search_term = " natural language processing "
users = query . search_users ( search_term )
for user in users :
print ( user . _json )
To get followers or friends of a known screen name or Twitter ID, the following
code can be used.
screen_name = " PyConAU "
# To get only IDs rather than detailed data for each follower / friend ,
# the below can be used instead
followers = []
for _ , page in enumerate ( cursor ) :
followers += [ itm . _json for itm in page ]
The below code snippet gets tweets for a twitter screen name @PyConAU.
Data Sources and Extraction ■ 51
# Print
print ( " Total count " , len ( all_tweets ) )
print ( all_tweets [0])
If the followers are more than a certain amount, the code for getting followers
using Cursor can terminate before getting all the results. Same applies to requesting
tweets.
One important thing to note here is the influence of the limitations of your API
tokens. Twitter API tokens allow you to make a certain number of each type of
request per 15-minute window. Integrating logic to handle these limits can help avoid
premature termination of the code. Below is an example of introducing a 15-minute
wait for allowing a reset of the token before making further requests using library
tweepy.
# Add the query variables as below
query = tweepy . API (
auth ,
w ait _o n_ ra te_l im it = True
)
52 ■ Natural Language Processing in the Real-World
Now you can run the request for getting tweets and followers using Cursor without
errors. The runtime might be long because of the 15-minute sleeps. The output of
the above code can be found in the notebook twitter-api.ipynb on GitHub.
Since you may not only have text data to think about, but also timestamps,
numbers, and other kinds of data, it is important to evaluate your needs accordingly.
The solution you pick should fit all your data needs and formats.
There are several database options in general, including relational, non-relational,
cloud, columnar, wide column, object-oriented, key-value, document, hierarchical,
and graph databases.36 contains further information about each. For our scope, specif-
ically for use cases around text data, the database types that are popularly used
include relational, non-relational, and document databases.
A relational database is a structure that recognizes relationships between stored
items. Most of such databases use Structured Query Language (SQL) as their under-
lying query language. A non-relational database does not rely on known relationships
and use a storage model that is optimized for the type of data being stored. They
are also referred to as NoSQL, or not only SQL. A document database is a type of a
non-relational database suitable for document-oriented information.
An often preferred and easy solution is to store your data in a relational database
if you can set expectations of which data can be added to a table in the future. The
queries are easy and the data schema can be standardized. However, when you don’t
have set fields and field types that a data table can contain, and want the flexibility for
adding new and unknown field types to your data tables at any time, non-relational
databases are a better choice. Next, let’s assume you have a collection of white papers
or resumes that you want to store. If you know how to extract the relevant pieces of
information within those documents, it can be a good idea to transform the data into
36
https://www.geeksforgeeks.org/types-of-databases
54 ■ Natural Language Processing in the Real-World
the fields needed first and then store them in a structured format. However, if the
intention is to perform full-text searches of the documents, then choosing a document
database will be suitable.
Popular databases that work well with text data include Elasticsearch, and Mon-
goDB if you have larger documents to store. We’ll also explore Google Cloud Plat-
form’s (GCP) BigQuery and a simple flat file system. Next, let’s look at the capabil-
ities and some query samples for each.
2.3.2 Elasticsearch
Elasticsearch is a NoSQL, distributed document-oriented database. It serves as a full-
text search engine designed to store, access, and manage structured, semi-structured,
and unstructured data of different types. Elasticsearch uses a data structure called
an inverted index. This data structure lists each word appearing in a document and
is then able to easily return documents where a word has occurred, thus supporting
fast full-text searches. These capabilities offered with Elasticsearch make it a popular
choice for text. Elasticsearch is also a popular choice for numeric and geospatial data
types37 .
With Elasticsearch, you can get documents matching other documents using TF-
IDF (discussed further in Chapter 3 (Section 3.4)), along with other simpler oper-
ations such as finding documents that contain or don’t contain a word/phrase as
an exact field value or within a text field, or a combination thereof. Every record
37
https://www.javatpoint.com/elasticsearch-vs-mongodb
Data Sources and Extraction ■ 55
returned has an attached score that represents the closeness to your search criteria.
Elasticsearch is a great choice when you need fast searches from your database and
supports fast filtering and aggregation.
For text fields, Elasticsearch offers two types – text and keyword. Type keyword
arguments are optimized for filtering operations. Type text arguments are better
suited for performing searches within the strings. A field can also be made as both
keyword and text if desired.
An index in Elasticsearch can be set up to expect documents containing fields
with certain names and types. Let’s consider the following example. You want to
create a new index in Elasticsearch and add data to it. This can be done as follows
using Python with the library elasticsearch.
! python -m pip install elasticsearch ==8.5.0
conn = Elasticsearch (
[{ " host " : < host > , " port " : < port >}] ,
http_auth =( < username > , < password >) ,
timeout =60 ,
max_retries =5 ,
retry_on_timeout = True ,
maxsize =25 ,
)
conn . index (
index = index_name , doc_type = " entity " , id =2 ,
body ={ " name " : " herman woman " , " userId " : 2}
)
Let’s assume your host and port is ‘elastic.org.com:9200’.
Now http://elastic.org.com:9200/users?pretty will show you the data that
you just inserted.
{
" took " : 53 ,
" timed_out " : false ,
" _shards " : {
" total " : 5 ,
" successful " : 5 ,
" failed " : 0
},
" hits " : {
" total " : 2 ,
" max_score " : 1.0 ,
" hits " : [
{
" _index " : " users " ,
" _type " : " entity " ,
" _id " : " 1 " ,
" _score " : 1.0 ,
" _source " : {
" name " : " mandatory payment " ,
" userId " : " 1 "
}
},
{
" _index " : " users " ,
" _type " : " entity " ,
" _id " : " 2 " ,
" _score " : 1.0 ,
" _source " : {
" name " : " herman woman " ,
" userId " : " 2 "
}
}
]
}
}
There are multiple ways to query data in Elasticsearch. Kibana38 is a great data
exploration tool that works on top of Elasticsearch.
38
https://www.elastic.co/kibana
Data Sources and Extraction ■ 57
Here, we’ll look at a few bash and Python examples of querying Elasticsearch.
This will return your record with _id 1 and 2 as "man" is present in both the
records in the name field and save the results in the specified temp.txt file.
Using Python
from elasticsearch import Elasticsearch
es_master = " elastic . orgname . com "
es_port = " 9200 "
es = Elasticsearch ([{ ' host ': es_master , ' port ': es_port }])
res = es . search (
index = " user " ,
body = query ,
size =10000 ,
request_timeout =30
)
2.3.3 MongoDB
MongoDB is a NoSQL document-oriented database. It is a popular choice for storing
text data. The DB supports query operations for performing a search on text.
Let’s consider an example dataset using the MongoDB Shell, mongosh, which
is a fully functional JavaScript and Node.js 14.x REPL environment for interacting
Data Sources and Extraction ■ 59
with MongoDB deployments. You can use the MongoDB Shell to test queries and
operations directly with your database. mongosh is available as a standalone package
in the MongoDB download center.39
db . stores . insert (
[
{ _id : 1 , name : " Java Hut " ,
description : " Coffee and cakes " } ,
{ _id : 2 , name : " Burger Buns " ,
description : " Gourmet hamburgers " } ,
{ _id : 3 , name : " Coffee Shop " ,
description : " Just coffee " } ,
{ _id : 4 , name : " Clothes Clothes Clothes " ,
description : " Discount clothing " } ,
{ _id : 5 , name : " Java Shopping " ,
description : " Indonesian goods " }
]
)
MongoDB uses a text index and $text operator to perform text searches.
This will allow you to perform text search on fields name and description.
$Text operator
The $text query operator can be used for performing text searches on a collection
with a text index. $text tokenizes the text using whitespace and common punctuation
as delimiters. For matching the field with a string, this operator performs a logical
OR with all tokens in the text field.
For instance, the following query can be used to find matches with any of these
terms - ‘coffee’, ‘shop’, and ‘java’.
db . stores . find ( { $text : { $search : " java coffee shop " } } )
An exact phrase match can also be searched for by wrapping the string in double
quotes. The following finds all documents containing coffee shop.
db . stores . find ( { $text : { $search : " \" coffee shop \" " } } )
39
https://docs.mongodb.com/mongodb-shell/
60 ■ Natural Language Processing in the Real-World
Furthermore, if you want to search for the presence of certain words, but also the
absence of a word, you can exclude a word by prepending a - character. For instance,
the following finds stores containing ‘java’ or ‘shop’, but not ‘coffee’.
db . stores . find ( { $text : { $search : " java shop - coffee " } } )
MongoDB returns the results without any sorting applied as the default. Text
search queries compute a relevance score for every document. This score is the mea-
sure of how well a document matches the query of the user. It is possible to specify
a sorting order within the query. This can be done as follows.
db . stores . find (
{ $text : { $search : " java coffee shop " } } ,
{ score : { $meta : " textScore " } }
) . sort ( { score : { $meta : " textScore " } })
Text search can also be performed in the aggregation pipeline. The following
aggregation searches for the term ‘cake’ in the $match stage and calculates the total
views for the matching documents in the $group stage.40
db . articles . aggregate (
[
{ $match : { $text : { $search : " cake " } } } ,
{ $group : { _id : null , views : { $sum : " $views " } } }
]
)
# Database Name
db = client [ " database " ]
# Collection Name
collection = db [ " your collection name " ]
40
https://docs.mongodb.com/manual/text-search/
41
https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
Data Sources and Extraction ■ 61
Language support
In addition to the English language, MongoDB supports text search for various
other languages. These include Danish, Dutch, Finnish, French, German, Hungarian,
Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish.
Other examples of document databases include RethinkDB42 and OrientDB.43
Regex
Regular expression functions are listed below44 .
REGEXP_MATCH(‘str’, ‘reg_exp’) returns true if str matches the regular ex-
pression. For string matching without regular expressions, use CONTAINS instead
of REGEXP_MATCH.
REGEXP_EXTRACT(‘str’, ‘reg_exp’) returns the portion of str that matches
the capturing group within the regular expression.
REGEXP_REPLACE(‘orig_str’, ‘reg_exp’, ‘replace_str’) returns a string where
any substring of orig_str that matches reg_exp is replaced with replace_str. For
example, REGEXP_REPLACE (‘Hello’, ‘lo’, ‘p’) returns Help.
String functions
Several string functions exist in BigQuery that operate on string data.45 Table
2.2 contains the function and their descriptions.
Amazing SpiderMan
I am a woman
I am a Man
I am HUMAN
Commander
mandatory
man
Let’s say our task is to find all rows where the title contains the term ‘man’. Let’s
see the solution a few different ways.
Presence of the search term ‘man’ anywhere in the string matching the
case specified.
SELECT title
FROM sample_project . sample_table
WHERE title LIKE '% man % '
This will result in titles containing ‘man’, and would also work for words such
as ‘woman’ or ‘commander’, that contain the search string within. This particular
search will only return rows that match the case with our search term. In this case,
it would be the lowercase word ‘man’. The following titles will be returned.
I am a woman
Commander
mandatory
man
Data Sources and Extraction ■ 63
Presence of the search term ‘man’ only at the start of the string field,
matching the case specified.
SELECT title
FROM project . sample_table
WHERE title LIKE ' man % '
mandatory
man
Similarly, to get an exact match, the % sign from the right can be removed. To get
string fields ending with ‘man’, the % sign can be placed on the left and removed
from the right.
Amazing SpiderMan
I am a woman
I am a Man
I am HUMAN
Commander
mandatory
man
Finding all rows that contain the search term ‘man’ as a separate
word/phrase with case insensitivity.
This use case differs from the search criteria perspective. Until now, we were
searching for the presence of a term within a string. Now, we want to detect rows
where the search string is present as a word/independently occurring phrase of its
own. We can leverage regex_contains for a use case as such.
SELECT title
FROM project . sample_table
WHERE REGEXP_CONTAINS (
title , " (? i ) (?:^|\\ W ) man (?: $ |\\ W ) "
)
I am a Man
man
SELECT title
FROM project . sample_table
WHERE REGEXP_CONTAINS (
title , " (? i ) (?:^|\\ W ) a man (?: $ |\\ W ) "
)
I am a Man
query = """
SELECT title
FROM project . sample_table
"""
# Make an API request .
query_job = client . query ( query = query )
46
https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-
install-python
Data Sources and Extraction ■ 65
Data maintenance
Now that you have made a call on how to store your data, an important follow-up
is considerations around maintaining your data. This includes the following.
1. Backup strategy
One of the unfortunate events that happen at different scales to most of us
at some point in our life is ‘accidental deletions!’. It is important to take or
schedule regular data backups. This is also useful when your data is getting
constantly updated. It will help restore a previous version if the recent data
updates introduced unwanted data or other noise.
2. Maintaining data quality
When storing data in a database, it is a popular practice to store ingestion
timestamps along with every record to retain the knowledge of when something
was inserted or changed. This helps in partitioning manual quality checks on
your data, so you do not need to re-check data you have checked before, but
only the new additions/changes.
3. Monitoring
Establishing processes and systems for monitoring data is an important prac-
tice for maintaining data quality and identifying issues in a timely manner.
Set alerts to get notified of errors or data changes. Datadoga is an example
of a tool that offers monitoring and analysis of the health and performance of
your databases.
a
https://docs.datadoghq.com/getting_started/database_monitoring/
66 ■ Natural Language Processing in the Real-World
Windup
In this section, we discussed several sources of text data, including first-party data
(a business’s own asset), public sources, and conditionally available data sources such
as social media APIs. We shared how you can extract text from different document
formats using Python and open-source tools. We also shared code samples for reading
text from different APIs. Once you have text data, exploring data storage solutions
is the next step. We have investigated data storage options that can help you stably
keep your data around, update it, and query it as per the need. We also shared several
query samples for performing text operations in different databases.
It is vital to note that some APIs change their supported methods with dif-
ferent versions without backwards compatibility. For example, tweepy 3.10 versus
tweepy 4.12 have different method names for getting followers (followers versus
get_followers).
We also discussed the importance of data maintenance and tips and tricks to
ensure the data retains quality. Now you know the basics of NLP and where the
data comes from, how you can extract it, and where you can store your data. In
the next section, we will discuss data preprocessing, modeling, visualization, and
augmentation.
Previously, we discussed several reasons why NLP is challenging. That included
language diversity, language evolution, and context awareness, among others. These
challenges impact how the text data looks. What does that mean? Text data can
be curated from multiple resources. The data’s nature itself can vary in quality and
quantity. Let’s consider an corpus curated from Research Gate that contains research
papers. How would you expect the text would look like from this corpus? Perhaps
long, formal language, field-specific jargon usage. Now consider a text corpus curated
from YouTube that contains comments on a gaming video. How would you expect this
corpus to look in comparison to the Research Gate corpus? Perhaps shorter length
documents, informal language, gaming-specific abbreviations, and term usage. What
if we were fetching comments from unpopular videos? Now another variation may be
the total number of documents in the corpus. We can see how changing the source
completely changes certain properties of the text documents within a corpus.
The below lists the most common text document varieties that are commonly
dealt with.
2. Language style: The style of language usage in the corpus can be formal, infor-
mal, semi-formal, or a combination thereof.
Data Sources and Extraction ■ 67
3. Text length: The length of text documents within a corpus can be short, long,
or a combination thereof.
4. Language of communication: Even if you expect to deal with only one language
of content, let’s say English, you can still have documents containing other
languages.
Why do we care about the variety found within text data? It is important to
know your data, and how it may be different from other text data that you are trying
to learn from or may have worked on in the past. The nature of the data is a useful
consideration factor for deciding between storage options and maintenance logic. It
also informs whether certain cleaning algorithms need to be applied to your dataset
prior to any processing. We will look at different data cleaning and preprocessing,
and data augmentation methods in Chapter 3.
III
Data Processing and Modeling
What you do once you have the data depends on the task you are trying to
accomplish. For data scientists, there are two common possibilities.
1. The task at hand requires data analysis and aggregations to draw insights for
a business use case.
2. The task at hand requires you to build models using machine learning (ML).
Figure 2.3 summarizes the chain of events most commonly expected in a Data
Scientist’s work. In this section, we will discuss all the phases - data, modeling, and
evaluation.
To begin with, we will dive into the data phase. We already discussed data sources
and curation in Section II. In this section, we will dive into data cleaning and prepro-
cessing techniques that eliminate unwanted elements from text and prepare the data
for numerical transformations and modeling. We’ll look at Python implementations
for removing commonly observed noise in the different varieties of data, including
lowercasing, stop word removal, spelling corrections, URL removal, punctuation re-
moval, and more. We’ll also discuss stemming, lemmatization, and other standard-
ization techniques. Before further processing, a text document often needs to be seg-
mented into its component sentences, words, or phrases. We’ll implement different
data segmentation techniques.
Once the data is clean, the next step includes data transformations to convert
the text into numerical representations. We’ll look at data transformation techniques
that include text encoding, frequency-based vectorization, co-occurrence matrix, and
several word embedding models (word embedding models convert words into numeric
FIGURE 2.3 Data Science project phases.
• Visualization
• Data augmentation
• Data transformation
• Distance metrics
• Modeling
• Model evaluation
DOI: 10.1201/9781003264774-3 75
76 Natural Language Processing in the Real-World
Select what you want to remove from your data based on the ap-
plication. For instance, to create a sentiment classification model, you would
not expect elements such as URLs to convey meaning.
The anticipated noise can also vary with language style and data source. For
instance, the language used on social media can have excessive punctuations,
emojis, and typing errors. For some applications, retaining punctuation
might be necessary, while for others it might not be useful or could also
be detrimental to your model. For instance, if you are creating a classifier
model using the count of words in the text as features, you may want to map
‘great’ , ‘great’ , and ‘GREAT!’ , to the single word ‘great’ . In this example,
you will need to remove punctuation from your data and lowercase before
extracting word frequencies. On the contrary, most named entity recognition
(NER) models rely on punctuation and case to identify entities in the text,
such as a person’s name. For example, below we run two sentences through
spaCy’s NER using the en_web_core_sm pre-trained model.
We’ll look at code implementations for NER using spaCy and some other tools
in Chapter 7 (Section 7.1).
3.1.1 Segmentation
Sentence segmentation
A long document can be split into multiple component sentences using sentence
segmentation. This can be accomplished using many Python libraries. Let’s see an
example below using spaCy.
! pip install spacy
import spacy
You can also perform sentence segmentation using the NLTK library, or write your
own regex function depending on how you want to split the text. We’ll look at an
example of the latter in Chapter 10 (Section 10.1.3).
What is regex?
Regex stands for regular expression. A regular expression is a sequence of
characters that specifies a pattern for searching text.
Word tokenization
Text tokenization refers to the splitting of text into meaningful tokens or units.
You can use text.split() (split() is a python inbuilt string function) to break
the text down into smaller units as well, however, that does not treat punctuation
as a separate unit from words. It can still work well for your data if you remove
punctuation before splitting the text, but fail to differentiate between regular period
usage versus something like ‘U.K.’, which should be one token.
Libraries such as TextBlob, NLTK, and spaCy can be used to tokenize text. Here
are a few implementations.
! pip install textblob ==0.17.1
import spacy
Part-of-speech tagging
Part-of-speech tagging is also called POS tagging. Sometimes, it might be desired
to retain only certain parts of speech, such as nouns. The use cases can be cleaning
data before creating a word-counts (bag-of-words) model or further processing that
depends on parts of speech, such as named entity recognition (where two nouns oc-
curring together are likely first and last names of a person) and keyphrase extraction.
This can be implemented in Python as follows.
from nltk import word_tokenize , pos_tag
tokens = word_tokenize (
" Can you please buy me an Arizona Ice Tea ? It 's $0 .57. "
)
pos = pos_tag ( tokens )
print ( pos )
# >> [( ' Can ', ' MD ') , ( ' you ', ' PRP ') , ( ' please ', ' VB ') , ( ' buy ', ' VB ') ,
( ' me ', ' PRP ') , ( ' an ', ' DT ') , ( ' Arizona ', ' NNP ') , ( ' Ice ', ' NNP ') ,
( ' Tea ', ' NNP ') , ( '? ' , '. ') , ( ' It ', ' PRP ') , (" ' s " , ' VBZ ') , ( ' $ ', '$
') , ( '0.57 ' , ' CD ') , ( '. ' , '. ') ]
N-grams
N-grams are a contiguous sequence of N elements. For instance, ‘natural’, ‘lan-
guage’, and ‘processing’ are unigrams, ‘natural language’ and ‘language processing’
are bigrams, and ‘natural language processing’ is the trigram of the string ‘natural
language processing’.
In many NLP feature generation methods, each word in a sentence is used as an
independent unit (token) while encoding data. Instead, getting multi-word pairs from
a sentence can be beneficial for certain applications that contain multi-word keywords
or sentiment analysis. For example, ‘not happy’ bigram versus ‘happy’ unigram can
convey different sentiments for the sentence ‘James is not happy."
! pip install textblob
3.1.2 Cleaning
Punctuation removal
For many applications such as category classification and word visualizations, the
words used in the text matter and the punctuation does not have relevance to the
application. Punctuation can be removed using a regex expression.
Data Preprocessing and Transformation 79
URL removal
In language documents, removing URLs can be beneficial in reducing overall text
length and removing information that does not convey meaning for your application.
In regex, \s matches all white-space characters and \S matches with all non
white-spaced characters. | stands for OR and can be used when you want to match
multiple patterns with the OR logic. An example can be seen below for removing
URLs from text.
import re
text = """
Check it out on https :// google . com or www . google . com for more
information .
Reach out to abc@xyz . com for inquiries .
"""
url_cleaned = re . sub ( r " https ?://\ S +| www \.\ S + " , " " , text )
# >> Check it out on or for more information .
# >> Reach out to abc@xyz . com for inquiries . '
Emoji removal
Unicode is an international standard that maintains a mapping of individual
characters and a unique number across devices and programs. Each character is
represented as a code point. These code points are encoded to bytes and can be
decoded back to code points. UTF-8 is an encoding system for Unicode. UTF-8 uses
1, 2, 3 or 4 bytes to encode every code point.
In the unicode standard, each emoji is represented as a code. For instance,
\U0001F600i s the combination that triggers a grinning face across all devices across
the world in UTF-8. Thus, regex patterns can be used to remove emojis from the
text.
For the sentence ‘What does emoji mean?’, the following code replaces the
emoji with and empty string.
import re
emoji_cleaned = re . sub (
r ' [\ U00010000 -\ U0010ffff ] ' , ' ' , text , flags = re . UNICODE
)
# >> ' What does emoji mean ? '
80 ■ Natural Language Processing in the Real-World
Spelling corrections
Sometimes the data consists of a lot of typing errors or intentional misspellings
that fail to get recognized as intended by our models, especially if our models have
been trained on cleaner data. In such cases, algorithmically correcting typos can come
in handy. Libraries such as pySpellChecker, TextBlob, and pyEnchant can be used
to accomplish spelling corrections.
For spelling corrections, common underlying approaches use character-based dif-
ferences. We’ll go over some character-based distance metrics later in Chapter 4
(Section 4.1.1).
Let’s look at the library pySpellChecker. The library has some drawbacks in rec-
ognizing typos containing more than 2 consecutively repeated alphabets, e.g., ‘craazy’
-> ‘crazy’ , but ‘craaazy’ x> ‘crazy’. If relevant to your data, consider limiting con-
secutive occurrences of any alphabet to a maximum of 2 times before passing the
text through pySpellChecker for getting more accurate spelling corrections. This
operation can take a long time depending on the length of the text.
spell = SpellChecker ()
The library TextBlob also does not always handle well more than two consecutive
repeated alphabets. You can also train a model on your own custom corpus using
TextBlob.
! pip install textblob
And lastly, the library pyenchant helps accomplish spelling corrections with sim-
ilar issues as seen in the other tools.
! pip install pyenchant ==3.2.2
In many use cases where the terms are expected to be specific to an industry,
custom spelling checker tools can be built using available and relevant datasets.
Stop words removal
Stop words refer to the commonly occurring words that help connect important
key terms in a sentence to make it meaningful. However, for many NLP applications,
they do not represent much meaning by themselves. Examples include ‘this’, ‘it’,
‘are’, etc. This is especially useful in applications using word occurrence-based fea-
tures. There are libraries and data sources containing common stop words that you
can use as a reference look-up list to remove those words from your text. In practice,
it is common to append to an existing stop words list the words specific to your
dataset that are expected to occur very commonly but don’t convey important infor-
mation. For example, if you are dealing with YouTube data, then the word ‘video’
may commonly occur without conveying a unique meaning across text documents
since all of them come from a video source.
Here’s how to use the NLTK library to remove stop words.
! pip install nltk
import nltk
nltk . download ()
82 ■ Natural Language Processing in the Real-World
stop_cleaned = [
w for w in tokens if w . lower () not in sw
]
# instead , you can also lowercase the text before tokenizing ,
# unless retaining case is required for your application
print ( stop_cleaned )
# >> [ ' Hi ', ' like ', ' NLP ', ',', '? ']
3.1.3 Standardization
Lowercasing
For applications where ‘Natural Language Processing’, ‘natural language process-
ing’, and ‘NATURAL LANGUAGE PROCESSING’ convey the same meaning, you
can lowercase your text as follows.
text = " NATURAL LANGUAGE PROCESSING "
lower_cleaned = text . lower ()
# >> natural language processing
tokens = [
" cars " , " car " , " fabric " , " fabrication " , " computation " , " computer "
]
st = PorterStemmer ()
stemmed = " " . join ([ st . stem ( word ) for word in tokens ])
print ( stemmed )
# >> car car fabric fabric comput comput
Lemmatization
Lemmatization is the process of extracting the root word by considering the var-
ious words in a vocabulary that convey a similar meaning. Lemmatization involves
morphological (described in Section I) analysis of words that remove inflectional end-
ings only to return a base word called the lemma. For example, lemmatizing the word
‘caring’ would result in ‘care’, whereas stemming the word would result in ‘car’.
There are many tools you can use for lemmatization. NLTK, TextBlob, spaCy,
and Gensim are some popular choices. Let’s look at a few implementation examples
below.
! pip install textblob
tokens = [
" fabric " , " fabrication " , " car " , " cars " , " computation " , " computer "
]
lemmatized = " " . join (
[ Word ( word ) . lemmatize () for word in tokens ]
)
print ( lemmatized )
# >> fabric fabrication car car computation computer
import spacy
nlp = spacy . load ( ' en_core_web_sm ')
import nltk
nltk . download ( ' wordnet ')
from nltk . stem import Word NetLemmatizer
import re
from nltk . corpus import stopwords
from nltk import word_tokenize
text = """
Hi all ! I saw there was a big snake at https :// xyz . he . com .
Come check out the big python snake video !!!!
"""
stop_words = stopwords . words ( " english " )
url_cleaned = re . sub ( r " https ?://\ S +| www \.\ S + " , " " , text )
stop_removed = [
word
for word in tokens
if word not in stop_words
]
print ( stop_removed )
# >> [ ' hi ', ' saw ', ' big ', ' snake ', ' come ',
# >> ' check ', ' big ', ' python ', ' snake ', ' video ']
You can further remove common words in your dataset associated with greetings
that do not convey meaning for your application. All the code used in this section
can be found in the notebook section3/preprocessing.ipynb in the GitHub location.
Applying data cleaning and preprocessing also reduces the size of the data samples
by retaining only the meaningful components. This in-turn reduces the vector size
during numerical transformations of your data, which we will discuss next.
3.2 VISUALIZATION
The most popular library in Python for representing text is wordcloud [127]. Word
cloud allows you to generate visualizations on a body of text, where the frequency of
words/phrases is correlated with the size of the word/phrase on the plot along with
its opacity. Figure 3.1 shows an example of the word cloud visualization. You can
install this library using the following install command in a Jupyter notebook. You
will also need matplotlib for creating word cloud visualizations.
! pip install wordcloud
! pip install matplotlib
wc = WordCloud (
mode = " RGBA " ,
collocations = False ,
background_color = None ,
width =1500 , height =1000
)
shows an example outcome of using this tool. You can install this library using the
following command.
! pip install scattertext
html = p r o d u c e _ s c a t t e r t e x t _ e x p l o r e r (
corpus ,
category = ' democrat ' ,
category_name = ' Democratic ' ,
not _categ ory_na me = ' Republican ' ,
m i n i m u m _ t e r m_ f r e q u e n c y =0 ,
p m i _ t h r e s h o l d _ c o e f f i c i e n t =0 ,
width_in_pixels =1000 ,
metadata = corpus . get_df () [ ' speaker '] ,
transform = Scalers . dense_rank
)
open ( ' ./ demo_compact . html ' , 'w ') . write ( html )
these options may not be feasible. Another way to increase the size of your text
dataset is using some text manipulation hacks.
Data augmentation refers to artificially synthesizing data samples based on the
samples present.
Data augmentation is a popularly used technique for images. For images, simply
rotating an image, replacing colors, adding blurs/noise, and such simple modifications
help generate new data samples. For text, the problem is a bit more challenging.
Popular techniques include word replacements in the text. However, replacing certain
words can at times completely change the context of a sentence. Furthermore, not
every word is replaceable by another or has a synonym. Nonetheless, it serves as a
popular technique to augment text data and works well for many cases.
A quick note before we dive into data augmentation techniques. The approaches
discussed here have solved problems that many individuals have faced while trying
to augment the text. While a technique may work for someone, it may not apply
to the data you are dealing with. It is recommended to tailor a data augmentation
approach based on the data available and your understanding of it.
The Python libraries pyAugmentText2 , nlpaug3 , and TextAugment4 contain im-
plementations for many data augmentation methods. Below are a few techniques that
have been adopted for data augmentation on text.
synonyms = []
Kohler went to Zurich.’, etc. Code samples on extracting such entities are dis-
cussed in Chapter 7 (Section 7.1).
4. Back translation
Back translation refers to translating text to another language and then trans-
lating it back to the original language. The results produced can give you a
different way of writing the same sentence that can be used as a new sample.
Language translation libraries are discussed in Section V with code samples.
Figure 3.3 shows an example of how the sentence changes using Google Trans-
late.
5. Adding intentional noise
This can be done by replacing target words with close but different spellings.
Introducing changes in spelling based on the keys next to each other on a
QWERTY keyboard are common practices.
Other advanced techniques include active learning [1], snorkel [10], and easy data
augmentation (EDA) [192]. [113] is a good further reading material on data augmen-
tation.
You will come across the term vector several times. Let’s quickly summarize what
a vector is before we proceed.
Vectors are a foundational element of linear algebra. Vectors are used throughout
the field of machine learning. A vector can be understood as being a list of numbers.
There are multiple ways to interpret what this list of numbers is. One way to think
of the vector is as being a point in a space (we’ll call this the vector space). Then
this list of numbers is a way of identifying that point in space, where each value in
the vector represents a dimension. For example, in 2-dimensions (or 2-D), a value
on the x-axis and a value on the y-axis gives us a point in the 2-D space. Similarly,
a 300-length vector will have 300 dimensions, which is hard to visualize.
90 ■ Natural Language Processing in the Real-World
3.4.1 Encoding
Label encoding
Label encoding is a method to represent categorical features as numeric labels.
Each of the categories is assigned a unique label.
name grade
.. A
.. B
.. C
name grade
.. 1
.. 2
.. 3
lenc = LabelEncoder ()
x = ["A", "B", "C", "B", "A"]
x_enc = lenc . fit_transform ( x )
In order to perform one hot encoding in Python, you can use the OneHotEncoder
from sklearn.
from sklearn . preprocessing import OneHotEncoder
Let’s assume you have a simple linear model (weight * input -> output) where a
weight multiplied by your input is used to select a threshold for the possible outputs.
Let’s consider school grades for two subjects as the input, and the output as ‘pass’ or
‘fail’. For simplicity, let’s assume that the weights are 1 for both subjects. Let’s label
encode the grades to convert data to numeric form.
1 * A + 1 * C = pass
1 * B + 1 * D = pass
1 * E + 1 * E = fail
1 * D + 1 * E = pass
1 * F + 1 * D = fail
1 * F + 1 * F = fail
Representing grades as labels A, B, C, D, E, F = 1, 2, 3, 4, 5, 6 yields equations
as follows.
1 * 1 + 1 * 3 = 4 = pass
1 * 2 + 1 * 4 = 6 = pass
1 * 5 + 1 * 5 = 10 = fail
1 * 4 + 1 * 5 = 9 = pass
1 * 6 + 1 * 4 = 10 = fail
1 * 6 + 1 * 6 = 12 = fail
92 ■ Natural Language Processing in the Real-World
This helps us determine the threshold of 10. A score >=10 leads to the output
‘fail’
Grade A is higher than B, and B in higher than C. If a similar ordering is preserved
with label encoding, label encoding can be a good choice.
Hash vectorizer
In general, a hash function is any function that can be used to map data of
arbitrary size to fixed-size values. A hash vector in NLP is produced when term
frequency counts are passed through a hash function that transforms the collection
of documents into a sparse numerical matrix. This sparse matrix holds information
regarding the term occurrence counts.
One advantage over a count vectorizer is that a count vector can get large if the
corpus is large. Hash vectorizer stores the token as numerical values as opposed to a
Data Preprocessing and Transformation ■ 93
string. The disadvantage of a hash vectorizer is that the features can’t be retrieved
once the vector is formed.
from sklearn . fe at ure _e xt ra ctio n . text import HashingVectorizer
where,
TF (t, d) = no. of times term t occurs in a document
IDF (t) = ln((1+n)/(1+df(d, t))) + 1
n = no. of documents
df(d, t) = document frequency of the term t
x = [ " i like nlp " , " nlp is fun " , " learn and like nlp " ]
vectorizer = TfidfVectorizer ()
vectorizer . fit ( x )
tfidf_x = vectorizer . transform ( x )
import numpy as np
import pandas as pd
import scipy
from nltk . tokenize import word_tokenize
print ( df_coocc )
# >> blue clouds forest grass green sky
# blue 2.0 1.0 0.0 0.0 0.0 2.0
# clouds 1.0 1.0 0.0 0.0 0.0 1.0
# forest 0.0 0.0 1.0 1.0 1.0 0.0
# grass 0.0 0.0 1.0 2.0 2.0 0.0
# green 0.0 0.0 1.0 2.0 2.0 0.0
# sky 2.0 1.0 0.0 0.0 0.0 2.0
FIGURE 3.4 Relationships between words using distances between word embeddings.
Once you have word embeddings, these can be used in the input to train a ma-
chine learning model. They can also be used to determine the relationship between
words by calculating the distance between their corresponding vectors. Word em-
beddings capture the meanings of words, semantic relationships, and the different
contexts. Using word embeddings, Apple the company and apple the fruit can be
distinguishable. While trying to get word pairs, e.g., ‘queen’ -> ‘king’ , ‘woman’ ->
?, word embeddings can be used to find the difference between vectors for ‘king’ and
‘queen’ , and find the corresponding word for ‘woman’ that exhibits a similar vector
difference with the word ‘woman’ . See Figure 3.4.
Models
Word embedding models can be generated using different methods like neural net-
works, co-occurrence matrix, probabilistic algorithms, and so on. Several word em-
bedding models in existence include Word2Vec5 , fastText6 , Doc2Vec7 , GloVe embed-
ding8 , ELMo9 , transformers10 , universal sentence encoder [44], InferSent [52], and
5
https://radimrehurek.com/gensim/models/word2vec.html
6
https://ai.facebook.com/tools/fasttext/
7
https://radimrehurek.com/gensim/models/doc2vec.html
8
https://nlp.stanford.edu/projects/glove/
9
https://paperswithcode.com/method/elmo
10
https://www.sbert.net/docs/pretrained_models.html
Data Preprocessing and Transformation 97
Open-AI GPT.11 Let’s look at the implementation of various of them using Python.
We’ll also leverage these in Section V for various tasks.
Word2Vec
Word2Vec is a popular word embedding approach. It consists of models for gener-
ating word embeddings. These models are shallow two-layer neural networks having
one input layer, one hidden layer, and one output layer. Word2Vec utilizes two models
within.
A sentence is divided into groups of n words. The model is trained by sliding the
window of n words.
• Skip gram
Skip gram works the other way round. It predicts the surrounding context words
for an input word.
The main disadvantage of Word2Vec is that you will not have a vector representing
a word that does not exist in the corpus. For instance, if you trained the model on
biological articles only, then that model will not be able to return vectors of unseen
words, such as ‘curtain’ or ‘cement’ .
You can produce Word2Vec embeddings using the library Gensim or spaCy.
spaCy offers many built-in pre-trained models which form a convenient way to
get word embeddings quickly. spaCy offers these models in several languages.
The most popularly used models for the English language are en_core_web_sm,
en_core_web_md, en_core_web_lg, en_core_web_trf.
spaCy parses blobs of text and seamlessly assigns word vectors from the loaded models
using the tok2vec component. For any custom corpus that varies vastly from web
documents, you can train your own word embeddings model using spaCy.
! pip install spacy
! python -m spacy download " en_core_web_sm "
11
https://openai.com/api/
98 Natural Language Processing in the Real-World
import spacy
doc . vector
spaCy offers pre-trained models. Gensim does not provide pre-trained models for
word2vec embeddings. There are models available online to download for free that
you can use with Gensim, such as the Google news model12 .
! pip install gensim
fastText
fastText was developed by Facebook. This architecture considers each character
in a word while learning the word’s representation.
The advantage of fastText over Word2Vec is that you can get a word represen-
tation for words not in the training data/vocabulary with fastText. Since fastText
uses character-level details on a word, it is able to compute vectors for unseen words
containing the characters it has seen before. One disadvantage of this method is that
unrelated words containing similar characters/alphabets may result in being close in
the vector space without semantic closeness. Example, words like ‘love’ , ‘solve’, and
‘glove’ contain many similar alphabets ‘l’ , ‘o’ , ‘v’ , ‘e’ , and may all be close together
in vector space.
tokens_doc = [
[ 'I ' , ' like ' , ' nlp '] ,
[ ' nlp ' , ' and ' , ' machine ' , ' learning ']
]
fast = FastText (
tokens_doc ,
size =20 ,
12
https://code.google.com/archive/p/word2vec/
Data Preprocessing and Transformation 99
window =1 ,
min_count =1 ,
workers =5 ,
min_n =1 ,
max_n =2
)
Doc2Vec
Doc2Vec is based on WordVec except it is suitable for larger documents.
Word2Vec computes a feature vector for every word in the corpus, whereas Doc2Vec
computes a feature vector for every document in the corpus.13 is an example of train-
ing Doc2Vec.
GloVe
GloVe stands for global vectors. GloVe model trains on co-occurrence counts of
words and produces a vector by minimizing the least square error.
Each word in the corpus is assigned a random vector. If two words are used
together more often, i.e., they have a high co-occurrence, then the vectors of those
words are moved closer in the vector space. After various rounds of this process, the
vector space representation approximates the information within the co-occurrence
matrix. In mathematical terms, the dot product of two words becomes approximately
equal to the log of the probability of co-occurrence of the words. This is the principle
behind GloVe.
Glove vectors treat each word as one, without considering the same word can have
multiple meanings. The word ‘bark’ in ‘a tree bark’ will have the same representation
as ‘a dog bark’ .
Since it is based on co-occurrence, which needs every word in the corpus, glove
vectors can be memory intensive based on corpus size.
Word2Vec, skip-gram, and CBOW are predictive and don’t account for scenarios
where some context words occur more often than others. They capture local context
rather than global context, whereas GloVe vectors capture the global context.
emmbed_dict = {}
with open ( '/ content / glove .6 B .200 d . txt ' , 'r ') as f :
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-
13
wikipedia.ipynb
100 Natural Language Processing in the Real-World
for line in f :
values = line . split ()
word = values [0]
vector = np . asarray ( values [1:] , ' float32 ')
emmbed_dict [ word ] = vector
ELMo
ELMo is a deep contextualized word representation model. It considers the com-
plex characteristics of words and how they vary across different contexts. Each term
is assigned a representation that is dependent on the entire input sentence. These
embeddings are derived from a Bi-LSTM model. We’ll go over Bi-LSTM later in
Chapter 4 (Section 4.2.2.4).
ELMo can handle words with different contexts used in different sentences, which
GloVe is unable to. Thus the same word with multiple meanings can have different
embeddings.
tf . d i s a b l e _ e a g e r _ e x e c u t i o n ()
print ( ' Word embeddings for the word " watch " in the 1 st sentence ')
print ( sess . run ( embeddings [0][3]) )
print ( ' Word embeddings for the word " watch " in the 2 nd sentence ')
print ( sess . run ( embeddings [1][5]) )
Data Preprocessing and Transformation 101
Transformers
Since the past few years, there has been heavy research on transformer-based
(neural network architecture) models that are suitable for many tasks, one being
generating word embeddings. We’ll dig into transformers in Chapter 4 (Section 4.2.3).
We will also learn more about the BERT (Bidirectional Encoder Representations from
Transformers) model.
14
https://tfhub.dev/
102 ■ Natural Language Processing in the Real-World
print ( s e n t e n c e _ e m b e d d i n g s _ B E R T )
There are many models offered with sentence-transformers15 that can be used
to generate embeddings. Different models are suitable for different applications.
You can also use the library transformers to get numerical features from text
as follows.
from transformers import pipeline
print ( feature )
All the code for numerical feature generation demonstrated above can be found in
the notebook section3/features.ipynb.
As we have seen, there are several options for extracting word embeddings using
a pre-trained model. It is also possible to train a custom model on your own data.
15
https://www.sbert.net/docs/pretrained_models.html
Data Preprocessing and Transformation 103
The models you try should depend on your data and applica-
tion. For instance, if you have strings representing finite cat-
egories, then using one-hot encoding will make sense. If you
have sentences, then using a count vectorizer, hash vector-
izer, TF-IDF, or word embeddings could be good solutions.
Different models are trained on different datasets and have
some advantages and drawbacks as we discussed above. A
model trained on data similar to your data can work well.
It also depends on your end goal. If you want to get words
similar to input words, using word embeddings will make the
job simpler.
Often, there is no one right answer. It is a common prac-
tice to try a few different transformations and see which one
works better. For instance, trying TF-IDF, and a couple of
word embedding models followed by comparing the results of
each can help with the process of selecting the feature gener-
ation method while creating a text classification model. The
comparison can comprise evaluating which model yields bet-
ter results when used with a fixed classifier model and/or
how much computation time and resources are required. For
instance, word embeddings are more complex than TF-IDF
as they use a model to generate numerical representation.
CHAPTER 4
Data Modeling
that can be used out of the box. One such library containing implementations of
multiple distance metrics is pyStringMatching1 .
from pyStringMatching import matcher
1
https://github.com/jsingh811/pyStringMatching
Data Modeling ■ 107
return soundex
You can also use the library fuzzy. Some users have reports errors using this tool
off late, hence we wanted to share the above implementation as well.
! pip install fuzzy
import fuzzy
soundex = fuzzy . Soundex (4)
soundex ( ' fuzzy ')
# >> F200
The Euclidean distance between two points is the length of the path connecting
them. This distance is useful when the length of text plays a role in determining
similarity. Note that if the length of the sentence is doubled by repeating the same
sentence twice, the euclidean distance will increase even though the sentences may
have the same meaning. A popular open-source library containing this implementa-
tion is sklearn.
from sklearn . metrics . pairwise import euclidean_distances
[[0.]]
[[3.74165739]]
Cosine distance
Cosine distance is the most popularly used metric for measuring distance when
the differences in document lengths (magnitude of the vectors) do not matter. Many
libraries in Python contain the implementation of cosine distance. The cosine distance
between ‘I love ice cream’ and ‘I love ice cream I love ice cream I love ice cream’ will
108 Natural Language Processing in the Real-World
be 0 because occurrences of terms within each sample follow to the same distribution.
sklearn can be used to compute cosine similarity.
[[1.]]
[[1.]]
Jaccard index
The Jaccard index is another metric that can be computed by calculating the
number of words common between two sentences divided by the number of words in
both sentences combined. This metric is also helpful while assuming the relationship
between semantic similarity and word usage.
The code used in this section can be found in the notebook section3/distance.ipynb.
4.2 MODELING
In a general sense, a model is a representation of a system. In data science, a model
can be software that consists of logical operations being performed on the input data
resulting in an output. A simple example is checking whether the input is present
Data Modeling 109
in the look-up list and returning its corresponding value as illustrated in Figure 4.1.
A model could also be based on machine-learning algorithms that follow a different
set of steps for processing input data. Below, we’ll describe some common types of
machine learning models for text data.
popularly used in the industry today and serve great solutions for a wide range of
problems.
Did you know that classic ML models are used very popularly in the industrial
domain, while deep learning models find heavier use in research domains?
This is because the resources needed to train a model along with the speed of get-
ting predictions out are important considerations in the industry and are usually
constrained. On the contrary, research teams are funded for getting access to extra
resources for their projects. For example, if 2% loss in accuracy means the model
size can be reduced to half, it may be a preferable solution. If you are an early data
science professional in the industry or currently interviewing, you may observe that
the focus in many data science interview assignments is not about the higher accu-
racy, but your overall approach, considerations, and thought process. Investment in
compute resources to get a small % increase in accuracy is more common in research-
front domains, such as the education sector and large organizations with dedicated
research teams.
Did you know that large language models (LLMs) that have been developed in the
recent years, such as GPT-3 [143] (not open-sourced, runs on OpenAI’s API) and
PaLM [151] (developed by Google), have taken weeks and months to train with the
training cost of millions of dollars? The recent model - BLOOM [33] took more than
three months to complete training on a supercomputer and consists of 176 billion
parameters. It was trained using $7 million in public funding. However, that is the
state-of-the-art for language models and is not what industries adopt for common
use cases.
Data Modeling 111
4.2.1.1 Clustering
Clustering refers to organizing data into groups (also called clusters or cohorts). In
machine learning, clustering is an unsupervised algorithm, which means that what
data sample belongs to which cluster is not known and the algorithm tries to establish
clusters on its own. The number of clusters an algorithm will divide the data into
is on the user. Often, there is some experimentation involved by trying few different
number of clusters. The library sklearn can be used to train clustering models.
At a high-level, clustering can be of a few different types. As illustrated in Figure
4.3, hard clustering separates the data samples into clusters without any overlap
between the clusters. This means that every data sample can only belong to one
cluster. Soft clustering, on the contrary, assumes that a data sample can be a part of
multiple clusters and there is no perfect separation between the clusters. Hierarchical
clustering forms clusters in a hierarchical manner where the clusters can end up in
one root. Any clustering that is not hierarchical is called flat or partitional clustering.
Any clustering algorithm can be applied to text data’s numerical representations
(also called numerical features). Let’s look at some popular ones below.
• K-means2
K-means is a hard, flat clustering method. It works by assigning k random
points in the vector space and the initial ‘means’ (mathematic mean) of the k
clusters. Then, it assigns each data point to the nearest cluster ‘mean’. Then the
‘mean’ is recalculated based on the assignments, followed by reassignment of
2
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
112 Natural Language Processing in the Real-World
the data points. This process goes on until the cluster ‘means’ stop changing.3
is a good resource for more details on k-means.
Other vector space-based methods for clustering include DBSCAN [65] which
favors densely populated clusters. Another method is called expectation maxi-
mization (EM) [138] which assumes an underlying probabilistic distribution for
each cluster.
Here’s how an implementation in Python looks like.
from sklearn . cluster import KMeans
lda . fit ( X )
Some other topic modeling algorithms include latent semantic analysis (LSA)
[87] and probabilistic latent semantic analysis (PLSA) [94].
• Brown clustering
Brown clustering is a hierarchical clustering method. The underlying method
revolves around the distributional hypothesis. A quality function is used to
describe how well the surrounding context words predict the occurrence of the
words in the current cluster. This is also called mutual information.
More on brown clustering can be found here [81].
Other approaches include graph-based clustering (also known as spectral clus-
tering). Examples include Markov chain clustering [89], Chinese whispers [32],
and minimal spanning tree-based clustering [135].
4.2.1.2 Classification
Classification is a supervised learning technique. Classification models require labeled
input training data containing data samples and the corresponding label/class. The
model then tries to learn from the known input-output relationships and can be used
to classify new/unseen data samples.
There exist many classification models. The list below contains some of the pop-
ular ones used in NLP. The library sklearn can be used for training a model using
these algorithms.
• Naive Bayes5
A Naive Bayes model is a probabilistic model that is based on Bayes theorem
[78]. Naive Bayes model is scalable, requiring the same number of parameters
as the features. In classification, this model learns a probability for each text
document to belong to a class and then chooses the class with the maximum
probability as the classification. Such models are also called generative models.
There are three types of Naive Bayes models that are commonly used - Gaussian
Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes. For text
classification, Multinomial Naive Bayes is commonly used and is a popular
choice for establishing a baseline.
5
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.
MultinomialNB.html
114 ■ Natural Language Processing in the Real-World
clf = MultinomialNB ()
clf . fit (X , y )
• Logistic regression6
Logistic Regression is a discriminative classifier that learns weights for individ-
ual features that can be linearly combined to get the classification. In other
words, this model aims to learn a linear separator between different classes.
The model assigns different weights to each feature value such that the sum of
the product of each feature value and weight decides which class the sample
belongs to. Despite the term ‘regression’ in its name, it is a classification model.
This is also a popular model for establishing baselines.
Further explanation can be found here [181].
Here’s how to build this model in Python.
from sklearn . svm import LogisticRegression
6
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.
LogisticRegression.html
Data Modeling 115
clf = LogisticRegression ()
clf . fit (X , y )
clf = SVC ()
clf . fit (X , y )
• Random forest8
Random forest algorithm constructs a multitude of decision trees [85]. In a
decision tree structure, leaves of the tree represent class labels and branches
represent features that result in those labels. An example can be seen in
7
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
8
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
RandomForestClassifier.html
116 Natural Language Processing in the Real-World
Figure 4.6. The output of the model is the class that is selected by most trees
for the sample. [134] contains step-by-step details on this algorithm.
Here’s how to build this model in Python.
from sklearn . ensemble import Ra nd om For es tC la ssi fi er
clf . fit (X , y )
clf . fit (X , y )
import sklearn_crfsuite
Pooling layer: This layer reduces the sample size of a feature map, as a result,
it reduces the number of parameters the network needs to process. The output is a
pooled feature map. Two methods of doing this are max pooling and average pooling.
Fully connected layer: This layer flattens the data and allows you to perform
classification.
An activation function is a mathematical function used in neural networks to en-
able the modeling of complex, non-linear relationships between inputs and outputs.
The absence of an activation function would limit the network to a linear combina-
tion of its inputs. Common activation functions include sigmoid, hyperbolic tangent,
ReLU, and its variants, with each having its own set of properties that can be used
in different types of neural networks depending on the problem being solved. Often
CNNs will have a Rectified Linear Unit or ReLU after the convolution layer that acts
as an activation function to ensure non-linearity as data moves through the layers in
the network. ReLU does not activate all the neurons at the same time. Using ReLU
helps prevent exponential growth in the computation required to operate the neural
network.
Figure 4.7 shows a diagram of a basic CNN. There can also be multiple convolu-
tional+ReLU and pooling layers in models.
More details on CNN can be found in this article [178].
Sequential is the easiest way to build a model in Keras. It allows you to build
a model layer by layer. You can then use the add() function to add layers to the
model.
# Imports
from keras . layers import (
Dense ,
Embedding ,
Conv1D ,
MaxPooling1D ,
Flatten
)
from keras . models import Sequential
Data Modeling ■ 119
We define some parameters we will use to create an embedding layer. The em-
bedding layer helps convert each word into a fixed length vector of defined size. The
embedding layer will also be used for other types deep neural network models such
as recurrent neural networks (RNNs) that we will build later in this section.
MA X_ WO RD S_ IN_V OC AB = 20000 # Size of the vocabulary
EMBEDDING_DIM = 100 # Dimension of the dense embedding
M A X_ S E QU E N CE _ L EN G T H = 300 # Length of input sequences
model = Sequential ()
Now, we can build our CNN. First we add a convolutional and pooling layer. We
can add multiple pairs of convolutional and pooling layers.
# Convolution layer
model . add ( Conv1D (128 , 5 , activation = " relu " ) )
# Pooling layer
model . add ( MaxPooling1D (5) )
Next, we add a Flatten layer to flatten the data to a single dimension for input
into the fully connected layer (dense layer).
# flattens the multi - dimension input tensors into a single dimension
# for input to the fully connected layer
model . add ( Flatten () )
# Output layer
# The softmax function turns a vector of N - real - values
# into a vector of N - real - values that sum to 1
model . add ( Dense (2 , activation = " softmax " ) )
Finally, to fit the model on your data, the following command can be used.
# Training
model . fit ( xs , ys , epochs =20 , verbose =1)
What is an epoch?
Epochs is the number of passes of the entire training dataset that have gone
through the training or learning process of the algorithm. Datasets are usually
grouped into batches (especially when the amount of data is very large).
We’ll build a CNN model on a sample dataset for demonstration of a text classi-
fication example in Chapter 8 (Section 8.3.2.2).
model = Sequential ()
Embedding (
MAX_WORDS_IN_VOCAB ,
EMBEDDING_DIM ,
input_length = M AX _ S EQ U E NCE_LENGTH
)
)
Next, we add a fully connected layer. In this layer, each neuron is connected to
the neurons of the preceding layer.
# Dense hidden layer
model . add ( Dense (32 , activation = ' relu ') )
Finally, we add an output layer, compile the model and fit it on the data.
# Output layer
# The softmax function turns a vector of N - real - values
# into a vector of N - real - values that sum to 1
model . add ( Dense (4 , activation = ' softmax ') )
# Model compilation
model . compile (
loss = ' s p a r s e _ c a t e g o r i c a l _ c r o s s e n t r o p y ' ,
optimizer = ' adam ' , metrics =[ ' accuracy ']
)
print ( model . summary () )
# Training
model . fit ( xs , ys , epochs =20 , verbose =1)
model = Sequential ()
# embedding layer to map word indices to vectors
model . add (
Embedding (
MAX_WORDS_IN_VOCAB ,
EMBEDDING_DIM ,
input_length = MAX_SEQUENC E_LENGTH
)
)
# Bidirectional LSTM
model . add ( Bidirectional ( LSTM (64) ) )
# Model compilation
model . compile (
loss = " c a t e g o r i c a l _ c r o s s e n t r o py " ,
optimizer = ' adam ' , metrics =[ " accuracy " ]
)
# Training
model . fit ( xs , ys , epochs =20 , verbose =1)
We’ll build a BiLSTM model for next word prediction classification in Chapter
11 (Section 11.2).
4.2.3 Transformers
You might have heard all the buzz around transformers in NLP. Transformer models
have achieved state-of-the-art status for many major NLP tasks in the past few years.
Transformer is a type of neural network architecture. But so is RNN, right? How
is a transformer different?
In NLP, before transformers, RNNs were used more profoundly for developing
state-of-the-art models. As we saw in the previous section, an RNN takes in the
input data sequentially. If we consider a language translation task, an RNN will
take the input sentence to be translated one word at a time, and translate one word
at a time. Thus, the order of words matters. However, word-by-word translations
124 ■ Natural Language Processing in the Real-World
don’t always yield an accurate sentence in a different language. So while it can work
well for next-word prediction models, it will not work well for language translation.
Since it takes in data sequentially, processing large text documents is hard. It is also
difficult to parallelize to speed up training on large datasets. Extra GPU (GPU, or
graphics processing unit, is a specialized processing unit with enhanced mathematical
computation capability) doesn’t offer much help in this case. These are the drawbacks
of RNNs.
On the contrary, transformers can be parallelized to train very large models.
Transformers are a form of semi-supervised learning. They are trained on unlabeled
data, and then fine-tuned using supervised learning for better performance. They
were initially designed for language translation.
1. Positional encoding
Let’s use language translation as an example. Each word in a sentence sequence
is assigned an ID. The order of the words in the sentence is stored in the data
rather than the structure of the network. Then, when you train a model on large
amounts of data, the network learns how to interpret the positional encodings.
In this way, the network learns the importance of word order from data.
This makes it easier to train a transformer than an RNN.
2. Attention
Attention is a useful concept. The first transformer model’s paper was titled
‘Attention is all you need’ [187].
As we learned before, transformers were originally built for language trans-
lation. Hence, we will use language translation examples to understand this
concept better as well.
The attention mechanism is a neural network structure that allows the text
model to look at every word in the original sentence while making decisions on
how to translate it.
‘The agreement on the European Economic Area was signed in August 1992’.
This sentence’s word-by-word translation to French does not yield a correctly
formed French sentence. ‘the European Economic Area’ translates to ‘la eu-
ropéenne économique zone’ using a word-by-word translation, which is not a
correct way of writing that in French. One of the correct French translation
for that phrase is ‘la zone économique européenne’. In French, the equivalent
word for ‘economic’ comes before the equivalent word for ‘european’ and there
is a gendered agreement between words. ‘la zona’ needs the word translation
on ‘European’ to be in the feminine form.
So for successful translation using the attention concept, ‘european’ and ‘eco-
nomic’ are looked at together. The model learns which words to attend to in
Data Modeling ■ 125
this fashion on its own from the data. Looking at the data, the model can learn
about word order rules, grammar, word genders, plurality, etc.
3. Self attention
Self attention is the concept of running attention on the input sentence itself.
Learning from data, models build internal representation or understanding
of language automatically. The better the representation the neural network
learns, the better it will be at any language task.
For instance, ‘Ask the server to bring me my check’ and ‘I think I just crashed
the server’, both contain the word server, but the meaning is vastly different in
each sentence. This can be known by looking at the context in each sentence.
‘server’ and ‘check’ point to one meaning of ‘server’. ‘serve’ and ‘crash’ point to
another meaning of ‘server’.
Self attention allows the network to understand a word in the context of other
words around it. It can help disambiguate words and many other language
tasks.
Architecture
Transformers have two parts - encoder and decoder. The encoder works on input
sequences to extract features. The decoder operates on the target output sequence
using the features. The encoder has multiple blocks. The features that are the output
of the last encoder block become the input to the decoder. The decoder consists of
multiple blocks as well.
For instance, in a language translation task, the encoder generates encodings that
determine which parts of the input sequence are relevant to each other and passes this
encoding to the next encoder layer. The decoder takes encodings and uses derived
context to generate the output sequence. Transformers run multiple encoder-decoder
sequences in parallel. Further information can be found at [225].
1. Autoencoding models
Autoencoding models are pre-trained by corrupting the input tokens and then
trying to reconstruct the original sentence as the output. They correspond to the
encoder of the original transformer model and have access to the complete input
without any mask. Those models usually build a bidirectional representation
of the whole sentence. Common applications include sentence classification,
named entity recognition (NER), and extractive question answering.
126 ■ Natural Language Processing in the Real-World
(a) BERT
BERT [62] stands for Bidirectional Encoder Representations from Trans-
formers. One of the most popular models, BERT is used for many NLP
tasks such as text summarization, question and answering system, text
classification, and more. It is also used in Google search, and many ML
tools offered by Google Cloud. It is a transformer-based machine learning
technique for NLP pre-training and was developed by Google. BERT over-
comes the limitations of RNN and other neural networks around handling
long sequences and capturing dependencies among different combinations
of words in long sentences. BERT is pre-trained on two different, but
related, NLP tasks - Masked Language Modeling and Next Sentence Pre-
diction. Masked Language Modeling training aims to hide a word in a
sentence and has the algorithm predict the masked/hidden word based on
context. Next Sentence Prediction training aims to predict the relation-
ship between two sentences. BERT was trained using 3.3 billion words total
with 2.5 billion from Wikipedia and 0.8 billion from BooksCorpus [97].
(b) DistilBERT
DistilBERT is a distilled version of BERT, smaller, faster, cheaper, and
lighter than BERT. It was built using knowledge distillation during the
pre-training phase that reduced the size of a BERT model by 40% while
retaining 97% of its language understanding capabilities and being 60%
faster [147].
(c) RoBERTa
RoBERTa [112] is a robustly optimized method for pre-training natural
language processing systems that improve on BERT. RoBERTa is different
from BERT in the masking approach. BERT uses static masking, which
means that the same part of the sentence is masked in each epoch. On the
contrary, RoBERTa uses dynamic masking where different parts of the
sentences are masked for different epochs. RoBERTa is trained on over
160GB of uncompressed text instead of the 16GB dataset originally used
to train BERT [152].
2. Autoregressive models
A statistical model is autoregressive if it predicts future values based on past
values. In language, autoregressive models are pre-trained on the classic lan-
guage modeling task of guessing the next token having read all the previous
ones. They correspond to the decoder of the original transformer model, and a
mask is used on top of the full sentence so that the attention heads can only
see what was before in the text, and not what’s after11 . Text generation is
the most common application.
11
https://huggingface.co/docs/transformers/model_summary
Data Modeling ■ 127
(a) XLNet
XLNet [203] is an extension of the Transformer-XL model (a transformer
architecture that introduces the notion of recurrence to the deep self-
attention network [56]) pre-trained using an autoregressive method where
the next token is dependent on all previous tokens. XLNet has an ar-
chitecture similar to BERT. The primary difference is the pre-training
approach. BERT is an autoencoding-based model, whereas XLNet is an
autoregressive-based model. XLNet is known to exhibit higher perfor-
mance than BERT. XLNet is also known for overcoming weakness of
BERT on tasks such as question answering, sentiment analysis, document
ranking, and natural language inference.
(b) GPT-2
GPT-2 [136] (generative pre-trained transformer model - 2nd generation)
was created by OpenAI in February 2019 which is pre-trained on a very
large corpus of English data in a self-supervised fashion. It is autoregressive
model where each token in the sentence has the context of the previous
words. GPT-2 was to be followed by the 175-billion-parameter GPT-3,
revealed to the public in 2020. GPT-2 has been well known for tasks
such as translating text between languages, summarizing long articles, and
answering trivia questions. GPT-2 was trained on a dataset of 8 million
web pages. GPT-2 is open-sourced.
(c) GPT-3
GPT-3 (generative pre-trained transformer model - 3rd generation) [38]
is an autoregressive language model that produces text that looks like it
was written by a human. It can write poetry, draft emails, write jokes, and
perform several other tasks. The GPT-3 model was trained on 45TB of text
data, including Common Crawl, webtexts, books, and Wikipedia. GPT-3
is not open-source. It is available via OpenAI’s API, which is reported
to be expensive as of 2022. The chat-based tool, ChatGPT, is based on
GPT-3.
3. Seq-to-seq models
These models use both the encoder and the decoder of the original trans-
former. Popular applications include translation, summarization, gener-
ative question answering, and classification.
(a) T5
T5 [226] is a text-to-text transfer transformer model which is trained on
unlabeled and labeled data, and further fine-tuned to individual tasks for
language modeling. T5 comes in multiple versions of different sizes; t5-
base, t5-small (smaller version of t5-base), t5-large (larger version of t5-
base), t5-3b, and t5-11b. T5 uses a text-to-text approach. Tasks including
128 ■ Natural Language Processing in the Real-World
We’ll be using several of these models in Section V for different popular applica-
tions.
For each of the different types of models above, one may perform better than the
other based on the dataset and task. Table 4.1 contains a summary of applications
of these models.
An example of the pre-training and fine-tuning flow can be seen in Figure 4.9.
The Hugging Face transformers14 library in Python provides thousands of pre-
trained and fine-tuned models to perform different tasks such as classification, infor-
mation extraction, question answering, summarization, translation, and text genera-
tion, in over 100 languages. Thousands of organizations pre-train and fine-tune mod-
els and open-source many models for the community. Some of these organizations
include the Allen Institute of AI, Facebook AI, Microsoft, Google AI, Grammarly,
and Typeform.
You can use transformers library - pipeline() function to use transformers.
Here’s an example.
from transformers import pipeline
14
https://huggingface.co/docs/transformers/index
130 Natural Language Processing in the Real-World
FIGURE 4.9 Examples of the pre-training and fine-tuning flow in transformer models.
To select a specific model for these tasks, the pipeline() takes a keyword ar-
gument model. Depending on your application, you can search for domain-specific
models. For example, nlpaueb/legal-bert-base-uncased is a BERT-based legal
domain model fine-tuned for fill-mask tasks. You can find all available models to
use at15 . Use the tag of the model you want to use as the value of the keyword
argument model.
There are several popular transformer models apt for different applications. We
will demonstrate these for different applications in Section V. You can also fine-tune
models on a custom dataset for several tasks using a pre-trained model16 . We will
demonstrate an implementation for fine-tuning using a custom dataset in Chapter 7
(Section 7.1.1.4) for NER.
Hugging Face transformers require either PyTorch or Tensorflow to be installed
since it relies on either one of them as the backend, thus make sure to have a working
version before installing transformers.
15
https://huggingface.co/models
16
https://huggingface.co/docs/transformers/training
Data Modeling ■ 131
Type of ML
Applications
Model
Classic ML Well suited for training models for simple tasks in text classifi-
models cation, regression, and clustering.
Well suited for training models on large datasets for text classi-
CNNs
fication tasks where word order is not relevant.
Well suited for training models on large datasets tasks where
LSTMs and
order of words matter, such as next-word prediction and senti-
BiLSTMs
ment analysis.
Well suited for simple classification tasks as well as complex
tasks such as text generation, language translation, text summa-
Transformers
rization, and question answering. For practical use in the indus-
try, many pre-trained and fine-tuned models can be leveraged.
This short course by Hugging Face is a great resource to learn further about
17
transformers.
Limitations
The biggest limitation of the transformer models is that it is trained on large
datasets scraped from the web and unfortunately include the good and the bad from
the Internet. There have been reports of gender and racial biases in the results of
these models. Fine-tuning the pre-trained models to your dataset won’t make the
intrinsic bias disappear. Hence, be sure to test your models thoroughly before de-
ploying them in your organization.
In this section, we covered classic machine learning models, CNNs, RNNs, and
transformers. Table 4.2 contains popular applications these models are used for in
NLP.
rithm behind it. For example, the k in kNN, the number of clusters in a clustering
algorithm, and the number of hidden units, epochs, and number of nodes in a neural
network model are some examples of hyperparameters. The tool you use to implement
the different models in Python specifies what these hyperparameters are. Tables 4.3
and 4.4 contain a list of common hyperparameters for some popular classic ML clas-
sification models and deep learning models in language tasks and links to resources
with further details.
Hyperparameter tuning is a popular practice and refers to the task of fine-tuning
your model with parameters that yield the best results. The evaluation of results is
discussed in the next section. We’ll discuss more on hyperparameter tuning in the
next section as well and share an example using kNN classifier.
1https://keras.io/api/layers/
2https://deepai.org/machine-learning-glossary-and-terms/activation-function
134 Natural Language Processing in the Real-World
such as accuracy. In this section, we’ll discuss model evaluation practices, what these
metrics are, and how to know which metrics make the best sense for your model.
While evaluating the model, learning about how it predicts on the very same data
it was trained on is not sufficient. The training data is what the model has already
seen, thus it may be better at predicting the training data as opposed to unseen
samples. A common practice is to break your data into training and testing samples.
The training samples go into training your model, and the testing samples are saved
for evaluating your model later. That way, you can verify the results of your model
on samples unseen during the training process. Some practitioners split the data into
three groups instead - training, testing, and validation samples. In that case, the
validation sample set is commonly used for hyperparameter tuning of the model.
The code used in this section below can be found in the notebook
section3/evaluation-metrics.ipynb.
Cross-validation
Another recommended practice is performing cross-validation. In this technique,
your data is split between training and testing samples x times (x can be user-defined
-> x-fold cross-validation). A fold refers to a run. 5 folds means that the training and
testing will be done five times on different splits of data. Each fold, the testing samples
can be different from all other folds, or you can also decide to turn that off and pick a
random sample each time. Evaluating the x models and their results help understand
if you have enough data and whether your model has any bias or high-variance issues.
Here’s how to implement it using sklearn.
from sklearn . datasets import load_iris
from sklearn . linear_model import LogisticRegression
from sklearn . model_selection import cross_val_score
FIGURE 4.10 An example of a confusion matrix for a binary classification case and
a multi-class classification case. TP stands for true positives. TN stands for true
negatives. FP stands for false positives. FN stands for false negatives.
Confusion matrix
A confusion matrix can be useful in understanding model performance by each
class in a classification problem. Figure 4.10 shows how a confusion matrix looks like.
Adding up the actual columns will give you the number of samples for the respective
classes in the dataset.
from sklearn . metrics import confusion_matrix
y_true = [2 , 0 , 2 , 2 , 0 , 1]
y_pred = [0 , 0 , 2 , 2 , 0 , 2]
4.3.1 Metrics
Important evaluation metrics include the following. The below metrics can be calcu-
lated per class or at an overall aggregate level.
• Accuracy/Recall
Accuracy is the measure of how correct your predictions are compared to the
truth/actual labels. The formula is as follows for a class. The term recall refers
to the accuracy of a particular label/class.
• Precision
Precision is a measure of how correct your predictions are compared to other
predictions. The formula is as follows for a class.
• F1
The F1 score is the harmonic mean of precision and recall. It is computed using
the following formula
2 * precision * recall
precision + recall
The maximum possible value is 1.0 indicating a perfect recall and precision. The
F1 score takes into account how the data samples might be distributed among
the classes. For example, if the data is imbalanced (e.g., 80% of all players do
not get drafted and 10% do), the F1 score provides a better overall assessment
compared to accuracy.
Let’s consider an example to understand the precision and recall and why looking
at these metrics per class can be beneficial in understanding your model. Your model
has two classes with the same number of samples each (let’s call the number of
samples as sample1 and sample2 for class 1 and class 2, respectively). Let’s say your
model predicts class 1 for all samples.
Overall :
Recall = prediction(class 1) in samples1
samples1 + prediction(class2) in samples2
+ samples2 = x+x
x+0
= 50%
prediction(class 1) in samples1 + prediction(class2) in samples2
Precision = prediction(class 1) + prediction(class2 = x+0
2x+0 = 50%
Per class :
predicted(class 1) in samples1
Recall class 1 = samples1 = xx = 100%
Precision class 1 = predicted(class 1) in samples1
predicted(class 1) = 2x
x
= 50%
predicted(class 2) in samples2
Recall class 2 = samples2 = 0%
predicted(class 2) in samples2
Precision class 2 = predicted(class 2) = 0%
These indicators help understand model results for the different classes and bring
forward any bias that might exist.
y_true = [ ' lion ' , ' dog ' , ' tiger ' , ' lion ' , ' dog ' , ' tiger ']
y_pred = [ ' lion ' , ' dog ' , ' lion ' , ' dog ' , ' dog ' , ' tiger ']
precision=0.7222222222222222
recall=0.6666666666666666
fscore=0.6555555555555556
Metric Formula
Recall correct classifications for a class
count of samples actually belonging to the class
Precision correct classifications for a class
count of all samples classified with the class
F1 2 * precision * recall
precision + recall
"""
Improvement in accuracy
"""
You can see the how the accuracy changes with different values of n_neighbors.
Practitioners also plot the change in accuracy with the hyperparameter value for
visually understanding the impact.
As we saw in Table 4.3 and Table 4.4, there are can be many hyperparameters
to tune per model. Fortunately, there are many methods that can be leveraged to
hyperparameter-tune models. These include the following.
• Random search: Rather than an exhaustive search, this method randomly se-
lects combinations of parameters. It is known to do better than grid search for
deep learning models.
In the below example, we use sklearn to perform grid search to find the good
hyperparameters for a kNN classifier.
from sklearn . model_selection import GridSearchCV
from sklearn . neighbors import KNeighborsClassifier
grid_params = {
" n_neighbors " : [3 , 5 , 7 , 10 , 15 , 20 , 25 , 35] ,
" weights " : [ " uniform " , " distance " ] ,
" metric " : [ " euclidean " , " manhattan " ]
}
gs = GridSearchCV (
K N e i g h b o r s C l a s s i f i e r () ,
grid_params ,
cv =10 ,
verbose =1
)
gs_results = gs . fit (X , y )
# best param and scores for your model can be obtained as follows
140 ■ Natural Language Processing in the Real-World
print (
" Best k : " ,
gs_results . best_estimator_ . get_params () [ " n_neighbors " ]
)
print ( gs_results . best_score_ , gs_results . best_params_ )
In the above grid search, there are 8 possibilities for n_neighbors, 2 possibilities
for weights, 2 possibilities for metric, and 10 cross-validations. The model in run 8 *
2 * 2 * 10 = 320 times to find the best hyperparameters.
The following code can be used to random search instead.
from sklearn . model_selection import RandomizedSearchCV
from sklearn . neighbors import KNeighborsClassifier
grid_params = {
" n_neighbors " : [3 , 5 , 7 , 10 , 15 , 20 , 25 , 35] ,
" weights " : [ " uniform " , " distance " ] ,
" metric " : [ " euclidean " , " manhattan " ]
}
rs = Ran do mi ze dS ea rc hC V (
K N e i g h b o r s C l a s s i f i e r () ,
p ar a m _d i s t ri b u ti o n s = grid_params ,
n_iter =10
)
rs_results = rs . fit (X , y )
# best param and scores for your model can be obtained as follows
print (
" Best k : " ,
rs_results . best_estimator_ . get_params () [ " n_neighbors " ]
)
print ( rs_results . best_score_ , rs_results . best_params_ )
Another option is to use the KerasTuner24 library. This library allows for opti-
mal hyperparameter searching for machine learning and deep learning models. The
library helps find kernel sizes, learning rate for optimization, and other different
hyper-parameters. Here is an example.
! pip install keras - tuner ==1.1.3
import keras_tuner
from tensorflow import keras
def build_model ( hp ) :
model = keras . Sequential ()
model . add (
keras . layers . Dense (
hp . Choice ( ' units ' , [8 , 16 , 32]) , # choice of param values
activation = ' relu '
)
)
24
https://keras.io/keras_tuner/
Data Modeling ■ 141
Windup
In this section, we covered several concepts necessary to start building NLP appli-
cations. We discussed practical implementations of cleaning and standardizing text.
For most applications based on text data, cleaning the data is the biggest and most
time-consuming step. We looked at various libraries that can be used in Python to
achieve text preprocessing. Most preprocessing tasks can be solved using regex or
libraries such as NLTK and spaCy. Different text-based datasets can have different
noise elements within. It is a good practice to manually look at some data samples
to gauge the data quality and what cleaning steps it might require. If the task is
just to understand what’s in the text, you can directly resort to data visualization
or printing words with the most occurrence. This is often the demand for many data
analytics-based tasks. We also discussed that simple yet effective data augmentation
techniques can come in handy when you are lacking data samples.
For building any predictive models, you need to further transform your text to
create numerical features. We discussed the various ways to get these features, includ-
ing logical operations on text, word frequency-based counts, and more advanced word
embedding techniques. Once generated, the features can then be used to find similar
words or for building machine learning models. For applications like finding similar
words, distance metrics can be used to compute similarities and differences between
words in a corpus. We went through the commonly used distance metrics in NLP.
The most popular one for finding context/semantic similarity is cosine distance. For
applications requiring modeling, we discussed classic machine learning models, deep
neural networks such as CNN, LSTM, and BiLSTM, and neural network-based trans-
former models. As a practitioner working in the industry, classic machine learning
models are very popular, especially when building the first solution for a problem.
CNN is used when larger datasets are available to create better text classification
models. LSTM and BiLSTM models can be built for tasks where word orders are
important, such as next-word prediction models. Transformer models are trained on
very large datasets, which is not typically done for solving NLP tasks in industry do-
mains other than large research organizations. Many transformers-based pre-trained
and fine-tuned models are accessible with Hugging Face that are pre-trained on large
language datasets and can be used out-of-the-box for several NLP tasks, such as doc-
ument summarization or for extracting word embeddings to use with other models
for custom tasks.
Data Modeling 143
• Social media
• E-commerce
• Insurance
• Finance
• Healthcare
• Law
• Real estate
• Supply chain
• Telecommunication
• Automotive
• Serious gaming
• Other popular applications: writing and email, home assistants, and recruiting
CHAPTER 5
background. Every text signal you leave as comments is also analyzed using NLP to
help content creators understand audience sentiment and demand.
Recommendations
Recommendation systems are not only common in the social media space, but
also in many other industries like retail and real estate that recommend items based
on your previous selections. Notice how often the suggestions on social media are
much alike what you previously watched or searched for? What about when you have
FIGURE 5.2 Social media content recommendations on the right based on the currently
viewed content.
NLP Applications - Active Usage 151
a typo in your search term as in Figure 5.3? It still shows you the most relevant and
usually the item you meant to search for.
Image-to-text
Conversion of text within images to text of string-form using optical character
recognition (OCR) is of use to analyze and understand content within social media
posts such as memes.
Research
Social media data is used for several research efforts studying employability mea-
surement [20], suicidal risks [55], travel trend analysis [155]1 , audience analytics and
interest indexing [157], marketing optimization [158], and many more applications.
5.2 FINANCE
5.2.1 What is finance?
The finance industry is a broad range of businesses that surround everything to do
with managing money. Examples include banks, credit unions, credit card companies,
companies managing investment funds, stock brokerages, and so on.
1
https://github.com/jsingh811/pyYouTubeAnalysis/blob/master/samples/report/
travel_vlogs_report.pdf
154 Natural Language Processing in the Real-World
5.3 E-COMMERCE
5.3.1 What is e-commerce?
E-commerce is the buying of goods and services on the Internet. There are different
types of e-commerce businesses, generally classified into three categories.
156 Natural Language Processing in the Real-World
FIGURE 5.8 Results on an e-commerce website from search with a spelling error.
Now, when a tumbler with a straw is clicked on (Figure 5.9), you see in the ‘more
to consider’ section, a lot of other tumblers, most with a straw as in Figure 5.10,
show up because that’s what the kind of tumbler image you first clicked on.
NLP Applications - Active Usage 157
contained within the comment and product information. These classifications can
also be made available to the end user for sifting through relevant review comments.
Let’s say you’re searching for formal pants on an e-commerce website, and then
you click on a product. There are a lot of reviews, and a platform for customers to
rate and categorize their reviews, such as their product size, how they rate the ‘true
to size’ metric, etc. But then some categories are not filled in by the customer but are
embedded in their comment - such as size, fit, and color. The comment could contain
information about the size, fit, and/or color of the product, but there may not be
a straightforward way to find that out without reading through all the comments.
With NLP, these comments are categorized algorithmically so that the consumer is
able to get the right view of the attributes that they might be interested in, that they
would not have the insights to otherwise. This can be seen in Figure 5.11.
Sentiment analysis
Sentiment analysis helps businesses understand their product shortcomings by
analyzing reviews of customers who have purchased a product.
Chatbots
Amazon, Zara, and many such businesses offer their customers the option of
using chat to interact with customer service. When you start the chat, typically
there are a few questions that the chatbot asks you and tries to give you automated
answers before it transfers you to a customer service representative. Something really
basic like order status or return status can be easily communicated without actually
connecting you to a human representative. The chatbot can also guide you to answers
to frequently asked questions (FAQs) as seen in Figure 5.12. It is a popular use case
for creating quicker solutions where some of the answers can be presented to the user
without the need for human-to-human interactions, along with the round-the-clock
availability of such services.
Customer interaction analytics
Calls or chats happening between customers and customer service representatives
are analyzed to better solve consumer complaints and questions and make their ex-
perience better in the future. NLP helps analyze when a user may be frustrated or
happy to optimize the interaction. Furthermore, data generated from the chat is used
to train models that recommend responses that the customer service representatives
can use for chatting with customers.
Another area of analytics is the search queries used by users. This helps identify
trends and popular product types which can help inform stocking decisions.
Marketing
NLP helps inform marketing efforts by analyzing searches and identifying the
keywords that should be on different types of products for increased discoverability.
Language translation
Language translation enhances the quality of a business’s global reach. Language
translation services are built using NLP to translate text on websites in different
geographical regions.
NLP Applications - Active Usage 159
FIGURE 5.11 Customer review section showcasing comment classification into types.
Chatbots are also built across the social media pages of businesses. For instance,
the Kayak chatbot is simple to engage with and easy to interact with. You type a
message to @Kayak within Facebook Messenger and the bot immediately responds
162 ■ Natural Language Processing in the Real-World
with an option to help you find a flight, hotel, rental car, and more. An example of
such a chatbot can be seen in Figure 5.4.
Personalized tour recommendation
User search preferences regarding their holidays can form an analytical data piece
for personalized recommendations. In the past decade, big data technologies have
allowed businesses to collect such information at scale and build personalized recom-
mendation systems. NLP tools aid in creating custom tour packages that rightly fit
the individual’s pocket while providing them the desired experience [91].
Marketing
An important aspect of travel and hospitality includes marketing efforts. Under-
standing the consumer segments and their response to select offers and services aids in
structuring efficient marketing strategies. Any interaction with consumers in terms
of surveys and comments helps establish trends and create customer groups (also
called segments). Each customer segment can then be targeted in ways that are most
likely to appeal to the segment. For example, it was reported that after deploying
IBM Watson Ads conversational system, Best Western Hotels and Resorts achieved
a 48% incremental lift in visits [221]. This system delivered unique responses to each
customer.
Document analysis
NLP-based tools find use in document classification and helping technicians find
relevant information from complex databases of manuals. For instance, airline and
aircraft maintenance procedures can be significantly helped by NLP document anal-
ysis and search functionalities [92].
Furthermore, in the age of digitization, any handwritten notes can be converted
to text using NLP techniques.
Mosaic ATM2 is a Virginia, US-based company that provides AI-powered aviation
solutions. They use NLP to gather insights from text, voice, audio, image, and speech
to inform operational and strategic decision-making across any aerospace business
unit. Services provided include document anomaly detection, aircraft maintenance
using information extraction and classification, aviation safety report analysis, and
more. Their customers include United Airlines, Hawaiin Airlines, Delta, United States
Navy, and NASA.
Comment sentiment analysis and categorization
When users leave negative comments about a hotel or flight, businesses need to
address the concerns of the individuals to maintain trust in their service quality.
Today, consumers make a lot of decisions based on past online reviews of a business.
NLP algorithms can classify comments into different sentiments and further bucket
them into topics. This allows for optimally sifting through user comments to address
concerns and analyze feedback.
Image-to-text
OCR is used in hospitality to convert receipts and invoices into digital records
that can be extracted for accounting and analytical purposes.
2
https://mosaicatm.com/aviation-natural-language-processing/
NLP Applications - Active Usage 163
5.5 MARKETING
5.5.1 What is marketing?
Marketing is a broad industry associated with promoting or selling products and
services. It includes market research and advertising.
The four P’s of marketing; product, price, place, and promotion; make up the
essential mix a company needs to market a product or service. The term marketing
mix was coined by Neil Borden who was a professor of advertising at the Harvard
Graduate School of Business Administration [195].
Marketing is a need that applies to most industry verticals. One of the main
objectives is to identify ideal customers and draw their attention to a product or
service.
The target audience is identified using cookies and IP addresses. These are es-
sentially text files in your browser that track the information you search for. An
IP address is like a house address for your computer that shows where you are lo-
cated. Both cookies and IP addresses together help advertisers reach you. This is how
searching for content in one place leads to related ads in another.
Many companies such as IBM (IBM Watson Advertising) and Salesforce [222]
leverage NLP for marketing-related offerings. They serve several clients in their mar-
164 Natural Language Processing in the Real-World
keting efforts. For example, Salesforce Einstein’s predictions are leveraged by Marriott
hotels, furniture seller Room & Board, and others [223]. The Marketing domain in-
cludes many components that leverage NLP. Product and service verbiage, audience
targeting, and measurement of campaigns are a few examples.
The following list entails popular NLP-based applications in Marketing.
Topic extraction
Content categorization is a popular implementation of NLP for effective content
creation and targeting. Extracting topics from a free-form text that your audience
is interested in, including the kind of keywords that they may be drawn towards, is
bucketed into topics for categorical filtering. This not only informs about audience
interests but also aids in analysis reports and recommendations of what brands can
create that will resonate with their audience.
NLP Applications - Active Usage ■ 165
Sentiment analysis
Sentiment analysis on free-form text enables the understanding of consumer in-
teraction and reaction to a product or a service. Using this analysis, marketers are
able to structure their own product or service in a way to best serve the customers.
Audience identification
Audience identification is helpful for target messaging so that delivered content
resonates with consumers and is presented to them in the way that is most likely to
receive engagement. Since the audience interacts with a lot of text on the web, text
data forms a large part of identifying audiences and their affinities.
Creating buyer personas in marketing campaigns is based on the product or ser-
vices defined by common traits of the people they want to reach. This definition
depends on the social, demographic, economical, and topical interests of the target
audience, for instance-males, 55+, living in North Carolina, and interested in baseball.
The interest in baseball is determined by related content searched for and browsed.
Not just the term ‘baseball’, but also other terms like ‘Aaron Judge’ convey interest
in the topic. NLP helps with establishing such relationships and classifications.
Chatbots
Chatbots are used in many companies to support marketing efforts. Answer-
ing basic questions, improving customer service, and analyzing user intent assists
in round-the-clock service, understanding consumers, and selling relevant products.
Customers are also targeted using chats. As we mentioned in the section on travel
and hospitality, Best Western Hotels and Resorts achieved a 48% incremental lift
in visits after deploying IBM Watson Ads [221]. This conversation system delivered
unique responses to each customer based on their questions.
Another use case is analyzing chat data and presenting advertisements based on
the identified topics and interest relevance.
Trend identification
Identification of trends that vary by consumer segment or time is important in
this domain. Trends using search history, product descriptions, articles, and social
media data inform marketing strategies.
AI-based slogan writing
AI technologies are popularly used for the automated identification of trends and
for coming up with slogan recommendations. Catchy slogans are known to help in
marketing content. The online tool - Frase [175] is a slogan generator powered by
NLP. Figure 5.17 shows a slogan recommendation for a ring size adjustment product.
Image-to-text
Understanding purchase history, location, image posts, and comments help with
brand strategy and analytics. OCR is used for such record digitization.
166 Natural Language Processing in the Real-World
5.6 INSURANCE
5.6.1 What is insurance?
Insurance is a type of risk management commonly used to protect against the risk of
uncertain financial loss. Some popular types of insurance include home or property
insurance, medical insurance, life insurance, disability insurance, automobile insur-
ance, travel insurance, fire insurance, and marine insurance.
Insurance typically consists of a legal agreement between you and the insurer
containing terms of the policy and coverages. To request your insurance to cover a
loss, a formal request needs to be filed, also called a claim. A claim is a request
for payment for loss or damage to something your insurance covers. Claim errors
cause insurance companies to lose money if the risk is not estimated correctly. For
example, there have already been 2,950 pandemic-related employment claims in the
United States including disputes that range from remote work to workplace safety and
discrimination [213]. Underwriters (members of financial organizations that evaluate
risks) benefit from risk mitigation by adding restrictions to new or renewed policies.
Fraud has a large financial impact on this industry. Property-casualty fraud leads
to loss of more than USD 30 billion from businesses each year, while auto insurance
‘premium leakage’ is a USD 29 billion problem [213]. Identification of fraud is key.
included reviewing the claim, cross-referencing it with the customer’s policy, running
18 anti-fraud algorithms, approving the claim, wiring instructions to the bank, up-
dating the customer, and closing the claim [115].
Sprout.AI [177] is another example of a company offering end-to-end claim au-
tomation for insurers. They use image recognition and OCR, audio analysis, and au-
tomatic document analysis using NLP techniques to analyze text data from insurance
claims. They also pair text with external real-time data like weather and geolocation
to enrich their analysis. The startup’s technology reportedly settles claims within
minutes, while also checking for fraud [115].
A few key areas that have benefitted from NLP in Insurance include customer
service and satisfaction, underwriting automation, fraud detection, risk assessment,
and claims management.
The following examples showcase applications of NLP in Insurance.
Chatbots
A survey indicated more than 80% of insurance customers want personalized of-
fers and recommendations from their auto, home, and life insurance providers [115].
Virtual assistants such as chatbots can help with this problem. With chatbots, ser-
vices as such can be available 24x7. Moreover, virtual agents can be trained to deliver
a more personalized experience to customers, as if the user is talking to one of the
human agents.
Allstate partnered with Earley Information Science to develop a virtual assistant
called ABIe [171] to help human agents know better about their products. This was
especially helpful when Allstate launched its business insurance division. ABIe can
process 25k inquiries per month, leveraging NLP and helping make corporate agents
more self-sufficient to better sell products to their customers [115].
Amelia is a technology company that was formerly known as IPsoft. They de-
veloped a conversational AI technology [172] for processing claims. Amelia’s conver-
sational agent can pull up a user’s policy information, verify their identity, collect
information relevant to the claim the user wants to file, and walk them through the
process step-by-step.
An example of an insurance chatbot can be seen in Figure 5.18.
Customer service analysis
Analysis of customer interactions helps identify customers who might be at risk of
cancellation of services, or on the other hand interested in further products. Customer
conversation or comment analysis helps improve customer support.
Information classification
Information classification helps agents look up information faster without having
to manually sift through documents. Much of the labor goes into the correct classi-
fication of information so that the text can be routed appropriately or acted upon
based on the business need. NLP-based solutions have proven helpful for this use
case. For example, Accenture developed its own NLP-based solution for document
classification named MALTA [170]. Its job is to automate the analysis and classifi-
cation of text to help with fast and easy information access. Accenture claimed the
solution provided 30% more accurate classification than when the process was done
manually [115].
168 Natural Language Processing in the Real-World
Fraud detection
Insurance frauds are estimated to be more than USD 40 billion per year in cost
by the FBI [5]. Insurance frauds cause high premiums and impact insurers as well
as customers. NLP-based solutions are able to better analyze fraudulent claims and
increase their correct detection.
The company, Shift Technology, developed an NLP-based technology to help in-
surers detect fraudulent claims before they pay them out. Their service is called
FORCE and it applies a variety of AI technologies, including NLP, to score each
claim according to the likelihood of fraud. The performance of their service is gaining
appreciation resulting in the company signing a partnership with Central Insurance
Companies to detect fraudulent claims in the auto and property sectors [16].
Risk assessment
Natural disasters are unavoidable and in the US alone, the cost of billion-dollar
disasters has been on the rise. The average over the past five years is 16 events per
year, costing just over USD 121 billion per year [103]. NLP-based tools have been a
popular component in risk assessment. As per a survey examining the adoption of
AI by risk and compliance professionals, 37% of respondents claimed that NLP was
a core component or extensively used in their organization [46].
The consulting company - Cognizant, uses NLP to predict flood risks in the United
States to better underwrite policies for their insurance clients [103]. They claimed that
NLP helped with a more accurate definition of risks, provided useful insights to help
with policy refinement, and resulted in a 25% improved policy acceptance rate.
Information extraction
The process of underwriting requires the availability of analysis of policies and
documents in bulk quantities. This part of the insurance process is highly error-prone
as it depends on how well the analysis was performed. With NLP, relevant information
NLP Applications - Active Usage 169
extraction can help underwriters access risk levels better. Entity extraction such as
dates, names, locations, etc., help underwriters find information that would’ve taken
a much longer time to look up manually.
For example, DigitalOwl [174] and Zelros [208] are two companies that have de-
veloped solutions to analyze, understand and extract relevant information from docu-
ments to help underwriters make their decisions faster, and with more accuracy [115].
This also finds application in claim management. Processing claims can be time-
consuming. With NLP, agents are able to auto-fill details of a claim by using com-
munication from customers in natural language.
Image-to-text
Several invoices are available on paper rather than a digital format. It is important
to place automation to process such documents to increase processing speed. Thus
conversion of such records to text is an important application that allows the usage of
NLP on the text extracted for processing, analytics, classification, or a combination
thereof. This is done using OCR. An example can be seen in Figure 5.19.
Writing emails has become a part of all work cultures today. Other than that,
NLP has aided technologies to help in spam detection (Figure 5.21). We still may
get some spam emails into our regular inbox, but technology plays a major role in
separating out true spam, thus saving users time and preventing clicks on unsolicited
URLs.
Other applications include bucketing email into different categories, such as pri-
mary, social, promotions, and more, as seen in Figure 5.22.
FIGURE 5.21 Email filtering leading to division of incoming mail between inbox and
spam.
other smart devices. Several devices get launched each year that integrate with your
existing smart home set-up to make your home a smart home. With home assistants
like Echo and Google Home, a lot is happening at the backend with models in the
cloud, but the start of it remains the voice commands. It needs the ability to listen to
you, understand what you are asking for, and then provide you with the best answer.
How does that work? Your voice is translated into text and further processing happens
thereafter on the text data for understanding intent. That’s right, NLP powers the
technology providing critical abilities without which we wouldn’t know these virtual
home assistants.
5.7.3 Recruiting
Recruiting is a common process across industry verticals. When an open job position
gets bulk of applications, NLP techniques can sift through various PDF resumes to
filter down to ones that match the closest to the open position requirements.
CHAPTER 6
NLP Applications -
Developing Usage
6.1 HEALTHCARE
6.1.1 What is healthcare?
The healthcare industry (also called the medical industry or health economy) is a
collection of sectors that provide goods and services to treat people with medical
needs. Drugs, medical equipment, healthcare facilities like hospitals and clinics, and
managed healthcare including medical insurance policy providers belong to this do-
main. Insurance is a large field on its own, which we have looked into as a separate
industry vertical in the previous chapter. Here, we will look into the other aspects of
the healthcare industry and how NLP makes a difference.
FIGURE 6.1 Classification of text into protected health information (PHI) categories.
Source [202].
FIGURE 6.2 Example of clinical record with annotated PHI categories. Source [202].
Clinical support
Analyzing patient documents to cross-link symptoms embedded in unstructured
text and finding case studies surrounding similar symptoms can be aided by NLP
using text classification, similarity measurement, and analysis.
Research
NLP techniques help summarize large chunks of text into key points or summaries.
This helps consolidate large records and documents into a readable summarized form
allowing doctors and researchers to get context without having to dig through every
document where applicable. Figure 6.3 depicts the algorithmic process and target.
Other research applications include algorithmically studying drug side effects,
people’s sentiment for pain medication, and studying a variety of symptoms and
relationships using NLP.
Speech-to-text
Many doctors today use audio recording devices to capture notes from patient
visits. These are then translated into the text to fill up patient chart notes and visit
summaries. This saves doctors typing time. Abridge1 and Medrecorder2 are examples
of audio recording products and transcribing services.
Language interpretation and translation
Translators and interpreters both work to provide meaningful communication
between different languages. Translators work with written words, while interpreters
work with spoken words. Helping patients with communication barriers to get medical
help is greatly advanced by the help of NLP in language translation technologies.
Many businesses offering these services exist today and leverage NLP. Healthcare
facilities like UCSD Health are able to partly adopt such services, but still keep an
on-site human language translator [73]. With time and increased efficiencies with
NLP, there is a large opportunity for growth in such services.
Imparied speech-to-text applications
Speech impairment in different individuals may not follow a standard pattern. It
is custom to each individual. Hence, it has been a technological challenge to develop
1
https://www.abridge.com/
2
https://www.medcorder.com/
176 ■ Natural Language Processing in the Real-World
intelligent applications that can help individuals with speech impairment communi-
cate. Google has been researching techniques to make this possible. Translation of
impaired speech-to-text finds applications in engaging in complete communication
with other individuals where the technology is able to fill in the gaps to make the
language understandable. The challenge of needing custom models per user remains
an issue. Research is being conducted on building models that are custom trained on
an individual’s voice samples [40].
Drug interactions
Detecting drug-to-drug interactions (DDI) is important because information on
DDIs can help prevent adverse effects from drug combinations. This field is especially
of use to the Pharma industry. There is always an increasing amount of new drug
interaction research getting published in the biomedical domain. Manually extracting
drug interactions from literature has been a laborious task. Drug interaction discovery
using NLP by sifting through millions of records has been a subject of research
recently and has proved to show good accuracy improvements in DDI [110, 149].
Image-to-text
Medical notes, prescriptions, and receipts are passed through digitization tech-
niques for information extraction and record-keeping using OCR.
6.2 LAW
6.2.1 What is law?
Law is a system of rules that are enforceable by governmental institutions. The le-
gal industry is all about sectors that provide legal goods and services. A lawyer is a
person who practices law. People need lawyers in many stages of life, including work,
marriage, and beyond. The size of the legal services market worldwide in 2020 was
reported at USD 713 million and is expected to reach USD 900 million + by 2025 [61].
Legal systems became a popular topic starting in the 1970s and 1980s. Richard
Susskind’s Expert Systems in Law, Oxford University Press, 1987 is a popular ex-
ample exploring artificial intelligence and legal reasoning [167]. In recent years, the
field has become increasingly popular with the rise of many start-ups that make use
of deep learning techniques in the context of legal applications.
CaseText [86] and CaseMine [42] were both founded in 2013 and provide interfaces
to help find relevant material by uploading a passage or even an entire brief that
provides context for the search. Another example is Ross Intelligence which was
founded in 2014 and offers the ability to make a query (ask a question) as you would
naturally talk to a lawyer.
Some of the ways NLP is getting used in law are as follows.
Legal research
Legal research is one of the most popular applications of NLP in law. The crux
of any legal process involves good research, creating documents, and sifting through
other relevant documents. It is popularly known that the legal processes take time
and this is one of the reasons why. NLP can help shorten this time by streamlining the
process. NLP-powered applications are able to convert a natural language query into
legal terms. It becomes easier to find documents relevant to the search query, thus
enabling a faster research process. NLP can also help find similar case documents to
help lawyers with references.
In a patent dispute case between Apple and Samsung (the years 2011–2016),
Samsung reportedly collected and processed around 3.6TB or 11,108,653 documents
with a processing cost of USD 13 million over 20 months. Today, a heavy focus is on
creating optimized techniques for categorizing the relevancy of documents as quickly
and efficiently as possible [57].
Language translation
Contract review programs can process documents in 20 different languages, help-
ing lawyers to understand and draft documents across geographies. OpenText has
introduced an e-discovery platform called Axcelerate; and SDL, known for its trans-
lation products and services, provides a Multilingual eDiscovery Solution, enabling
access to foreign language case-related content via translation [57].
Chatbots
A survey taken in 2018 highlighted that 59% of clients expect their lawyers to be
available after hours [49]. Chatbots are of great help to answer straightforward ques-
tions of clients. It gives customers access to communication beyond regular working
hours without the involvement of actual humans working off hours. Such services also
field people’s initial questions to direct them to the services they need faster [70]. An
example includes a chatbot based on IBM Watson created by Norton Rose Fulbright,
an Australian law firm, to answer standard questions about data breaches and was
active until 2022 [21]. Figure 6.4 shows an example of such a chatbot.
Document drafting, review, and analysis
Word choice is prime in legal documents. Any errors in a contract can open it up
to unintended consequences. With the help of NLP tools, lawyers can get documents
cross-checked without spending additional manual hours. ContractProbe [53] and
PrivacyPolicyCheck are some examples that let you upload documents for review.
178 Natural Language Processing in the Real-World
Furthermore, services are available that create templates for contracts based on
a law or a policy using NLP. This helps create basic versions of contracts. Kira
Systems [98], founded in 2015, and Seal Software, founded in 2010 and acquired by
DocuSign in 2020 for USD 188 million, offer pre-built models for hundreds of common
provisions covering a range of contract types.
Other services also help organize and file documents automatically based on con-
tained language. NLP aids in the automation of such processes, thereby saving lawyers
time and enabling them to assist more clients.
3. Commercial: Office buildings, shopping malls, stores, parking lots, medical cen-
ters, and hotels are examples of this type of real estate.
4. Industrial: Factories, mechanical productions, research and development, con-
struction, transportation, logistics, and warehouses are examples of this type
of real estate.
The real estate industry includes many branches, including developers, broker-
ages, sales and marketing, property lending, property management, and professional
services.
FIGURE 6.5 Real estate listing description with information extraction results on the
right to identify key pieces of information.
FIGURE 6.6 Convoboss: Real estate chatbot for 24/7 lead generation. Source [54].
NLP Applications - Developing Usage ■ 181
ments and filter down large corpora for a given use case. For example, a Brazilian
startup, Beaver (beaver.com.br/), leverages NLP to improve the working efficiency of
real estate agents by speeding up document analysis [224]. They extract information
from documents surrounding property registrations and purchase contracts to create
summarized reports.
For instance, underwriting is the process through which an individual or institu-
tion takes on financial risk (e.g., loans, insurance, or investment) for a certain fee.
The process of underwriting requires the availability of an analysis of policies and
documents in large quantities.
Compliance assurance
General data protection regulation (GDPR) and California consumer privacy act
(CCPA) compliances are in place to protect consumer privacy. GDPR helps ensure a
user’s information is not stored by a business outside of geographical bounds. CCPA-
compliant businesses must ensure a user’s information is erased from their systems
if a California user issues such a request. NLP techniques help identify and remove
consumer information.
Personal identifiable information (PII) identification and removal
PII refers to information that can reveal the identity of an individual, such as
email ID, phone number, government ID number, etc. At times certain personal
information may make its way into property listings or documents unintentionally.
NLP algorithms help with automatically identifying such information from free-form
text with the help of named entity recognition (NER).
reports and documents [26]. Such availability makes NLP applications a large and
increasing possibility in the industry.
in natural language like ‘Find incidents involving debris falling on lone employees’ to
the right type of classifications and results [200].
Chatbots
A virtual agent that technicians and engineers can interact with aids in making
operations more time efficient. It is often critical that problems are resolved given the
delicate and volatile nature of the job. With NLP employment, engineers and tech-
nicians can engage in a full dialog to get their questions answered. This enables safer
and faster troubleshooting by reducing resource downtime because of unexpected
issues [144].
Other than employees, chatbots are also built for businesses. For example, Shell
launched an AI-based Chatbot for B2B (business-to-business) lubricant customers
and distributors. Chatbots as such help find lubricants, alternatives, the right oil for
a machine, and more. An example of such a virtual agent can be seen in Figure 6.7.
Analysis of logs from employees as well as clients helps analyze feedback that in
turn informs process improvements.
Crude oil price forecasting
A paper in 2019 published by Li, Shang, and Wang [109] explored a text-based
crude oil price forecasting approach using deep learning, where they leveraged data
from the crude oil news column from Investing.com. Topic modeling followed by the
analysis of each word within the topics helped them develop the approach.
184 ■ Natural Language Processing in the Real-World
several different use cases in this industry as can be seen in Figure 6.8. Tradeshift [13]
is an example of a company using a procurement assistant, and there are several other
advancements and potentials being explored in this domain [114]. Figure 6.9 shows
a conversation with a chatbot for procurement after reported damage. Other exam-
ples include fetching natural disaster information via news sources and identifying
overlap between routes and timelines to inform users of the situation, loss, and/or
alternatives.
6.6 TELECOMMUNICATION
6.6.1 What is telecom?
Telecommunications (also called telecom) is defined as communicating over a distance
[11]. The telecommunications industry is made up of cable companies, internet service
providers, satellite companies, and telephone companies. The global telecom services
market size was estimated at USD 1,657.7 billion in 2020 [75].
million subscribers [161]. The company leverages NLP techniques for fraud detec-
tion. Their big data-based anti-fraud system, Tiandun, is able to detect spam and
fraudulent activity in calls and texts.
Vodafone Group is a British multinational telecommunications company with
more than 265 million subscribers [27]. Vodafone uses a chatbot named TOBi [100].
The group has a presence across 46 spoken languages and 26 countries, where only
10% of their consumer base speaks English as their native language [100]. They
implemented NLP-based techniques into their virtual assistant TOBi. TOBi is a text
bot that is able to directly answer most customer questions and recommend products
or solutions to their queries.
See Figure 6.10 for an example of a Telecom company’s chatbot.
6.7 AUTOMOTIVE
6.7.1 What is automotive?
Automotive is one of the world’s largest industries by revenue. The automotive indus-
try comprises a wide range of companies involved in different components of building
and selling vehicles including design, development, manufacturing, marketing, and
sales.
Similar dialog-based systems can also be used to engage with the entertainment
systems inside the vehicle. Figure 6.11 shows an example of a smart car assistant.
Language translation
While this technology is not yet a part of widely used vehicles, it has started
finding use in the automotive industry. For vehicles meant for tourists, language
translation devices are attached to the car to help facilitate communication between
the driver and passengers.
Chatbots
Like any other industry, automotive finds use in chatbots as well which help an-
swer customers’ common queries in an automated fashion, thereby making customer
service available round the clock and engaging human agents only for the needed
queries [8]. Chatbots find use in car dealerships to aid in finding suitable vehicles for
customers. For instance, DealerAI [215] is a car dealership chatbot platform. Figure
6.12 shows an example of a dealership chatbot.
Topic modeling, text classification, and analytics
There are large documents in an unstructured format that gather around incident
reports. To discover trends at a high level, NLP is leveraged to organize text, extract
topical information, classify as needed, and provide analytics on key causes of incident
reports within a time frame [31].
NLP Applications - Developing Usage 191
Eveil-3D [133] is a language-learning game where NLP is used for the development
of speaking and reading skills. In this game, a verbal automatic speech recognizer is
trained and used to accept input from students.
I-fleg (interactive French language learning game) [18,19] aims to present aspects
of French as a second language. NLP plays a role in the test exercises that are
produced in a non-deterministic manner based on the learner’s goal and profile.
Façade [121] is one of the early games allowing a user to type in natural lan-
guage to control the flow and direction of the game. The game is designed to train
students to find arguments in difficult situations. In this game, the user is invited
NLP Applications - Developing Usage 193
to a dinner during which a marital conflict occurs. The student’s goal is to reconcile
the couple. With communications from the student, the couple (game) comes back
with a predefined context-appropriate response, depending on the utterance category
of the student such as praise, criticism, provocation, etc. NLP is used to improve
dialogue efficiency between the player and the game (couple). NLP enables prag-
matic dialogues between the player and the game, with no emphasis on the syntax
or the semantics of the input sentences [131]. Figure 6.14 shows a still from this
game.
Text summarization
While dealing with documentation, dissertations, and papers, text summarization
helps reveal summaries of content captured in long documents, helping sift through
information faster and more concisely.
Language translation
Language translation is immensely helpful in the education sector. It not only
helps students learn different languages better but also helps break language barriers
for making educational material available globally. How many of you use Google
Translate whenever you need to find quick translations for a word or sentence? (Figure
6.15)
FIGURE 6.15 Google Translate for quick language translation between English and
Punjabi.
Windup
In this section, we looked at many industry verticals and how NLP is being used
today or envisioned being used in research efforts. We shared NLP instances from
different industry verticals along with technologies that individuals encounter regu-
larly. Figure 6.16 contains a summary of NLP applications and projects for different
industries. Chatbots and text analysis are the most popular applications across do-
mains. Furthermore, language translation (also called machine translation when done
by a machine) is a common application for global businesses. NLP techniques includ-
ing text similarity, text classification, and information extraction power several other
common industry applications seen in this section.
How would you take an application and implement it for your use case? Section
V is all about implementing different applications. We make use of existing tools and
libraries where possible to demonstrate how quickly some of these applications can
be brought to life. We take it a notch further in Section VI and build NLP projects
and discuss them in an enterprise setting. For example, why would you want to build
a chatbot for your company? What are the driving factors for building NLP solu-
tions? And how can you implement that? Section VI contains four industry projects
comprising NLP applications that not only contain step-by-step implementations but
also aim to give the readers a sense of an actual work setting and how AI fits into
broad company goals.
V
Implementing Advanced NLP Applications
We discussed several data sources, data extraction from different formats, and
storage solutions in Section II. Once text data is accessible and available, several NLP
tools, processes, and models come in handy for different applications, as discussed in
Section III. In Section IV, we looked at how several industries are researching and
leveraging NLP for different applications.
This section is a guide for implementing advanced NLP applications using con-
cepts learned in the previous chapters, which will prepare you for solving real-world
use cases. Each application that we will build in this section can be used stand-alone
or in conjunction with other applications to solve real-world problems around text
data. These stand-alone applications are like building blocks, that you can combine
together to build full-fledged NLP projects, which we will focus on in Section VI
along with what it means to build NLP projects in the real-world (in the industry
vertical/in an enterprise setting).
In Section IV, we visited popular NLP applications across industry domains. Here,
we will pick the most popular applications and implement them using Python tools.
The code used in this section can be found at https://github.com/jsingh811/NLP-in-
the-real-world under section5. In this section, we will build the following applications.
• Text summarization
• Topic modeling
• Text similarity
• Text classification
• Sentiment analysis
CHAPTER 7
Here, ‘stock price’ is a phrase that conveys meaning. Extraction of phrases from
text is the task of Keyphrase Extraction, or KPE.
Recognizing ‘Apple’ as an organization in the above sentence is an example of
the task of Named Entity Recognition, or NER.
Apple is an organization, but it can also refer to a fruit. In the above sentence,
the implied reference is the organization given the context. Disambiguation of ‘Apple’
here referring to the organization and not the fruit is an example of Named Entity
Disambiguation and Linking.
All of the above are examples of IE tasks. Let’s dive into some popularly applied
IE tasks and how you could implement and use them in Python. We’ll dive further
into Named Entity Recognition and Keyphrase Extraction, as these are the most
popular implementations in the industry. Most other IE tasks we have spoken about
are relatively easier to implement using service providers such as IBM1 , Google2 ,
and AWS3 and are also a popular choice of industry practitioners over custom code
writing.
1
https://www.ibm.com/blogs/research/2016/10/entity-linking/
2
https://cloud.google.com/natural-language/docs/analyzing-entities
3
https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html
• There are several pre-trained open-source models and tools that you can directly
use for NER. Service providers also offer such services at some cost.
• You can also train models for your own entities using your own data and the
available tool’s training pipelines.
• Furthermore, if you wanted to build a custom model from scratch using custom
machine learning techniques, you could do that as well with the help of existing
libraries in Python.
import re
# >> text = " send to j_2 .4 - dj3@xyz . co . net for queries ."
# >> [ ' j_2 .4 - d j3@unk nowndo main . co . net ']
To download the en_core_web_lg model, run the following in your Jupyter note-
book. To run it in bash, remove the !.
! pip install spacy
! python -m spacy download en_core_web_lg
import spacy
Input raw_text:
Output:
The Mars Orbiter Mission (MOM) PRODUCT
Mangalyaan PERSON
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY
NLTK
NLTK offers a pre-trained model that recognizes PERSON, ORG, GPE entities. The
function can be accessed by nltk.ne_chunk() and it returns a nested nltk.tree.Tree
object, so you have to traverse the Tree object to get to the named entities. Addi-
tionally, it accepts a parameter binary. If binary is set to True, then named entities
are just tagged as NE (i.e., if an entity was detected or not); otherwise, the classifier
adds category labels (PERSON, ORGANIZATION, and GPE).
Information Extraction and Text Transforming Models 207
Then, you have the needed models and can get entities as follows.
from nltk . tokenize import word_tokenize , sent_tokenize
from nltk . tag import pos_tag
from nltk import ne_chunk
NLTK_LABELS = [ " PERSON " , " ORGANIZATION " , " GPE " ]
tagged_doc = []
for sent in sent_tokenize ( raw_text ) :
tagged_doc . append ( pos_tag ( word_tokenize ( sent ) ) )
entities = []
for sent in tagged_doc :
trees = ne_chunk ( sent )
for tree in trees :
if (
hasattr ( tree , " label " )
and tree . label () in NLTK_LABELS
):
entities . append ((
" " . join ([
entity
for ( entity , label ) in tree
# filter for non - entities
if (
# removing noise , if it is a URL or empty
" http " not in entity . lower ()
208 Natural Language Processing in the Real-World
Passing in the same input as in the previous example, here is the output.
spaCy transformers
spaCy 3, in particular, has prebuilt models with HuggingFace’s transformers.
The en_core_web_trf model is a RoBERTa-based English language transformer
pipeline. Its various components include a transformer, tagger, parser, ner, at-
tribute_ruler, and lemmatizer. Using this model right out of the box can be done as
follows.
! pip install spacy - transformers
! python -m spacy download en_core_web_trf
import spacy
from spacy import displacy
Transformers
You can also use the transformers library directly to perform NER.
! pip install transformers
Information Extraction and Text Transforming Models ■ 209
ner = pipeline (
" ner " ,
model = " dslim / bert - base - NER " ,
grouped_entities = True
)
Many additional models can be used for NER with the transformers library. Refer
to and use the tag of the model that you want to use against the model= argument
5
The code demonstrated in this section can be found in the notebook section5/ner-
pretrained.ipynb.
We can see how each model can have its own drawbacks in terms of quality of
results. The biggest limitations of the pre-trained models is the limited number of
entities it can recognize.
5
https://huggingface.co/models?language=en&pipeline_tag=token-classification&
sort=downloads
210 Natural Language Processing in the Real-World
spaCy pipeline
To begin with, arrange your entity-labeled data in the format below. In this
example, we want to label the entities - main ingredient and spice.
train_data = [
(
' Chef added some salt and pepper to the rice . ' ,
{ ' entities ': [
(16 , 20 , ' SPICE ') ,
(25 , 31 , ' SPICE ') ,
(39 , 43 , ' INGREDIENT ')
]}
),
(
' The pasta was set to boil with some salt . ' ,
{ ' entities ': [
(4 , 9 , ' INGREDIENT ') ,
(36 , 40 , ' SPICE ')
]}
),
Information Extraction and Text Transforming Models ■ 211
(
' Adding egg to the rice dish with some pepper . ' ,
{ ' entities ': [
(7 , 10 , ' INGREDIENT ') ,
(18 , 22 , ' INGREDIENT ') ,
(38 , 44 , ' SPICE ')
]}
)
]
We start by creating a blank model, adding ner pipe, and addding our entities to
the ner pipe.
! pip install spacy
import spacy
# begin training
optimizer = nlp . begin_training ()
n_iter = 100
pipe_exceptions = [ " ner " , " trf_wordpiece " , " trf_tok2vec " ]
other_pipes = [
pipe
for pipe in nlp . pipe_names
if pipe not in pipe_exceptions
]
with nlp . disable_pipes (* other_pipes ) :
for _ in range ( n_iter ) :
random . shuffle ( train_data )
losses = {}
for batch in spacy . util . minibatch (
train_data , size =2
):
for text , annots in batch :
doc = nlp . make_doc ( text )
nlp . update (
[ Example . from_dict ( doc , annots ) ] ,
drop =0.5 ,
sgd = optimizer ,
losses = losses
)
print ( " Losses " , losses )
212 ■ Natural Language Processing in the Real-World
This is not a highly accurate model and is built to demonstrate the functionality
alone. Adding more training data will help train a better model. The complete code
can be found at section5/training-ner-spacy.ipynb.
A model can be saved to disk for future use as follows.
nlp . to_disk ( output_dir )
# to load back
NER is a sequence labeling problem. What does that mean? The context of the
sentence is important for tasks like NER.
There are rule-based methods that can work as a sequence labeling model. Ad-
ditionally, NER can be done using a number of sequence labeling methods. Popular
ones include Linear Chain Conditional Random Fields (Linear Chain CRF), Maxi-
mum Entropy Markov Models, and Bi-LSTM.
CRF
CRF stands for conditional random fields and is used heavily in information
extraction. The principal idea is that the context of each word is important in addition
to the word’s meaning. One approach is to use the two words before a given word
and the two words following the given word as features.
We can use sklearn-crfsuite to accomplish training a custom NER model. To
begin with, we need data that is annotated in a given format. The labels follow a
BIO notation where B indicates the beginning of an entity, I indicates the inside of
an entity for multi-word entities, and O for non-entities. An example can be seen
below. PER stands for person and LOC stands for location.
Jessie I-PER
Johnson I-PER
went O
to O
Dubai B-LOC
.O
Information Extraction and Text Transforming Models ■ 213
Once we have annotated data, we can perform feature extraction and classifier
training.
The feature extraction logic used should depend on the task at hand. In a typical
scenario, looking at the POS (part-of-speech) tag of words before and after a word is
helpful. A complete demo code can be found at6 .
To read the text and tags, the following code can be used.
from pathlib import Path
import re
6
https://medium.com/data-science-in-your-pocket/training-custom-ner-system-
using-crfs-146e0e922851
7
https://huggingface.co/models
214 ■ Natural Language Processing in the Real-World
‘O’ indicates that the token does not correspond to any entity. ‘location’ is an
entity. ‘B-’ indicates the beginning of the entity and ‘I-’ indicates consecutive positions
of the same entity. Thus, ‘Empire’, ‘State’, and ‘Building’ have tokens ‘B-location’,
‘I-location’, and ‘I-location’.
Next, we split the data into training and validation samples and initialize a pre-
trained DistilBert tokenizer using the model distilbert-base-cased. Our data has
split tokens rather than full sentence strings, thus we will set is_split_into_words
to True. We pass padding as True and truncation as True to pad the sequences to
be the same length.
from sklearn . model_selection import train_test_split
train_texts , val_texts , train_tags , val_tags = train_test_split (
texts , tags , test_size =.2
)
tokenizer = D i s t il B e r tT o k e n iz er F as t . from_pretrained (
' distilbert - base - cased '
)
train_encodings = tokenizer (
train_texts ,
i s_ s p li t _ i nt o_ wo r d s = True ,
r e t u r n _ o f f s et s _ m a p p i n g = True ,
padding = True ,
truncation = True
)
val_encodings = tokenizer (
val_texts ,
i s_ s p li t _ i nt o_ wo r d s = True ,
r e t u r n _ o f f s et s _ m a p p i n g = True ,
padding = True ,
truncation = True
)
We can tell the model to return information about the tokens that are split by
the WordPiece tokenization process.
WordPiece tokenization is the process by which single words are split into multiple
tokens such that each token is likely to be in the vocabulary. Some words may not be
in the vocabulary of a model. Thus the model splits the word into sub-words/tokens.
Information Extraction and Text Transforming Models ■ 215
Since we have only one tag per token, if the tokenizer splits a token into multiple
sub-tokens, then we will end up with a mismatch between our tokens and our labels.
To resolve this, we will train on the tag labels for the first subtoken of a split token.
We can do this by setting the labels we wish to ignore to -100.
import numpy as np
print (
f """ There are total { len ( unique_tags ) } entity tags in the data :
{ unique_tags } """
)
Now we load in a token classification model and specify the number of labels.
Then, our model is ready for fine-tuning.
216 ■ Natural Language Processing in the Real-World
custom_ner = pipeline (
" ner " ,
model = model ,
tokenizer = tokenizer ,
a g g r e g a t i o n _ s t r a t e g y = " simple "
)
output = custom_ner ( """
Ella Parker purchased a Samsung Galaxy s21 + from Elante mall .
""" )
print ( output )
The resultant output has entity group labels as ‘LABEL_0’, ‘LABEL_1’, etc.
You can map it to your label names using id2tag or the mapping available in
model.config.
The output is as follows.
[‘entity_group’: ‘B-person’, ‘score’: 0.97740185, ‘word’: ‘Ella’, ‘start’: 65, ‘end’: 69,
‘entity_group’: ‘I-person’, ‘score’: 0.97186667, ‘word’: ‘Parker’, ‘start’: 70, ‘end’: 76,
‘entity_group’: ‘O‘, ‘score’: 0.9917011, ‘word’: ‘purchased a’, ‘start’: 77, ‘end’: 88,
‘entity_group’: ‘B-product’, ‘score’: 0.39736107, ‘word’: ‘Samsung’, ‘start’: 89, ‘end’:
96,
‘entity_group’: ‘I-product’, ‘score’: 0.65990174, ‘word’: ‘Galaxy’, ‘start’: 97, ‘end’:
103,
‘entity_group’: ‘O’, ‘score’: 0.77520126, ‘word’: ‘s21 + from’, ‘start’: 104, ‘end’: 113,
‘entity_group’: ‘B-location’, ‘score’: 0.41146958, ‘word’: ‘El’, ‘start’: 114, ‘end’: 116,
‘entity_group’: ‘I-corporation’, ‘score’: 0.23474006, ‘word’: ‘##ante’, ‘start’: 116,
‘end’: 120,
‘entity_group’: ‘O’, ‘score’: 0.87043536, ‘word’: ‘mall.’, ‘start’: 121, ‘end’: 126]
Information Extraction and Text Transforming Models 217
7.1.2.1 textacy
This library is built on top of spaCy and contains implementations of multiple graph-
based approaches for extracting keyphrases. It includes algorithms such as TextRank,
SGRank, YAKE, and sCAKE.
en = load_spacy_lang (
" en_core_web_sm " , disable =( " parser " ,)
)
doc = make_spacy_doc ( text , lang = en )
# TextRank
tr = textrank ( doc , normalize = " lemma " , topn =5)
# SGRank
sg = sgrank ( doc , topn =5)
TextRank keyphrases
[‘natural language processing’, ‘natural language datum’, ‘computer capable’, ‘com-
puter science’, ‘human language’]
SGRank keyphrases
[‘natural language datum’, ‘natural language processing’, ‘artificial intelligence’, ‘hu-
man language’, ‘computer science’]
7.1.2.2 rake-nltk
Rapid Automatic Keyword Extraction, or RAKE, is a domain-independent keyword
extraction algorithm. The logic analyzes the frequency of word appearance and its
co-occurrence with other words in the text. Many open-source contributors have im-
plemented RAKE. Usage of one such RAKE implementation with NLTK is as follows.
! pip install rake - nltk ==1.0.6
r . e x t r a c t _ k e y q w o r d s _ f r o m _ t e x t ( text )
# top 5 keyphrases
print ( r . get_ra nk ed _p hr as es () [:5])
7.1.2.3 KeyBERT
KeyBERT implements a keyword extraction algorithm that leverages sentence-BERT
(SBERT) embeddings to create keywords and keyphrases that are most similar to
the input document.
The logic involves the generation of document embeddings using SBERT model,
followed by the extraction of n-gram phrases from the embeddings. Then, cosine
similarity is used to measure the similarity of each keyphrase to the document. The
most similar words can then be identified as the terms that best describe the entire
document and are considered keywords and keyphrases.
! pip install keybert ==0.5.1
kw_model = KeyBERT ()
[(‘processing nlp’, 0.7913), (‘language processing nlp’, 0.7629), (‘processing nlp is’,
0.7527), (‘natural language processing’, 0.7435), (‘of natural language’, 0.6745)]
There are many other ways to get keyphrase extraction in Python. Some other
libraries include MultiRake (multilingual rake), summa (TextRank algorithm), Gensim
(summarization.keywords in version 3.8.3), and pke.
In general, preprocessing of the text, what ‘n’ to choose in n-grams, and which
algorithm to use are factors that can change the outcome of the KPE model that you
construct and are worth experimenting with to fine-tune your model.
1. Extractive summarization
Certain phrases or sentences from the original text are identified and extracted.
Together, these extractions form the summary.
2. Abstractive summarization
New sentences are generated to form the summary. In contrast to extractive
summarization, the sentences contained with in the generated summary may
not be present at all in the original text.
We’ll use the text from the Wikipedia page on ‘Data Science’ as the document
we want to summarize.
from sumy . parsers . html import HtmlParser
from sumy . nlp . tokenizers import Tokenizer
from sumy . summarizers . text_rank import TextRankSummarizer
Gensim
Gensim, another open-source library in Python, implements an improvised version
of TextRank and can be used to get document summaries as well. This support was
removed in Gensim 4.0 onwards, but you can still use the functionality by installing
Gensim 3.8. 8
! pip install gensim ==3.8.3
Let’s define the variable text as the contents of the Wikipedia webpage on ‘Data
Science’.9
from gensim . summarization import summarize
text = " " # replace with the text you want to summarize
print ( gensim_summary )
field focused on extracting knowledge from data sets, which are typically large (see
big data), and applying the knowledge and actionable insights from data to solve
problems in a wide range of application domains.[8] The field encompasses prepar-
ing data for analysis, formulating data science problems, analyzing data, developing
data-driven solutions, and presenting findings to inform high-level decisions in a broad
range of application domains.
7.2.1.2 Transformers
We discussed transformer models in Chapter 4. Library bert-extractive-summarizer
can be used for document summarization using transformer models like BERT, GPT-
2, and XLNet. Each of these models comes in different sizes.
Let’s look at their implementation in code below. We’ll use the same value for
variable text as above, i.e., the Wikipedia page on ‘Data Science’.
! pip install transformers
! pip install bert - extractive - summarizer ==0.10.1
# BERT
from summarizer import Summarizer
bert_model = Summarizer ()
bert_summary = ' '. join (
bert_model ( text , min_length =60 , max_length =500)
)
print ( bert_summary )
You can control the min_length and max_length of the summary. BERT sum-
marization results in the following.
Data science is an interdisciplinary field that uses scientific methods, processes, al-
gorithms and systems to extract knowledge and insights from noisy, structured and
unstructured data,[1][2] and apply knowledge and actionable insights from data across
a broad range of application domains. Data science is a ‘concept to unify statistics,
data analysis, informatics, and their related methods’ in order to ‘understand and
analyze actual phenomena’ with data.[3] It uses techniques and theories drawn from
many fields within the context of mathematics, statistics, computer science, infor-
mation science, and domain knowledge.[4] However, data science is different from
computer science and information science. Further information: Statistics Âğ Meth-
ods
Linear regression
Logistic regression
Decision trees are used as prediction models for classification and data fitting.
gpt2_model = T r a n s f o r m e r S u m ma rizer (
transformer_type = " GPT2 " ,
t r a n s f o r m e r _ m o d e l _ k e y = " gpt2 - medium "
)
gpt2_summary = ' '. join (
gpt2_model ( text , min_length =60 , max_length =500)
224 ■ Natural Language Processing in the Real-World
)
print ( gpt2_summary )
# XLNet
from summarizer import T r a n sfo rmer Summa rize r
The code used in this section can be found in the notebook section5/extractive-
summarization.ipynb.
7.2.2.1 Transformers
T5
We can use a sequence-to-sequence model like T5 [137] 12 for abstractive text
summarization. We’ll pass in the same text we did for the previous section.
! pip install transformers
summarizer = pipeline (
" summarization " ,
model = " t5 - base " ,
tokenizer = " t5 - base " ,
framework = " tf "
)
summary = summarizer (
text , min_length =50 , max_length =500
)
print ( summary )
The above results in the following abstract summary. The bold text represents
the sentences formed by the model that were not present in this exact form in the
input document.
BART
BART models come in different sizes that can be found on Hugging Face’s web-
site13 . The following code sample uses bart-base model for abstractive summariza-
tion.
from transformers import pipeline
bart_summarizer = pipeline (
" summarization " ,
model = " facebook / bart - base " ,
tokenizer = " facebook / bart - base "
)
bart_summary = bart_summarizer (
text ,
min_length =50 ,
max_length =500 ,
truncation = True
)
print ( bart_summary )
12
https://huggingface.co/docs/transformers/model_doc/t5
13
https://huggingface.co/docs/transformers/model_doc/bart
226 ■ Natural Language Processing in the Real-World
Below are the first few sentences from the returned abstract summary. The bold
text represents the new sentences formed by the model.
Data science is an interdisciplinary field that uses scientific methods, processes, al-
gorithms and systems to extract knowledge and insights from noisy, structured and
unstructured data,[1][2] and apply knowledge and actionable insights from data across
a broad range of application domains. Data science is related to data mining, machine
learning and big data. The term ‘data science’ has been traced back to 1974, when
Peter Naur proposed it as an alternative name for computer science.[4] However, data
science is different from computer science and information science. Turing Award
winner Jim Gray imagined data science as a ‘concept to unify statistics,
data analysis, informatics, and their related methods’ in order to ‘under-
stand and analyze actual phenomena’ with data.[3] It uses techniques and
theories drawn from many fields within the context of mathematics, statis-
tics, computer science, information science, technology, engineering, and
domain knowledge.[4][5]
PEGASUS
PEGASUS is currently the state-of-the-art for abstractive summarization on
many benchmark datasets. For our demo, we will use the fine-tuned model
google/pegasus-xsum.
from transformers import pipeline
p_summarizer = pipeline (
' summarization ' ,
model = ' google / pegasus - xsum ' ,
tokenizer = ' google / pegasus - xsum '
)
p_summary = p_summarizer (
text ,
min_length =50 ,
max_length =500 ,
truncation = True
)
print ( p_summary )
The bold text represents the new sentences formed by the model.
For this task, it is common to use a solution option that already exists with
the available fine-tuned models. You can also fine-tune your own models on custom
datasets 15 . This solution can be time-consuming and requires labeled data with
summaries. Due to lack of labeled data, this option is primarily explored by industry
practitioners with a primary focus on research or domain-specific datasets.
Table 7.1 lists out a comparison of the machine translation offerings by different
industry leaders. 20
20
https://learn.vonage.com/blog/2019/12/10/text-translation-api-comparison-dr/
21
https://mymemory.translated.net/
Information Extraction and Text Transforming Models 229
en_hi_translator = Translator (
from_lang = " english " , to_lang = " hindi "
)
translation = en_hi_translator . translate ( " How are you today ? " )
print ( translation )
7.3.2.3 Transformers
You can also perform language translation using Hugging Face transformers library.
The following example converts English to French. The default model for the
pipeline is t5-base.
from transformers import pipeline
Est-ce vrai?
es_en_translator = pipeline (
" translation " , model = " Helsinki - NLP / opus - mt - es - en "
)
print ( es_en_translator ( " Me gusta esto muchisimo " ) )
In this chapter, we will implement topic modeling, text similarity, and text classifi-
cation including sentiment classification.
FIGURE 8.1 Books whose descriptions were used to build our LDA model. Source
doc1 [23], doc2 [82], doc3 [90], doc4 [76].
processed_docs = [
clean ( doc ) . split () for doc in documents
]
Next, we use Gensim to index each term in our corpus and create a bag-of-words
matrix.
import gensim
from gensim import corpora
Finally, we create the LDA model. Some trial and error on the number of topics
is required.
# Creating the object for LDA model using gensim library
lda = gensim . models . ldamodel . LdaModel
# Running and Training LDA model on the
# document term matrix for 3 topics
lda_model = lda (
doc_term_matrix ,
num_topics =3 ,
id2word = dictionary ,
passes =20
)
# Results
for itm in lda_model . print_topics () :
print ( itm )
print ( " \ n " )
FIGURE 8.2 The book used to test our LDA model. Source [188].
We can see it got bucketed with the highest score on the topic of machine learning
/ deep learning.
8.2.1 Elasticsearch
If you have your data housed in Elasticsearch, you can write a query to find similar
records to a record or any piece of text. Its underlying principle works by computing
TF-IDF followed by cosine distance. To find records similar to some custom text, use
the field like. All the fields to consider for computing similarity are listed under fields.
You can define several other parameters to tune the model. The documentation lists
out the different inputs accepted1 .
An example query is as follows.
{
" query " : {
" more_like_this " : {
" fields " : [ " title " ] ,
" like " : " elasticsearch is fast " ,
" min_term_freq " : 1 ,
" max_query_terms " : 12
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-
1
query.html
236 ■ Natural Language Processing in the Real-World
We discussed many word embedding models in Chapter 3 (Section 3.4.4) with code
samples, advantages, and disadvantages. Any method can be used for the application
of text similarity in conjunction with cosine similarity. Let’s look at code samples for
some of them below.
spaCy
Here’s a code sample of computing text similarity with spaCy using an existing
model. You can choose from any of the available pre-trained models with this library.
! pip install spacy
! python -m spacy download " en_core_web_lg "
import spacy
docs = [
nlp ( u " I like cold soda " ) ,
nlp ( u " hot chocolate is filling " ) ,
nlp ( u " ice cream is cold " ) ,
nlp ( u " burger tastes best when hot " )
]
Gensim
Another way of getting cosine similarty is using the Gensim library as follows.
! pip install gensim
print (
corpus . n_similarity (
[ ' hot ' , ' meal '] ,
[ ' burger ' , ' tastes ' , ' best ' , ' when ' , ' hot ']
)
)
# >> 0.674938
print (
corpus . n_similarity (
[ ' hot ' , ' meal '] ,
[ 'I ' , ' like ' , ' cold ' , ' soda ']
)
)
# >> 0.46843547
Transformers
We will use the sentence-transformers library [142] with the model name spec-
ified as below to get word embeddings. This library uses Hugging Face transformers
behind the scenes. Then, we will use cosine similarity to measure text similarity. The
full list of available models can be found at2 .
! pip install transformers
! pip install sentence - transformers
2
https://www.sbert.net/docs/pretrained_models.html
Text Categorization and Affinities ■ 239
to complete a task without any training examples. Hugging Face transformers can
be used for zero-shot classification using any one of their models offered for this task.
You can choose from the various models fine-tuned for this task at4 . Here’s how it
would look like in Python with the model distilbert-base-uncased-mnli.
from transformers import pipeline
classifier = pipeline (
" zero - shot - classification " ,
model = " typeform / distilbert - base - uncased - mnli "
)
classifier (
" This is a book about Natural Language Processing . " ,
candidate_labels =[ " education " , " politics " , " business " ] ,
)
classifier (
" I saw a large python crawing in the jungle behind the house . " ,
candidate_labels =[ " animal " , " programming " ]
)
‘sequence’: ‘I saw a large python crawing in the jungle behind the house.’,
‘labels’: [‘animal’, ‘programming’],
‘scores’: [0.6283940076828003, 0.3716059923171997]
classifier (
" NLP applications can be implemented using Python . " ,
candidate_labels =[ " animal " , " programming " ] ,
)
These classifiers may not work well for all use cases, such as the below example.
classifier (
" I wanna order a medium pizza for pick up at 6 pm . " ,
candidate_labels =[ " delivery " , " pickup " ] ,
)
1. Removing any bad samples (Null, too short, etc.) from your data
8.3.2.1 Classic ML
import pandas as pd
# Read dataset
df = pd . read_csv ( " spam . csv " , encoding = ' latin1 ')
86.5% of the data belongs to class ham. There is a high class imbalance.
5
http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/.
242 Natural Language Processing in the Real-World
Exploring the text by manually checking samples, we find some informal language
and excess punctuation. Now, let’s preprocess the text by removing stop words, punc-
tuation, and lemmatizing the words. You can experiment further here by adding and
removing other cleaning steps, such as removing specific words, stemming, etc.
import string
import random
from nltk . corpus import stopwords
from nltk . stem . wordnet import WordNetLemmatizer
Next, we’ll build features and split the data into training and testing sets.
Here, we will build TF-IDF features from the text data. The min_df, max_df, and
Text Categorization and Affinities 243
max_features parameters can be experimented with to find the right values for your
dataset.
You can build other features and experiment further with your model. We de-
scribed some of the other feature options in Chapter 3.
from sklearn . model_selection import train_test_split
from sklearn . feature_extraction . text import TfidfVectorizer
vectorizer = TfidfVectorizer (
max_df =0.9 , min_df =0.01 , max_features =5000
)
We write a function to train the model and return evaluation scores for the model.
import numpy
from sklearn . naive_bayes import MultinomialNB
from sklearn . metrics import p r e c i s i o n _ r e c a l l _ f s c o r e _ s u p p o r t
def m ultinomialBN_model (
X_train , train_y , X_valid , valid_y , alpha =1.0
):
model = MultinomialNB ( alpha = alpha ) . fit (
X_train . todense () , train_y
)
y_pred = model . predict ( X_valid . todense () )
prec , recall , f1 , class_size = p r e c i s i o n _ r e c a l l _ f s c o r e _ s u p p o r t (
valid_y ,
y_pred ,
average = None ,
labels = model . classes_
)
scores = {
" class_order " : model . classes_ ,
244 ■ Natural Language Processing in the Real-World
Next, we train models for different alpha values and save all scores in the variable
models. We also find the maximum F1 score.
We have discussed methods from tools like sklearn in Chapter 4 for grid search-
ing as we implemented here. We could replace this implementation with the sklearn
one. The reason for demonstrating it this way is the simplicity of the problem (only
one hyperparameter with a limited set of values), small model size, the flexibility to
experiment with the choice for the best model based on different evaluations met-
rics, and having multiple models to experiment with as needed. For instance, is the
model with the best accuracy the same as the model with the best F1? This type of
implementation can help analyze multiple resultant models.
models = {}
f1_max = 0
for alpha in [0.1 , 0.2 , 0.4 , 0.6 , 0.8 , 1.0]:
models [ alpha ] = m ul t i no m i alBN_model (
X_train , train_y , X_valid , valid_y , alpha = alpha
)
We find the model corresponding to the maxmimum F1 score and print the results.
best_alpha , best_model , best_score , y_pred = [
(
alpha , models [ alpha ][0] ,
models [ alpha ][1] ,
models [ alpha ][2]
)
for alpha in models
if models [ alpha ][1][ " avg f1 " ] == f1_max
][0]
print ( f """
Best alpha : { best_alpha }
Avg . Precision : { best_score [" avg prec "]}
Avg . Recall : { best_score [" avg recall "]}
Avg . F1 : { best_score [" avg f1 "]} """
)
print ( f """
\ nPer class evaluation :
Classes : { best_score [" class_order "]}
Precision : { best_score [" precision "]}
Recall : { best_score [" recall "]}
F1 : { best_score [" f1 "]} """
)
Text Categorization and Affinities 245
Output is as follows.
We also compute the confusion matrix to further understand the model results.
from sklearn . metrics import confusion_matrix , C onf us io nM atr ix Di spl ay
FIGURE 8.4Confusion matrix for spam vs ham classification model using Multinomial
Naive Bayes classifier.
We observe the recall score for spam class to be 75.2%, which is a lot lower than
that of class ham which is at 99.3%. We already know of class imbalance. Having
246 ■ Natural Language Processing in the Real-World
less data for the spam class is one likely large factor. Now let’s check the 5-fold cross
validation score per class to check for variance in recall.
# Checking cross validation recall scores per class
vect = TfidfVectorizer (
max_df =0.9 , min_df =0.01 , max_features =5000
)
scoring = {
' recall_spam ': make_scorer (
recall_score , average = None , labels =[ " spam " ]
),
' recall_ham ': make_scorer (
recall_score , average = None , labels =[ " ham " ]
)
}
cross_validate (
MultinomialNB ( alpha = best_alpha ) ,
vect . fit_transform ( x ) ,
y,
scoring = scoring ,
cv =5
)
The variance for ham recall score is low. On the other hand, the spam recall
variance is high.
Running the created model on some samples below.
new_samples = [
""" You have completed your order . Please check your email for a
refund receipt for $50 . """ ,
""" Win lottery worth $2 Million ! click here to participate for free
. """ ,
""" Please send me the report by tomorrow morning . Thanks . """ ,
""" You have been selected for a free $500 prepaid card . """
]
sample_vects = vectorizer . transform (
[ clean ( doc ) for doc in new_samples ]
)
print (
" Predicted class for samples : " ,
Text Categorization and Affinities 247
import numpy as np
from keras . preprocessing . text import Tokenizer
from keras . utils import pad_sequences
from sklearn . preprocessing import LabelEncoder
from keras . utils import to_categorical
M A X_ S E QU E N CE _ L EN G T H = 200
MAX_NB_WORDS = 10000
train_data = pad_sequences (
train_sequences , maxlen = M AX_ SEQUENCE_LENGTH
)
test_data = pad_sequences (
test_sequences , maxlen = M AX _SEQUENCE _LENGTH
)
print (
" train and test data shapes " ,
train_data . shape , test_data . shape
)
# >> train and test data shapes (4133 , 200) (1378 , 200)
Then, we start adding different layers to build our model. First, we add an Em-
bedding layer.
# Model training
from keras . layers import (
Dense ,
Embedding ,
Conv1D ,
MaxPooling1D ,
Flatten ,
Dropout
)
250 ■ Natural Language Processing in the Real-World
EMBEDDING_DIM = 100
model = Sequential ()
model . add ( Embedding (
MAX_NB_WORDS ,
EMBEDDING_DIM ,
input_length = M AX _ S EQ U E NC E _LENGTH
))
model . add ( Dropout (0.5) )
Now, we can get the metrics of this model on the test data using the following
code.
from sklearn . metrics import p r e c i s i o n _ r e c a l l _ f s c o r e _ s u p p o r t as sc
from sklearn . metrics import cl assif icat ion_ repor t
# evaluation
precision , recall , fscore , _ = sc ( labels_test , predicted . round () )
precision=array([0.97981497, 0.96825397])
recall=array([0.99487617, 0.88405797])
fscore=array([0.98728814, 0.92424242])
We can also visualize the training and testing accuracy and loss by epoch by using
the following code.
import matplotlib . pyplot as plt
# training data
plt . plot ( history . history [ " accuracy " ])
plt . xlabel ( " Epochs " )
plt . ylabel ( " Training accuracy " )
plt . show ()
plt . plot ( history . history [ " loss " ])
plt . xlabel ( " Epochs " )
plt . ylabel ( " Training loss " )
plt . show ()
# validation data
plt . plot ( history . history [ " val_accuracy " ])
plt . xlabel ( " Epochs " )
plt . ylabel ( " Validation accuracy " )
plt . show ()
plt . plot ( history . history [ " val_loss " ])
plt . xlabel ( " Epochs " )
plt . ylabel ( " Validation loss " )
plt . show ()
FIGURE 8.5 Training and validation accuracy and loss for ham/spam CNN model.
We see a higher score for the second and fourth sentence in new_samples for
class spam, and a higher score for the first and third sentence for the class ham. The
complete script can be found in section5/ham-spam-classifier-CNN.ipynb.
Similarly, you can use the code from Chapter 4 to train LSTM or BiLSTM by
adding relevant layers to the model for this problem.
This tool does fairly well in terms of accuracy of the predicted sentiment.
Like any probabilistic tool, there are areas of ambiguity where models fail. One
such example is as follows.
VADER
VADER stands for Valence Aware Dictionary and sEntiment Reasoner. It is a lex-
icon and rule-based sentiment analysis tool that is specifically attuned to sentiments
expressed in social media [83].
The model returns a dictionary with scores for pos, neg, neu, and compound.
The compound score is computed by combining the valence/polarity scores of
each word in the lexicon, with the final values normalized between -1 and 1. -1 stands
for most negative, and +1 stands for most positive. This metric is the most used
for scenarios where a single measure of sentiment of a given sentence is desired, i.e.,
positive, negative, or neutral. Threshold reported by the library is as follows.
Positive sentiment: compound score >= 0.05
Neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
Negative sentiment: compound score <= -0.05
The scores assigned to pos, neg, and neu are the ratios for proportions of text
that fall in each category. These are useful if you want to understand the context
and presentation of how sentiment is conveyed in rhetoric for a given sentence. An
example reported by the library - ‘some writing styles may reflect a penchant for
strongly flavored rhetoric, whereas other styles may use a great deal of neutral text
while still conveying a similar overall (compound) sentiment’.
Here’s how to get sentiment using Vader.
256 Natural Language Processing in the Real-World
sid_obj = S e n t i m e n t I n t e n s i t y A n a l y z e r ()
sentiment_dict = sid_obj . polarity_scores (
" Vader works well ! "
)
print ( sentiment_dict )
# >> { ' neg ': 0.0 , ' neu ': 0.556 , ' pos ': 0.444 , ' compound ': 0.3382}
Vader classifies our sample ‘Who wouldn’t love a headache?’ as negative senti-
ment.
Depending on the nature of the dataset, one tool may do better than the other.
8.4.2 Transformers
Sentiment analysis can also be performed using the transformers library. We have
seen a demo of Hugging Face transformers for some other NLP tasks. Let’s look at
how to use it for sentiment analysis.
! pip install transformers
result = sentiment_analysis (
" Transformers are great for many tasks . "
) [0]
print ( result )
# >> { ' label ': ' POSITIVE ', ' score ': 0.9994866847991943}
As the default, the transformers library uses a DistilBERT [148] model fine-
tuned on the Stanford Sentiment Treebank v2 (SST2) [145] task from the GLUE
Dataset [190]. Using any different model or tokenizer is possible by passing it in on
instantiation of the pipeline. The full list of available models can be found here9 .
9
https://huggingface.co/models?other=sentiment-analysis&sort=downloads
Text Categorization and Affinities 257
Passing the same ambiguous sample sentence through this model, the results are
as follows.
10
https://cloud.google.com/natural-language/docs/analyzing-sentiment
11
https://docs.aws.amazon.com/lex/latest/dg/sentiment-analysis.html
12
https://www.ibm.com/cloud/watson-natural-language-understanding
258 ■ Natural Language Processing in the Real-World
Windup
In this Section, we implemented several advanced NLP applications. We discussed
IE (Information Extraction), specifically implementations for NER and keyphrase ex-
traction using open-source tools and pre-trained models. These applications typically
form an important part of larger NLP projects, such as personally identifiable infor-
mation (PII) recognition and chatbots.
We examined topic modeling as a unsupervised approach to club similar docu-
ments together. Topic modeling is a popular approach to cluster data where certain
overlap is expected between clusters. This method is also used to curate data labels
and understand data patterns.
Then, we looked at different types of text similarity measures and explored im-
plementations of semantic text similarity. This finds use in building recommendation
systems that measure similarity between search terms and content, or between two
pieces of content.
Summarization of documents is helpful in reducing the size of the data, enabling
faster search, and generating a summary to sift through documents rather than having
to go through full documents manually. Many applications exist in different industry
domains, such as in Healthcare to summarize patient visit notes, and in Legal and
Research. We discussed extractive and abstractive text summarization and shared
implementations using a variety of models. The need to implement these solutions
from scratch is rare for practitioners working in an enterprise setting. Many available
tools can be used to implement such solutions, which we discussed in this section.
We also looked at language detection and translation along with the options of
service providers as well as some open-source tools. This is particularly useful for
businesses with users across the globe. It is most common to opt for service providers
for this application.
We looked at many ways of approaching a text classification problem and demon-
strated them using the ham/spam dataset. Finally, we discussed sentiment analysis
and implementing it using open-source pre-trained models. Since, this problem is
common and has been implemented and made available, the need to build custom
sentiment analysis models is rare, unless the dataset is domain-specific containing
many jargons.
Throughout this section, we looked at classic approaches, API services, and the
latest transformer-based approaches. Each of these applications are directly valuable
for many use cases in the industry. Other times, multiple such applications have to be
combined to satisfy an industry application. How does it all fit into projects around
text data in a company? We’ll build some real-world industrial projects in Section
VI using the techniques and concepts discussed thus far. We will explain an NLP
project in the context of company goals and build solutions using Python. We will
also share some practical tips and solutions to common industrial problems.
VI
Implementing NLP Projects in the Real-World
An enterprise is a firm or a combination of firms that engages in economic activi-
ties which can be classified into multiple industries. We’ll be using the terms industry
and enterprise interchangeably throughout this section, referring to work that hap-
pens in a company rather than an educational setting. The primary reason for this
focus is that the kind of projects you typically work on during advanced degrees,
courses, or boot camps are very different in nature compared to real-world projects.
On the academic front, there is a heavier focus on the concepts, models, and features
which are highly relevant skills. But there are also expectations of available data
sources and little visibility into considerations and impacts of building NLP solutions
in enterprise settings. In this section, we will look at these projects in a real-world
sense to give you a glimpse of how it all works in the enterprise.
In Section V, we implemented the most common advanced applications of NLP in-
cluding information extraction, topic modeling, text summarization, language trans-
lation, text classification, and sentiment analysis. In this section, we’ll build some
real-world applications of NLP using the concepts and implementations presented
in the previous chapters. We will discuss and build around business goals and data
challenges commonly faced in practice. We’ll build the following projects.
• Chatbots
FIGURE 8.7 Where data science modeling fits within a business’s goal and its driving
factors.
CHAPTER 9
Chatbots
2. Conversational chatbots
These can be either goal-oriented, such as allowing users to place a pizza order
over chat, or more open-ended such as having a normal human-like conversation
with no end goal. The key components are natural language generation to
generate responses to users and adaptive state memory (does the bot need to
have a history of 10 recent messages or 20?) because the user message-response
pairs may not be independent.
In December 2022, OpenAI (an AI research and deployment company) released
ChatGPT1 , a chatbot that converses with humans to help write code, debug
code, compose essays, recommend places to visit, summarize a long document,
make up stories, and give you ideas for questions like ‘How do I decorate my
room?’ . Shortly after, Google announced Bard and Meta announced LLaMA.
These chatbots decipher a user’s requests and then call different models for
different tasks they want to perform. For example, a different model is called
if the request requires document summarization versus paraphrasing. These
are some well-known examples of conversational chatbots that rely on large
language models (LLMs) and integrate different fine-tuned models to perform
a variety of different tasks.
In the enterprise, the most common type of conversational chatbot is the goal-
oriented chatbot, where there is typically an end goal associated with the con-
versation. Examples include placing orders, booking movie tickets, and Cor-
tana2 by Microsoft.
1
https://openai.com/blog/chatgpt/
2
https://support.microsoft.com/en-us/topic/chat-with-cortana-54a859b0-19e9-
cc13-c044-3b2db29be41e
Chatbots ■ 265
USER: Hi
BOT: The store is open from 8am to 9pm PST every day.
It is closed on public holidays.
In the example above, we need a system to map a user’s message to either timing,
address, greeting, or something else. We call this the user intent. Identification of
intent helps the chatbot generate an appropriate response.
Let’s complicate our example and include the ability to get all information nec-
essary to order a pizza from a store. Now we not only need to know the user’s intent
but also entities such as pizza size, toppings, etc. We need all these entities to place
a pizza order. Hence if the user’s first message does not contain all the required in-
formation, the chatbot will need to maintain a state and ask the user the follow-on
questions in order to get all the required information.
266 Natural Language Processing in the Real-World
Example 1
Example 2
Action plan
You are tasked with creating a chatbot for a business that runs a store. The goal
is to build a simple chatbot prototype that can chat with customers and tell them
about store timing and address. The chatbot should be able to direct the users to a
customer service representative for advanced questions.
Concepts
A chatbot implementation need not be complex if your use case is simple.
Rule-based or Q/A are the easiest to build and understand. Imagine your business
sells certain services to its customers. You want to create a chatbot for your customers
268 ■ Natural Language Processing in the Real-World
Solution
This chatbot will present options to the user and respond with an answer.
def get_responses ( user_message ) :
"""
Returns a list of responses based on user input message .
"""
output = []
if " timing " in user_message :
output . append ( """
The store is open from 8 am to 9 pm PST every day .
It is closed on public holidays .
""" )
if " address " in user_message :
output . append ( " xyz St , Los Angeles , CA , 91*** " )
print ( """
Hi ! I can help you with store information .
Type ' timing ' to know about the store timings .
Type ' address ' to learn about the store address .
""" )
user_input = input () . lower ()
print ( " \ n " . join ( get_responses ( user_input ) ) )
USER: timing
BOT: The store is open from 8am to 9pm PST every day.
Chatbots ■ 269
Example 2
BOT: The store is open from 8am to 9pm PST every day.
It is closed on public holidays.
The following implementation accounts for one of the ways for better handling
free-form text questions by matching them to a fixed set of questions. This approach
uses synonyms.
import re
from nltk . corpus import wordnet
USER: Hi
BOT: The store is open from 8am to 9pm PST every day.
It is closed on public holidays.
The store is located at xyz St, Los Angeles, CA, 91***.
Action plan
You are tasked with building a chatbot for a pizza shop. The goal is to build a
prototype that can assist customers to place orders for pickup.
Concepts
You want your customer to be able to place an order for a pizza by chatting with
your chatbot. First, two things need to be identified from an NLP perspective-intent
and entities. The intent will help you differentiate between - does the customer wants
to place an order, is the customer just inquiring about the pizza toppings available, or
just ask for store timings. If you have an entity classification and intent classification,
that seems to solve some of the big requirements to implement a chatbot. The intent
and entities are likely different for different industries or use cases of the chatbot.
272 Natural Language Processing in the Real-World
Once you have some labeled data, you can train an intent classification model
and an entity recognition model using tools and techniques discussed in Section V.
There are also other options for building a chatbot using your labeled data. One such
option is to leverage prebuilt services such as Google’s Dialogflow3 to implement the
chatbot. Most service providers like AWS, IBM, etc. have options you can use to build
and deploy your chatbot. Another option is using the RASA chatbot framework [37]
if you want to build your own components. We’ll look into both these options in this
section.
The reason many industry practitioners do not build complete chatbot pipelines
from scratch is the complexities outside of the intent and entity models. In addition,
you will need a response generation mechanism, a service that talks to your database,
and other systems that can successfully trigger an action, such as placing an order.
You’ll also need a way to deploy the chatbots on your website as well as chat plat-
forms such as Facebook Messenger, Slack, etc.
Solution
For this demonstration, we will build a pizza-ordering chatbot using RASA. Before
that, we’ll highlight some other solution options as well.
There are many other options offered by different service providers. Table 9.1
provides some high-level details, differences, and pricing. Since the costs are subject
to change, we denote expensiveness in the number of dollar signs compared to the
other vendors in the table. Even then, a vendor might be cheaper for your company
than the others based on any existing contracts which may include free credits.
FIGURE 9.5 Training data for building a custom NER model with spaCy.
that it could be complex for beginners and a fair understanding of chatbots and how
the tool works would need to be acquired.
Let’s build a small prototype for a pizza-ordering chatbot and demonstrate the
usage.
Running this demo in a virtual environment would help ensure there remain
fewer dependency conflicts with what you may have installed on your machine. We
use conda4 to create a Python 3.8 environment using bash commands below.
conda create -n rasademo python =3.8
Enter y on the prompts. Once done, run the following command to activate. Then,
install rasa and create a project.
conda activate rasademo
pip install rasa
pip version >= 20.3 and < 21.3 can make the install very slow. It is due to
the dependency resolution backtracking logic that was introduced in pip v20.3.
v21.3 appears to no longer have the same issue. If your install is taking long,
check your pip version. Upgrade your pip as follows to resolve the issue, and
run the rasa install command again.
pip install -- upgrade pip ==21.3
Then, running the following command will prompt the user to create a sample
RASA project We used RASA version 3.1.
4
https://www.anaconda.com/products/distribution
Chatbots 277
FIGURE 9.6 Test results for our custom NER model built using spaCy for entities
related to pizza attributes.
rasa init
Enter the path you want to create your project in. You’ll notice some files are
created as in Figure 9.7.
To configure a chatbot, you will need to understand what some of the files contain.
At a high-level, Figure 9.8 shows where these files sit within the chatbot system seen
earlier in Figure 9.2. Let’s look at the files one by one.
nlu.yml : This file contains intents. You can add a new intent by following the
format you already see there. In our case, we want to add the intent of ordering a
pizza. The intent names should not be duplicated. See Figures 9.9 and 9.10 for how
we defined our intents in the file nlu.yml.
actions.py : You’ll see a class Action and two methods - name and run. To create
your own actions, create a child of Action.
278 Natural Language Processing in the Real-World
FIGURE 9.9 nlu.yml intents related to greeting, user agreement, and user disagree-
ment.
Chatbots 279
domain.yml : This file contains the domain knowledge needed by the chatbot
in terms of what to respond or do to what people ask. Add your custom actions,
responses, entity slots, and intents to this file. More on domain.yml can be found
here5 . Here’s how we structured it.
intents :
- greet
- order_pizza
- o r d e r _ p i z z a _ t o p p i n g _ s iz e
- order_pizza_topping
- order_pizza_size
- agree
- disagree
entities :
- topping
- pizzasize
slots :
topping :
type : list
mappings :
- type : from_entity
entity : topping
pizzasize :
type : text
mappings :
- type : from_entity
5
https://rasa.com/docs/rasa/domain/
280 ■ Natural Language Processing in the Real-World
entity : pizzasize
For defining responses, you can have multiple responses under each action utter-
ance. The bot will randomly select one of those in chats. See utter_greet below as
an example. You can also customize responses with values from entities. See how we
defined some of our responses for utter_order_placed as an example. The full file
can be found in section6/rasademo/domain.yml.
responses :
utter_greet :
- text : " Hey ! How can I help you ? "
- text : " Hello ! Which pizza would you like to order ? "
utter_order_pizza_topping :
- text : " Can you tell me what toppings you 'd like ? "
u t t e r _ o r d e r _p i z z a _ s i z e :
- text : " Can you tell me what pizza size you 'd like ?
We have medium and large . "
utter_order_pizza_topping_size :
- text : " Thank you . We are getting ready to place
your order for pizza size : { pizzasize } with toppings
{ topping }. Does the order look correct ? "
utter_disagree :
- text : " Sorry that we were unable to help you on
chat . Kindly call +1(800) xxx - xxxx and they ' ll assist
you right away . "
ut te r_ or de r_pl ac ed :
- text : " Good news , your order has been placed ! Your
{ pizzasize } pizza with { topping } will be ready in
30 mins ! "
‘actions’ in domain.yml need to contain a list of every possible action your bot
can take. This includes responses and actions (such as updating a database, etc.).
actions :
- utter_greet
- utter_disagree
- a ct io n_ or der_ pi zz a
- ut ter_or der_pi zza
- utter_order_pizza_topping
- u tt e r _ o r d e r _ p i z z a _ s i z e
- utter_order_pizza_topping_size
- u tt er _o rd er_p la ce d
Next, stories.yml helps tie all the pieces together. This file contains sample
sequences of events in sample conversations. Here are a few examples of how
we defined our stories for this demo. The full script can be found in section6/
rasademo/data/stories.yml.
Chatbots ■ 281
stories :
For this demo, we will be testing this sample without a defined action. Hence, we
comment out action_order_pizza in the stories section.
Run rasa data validate to ensure there are no errors with your data
changes/additions.
Now, you can train your model.
rasa train
Several options can improve your chatbot besides adding more relevant training
data and fine-tuning rasa rules, domain, nlu, and stories.
The default selection uses DIETClassifier, which is Dual Intent Entity Trans-
former (DIET) used for both intent classification and entity extraction.
7
https://rasa.com/docs/rasa/components/
284 Natural Language Processing in the Real-World
When you start generating real chat data for your bot,
manually go through some of it to label and save as test data
for your chatbot. This way, you’ll be able to run tests on real
conversations.
You can also test different components of your chatbot in addition to complete
test stories. You can test your NLU (natural language understanding) model using the
command rasa data split nlu, which will shuffle and split your data into training
and testing samples. rasa test nlu –nlu data/nlu.yml –cross-validation is a
sample command you can use to run a full NLU evaluation using cross-validation.
To test the intent and entity models, run
9
https://rasa.com/docs/rasa/testing-your-assistant/
10
https://rasa.com/docs/rasa/connector
286 Natural Language Processing in the Real-World
Just like RASA models, Dialogflow gets better with more data. However, some
complex examples that fail with Dialogflow are where multiple pieces of information
are present in the same sentence.
I have a chicken with me, what can I cook with it besides chicken lasagna?
Give me a recipe for a chocolate dessert that can be made in just 10 minutes
instead of the regular half an hour.
In a RASA pipeline with custom models, such adversarial examples can be added
for the model to learn to identify correct entities and their values.
CHAPTER 10
Action plan
Your customer strategy team has a few different requirements.
- They want to understand comment sentiment.
- They want to understand themes in the positive and negative sentiment com-
ments without having to manually read all the comments.
- Once they gain more visibility into the popular themes, they want to select a few
themes that make sense from a business standpoint and also from a data standpoint.
- Then, they want you to build a model that can detect the presence of the selected
themes within any comments.
This will eventually allow the business to show the classification on their website
so users can sift through reviews around a particular theme of interest, such as ‘staff
and service’.
Dataset
We will build our analysis and models for comments on hotels from the OpinRank
Review Dataset available at the UCI Machine Learning Repository.1
We’ll select New York based hotels for our analysis.
The total number of reviews is 50656
Shortest review length: 10 characters
Longest review length: 793 characters
Mean review length: 981 characters
1
https://archive.ics.uci.edu/ml/datasets/opinrank+review+dataset
FIGURE 10.1 Performing comment review analysis from a company KPI perspective.
Customer Review Analysis 289
The size of the dataset here is good for producing a data-driven analysis.
Solution
From a data science perspective, the steps that we will go through are shown in
Figure 10.2.
FIGURE 10.2 Data science tasks breakdown for customer review analysis project.
sentiment = {}
for rev_id , rev in id_review . items () :
pol = TextBlob ( rev ) . sentiment . polarity
if pol > 0:
sent = " pos "
elif pol < 0:
sent = " neg "
else :
sent = " neu "
sentiment [ rev_id ] = { " class " : sent , " polarity " : pol }
We notice that majority of the comments are positive. Let’s look at a few samples
from each sentiment class.
• Positive:
‘We were given an overnight stay at the St. Regis as an anniversary present and were
treated to elegant luxury. The accommodations were plush, clean and first class.The
location to the theater district was convenient as well as many choices of restau-
rants.For an overnight in the City, do not hesitate to enjoy the St. Regis.’
‘I was on a trip looking for sites to hold business meetings in New York City. Every-
one at the St. Regis, from the front desk to security to the housekeeping and butlers
were friendly, helpful and went out of their way to provide anything I requested. The
rooms were spacious (for New York) and quiet and the sheets on the bed were Pratesi.
What more could someone ask for? Oh yes, they also provided butler service.’
‘I’ve stayed at the St. Regis on several occasions and had a wonderful experience each
time. The guest rooms are spacious, quiet, well decorated and functional, with com-
fortable beds and big marble bathrooms. Public areas are elegant. The staff is cheerful
and professional. Room service is prompt and the food is tasty. Ideal location. Overall,
a lovely hotel.’
Customer Review Analysis ■ 291
• Neutral:
‘could not fault this hotel, fab location, staff and service... will definitely stay there
again’
‘If you like animals - cockroaches - this is the hotel for you!Stayed here in june and it
was cockroaches all over the placein the bathroom and under the bed - not nice.....But
if you like animals this is the hotel for you!I don’t recommend this hotel at all!!’
‘STAY HERE! But choose the river view, not the twin towers view.’
• ‘Negative: While this hotel is luxurious, I just spent my second night on the fourth
floor and was woken up at two by garbage trucks outside which loaded and beeped for
an hour. My colleague got bumped up to a suite when she complained about her room.
Avoid any rooms on low floors facing the street and you might get some sleep.’
‘room smelled like mold....you could see mold in the tub...when i checked in on satur-
day agent failed to tell me the following day there would be a festival that would on
shut down all street including the one in front of the hotel making it impossible to get
a taxi to anywhere.The deorative pillows on the bed were so filthy i have to put them
on the floor. I would never stay here again even for half the price.’
‘We must have had the worst room at the hotel compared to the other ratings. Our
windows faced a brick wall, the windows wouldn’t open properly which we wanted be-
cause the airconditioning system wouldn’t regulate properly. The room was small and
because of the windows facing the wall, the room was dark and dreary. Never saw the
sun. It was like staying in a closet. The staff were a bit put off and arrogant. Not
friendly. The only positive about this hotel is the location. There are better choices.
We will not stay here again.’
We notice that neutral sentiment analysis fails when abbreviations are used to
describe the stay, such as ‘fab’. Furthermore, any sarcasm present in the comments
leads to further incorrect classification.
On manually checking 100 random comments per class, the neutral class had
the most incorrect classifications with 60% correct classifications. We noticed 80%
accurate results for the positive class and 100% accurate results for the negative
class.
For our purpose, we want to understand the negative and positive comments
further. These results overall seem satisfactory.
In the event we wanted to re-purpose this model with a heavy focus on neutral
sentiment, we would have needed to access whether 60% accuracy would be
satisfactory by discussing with the team intending to use the outcome.
FIGURE 10.3 Data science tasks breakdown for customer review analysis project (sen-
timent analysis).
We clean the comments to remove stop words that aren’t going to give us any
meaningful insights, words such as to, at, a, etc. We also remove punctuation and
lemmatize the words.
! pip install nltk
! pip install wordcloud
import string
from matplotlib import pyplot as plt
from nltk . corpus import stopwords
from nltk . stem . wordnet import WordNetLemmatizer
from wordcloud import WordCloud
negatives = []
positives = []
The largest common themes in both positive and negative sentiments appear to
be around room and hotel. These word clouds have multiple words that do not give
us information about the theme, such as ‘one’, ‘really’, etc. Thus, we will pass the
data through another cleaning function to retain only nouns and plot word clouds
again. We’ll also remove the words ‘room’ and ‘hotel’ for studying other themes for
the two sentiment classes and compare them.
from nltk import word_tokenize
from nltk import pos_tag
def noun_clean ( x ) :
"""
Retain only nouns and then pass through cleaning function
"""
tokens = word_tokenize ( x )
tags = pos_tag ( tokens )
nouns = [
word
for word , pos in tags
if (
pos == ' NN '
or pos == ' NNP '
or pos == ' NNS '
or pos == ' NNPS '
)
]
return clean ( " " . join ( nouns ) )
# positives
noun_pos = [ noun_clean ( id_review [ rev_id ]) for rev_id in positives ]
# remove hand - selected words
clean_pos = [
" " . join (
[
j for j in i . split ()
if j not in [ " room " , " hotel " , " quot " ] and len ( j ) >= 2
]
) for i in noun_pos
]
plot_wc ( clean_pos )
# negatives
noun_neg = [ noun_clean ( id_review [ rev_id ]) for rev_id in negatives ]
# remove hand - selected words
clean_neg = [
" " . join (
[
j for j in i . split ()
if j . lower () not in [ " room " , " hotel " , " quot " ] and len ( j ) >= 2
]
) for i in noun_neg
]
plot_wc ( clean_neg )
Customer Review Analysis 295
Let’s first look at the positive reviews on the word cloud. In these plots, we observe
the words time, staff, location, night, and several other room-related words such as
door, bathroom, and bed. Bucketing some top words in general topical areas, we have
1. staff and service with other related words such as time, night, etc.
2. location, with other related words such as area, street, york, park, etc.
Next, let’s explore the negative reviews word cloud. We see words such as night,
staff, time, desk, day, bathroom, and service. Bucketing some top words in general
topical areas, we have
1. staff and service with other related words such as time, night, desk (likely
coming from front desk), manager, etc.
There are fewer location-related and food-related words compared to the positive
word cloud. In comparison, ‘service’ ‘bathroom’, and ‘night’ seem to be mentioned a
lot more in negative reviews.
To understand which reviews contain what topics, we will create a classification
model. The above topics identified from the word clouds can be a good baseline for
classifying our comments. In reality, such decisions are made in conjunction with
important stakeholders to satisfy the business applications.
Here, we choose service & staff (including hotel food services), location, and room
as the topics with which we want to classify our comments. These three categories
are also generally common from a business perspective for hotel review classification
(see Google reviews classification for hotels).
This completes the highlighted task in Figure 10.8.
FIGURE 10.8 Data science tasks breakdown for customer review analysis project (iden-
tification of topics and themes).
subset = set (
random . sample (
list ( id_review . keys () ) , int ( len ( id_review ) *0.75)
)
)
Customer Review Analysis ■ 297
1. One method is to hand-curate a set of words that identify with a topic. We then
plot word clouds for all the documents that contain our hand-curated word list
but remove the words from our hand-curated set. This surfaces other words
that co-occur with our initial set. We can then increase our initial set and add
more words identified from the word cloud. Next, we assume all documents
containing any of the words in the set belong to our topic. We repeat this for
each topic. This process helps quickly gather some training data for the classifi-
cation problem. The main downside is a potential bias that can get introduced
in our model via the hand-curation process. A further step could be to remove
the search word from the document that we append to our training dataset
to limit the bias of this initial process if the test results are not satisfactory.
Nonetheless, this can form a good baseline model that you can improve upon as
a part of future iterations once you’re able to curate more generalizable training
data.
2. We can run a clustering algorithm such as LDA and inspect the resulting clus-
ters for several runs with different num_topics. Whichever cluster seems rele-
vant to a topic, we can manually check accuracy for a random 100 samples and
add all documents that get assigned to that cluster to our training dataset for
the topic if the accuracy is satisfactory. A similar approach can be used with
any clustering algorithm such as K-means. We can also use zero-shot classifi-
cation to refine data labels, followed by manual verification of the results. This
strategy limits the high inference time to only the initial label curation process.
While approach number 1 may not seem like a conventional data science method
of solving such a problem compared to approach number 2, it can work well and
fast for a problem like this. For this exercise, we’ll go with approach number 1 and
evaluate our model. We’ll also explore approach number 2 at the surface level using
LDA before wrapping up this section for demonstration purposes.
Before we begin, we need to break down each review comment into segments.
This is because the review comments in our data contain information about multiple
topics in the same comment, and sometimes in the same sentence as well. Breaking
down the comment into parts and sentences will help us segregate the data such that
multiple topics are less likely to occur within the same sample.
‘The rooms were very clean, service was excellent and useful. Location was out-
standing’ -> [ ‘The rooms were clean’, ‘service was excellent and useful’, ‘Location
was outstanding’]
import re
new_id_docs = []
for rev_id , rev in id_review . items () :
if rev_id in subset :
rev_split = re . split ( SENT_SPLIT , rev )
for phrase in rev_split :
# get each word in the cleaned phrase
phr_list = clean ( phrase ) . split ()
# only select cleaned phrases
# that contain more than 2 words
if len ( phr_list ) > 2:
new_id_docs . append (( rev_id , phr_list ) )
Then we plot word clouds to see the top noun words in the documents. We remove
overall top-occurring terms such as ‘hotel’ in addition to the initial hand-curated set.
remove_set = { " hotel " }
)
print ( " \ nFor room \ n " )
plot_wc ( cl ean _t ra ini ng _s am ple s (
room , set . union ( remove_set , room_set ) )
)
print ( " \ nFor staff and service \ n " )
plot_wc ( cl ean _t ra ini ng _s am ple s (
serv , set . union ( remove_set , serv_set ) )
)
The word clouds produced are shown in Figures 10.9, 10.10, and 10.11. We curate
more words by looking at the word clouds. Then, we select all comment segments
containing any of our curated words per topic to form our training dataset.
For location,
loc_set = {
" location " , " subway " , " walking " , " street " , " block " ,
" distance " , " walk " , " park " , " midtown " , " manhattan " ,
" empire " , " avenue " , " shop " , " attraction "
}
loc_set_dual = { " time " , " square " }
location_train = []
for rev_id , word_list in new_id_docs :
for loc_word in loc_set :
if loc_word in word_list :
300 Natural Language Processing in the Real-World
location_train . append (
(
rev_id , word_list
)
)
if len (
[ w for w in loc_set_dual if w in word_list ]
) == loc_set_dual :
location_train . append (
(
rev_id , word_list
)
)
print ( len ( location ) , len ( location_train ) )
# >> 18292 88196
For room,
room_set = {
" bath " , " floor " , " shower " , " window " , " space " ,
" room " , " bed " , " bathroom " , " bedroom "
}
room_train = []
for rev_id , word_list in new_id_docs :
for room_word in room_set :
if room_word in word_list :
room_train . append (
(
rev_id , word_list
)
)
print ( len ( room ) , len ( room_train ) )
# >> (103584 , 141598)
Customer Review Analysis 301
FIGURE 10.12 Data science tasks breakdown for customer review analysis project (cu-
rating training data).
Next, we put together this data, randomly shuffle, and create a baseline Multino-
mial Naive Bayes model as in Chapter 8.
cleaned_data = (
[
( " " . join ( word_list ) , " staff_serv " )
for _ , word_list in serv_train ] + [
( " " . join ( word_list ) , " loc " )
for _ , word_list in location_train ] + [
( " " . join ( word_list ) , " room " )
for _ , word_list in room_train
]
)
cleaned_data = [
itm for itm in cleaned_data
if len ( itm [0]. split () ) > 2
]
random . shuffle ( cleaned_data )
x = [ itm [0] for itm in cleaned_data ]
y = [ itm [1] for itm in cleaned_data ]
We’ll use the same code used in Chapter 8 for training the model. The full script
can be found in section6/comment-analysis-hotel-reviews.ipynb.
302 Natural Language Processing in the Real-World
FIGURE 10.14 Data science tasks breakdown for customer review analysis project
(training a classification model).
Customer Review Analysis ■ 303
We want to identify the correct classifications in each comment segment. We’ll use
the predict_proba function to get probabilities of prediction per class and identify a
good cut-off threshold score. This means any data that has a classification of ‘location’
with a probability > threshold will be deemed location-related. This is also useful
in our case as a lot of comments may have segments that do not belong to any of
the three topics we want to identify, e.g. ‘I was traveling on Monday for a work
trip.’ from ‘I was traveling on Monday for a work trip. The staff was very helpful
in accommodating my late check in request.’ However, our classifier will force each
segment to the three classes defined. Thus by eliminating low-probability detections,
we can disregard some of the classifications of comment segments that are irrelevant
to our model.
Since we curated the data manually, testing this model is important so we can
detect cases of bias and work on future iterations accordingly. Let’s inspect some
results below.
best_model . classes_
# >> array ([" loc " , " room " , " staff_serv "]
# Room classification
best_model . predict_proba ( vectorizer . transform ([ clean (
" the bedroom was spacious "
) ]) )
# >> array ([[0.01944015 , 0.95802311 , 0.02253675]])
FIGURE 10.15 Data science tasks breakdown for customer review analysis project
(model evaluation).
Using the above thresholds, we can now pass in new comments through our clas-
sifier and get sentiment and topics in each comment as follows.
clean_test = {}
for rev_id , rev in id_review . items () :
if rev_id not in subset :
phrase_list = re . split ( SENT_SPLIT , rev )
clean_test [ rev_id ] = [
clean ( phr ) for phr in phrase_list
]
print ( f " { len ( clean_test ) } unseen test samples prepared " )
# > A random sample was taken to identify a threshold
# Threshold of 0.62 was identified
Here is what a few samples look like. Format ([id, comment, topics, sentiment]).
[45522, ‘Stayed here for 4 nights and really enjoyed it - staff were friendly and
the rooms were lovely...had a real charm old school feeling. Don’t be put off by the
negative reviews here....we really did not have anything to complain about, it was
great’, [(‘staff_serv’, 1)], ‘pos’]
[28126, ‘Granted, the hotel has history, but trying to get a good nights sleep is
a whole different story. The windows are single pane - I think they built a building
next to us during the night. It was soo noisy you could hear honking and noises the
entire evening. The room was tiny even for New York standards. The lampshade was
actually torn and stained and looked like it belonged in a shady hotel(looked worse than
a salvation army lampshade). This was evidence of how clean the overall room was
with black hairs in the bathtub...yuk. The beds were small hard and uncomfortable. We
changed hotels for the remainder of nights. Extremely disappointed, especially after
the good reviews here.’, [(‘room’, 4)], ‘neg’]
This completes the final task of putting it all together as seen in Figure 10.16.
FIGURE 10.16 Data science tasks breakdown for customer review analysis project
(pipeline).
306 Natural Language Processing in the Real-World
Why not just look for the initial list of curated keywords to
identify topics in comments? Why create a model?
FIGURE 10.17 Data science tasks breakdown for customer review analysis project (cu-
rating training data).
We pass our data into LDA clustering and experiment using multiple runs with
different number of clusters as follows. The complete code can be found in the note-
book section6/comment-analysis-hotel-reviews.ipynb on GitHub.
from gensim import corpora , models
# Creating the object for LDA model using gensim library
Ida = models . ldamodel . LdaModel
Here, cluster 0 appears to be related to the topics room and service & staff.
Cluster 1 has aspects related to the topic location.
lda_3 = lda (
doc_term_matrix , num_topics =3 ,
id2word = dictionary_pos , passes =10
)
# Results
for itm in lda_3 . print_topics () :
print ( itm , " \ n " )
308 ■ Natural Language Processing in the Real-World
Here, cluster 2 looks strongly related to the topic service & staff and cluster 1
looks strongly related to the topic room. Cluster 0 has a few aspects that relate to
location.
lda_4 = lda (
doc_term_matrix , num_topics =4 ,
id2word = dictionary_pos , passes =10
)
# Results
for itm in lda_4 . print_topics () :
print ( itm , " \ n " )
Here, cluster 2 looks strongly related to the topic room and cluster 0 looks strongly
related to the topic service & staff. Cluster 3 has aspects that relate to location.
lda_5 = lda (
doc_term_matrix , num_topics =5 ,
id2word = dictionary_pos , passes =10
)
# Results
for itm in lda_5 . print_topics () :
print ( itm , " \ n " )
Here, cluster 0 looks strongly related to the topic location and cluster 1 looks
strongly related to the topic room. Clusters 2 and 4 also have aspects that relate to
room. Cluster 3 has some hints of service & staff.
We see some of the resultant clusters look very relevant to our topics. Since cluster
0 of the lda_5 model is strongly related to location, we further explore the goodness
of the data in the cluster.
We randomly sample 100 documents that belong to cluster number 0 for lda_5
and manually label them. We find that 72% of the documents are relevant to the
class location. To further clean the document list, we can take the documents irrele-
vant to the topic location, find similar documents using text similarity (as discussed
in Chapter 8), and remove them. The resultant accuracy will likely be higher and
give you ready-to-use training data for the class location. A similar approach can be
followed for the other two classes as well.
CHAPTER 11
Recommendations and
Predictions
11.1.1 Approaches
Recommendation systems usually rely on a few different approaches.
1. Collaborative filtering
Collaborative filtering is the process of getting recommendations for a user
based on the interests of other users that have watched the same content.
2. Content-based filtering
Content-based filtering gets recommendations for a user based on the user’s
preferences using content-based features. The recommendations are typically
items similar to ones the user has expressed interest in previously.
3. Knowledge-based systems
Another realm of recommendation systems includes knowledge-based systems,
where contextual knowledge is applied as input by the user. Knowledge-based
recommender systems are well suited to complex domains where items are not
purchased very often. Examples include apartments, cars, financial services,
digital cameras, and tourist destinations.
Action plan
The goal is to build a recommendation system for videos. There is no user watch
history available. The only known input is the title and description of the video that
the user is currently watching. The goal is to recommend 8 videos similar to the one
being watched. This is a part of a new product (with a user interface and platform)
that is not fully built out yet. The goal is to work on building the model in parallel
so the recommendation system can be launched with the product. You need to think
of the best way to create the recommendation model prototype without having any
data. The only known data detail is that it is in a video format with an attached text
description.
Recommendations and Predictions ■ 313
The product team has evaluated use cases and would like to test them as follows
for a proof of concept.
From a corpus of videos related to ‘Python’,
- get video recommendations for ‘Python’ the snake.
- get video recommendations for ‘Python’ the programming language.
Another requirement is to evaluate a few different options to accomplish the goal
and choose the one with the best results for the homonym ‘Python’ example.
Dataset
Since the business is video-centric, we’ll curate sample video data to build this
prototype. We’ll get data from YouTube using the YouTube API’s ‘search’ endpoint.
The keyword used to query this data is python.
To get the dataset, we’ll use YouTube API as discussed in Section II. We’ll use
code from Chapter 2 (Section 2.2.7). The full code for this exercise can be found in
the notebook section6/content-based-rec-sys.ipynb. We’ll store YouTube video text
in the yt_text variable which is a list of tuples containing video ID as the first ele-
ment and video title + description as the second element. A total of 522 videos were
grabbed using the API for the query search keyword = ‘python’.
Concepts
Python is a homonym where on one hand it can refer to the snake, and on the
other hand, it could refer to the popular programming language.
We may be serving video recommendations, but the underlying data is the text
associated with the video title and description. We will build a recommendation
model to get video recommendations based on an input video text. This will be the
core component of a system that is able to recommend videos based on the video the
user is looking at.
We will use some of the discussed approaches from the text similarity section of
Chapter 8 (Section 8.2). We’ll compare the results from a few different models to
understand which model works well for the described application and data.
11.1.2.1 Evaluating a classic TF-IDF method, spaCy model, and BERT model
Solution
Approach 1: TF-IDF - cosine similarity
TF-IDF or Term Frequency - Inverse Document Frequency is a common and pop-
ular algorithm for transforming text into a numerical representation. It is a numerical
statistic that intends to reflect how important a word is to a document in a corpus.
The sklearn library offers a prebuilt TF-IDF vectorizer. Cosine similarity will be
used to determine the similarity between the two documents.
We can preprocess the text before computing TF-IDF to get rid of noise ele-
ments within the text depending on how our dataset looks like. We can combine text
from the title and description of the YouTube content into a single string variable
and clean the data based on the observed noise. Removing URLs, stop words, and
314 ■ Natural Language Processing in the Real-World
non-alphanumeric characters can be useful for social media data. However, here we’ll
proceed without cleaning the data but can always revisit it based on the results. Here’s
what a cleaning method to remove noise and unwanted elements from YouTube data
could look like.
import re
from nltk . corpus import stopwords
return cleaned_text
vect = TfidfVectorizer ()
# get tfidf of all samples in the corpus
# yt_text is a list of video text - training data
tfidf = vect . fit_transform ( yt_text )
docs_spacy = [ nlp ( " u '" + itm + " '" ) for itm in yt_text ]
# add if itm in nlp . vocab to avoid out of vocab errors .
Potential issues with a word embedding model as such are that processing a
sentence with terms that are not in the pre-trained models can throw errors. To
ensure a word is present and does not break your code, a check for presence can be
added as seen above in the code comment.
The top 8 recommendations can be seen in Figure 11.3.
SentenceTransformer , util
)
bert_model = SentenceTransformer (
" bert - base - nli - mean - tokens "
)
Action plan
The goal for this application is to build a model that predicts next words based
on the current words that the user has written. This tool will be used by a data
science team to write documentations.
Concepts
In this demonstration, we’ll build a next word prediction model for a corpus
constructed from Wikipedia pages on the topic data science. In a practical scenario,
Recommendations and Predictions 319
FIGURE 11.5 Building next word prediction models from a company KPI perspective.
using existing documentation as the training data will be beneficial. Here, we are
assuming such a dataset does not already exist. Hence, we leverage public data sources
to build this model.
We discussed the BiLSTM (Bidirectional Long Short-Term Memory) model in
Chapter 4 (Section 4.2.2.4). BiLSTMs work very well on sequential data and are a
good choice for a task like this. We will use BiLSTM to construct a model that pre-
dicts the next n words based on an input word.
320 ■ Natural Language Processing in the Real-World
Dataset
To curate our corpus, we use the Wikipedia crawler wrapper for Python1 and get
data for a few pages associated with data science.
! pip install wikipedia
import wikipedia
Solution
We’ll train a BiLSTM model which is a popular choice for applications where the
sequence/ordering of words is important.
Now that we have curated the Wikipedia data, let’s clean the data and pass it
through Keras Tokenizer. We will remove any lines less than five characters long
without considering trailing or leading spaces.
from nltk import tokenize
from tensorflow . keras . preprocessing . text import Tokenizer
Next, let’s prepare input sequences to pass into the model. From each line in the
corpus, we will generate n-gram tokens to create sequences of words. For instance,
‘natural language processing generates responses’ to ‘natural language’, ‘natural lan-
guage processing’, ‘natural language processing generates’, and ‘natural language
processing generates responses’. We don’t envision the model needing to predict any-
thing more than the next 5-6 words. So while curating sequences for training, we will
limit the length of the sequence to be less than 10 to keep some buffer in case the
expectation evolves.
import numpy as np
from tensorflow . keras . utils import to_categorical
1
https://pypi.org/project/wikipedia/
Recommendations and Predictions ■ 321
max_seq_len = 10
input_sequences = []
input_sequences = np . array (
pad_sequences (
input_sequences , maxlen = max_seq_len , padding = " pre "
)
)
xs = input_sequences [: ,: -1]
labels = input_sequences [: , -1]
ys = to_categorical (
labels , num_classes = total_words
)
We see the number of sequences for training the model is 17729. Next, we set up
the model and begin training.
from tensorflow . keras . layers import (
Embedding , LSTM , Dense , Bidirectional
)
from tensorflow . keras . models import Sequential
from tensorflow . keras . optimizers import Adam
model = Sequential ()
model . add (
Embedding (
total_words , 100 , input_length = max_len -1
)
)
model . add ( Bidirectional ( LSTM (64) ) )
model . add ( Dense ( total_words , activation = " softmax " ) )
adam = Adam ( learning_rate =0.01)
model . compile (
loss = " c a t e g o r i c a l _ c r o s s e n t r o py " ,
optimizer = adam , metrics =[ " accuracy " ]
)
history = model . fit ( xs , ys , epochs =10 , verbose =1)
FIGURE 11.6 Next word prediction BiLSTM model accuracy and loss at 10 epochs.
We increase the epochs a few times and re-train to find a good number for this
model.
history = model . fit ( xs , ys , epochs =20 , verbose =1)
We plot accuracy and loss again for each re-train, and the results can be seen
in Figure 11.7. We can see that increasing the epochs further is unlikely to make
our model accuracy higher / loss lower. We can use KerasTuner as demonstrated in
Chapter 4 for tuning the number of epochs. Here, we end up with a model accuracy
of 79%.
The full code can be found in section6/next-word-pred-bilstm.ipynb.
The other parameters of the model can be altered as well to notice changes in the
model training metrics.
To generate next-word predictions, we write a function as below and call it with
test samples as input.
def predict_nw ( text , next_words =2) :
"""
For the input ` text ` , predict the next n words ,
where n = ` next_words `
"""
words = [ text ]
for _ in range ( next_words ) :
full_text = " " . join ( words )
token_list = tokenizer . texts_to_sequences (
[ full_text ]
Recommendations and Predictions 323
FIGURE 11.7 Next word prediction BiLSTM model accuracy and loss at 20 and 40
epochs.
) [0]
token_list = pad_sequences (
[ token_list ] , maxlen = max_len -1 , padding = " pre "
)
predicted = np . argmax ( model . predict (
token_list , verbose =0
) , axis = -1)
next_word = " "
for word , inx in tokenizer . word_index . items () :
if inx == predicted :
next_word = word
break
words . append ( next_word )
FIGURE 11.8 Next word prediction output from the BiLSTM model with the predicted
words in bold.
The way to measure the impact of this model on the writing efficiencies of data
scientists includes calculating the average time taken per page before and after the
model is deployed. Such models can also reduce typing errors in writing, which can
also be measured before and after the model is used.
CHAPTER 12
Throughout this book, we have shared tips and the consideration factors that in-
fluence the path of implementing NLP applications. Here we go over some common
scenarios, key takeaways, final thoughts, and tips for building successful NLP solu-
tions in the real world.
Let’s divide the text modeling problem into three phases as seen in Figure 12.1
and discuss common scenarios for each phase.
model first, followed by another multiclass or binary classification model under one
or more of the classes. Another approach for such problems is resorting to a keywords-
based model that classifies based on the keywords present in the text. These can be
hand-curated or collected using clustering models. Labeling a few samples followed
by using text similarity can also help kick-start the process.
Class imbalance
Class imbalance is a common problem. Solutions include getting more labeled
data, using data augmentation to increase the size of the smaller class, or reducing
the size of the larger class by getting rid of some samples. Libraries like imblearn1
provide tools when dealing with classification with imbalanced classes.
What if you have 8 classes, of which 5 are well represented, but 3 don’t have as
much data? If no other technique applies, you can combine the 3 classes into one
before creating the model, and then create a sub-classifier if needed.
and/or precision. These advancements help research efforts in finding the better and
the best solutions. On the contrary, working in an enterprise, the goal is not to have
a model that is best in accuracy. Factors like model complexity, compute resource
cost/availability, and model explainability are also important. It would be reasonable
to choose a Logistic Regression over a Random Forest classifier with a 2% accuracy
loss because of the simplicity of Logistic Regression.
• Give your machine more compute resources - Memory, CPUs, and GPUs if
needed.
2
https://docs.python.org/3/library/multiprocessing.html
3
https://spark.apache.org/docs/latest/api/python/
More Real-World Scenarios and Tips 331
import joblib
4
https://www.tensorflow.org/tfx/guide/serving
5
https://mlflow.org/
6
https://www.kubeflow.org/
7
https://www.cortex.dev/
8
https://pytorch.org/serve/
9
https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html
10
https://streamlit.io/
11
https://flask.palletsprojects.com/en/2.2.x/
12
https://www.django-rest-framework.org/
13
https://www.tensorflow.org/lite
14
https://airflow.apache.org/
15
https://www.prefect.io/
More Real-World Scenarios and Tips ■ 333
)
)
return clf
You can then deploy your model by choosing an appropriate instance type16 .
Instance types comprise different combinations of CPU, memory, storage, and net-
working capacity.
from sagemaker . sklearn . model import SKLearnModel
from sagemaker import g et_ execution_role
model = SKLearnModel (
model_data = ' s3 :// recommendation - models /2022/ com me nt _c las si fi er_ v1 2 .
joblib . tar . gz ' ,
role = ge t_e xe cu ti on _r ol e () ,
entry_point = ' inference_script . py ' ,
fra mework _versi on = ' 0.23 -1 '
)
model_ep = model . deploy (
instance_type = ' ml . m5 . large ' , # choose the right instance type
i n i t i a l _ i n s t a n c e _ c o u n t =1
)
This endpoint name that prints can now be used by other applications to get
predictions from this model. The endpoints section under Sagemaker will show all
active endpoints. By default, this creates an endpoint that is always active and has
dedicated resources behind it at all times. Sometimes you may not want your model
to be running at all times, you can then consider making your endpoint serverless.
For serverless endpoints, resources are allocated dynamically based on calls to the
endpoint. Because of that, you can expect some cost savings but the inference time
may be slower.
To get predictions from your model, you can use the library boto317 .
import boto3
Service providers like Google18 , Microsoft19 , etc. have equivalent options for de-
ploying ML models. These service providers also give you the option of running
16
https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-
types.html
17
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
18
https://cloud.google.com/ai-platform/prediction/docs/deploying-models
19
https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-deploy-and-
where?tabs=azcli
334 Natural Language Processing in the Real-World
training jobs, model testing, model monitoring, and other parts of the model-to-
production pipeline, also called MLOps. Machine Learning Operations, or MLOps,
is a core function of machine learning engineering. Its focus is on streamlining the
process of taking machine learning models to production, including maintenance and
monitoring.
It is not uncommon to opt for simpler models at the loss of some accuracy for
better explainability. Often, the teams that consume the results of your models may
not understand ML. While some may trust a data scientist’s work completely, some
need convincing material to trust and use the results of a model.
Other than open and constant communication, here are some pointers that can
help a team without an ML background to trust, use, and understand your model.
• Share some high-level examples of the chosen ML model finding success in sim-
ilar tasks based on published research. They may not know how great BiLSTM
is but still care about understanding why you used it.
• Share examples of input -> output, including samples of both successes and
failures. Showing only failures can understate the model’s capabilities. Showing
only successes can set false expectations.
• Share visualizations that resonate with the background of the audience. Plots
of training loss of your model may not be easy to understand. Share plots that
help them understand the data or model results in a way that impacts their
consumption. Plots of test data errors, aggregated error/accuracy, and word
clouds (with a little bit of explanation) could resonate with a wide audience.
More Real-World Scenarios and Tips 335
Your model might be the best there is. However, if its impact
on company KPIs is too small to be of significance, it may
never be deployed or used in production. It does not mean
you didn’t do a good job. It just means that your work didn’t
contribute as much to what the company wanted at that
point. Sometimes a simple word counts aggregation can leave
a larger impact on company goals than a predictive neural
network model. There also might be times when you have
a clear path towards improving model accuracy by 5%, but
it may not be worth four weeks of your time based on the
contribution of that extra 5% to the company’s goal.
As we saw from the KPI diagrams shared for each project in
the previous chapters, how you test and measure the good-
ness of your ML models can be different from how their im-
pact is measured from a KPI perspective.
336 ■ Natural Language Processing in the Real-World
Windup
In this section, we looked at multiple popular applications of NLP in the industry,
including chatbots, customer review analysis, recommendation systems, and next-
word prediction. For each application, we listed key performance indicators (KPIs),
business objectives, action plans for the demonstrations, and executed implementa-
tions. Most application pipelines require integration with other systems used in the
organization. We discussed model deployment tools but our aim has not been to ex-
plore all the engineering efforts required to completely deploy these applications and
have focused primarily on the back-end modeling.
We used realistic datasets that help with understanding challenges in a company.
We also assumed cases where datasets are not available and leveraged openly acces-
sible data in those cases (Wikipedia data for next-word prediction). We also went
through scenarios where we don’t have labeled data for creating a supervised classi-
fication model (comment classification). We explored ways to generate labeled data
with minimal time investment. We also made use of a conditionally available data
source, i.e., YouTube API, for building a recommendation system. For chatbots, we
started by manually creating labeled data. These different examples reflect a realistic
state of data processes in the industry.
To begin with, we showed implementations of two types of chatbots - a simpler
rule-based chatbot, and a goal-oriented chatbot. We shared code samples and popular
frameworks that can help you build such chatbot applications. We shared resources
and tips on creating and fine-tuning your chatbot to best serve your purpose. We
also listed out multiple service providers that you can explore to deploy chatbots.
We then explored a popular NLP application surrounding customer reviews. We
analyzed customer reviews by computing sentiment and looked at word clouds to
understand comment themes. We then picked three popular themes to build a classi-
fication model that identifies the presence of each theme in a comment. We discussed
multiple ways of approaching the problem when labeled data is not available. We
built a model using one approach and also implemented alternate options to show-
case different methods of achieving the goal.
Next, we looked at building recommendation systems using YouTube videos (title
and description fields). For this application, we explored three modeling options and
compared the results from each. There are many pre-trained and quick solutions that
you can leverage for building such a model. Which one of them might be good for
your data and application? This remains a popular question that practitioners often
deal with. We demonstrated making a choice between different available tools, which
is a common industry scenario, i.e., choosing the right tool for the job. We looked
at building a model that predicts the next words. We worked under the assumption
that we do not have available any existing documents that help us form a training
data corpus for the task. We collected data from Wikipedia to develop the first model
for this task. We shared the implementation of a BiLSTM model and shared some
outcomes from the model.
Finally, we discussed some common types of modeling and data problems in the
real world, model deployment, explainability, final thoughts, and tips.
Bibliography
[3] Create cnn model and optimize using keras tuner – deep learning.
[4] Emerging and rare entity recognition. 2017 The 3rd Workshop on Noisy User-
generated Text (W-NUT). http://noisy-text.github.io/2017/emerging-
rare-entities.html. Online; accessed 2022-12-05.
[8] Machine learning driven models in the automotive sector. reply.com. https:
//www.reply.com/en/topics/artificial-intelligence-and-machine-
learning/nlp-across-the-automotive-value-chain Online; accessed
2022-02-05.
337
338 ■ Bibliography
[13] Tradeshift announces ‘go,’ the first virtual assistant for company spending &
travel. https://tradeshift.com/press/tradeshift-announces- go-the-
first-virtual-assistant-for-company- spending-travel/. Online; ac-
cessed 2022-11-06.
[15] Everything you need to know about natural language processing (nlp) in real es-
tate. https://co-libry.com/blogs/natural- language-processing-nlp-
real-estate/, November 2020.
[17] All about education industry: Key segments, trends, and competitve advan-
tages. Pat Research, 2021. https://www.predictiveanalyticstoday.com/
what-is-education-industry/ Online; accessed 2022-02-13.
[18] Eliane Alhadeff. I-fleg: A serious game for second language acquisition. Serious
Game Market, April 2013. https://www.seriousgamemarket.com/2013/04/
i-fleg-serious-game-for-second-language.html.
[21] Artificiallawyer. Norton rose rolls out ‘parker’ the legal chat bot for gdpr. Ar-
tificial Lawyer. https://www.artificiallawyer.com/2018/05/16/norton-
rose-rolls-out-parker-the-legal-chat-bot-for-gdpr/, May 2018. On-
line; accessed 2022-11-07.
[23] Kevin Ashley. Applied Machine Learning for Health and Fitness. Apress, 2020.
[25] Oliver Batey. Mining an economic news article using pre-trained language
models. Towards Data Science, January 2021. https://towardsdatascience.
com/mining-an-economic-news-article-using-pre-trained-language-
models-f75af041ecf0 Online; accessed 2022-05-29.
[26] Zikri Bayraktar. Natural language processing: The new frontier in oil
and gas. Schlumberger-Doll Research Center. https://www.software.
slb.com/blog/natural-language-processing---the-new-frontier. On-
line; accessed 2022-01-13.
[27] Pete Bell. Ten groups control 40%. of global wireless subscribers. https:
//blog.telegeography.com/top-telecos-by-wireless-subscribers-
global, September 2020. Online; accessed 2022-01-11.
[29] Deepanshu Bhalla. K nearest neighbor : Step by step tutorial. Listen Data,
2018. https://www.listendata.com/2017/12/k-nearest-neighbor-step-
by-step-tutorial.html Online; accessed 2022-11-22.
[30] Raghav Bharadwaj. Data search and discovery in oil and gas – A re-
view of capabilities. Emerj: The AI research and advisory company.
https://emerj.com/ai-sector-overviews/data-search-discovery-
oil-gas-review-capabilities/, 11 2018. Online; accessed 2022-01-13.
[31] Raghav Bharadwaj. Using nlp for customer feedback in automotive, bank-
ing, and more. https://emerj.com/ai-podcast-interviews/using-nlp-
customer-feedback-automotive-banking/, February 2019.
[34] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with
Python: Analyzing text with the natural language toolkit. O’Reilly Media, Inc.,
2009.
[35] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation.
Journal of Machine Learning Research, 3:993, 01 2013.
[36] That Data Bloke. Gesture recognition for beginners with cnn.
https://towardsdatascience.com/artificial-neural-networks-for-
gesture-recognition-for-beginners- 7066b7d771b5, Apr 2020.
340 ■ Bibliography
[37] Tom Bocklisch, Joey Faulkner, Nick Pawlowski, and Alan Nichol. Rasa: Open
source language understanding and dialogue management. arXiv, 2017.
[38] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan-
dlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are
few-shot learners. 2020.
[40] Justine Calma. Google is taking sign-ups for relate, a voice assis-
tant that recognizes impaired speech. The verge, November 2021.
https://www.theverge.com/2021/11/9/22772535/google-project-
relate-euphonia-voice-recognition-command-control- assistant.
[41] Ricardo Campos, VÃŋtor Mangaravite, Arian Pasquali, AlÃŋpio Jorge, CÃľlia
Nunes, and Adam Jatowt. Yake! collection-independent automatic keyword ex-
tractor. 02 2018.
[42] CaseMine. casemine : The most granular mapping of us case law. CaseMine,
Gauge Data Solutions. https://www.casemine.com/. Online; accessed 2022-
11-07.
[44] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni
John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-
Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. 03
2018.
[47] Nagesh Singh Chauhan. Naïve bayes algorithm: Everything you need to know.
KD Nuggets, April 2022. https://www.kdnuggets.com/2020/06/naive-
bayes-algorithm-everything.html Online; accessed 2022-11-22.
Bibliography ■ 341
[52] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine
Bordes. Supervised learning of universal sentence representations from natural
language inference data. BibSonomy pages 670–680, 09 2017.
[55] Glen Coppersmith, Ryan Leary, Patrick Crutchley, and Alex Fine. Natural
language processing of social media as screening for suicide risk. Biomedical
Informatics Insights, 10:1178222618792860, 2018. PMID: 30158822.
[56] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and
Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a
fixed-length context. arXiv, 2019. https://arxiv.org/abs/1901.02860.
[57] Robert Dale. Law and word order: Nlp in legal tech. Towards Data
Science https://towardsdatascience.com/law-and-word-order-nlp-in-
legal-tech-bd14257ebd06, December 2018.
[58] Soheil Danesh, Tamara Sumner, and James Martin. Sgrank: Combining sta-
tistical and graphical methods to improve the state of the art in unsupervised
keyphrase extraction. In Proceedings of the Fourth Joint Conference on Lexical
and Computational Semantics pages 117–126, 01 2015.
[60] Cruz E. Borges Cristina Martin Ainhoa Alonso-Vicario David Orive, Gorka Sor-
rosal. Evolutionary algorithms for hyperparameter tuning on neural networks
models. Proceedings of the European Modeling and Simulation Symposium,
2014 978-88-97999-38-6, 2014. http://www.msc-les.org/proceedings/emss/
2014/EMSS2014_402.pdf Online; accessed 2022-11-22.
342 ■ Bibliography
[61] Statista Research Department. Estimated size of the legal services market
worldwide from 2015 to 2025. January 2022. https://www.statista.com/
statistics/605125/size-of-the-global-legal-services-market/.
[62] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
pre-training of deep bidirectional transformers for language understanding.
CoRR, abs/1810.04805, 2018.
[63] Jay DeYoung, Iz Beltagy, Madeleine Zuylen, Bailey Kuehl, and Lucy Wang.
Msˆ2: Multi-document summarization of medical studies. pages 7494–7513, 01
2021.
[64] Andrew Dickson. How we made the dyson vacuum cleaner. The Guardian.
https://www.theguardian.com/culture/2016/may/24/interview-james-
dyson-vacuum-cleaner, 2016.
[65] Kelvin Salton do Prado. How dbscan works and why should we use it? Towards
Data Science, April 2017. https://towardsdatascience.com/how-dbscan-
works-and-why-should-i-use-it-443b4a191c80 Online; accessed 2022-11-
20.
[66] Dr. Yiqiang Han Dr. Yannan Shen. A natural language process-
ing model for house price forecasting. Clemson University Research
Foundation. http://curf.clemson.edu/technology/a-natural-language-
processing-model-for-house-price-forecasting/. Online; accessed 2022-
11-04.
[67] Dheeru Dua and Casey Graff. UCI machine learning repository. Uni-
versity of California, Irvine, School of Information and Computer
Sciences. https://archive.ics.uci.edu/ml/datasets.php?format=&task=
att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table, 2017.
[68] Swagata Duari and Vasudha Bhatnagar. scake: Semantic connectivity aware
keyword extraction. Information Sciences, 477, 10 2018.
[70] Shannon Flynn. How natural language processing (nlp) ai is used in law.
Law Technology Today. https://www.lawtechnologytoday.org/2021/06/
how-natural-language-processing-nlp-ai-is-used-in-law/, June 2021.
[72] Kevin Leyton-Brown Thomas Stutzle Frank Hutter, Holger H Hoos. Paramils:
An automatic algorithm configuration framework. Journal of Artificial
Intelligence Research 36 (2009) 267-306, October 2009. https://arxiv.org/
ftp/arxiv/papers/1401/1401.3492.pdf Online; accessed 2022-11-22.
[73] Yan Gao. Ucsd health system provides language interpreting device to aid
communication. The Guardian. https://ucsdguardian.org/2014/04/21/
ucsd-health-system-provides-language-interpreting-device-to-aid-
communication/, April 2014.
[75] Inc. Grand View Research. Telecom services market size, share & trends anal-
ysis report by service type (mobile data services, machine-to-machine ser-
vices), by transmission (wireline, wireless), by end-use, by region, and segment
forecasts, 2021–2028. https://www.grandviewresearch.com/industry-
analysis/global-telecom-services-market. Online; accessed 2022-11-07.
[76] AurÃľlien GÃľron. Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow. O’Reilly Media, Inc., September 2019.
[77] Hilda Hardy, Alan Biermann, R Bryce Inouye, Ashley McKenzie, Tomek
Strzalkowski, Cristian Ursu, Nick Webb, and Min Wu. The amitiés system:
Data-driven techniques for automated dialogue. Speech Communication, 48(3–
4):354–373, 2006.
[78] Adam Hayes. Bayes’ theorem: What it is, the formula, and examples.
Investopedia, March 2022. https://www.investopedia.com/terms/b/bayes-
theorem.asp Online; accessed 2022-11-22.
[79] Brenner Heintz. Training a neural network to detect gestures with opencv in
python. Towards Data Science, Dec 2018. https://towardsdatascience.
com/training-a-neural-network-to-detect-gestures-with-opencv-in-
python-e09b0a12bdf1.
[80] Matthew Honnibal and Ines Montani. spaCy 2: Natural language understand-
ing with Bloom embeddings, convolutional neural networks and incremental
parsing. To appear, 2017.
[81] Daniel Hsu. Brown clusters, linguistic context, and spectral algorithms.
Columbia University. https://www.cs.columbia.edu/~djhsu/papers/
brown-talk.pdf Online; accessed 2022-11-24.
[83] C.J. Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sen-
timent analysis of social media text. 01 2015.
[86] Casetext Inc. casetext: Modern search technology that finds cases lexis and
westlaw miss. Casetext Inc. https://casetext.com/. Online; accessed 2022-
11-07.
[88] Ferris Jabr. The reading brain in the digital age: The science of paper versus
screens. Scientific Journal. https://www.scientificamerican.com/article/
reading-paper-screens/ Online; accessed 2021-12-10.
[89] Arun Jagota. Markov clustering algorithm. Towards Data Science, December
2020. https://towardsdatascience.com/markov-clustering-algorithm-
577168dad475 Online; accessed 2022-11-20.
[90] Kipling D. Williams and Joseph P. Forgas. Social Influence: Direct and Indirect
Processes. Psychology Press, May 2001.
[91] Naveen Joshi. Nlp is taking the travel and tourism industry to new places. here’s
how. Allerin, November 2020. https://www.allerin.com/blog/nlp-is-
taking-the-travel-and-tourism-industry-to-new-places- heres-how.
[92] Naveen Joshi. 5 benefits of natural language processing in the travel and tourism
industry. bbn times, March 2021. https://www.bbntimes.com/technology/
5-benefits-of-natural-language-processing-in-the-travel-and-
tourism-industry.
[94] Dhruvil Karani. Topic modelling with plsa. Towards Data Science, Oc-
tober 2018. https://towardsdatascience.com/topic-modelling-with-
plsa-728b92043f41 Online; accessed 2022-11-20.
[96] Jason S. Kessler. Scattertext: A browser-based tool for visualizing how corpora
differ. 2017.
[100] Kevin Knowles. How vodafone’s chatbot tobi is changing the contact center.
https://contact-center.cioapplicationseurope.com/cxoinsights/
how-vodafone-s-chatbot-tobi-is-changing-the-contact-centre-nid-
1640.html. Online; accessed 2022-01-11.
[102] John Lafferty, Andrew Mccallum, and Fernando Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence data. pages
282–289, 01 2001.
[103] Natasia Langfelder. 3 reasons why insurers should use natural language
processing technology. Data Axle, July 2021. https://www.data-axle.
com/resources/blog/3-reasons-why-insurers-should-use-natural-
language-processing- technology/.
[104] Encore Language Learning. What is the most spoken language in the world.
Encore Language Learning. https://gurmentor.com/what-is-the-most-
spoken-language-in-the-world/ Online; accessed 2021-12-14.
[105] Angelina Leigh. 10 examples of natural language processing (nlp) and how to
leverage its capabilities. Hitachi. https://global.hitachi-solutions.com/
blog/natural-language-processing. Online; accessed 2022-01-16.
[107] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large net-
work dataset collection. https://snap.stanford.edu/data/web-Amazon.
html, June 2014.
[108] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denois-
ing sequence-to-sequence pre-training for natural language generation, transla-
tion, and comprehension. pages 7871–7880, 01 2020.
[109] Xuerong Li, Wei Shang, and Shouyang Wang. Text-based crude oil price
forecasting: A deep learning approach. International Journal of Forecasting,
35(4):1548–1560, October 2019. https://doi.org/10.1016/j.ijforecast.
2018.07.006.
[110] Sangrak Lim, Kyubum Lee, and Jaewoo Kang. Drug drug interaction ex-
traction from the literature using a recursive neural network. PLOS ONE,
13(1):e0190926, January 2018. https://doi.org/10.1371/journal.pone.
0190926.
[112] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A
robustly optimized bert pretraining approach, 07 2019.
[113] Edward Ma. Data augmentation in nlp. Towards Data Science, April
2019. https://towardsdatascience.com/data-augmentation-in-nlp-
2801a34dfc28 Online; accessed 2022-11-20.
[115] Katerina Mansour. 4 ways nlp technology can be leveraged for insurance.
Early Metrics. https://earlymetrics.com/4-ways-nlp-technology-can-
be-leveraged-for-insurance/, December 2020.
[116] Chelsie May. Top 8 machine learning & ai software development com-
panies for automotive. https://medium.datadriveninvestor.com/
top-8-machine-learning-ai-software-development-companies-for-
automotive-39d33a38ff9d, April 2020.
[122] USC Institute for Creative Technologies Mike van Lent et al. Ict mission
rehearsal exercise. Soar EECS University of Michigan. https://soar.eecs.
umich.edu/workshop/22/vanLentMRE-S22.PDF Online; accessed 2022-11-08.
[123] Libby Nelson. British desserts, explained for americans confused by the
great british baking show. Vox. https://www.vox.com/2015/11/29/9806038/
great-british-baking-show-pudding-biscuit, Nov 2015.
[124] Jordan Novet. Elon musk said ‘use signal,’ and confused investors sent the
wrong stock up 438%. On monday. https://www.cnbc.com/2021/01/11/
signal-advance-jumps-another-438percent-after-elon-musk-fueled-
buying-frenzy.html, January 2021.
[125] Oberlo. How many people use social media in 2021?how many people use social
media in 2021? https://www.oberlo.com/statistics/how-many-people-
use-social-media Online; accessed December 31, 2021.
[126] Layla Oesper, Daniele Merico, Ruth Isserlin, and Gary D Bader. WordCloud:
A cytoscape plugin to create a visual semantic summary of networks. Springer
Science and Business Media LLC, 6(1), April 2011. https://doi.org/10.
1186/1751-0473-6-7.
[127] Layla Oesper, Daniele Merico, Ruth Isserlin, and Gary D Bader. Wordcloud:
A cytoscape plugin to create a visual semantic summary of networks. Source
code for biology and medicine, 6(1):7, 2011.
[131] Davide Picca, Dominique Jaccard, and GÃľrald EberlÃľ. Natural language pro-
cessing in serious games: A state of the art. International Journal of Serious
Games, 2, 09 2015.
[132] Edward Dixon Jonas Christensen Kirk Borne Leland Wilkinson Shantha Mohan
Prashant Natarajan, Bob Rogers. Demystifying AI for the Enterprise. Rout-
ledge, Taylor and Francis Group, December 2021.
[133] Juliane Zeiser Prof. Dr. GÃľrald Schlemminger. Eveil-3d! Karlsruhe Institute
of Technology. https://www.eveil-3d.eu/. Online; accessed 2022-11-08.
[135] Aparijita Ojha R. Jothi, Sraban Kumar Mohanty. Fast approximate min-
imum spanning tree based clustering algorithm. Science Direct, Jan-
uary 2018. https://www.sciencedirect.com/science/article/abs/pii/
S092523121731295X Online; accessed 2022-11-22.
[136] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,
and Ilya Sutskever. Language models are unsupervised multitask learners.
2018. https://d4mucfpksywv.cloudfront.net/better-language-models/
language-models.pdf.
[137] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter Liu. Exploring the limits of
transfer learning with a unified text-to-text transformer, 10 2019.
[140] Radim Rehurek and Petr Sojka. Gensim–python framework for vector space
modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno,
Czech Republic, 3(2), 2011.
[141] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using
siamese bert-networks. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Association for Computational Lin-
guistics, 11 2019.
Bibliography ■ 349
[142] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using
siamese bert-networks. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Association for Computational Lin-
guistics, 11 2019.
[143] Alberto Romero. A complete overview of gpt-3 âĂŤ the largest neu-
ral network ever created. Towards Data Science, May 2021. https://
towardsdatascience.com/gpt-3-a-complete-overview-190232eb25fd On-
line; accessed 2022-11-20.
[144] Marla Rosner. Oil & gas and natural language processing are the perfect
match no one predicted. Sparkcognition. https://www.sparkcognition.com/
oil-gas-natural-language-processing-are-perfect-match-no-one-
predicted/, 3 2017. Online; accessed 2022-01-13.
[146] Erik Sang and Fien Meulder. Introduction to the conll-2003 shared
task: Language-independent named entity recognition. Proceeding of the
Computational Natural Language Learning (CoNLL), 07 2003.
[147] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert,
a distilled version of bert: smaller, faster, cheaper and lighter. 10 2019.
[148] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert,
a distilled version of bert: smaller, faster, cheaper and lighter. 10 2019.
[149] Kyriakos Schwarz, Ahmed Allam, Nicolas Andres Perez Gonzalez, and Michael
Krauthammer. AttentionDDI: Siamese attention-based deep learning method
for drug–drug interaction predictions. BMC Bioinformatics, 22(1), August 2021.
https://doi.org/10.1186/s12859-021-04325-y.
[150] Seed Scientific. How much data is created every day? [27 staggering
stats]. https://seedscientific.com/how-much-data-is-created-every-
day/, October 2021.
[151] Aakanksha Chowdhery Sharan Narang. Pathways language model (palm): Scal-
ing to 540 billion parameters for breakthrough performance. Google Research,
April 2022. https://ai.googleblog.com/2022/04/pathways-language-
model-palm-scaling-to.html Online; accessed 2022-11-22.
[154] Jyotika Singh. An introduction to audio processing and machine learning using
python. opensource.com, 09 2019. https://opensource.com/article/19/9/
audio-processing-machine-learning-python.
[155] Jyotika Singh. Social media analysis using natural language processing tech-
niques. In Proceedings of the 20th Python in Science Conference, pages 74–80,
01 2021.
[157] Jyotika Singh, Michael Avon, and Serge Matta. Media and marketing op-
timization with cross platform consumer and content intelligence. https:
//patents.google.com/patent/US20210201349A1/en, 07 2021.
[158] Jyotika Singh, Rebecca Bilbro, Michael Avon, Scott Bowen, Dan Jolicoeur,
and Serge Matta. Method for optimizing media and marketing content us-
ing cross-platform video intelligence. https://patents.google.com/patent/
US10949880B2/en, 03 2021.
[159] Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCal-
lum. Wikilinks: A large-scale cross-document coreference corpus labeled via
links to Wikipedia. Technical Report UM-CS-2012-015, 2012.
[160] S. Lock. Global tourism industry – statistics & facts. August. https://www.
statista.com/topics/962/global-tourism/#topicHeader__wrapper.
[161] Daniel Slotta. Number of customers of china mobile limited from 2010 to 2020.
Statista. https://www.statista.com/statistics/272097/customer-base-
of-china-mobile/, April 2021. Online; accessed 2022-01-11.
[162] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Man-
ning, Andrew Ng, and Christopher Potts. Recursive deep models for seman-
tic compositionality over a sentiment treebank. In Proceedings of the 2013
Conference on Empirical Methods in Natural Language Processing, pages 1631–
1642, Seattle, Washington, USA, October 2013.
[163] Bruno Stecanella. Support vector machines (svm) algorithm explained. Monkey
Learn, June 2017. https://monkeylearn.com/blog/introduction-to-
support-vector-machines-svm/ Online; accessed 2022-11-22.
[166] Elias Kalapanidas Dimitri Konstantas Todor Ganchev Otilia Kocsis Tony Lam-
Juan J. SantamarÃŋa Thierry Raguin Christian Breiteneder Hannes Kaufmann
Costas Davarakis Susana JimÃľnez-Murcia, Fernando FernÃąndez-Aranda.
Playmancer Project: A Serious Videogame as an Additional Therapy Tool for
Eating and Impulse Control Disorders. Volume 144: Annual Review of Cy-
bertherapy and Telemedicine 2009. Online; accessed 2022-11-08.
[170] Accenture team. Malta machine learning text analyzer. Accenture https://
malta.accenture.com/. Online; accessed 2022-11-07.
[180] Anu Thomas. How mercedes-benz is using ai & nlp to give driving a
tech makeover. https://analyticsindiamag.com/how-mercedes-benz-is-
using-ai-nlp-to-give-driving-a-tech- makeover/, May 2020.
[182] Sunil Gupta Svetha Venkatesh Tinu Theckel Joy, Santu Rana. Hyper-
parameter tuning for big data using bayesian optimisation. In 2016
23rd International Conference on Pattern Recognition (ICPR), December
2016. https://projet.liris.cnrs.fr/imagine/pub/proceedings/ICPR-
2016/media/files/0557.pdf Online; accessed 2022-11-22.
[185] Natalie Vannini, Sibylle Enz, Maria Sapouna, Dieter Wolke, Scott Watson,
Sarah Woods, Kerstin Dautenhahn, Lynne Hall, Ana Paiva, Elizabeth An-
drÃľ, Ruth Aylett, and Wolfgang Schneider. “fearnot!”: A computer-based anti-
bullying-programme designed to foster peer intervention. European Journal
of Psychology of Education, 26(1):21–44, 2011. https://doi.org/10.1007/
s10212-010-0035-4.
[187] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
06 2017.
[188] Vaibhav Verdhan. Computer Vision Using Deep Learning. Apress, 2021.
[189] Jo Ann Duffy Victor E. Sower and Gerald Kohers. Great ormond street hospital
for children: Ferrari’s formula one handovers and handovers from surgery to
intensive care. American Society for Quality. https://www.gwern.net/docs/
technology/2008-sower.pdf, August 2008.
Bibliography ■ 353
[190] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and
Samuel Bowman. Glue: A multi-task benchmark and analysis platform for nat-
ural language understanding. pages 353–355, 01 2018.
[191] Jin Wang, Haiying Li, Zhiqiang Cai, Fazel Keshtkar, Art Graesser, and
David Williamson Shaffer. Automentor: Artificial intelligent mentor in educa-
tional game. In H. Chad Lane, Kalina Yacef, Jack Mostow, and Philip Pavlik,
editors, Artificial Intelligence in Education, pages 940–941, Berlin, Heidelberg,
2013. Springer, Berlin, Heidelberg.
[192] Jason Wei and Kai Zou. EDA: Easy data augmentation techniques for
boosting performance on text classification tasks. In Proceedings of the
2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 6382–6388, Hong Kong, China, November 2019. As-
sociation for Computational Linguistics.
[200] Adam Wilson. Natural-language-processing techniques for oil and gas drilling
data. Journal of Petroleum Technology, 69(10):96–97, 10 2017. https://doi.
org/10.2118/1017-0096-JPT, eprint = https://onepetro.org/JPT/article-
pdf/69/10/96/2212181/spe-1017-0096-jpt.pdf.
354 ■ Bibliography
[201] Hamed Yaghoobian, Hamid R. Arabnia, and Khaled Rasheed. Sarcasm detec-
tion: A comparative study. CoRR, abs/2107.02276, 2021. https://arxiv.org/
abs/2107.02276.
[202] Hui Yang and Jonathan Garibaldi. Automatic detection of protected health
information from clinic narratives. Journal of biomedical informatics, 79, 07
2015. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4989090/.
[203] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdi-
nov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language
understanding. CoRR, abs/1906.08237, 2019.
[207] Filip Zelic and Anuj Sable. How to ocr with tesseract, opencv and python.
Nanonets. https://nanonets.com/blog/ocr-with-tesseract/ Online; ac-
cessed 2022-11-19.
[209] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-
training with extracted gap-sentences for abstractive summarization, 12 2019.
[210] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urta-
sun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards
story-like visual explanations by watching movies and reading books. 06 2015.
[212] Kevin Dewalt. The coming nlp revolution in insurance. Prolego https://
www.prolego.com/blog/the-coming-nlp-revolution-in-insurance, Jan-
uary 2022.
[213] Jay Selig. How nlp streamlines the insurance claims process. expert.ai https:
//www.expert.ai/blog/nlp_streamlines_insurance_claims_process/,
September 2021.
Bibliography ■ 355
[214] Iman Ghosh. Ranked: The 100 Most Spoken Languages Around the World.
Visual Capitalist https://www.visualcapitalist.com/100-most-spoken-
languages/, February 2020.
[217] Pascal Wichmann, Alexandra Brintrup, Simon Baker, Philip Woodall, Duncan
McFarlane. Towards automatically generating supply chain maps from natural
language text. IFAC-PapersOnLine, Volume 51, Issue 11, Pages 1726-1731,
ISSN 2405-8963. https://doi.org/10.1016/j.ifacol.2018.08.207. https:
//www.sciencedirect.com/science/article/pii/S2405896318313284.
September 2018.
[219] Steve Banker. Companies Improve Their Supply Chains With Artificial Intel-
ligence. Forbes https://www.forbes.com/sites/stevebanker/2022/02/24/
companies-improve-their-supply-chains-with-artificial-intelli-
gence/, February 2022.
[220] Sean Ashcroft. Top 10: AI firms helping minimise supply chain disruption.
Supply Chain Digital https://supplychaindigital.com/digital-supply-
chain/top-10-ai-firms-helping-minimise-supply-chain-disruption,
October 2022.
[221] IBM. 48% Lift in Visits to Best Western Hotel and Resort Locations
with IBM Watson Advertising Conversations. https://www.ibm.com/case-
studies/best-western-watson-advertising. Online; accessed 2023-03-07.
[223] Kate Koidan. These 10 Companies Are Transforming Marketing With AI. Top-
bots https://www.topbots.com/ai-companies-transforming-marketing/,
June 2020.
[224] StartUs Insights. 5 Top AI Solutions impacting Property & Real Es-
tate Companies. https://www.startus-insights.com/innovators-guide/
ai-solutions-property-real-estate-companies/. Accessed Online; ac-
cessed 2023-03-07.
356 ■ Bibliography
[226] Raffel, Colin & Shazeer, Noam & Roberts, Adam & Lee, Katherine & Narang,
Sharan & Matena, Michael & Zhou, Yanqi & Li, Wei & Liu, Peter. Exploring the
Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal
of Machine Learning Research 21, 2010.
Index
357
358 ■ INDEX