0% found this document useful (0 votes)
41 views

CCPS521-WIN2023-Week10 Natural Language Processing Final

Uploaded by

Charles Kingston
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

CCPS521-WIN2023-Week10 Natural Language Processing Final

Uploaded by

Charles Kingston
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to Data

Science (CCPS521)
Session 10:
Natural Language
Processing
Natural Language Processing (NLP)
Natural language processing (NLP) is an interdisciplinary subfield of linguistics,
computer science, and artificial intelligence concerned with the interactions
between computers and human language, in particular how to program computers
to process and analyze large amounts of natural language data. The goal is a
computer capable of "understanding" the contents of documents, including
the contextual nuances of the language within them. The technology can then
accurately extract information and insights contained in the documents as well as
categorize and organize the documents themselves.

Challenges in natural language processing frequently involve speech


recognition, natural-language understanding, and natural-language generation.
Source: https://en.wikipedia.org/wiki/Natural_language_processing
Natural Language Processing (NLP)
Natural language processing (NLP)
• Natural language processing (NLP) refers to the branch of computer science—and
more specifically, the branch of artificial intelligence or AI—concerned with giving
computers the ability to understand text and spoken words in much the same
way human beings can.
• NLP combines computational linguistics—rule-based modeling of human
language—with statistical, machine learning, and deep learning models. Together,
these technologies enable computers to process human language in the form of
text or voice data and to ‘understand’ its full meaning, complete with the speaker
or writer’s intent and sentiment.

Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
Natural language processing (NLP)
NLP drives computer programs that translate text from one language to another,
respond to spoken commands, and summarize large volumes of text rapidly—even
in real time. There’s a good chance you’ve interacted with NLP in the form of voice-
operated GPS systems, digital assistants, speech-to-text dictation software,
customer service chatbots, and other consumer conveniences. But NLP also plays a
growing role in enterprise solutions that help streamline business operations,
increase employee productivity, and simplify mission-critical business processes.

Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
Natural language processing (NLP) History.
Natural language processing has its roots in the 1950s. Already in 1950, Alan
Turing published an article titled "Computing Machinery and Intelligence" which
proposed what is now called the Turing test as a criterion of intelligence, though at
the time that was not articulated as a problem separate from artificial intelligence.
The proposed test includes a task that involves the automated interpretation and
generation of natural language.

Source: https://en.wikipedia.org/wiki/Natural_language_processing
Natural Language Processing (NLP)
Natural language processing (NLP) History.
Up to the 1980s, most natural language processing systems were based on
complex sets of hand-written rules. Starting in the late 1980s, however, there was
a revolution in natural language processing with the introduction of machine
learning algorithms for language processing. This was due to both the steady
increase in computational power (see Moore's law) and the gradual lessening of
the dominance of Chomskyan theories of linguistics (e.g. transformational
grammar), whose theoretical underpinnings discouraged the sort of corpus
linguistics that underlies the machine-learning approach to language processing.
Source: https://en.wikipedia.org/wiki/Natural_language_processing
Natural Language Processing (NLP)
Natural language processing (NLP) History.
In the 2010s, representation learning and deep neural network-style machine
learning methods became widespread in natural language processing. That
popularity was due partly to a flurry of results showing that such
techniques[8][9] can achieve state-of-the-art results in many natural language tasks,
e.g., in language modeling[10] and parsing.[11][12] This is increasingly important in
medicine and healthcare, where NLP helps analyze notes and text in electronic
health records that would otherwise be inaccessible for study when seeking to
improve care
Source: https://en.wikipedia.org/wiki/Natural_language_processing
Natural Language Processing (NLP)
NLP tasks
Human language is filled with ambiguities that make it incredibly difficult to write
software that accurately determines the intended meaning of text or voice data.
Homonyms, homophones, sarcasm, idioms, metaphors, grammar and usage
exceptions, variations in sentence structure—these just a few of the irregularities
of human language that take humans years to learn, but that programmers must
teach natural language-driven applications to recognize and understand accurately
from the start, if those applications are going to be useful.
Several NLP tasks break down human text and voice data in ways that help the
computer make sense of what it's ingesting.

Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP tasks
Some of NLP tasks include the following:
Speech recognition, also called speech-to-text, is the task of reliably converting
voice data into text data. Speech recognition is required for any application that
follows voice commands or answers spoken questions. What makes speech
recognition especially challenging is the way people talk—quickly, slurring words
together, with varying emphasis and intonation, in different accents, and often
using incorrect grammar.

Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP tasks
Some of NLP tasks include the following:
• Part of speech tagging, also called grammatical tagging, is the process of determining
the part of speech of a particular word or piece of text based on its use and context.
Part of speech identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in
‘What make of car do you own?’
• Word sense disambiguation is the selection of the meaning of a word with multiple
meanings through a process of semantic analysis that determine the word that makes
the most sense in the given context. For example, word sense disambiguation helps
distinguish the meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a
bet’ (place).
• Named entity recognition, or NEM, identifies words or phrases as useful entities. NEM
identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP tasks
Some of NLP tasks include the following:
• Co-reference resolution is the task of identifying if and when two words refer to
the same entity. The most common example is determining the person or object
to which a certain pronoun refers (e.g., ‘she’ = ‘Mary’), but it can also involve
identifying a metaphor or an idiom in the text (e.g., an instance in which 'bear'
isn't an animal but a large hairy person).
• Sentiment analysis attempts to extract subjective qualities—attitudes, emotions,
sarcasm, confusion, suspicion—from text.
• Natural language generation is sometimes described as the opposite of speech
recognition or speech-to-text; it's the task of putting structured information into
human language.

Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
The differences between three natural language processing concepts:
NLP: Natural Language Processing
NLU: Natural Language Understanding
NLG: Natural Language Generation

Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
The differences between three natural language processing concepts:
While natural language processing (NLP), natural language understanding (NLU),
and natural language generation (NLG) are all related topics, they are distinct ones.
At a high level, NLU and NLG are just components of NLP. Given how they intersect,
they are commonly confused within conversation. Let us summarize their
differences to clarify any ambiguities.

Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
What is Natural Language Processing (NLP)?
Natural language processing, which evolved from computational linguistics, uses
methods from various disciplines, such as computer science, artificial intelligence,
linguistics, and data science, to enable computers to understand human language
in both written and verbal forms. While computational linguistics has more of a
focus on aspects of language, natural language processing emphasizes its use of
machine learning and deep learning techniques to complete tasks, like language
translation or question answering.

Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
What is Natural Language Processing (NLP)?
Natural language processing works by taking unstructured data and converting it
into a structured data format. It does this through the identification of named
entities (a process called named entity recognition) and identification of word
patterns, using methods like tokenization, stemming, and lemmatization, which
examine the root forms of words.
For example, the suffix -ed on a word, like called, indicates past tense, but it has
the same base infinitive (to call) as the present tense verb calling.

Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
What is Natural Language Processing (NLP)?
While a number of NLP algorithms exist, different approaches tend to be used for
different types of language tasks. For example, hidden Markov chains tend to be used for
part-of-speech tagging.
Recurrent neural networks help to generate the appropriate sequence of text. N-grams, a
simple language model (LM), assign probabilities to sentences or phrases to predict the
accuracy of a response.
These techniques work together to support popular technology such as chatbots, or
speech recognition products like Amazon’s Alexa or Apple’s Siri. However, its application
has been broader than that, affecting other industries such as education and healthcare.
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
Natural Language Understanding (NLU) is a subset of natural language processing,
which uses syntactic and semantic analysis of text and speech to determine the
meaning of a sentence. Syntax refers to the grammatical structure of a sentence,
while semantics alludes to its intended meaning. NLU also establishes a relevant
ontology: a data structure which specifies the relationships between words and
phrases. While humans naturally do this in conversation, the combination of these
analyses is required for a machine to understand the intended meaning of different
texts. Our ability to distinguish between homonyms and homophones illustrates
the nuances of language well.
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
let’s take the following two sentences:
• Alice is swimming against the current.
• The current version of the report is in the folder.
In the first sentence, the word, current is a noun. The verb that precedes it,
swimming, provides additional context to the reader, allowing us to conclude that
we are referring to the flow of water in the ocean. The second sentence uses the
word current, but as an adjective. The noun it describes, version, denotes multiple
iterations of a report, enabling us to determine that we are referring to the most
up-to-date status of a file.
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
These approaches are also commonly used in data mining to understand consumer
attitudes. In particular, sentiment analysis enables brands to monitor their
customer feedback more closely, allowing them to cluster positive and negative
social media comments and track net promoter scores. By reviewing comments
with negative sentiment, companies are able to identify and address potential
problem areas within their products or services more quickly.
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
What is Natural Language Generation (NLG)?
Natural language generation is another subset of natural language processing.
While natural language understanding focuses on computer reading
comprehension, natural language generation enables computers to write. NLG is
the process of producing a human language text response based on some data
input. This text can also be converted into a speech format through text-to-speech
services.
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
What is Natural Language Generation (NLG)?
NLG also encompasses text summarization capabilities that generate summaries
from in-put documents while maintaining the integrity of the information.
Extractive summarization is the AI innovation powering Key Point Analysis used in
That’s Debatable.
Initially, NLG systems used templates to generate text. Based on some data or
query, an NLG system would fill in the blank, like a game of Mad Libs. But over
time, natural language generation systems have evolved with the application of
hidden Markov chains, recurrent neural networks, and transformers, enabling
more dynamic text generation in real time.
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP vs. NLU vs. NLG
As with NLU, NLG applications need to consider language rules based on morphology,
lexicons, syntax and semantics to make choices on how to phrase responses
appropriately. They tackle this in three stages:
• Text planning: During this stage, general content is formulated and ordered in a logical
manner.
• Sentence planning: This stage considers punctuation and text flow, breaking out
content into paragraphs and sentences and incorporating pronouns or conjunctions
where appropriate.
• Realization: This stage accounts for grammatical accuracy, ensuring that rules around
punctuation and conjugations are followed. For example, the past tense of the
verb run is ran, not runned.
Source: https://www.ibm.com/blogs/watson/2020/11/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/
Natural Language Processing (NLP)
NLP use cases: Natural language processing is the driving force behind machine
intelligence in many modern real-world applications. Here are a few examples:

Spam detection: You may not think of spam detection as an NLP solution, but the
best spam detection technologies use NLP's text classification capabilities to scan
emails for language that often indicates spam or phishing. These indicators can
include overuse of financial terms, characteristic bad grammar, threatening
language, inappropriate urgency, misspelled company names, and more. Spam
detection is one of a handful of NLP problems that experts consider 'mostly solved'
(although you may argue that this doesn’t match your email experience).

Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP use cases:
Machine translation: Google Translate is an example of widely available NLP
technology at work. Truly useful machine translation involves more than replacing
words in one language with words of another. Effective translation has to capture
accurately the meaning and tone of the input language and translate it to text with
the same meaning and desired impact in the output language. Machine translation
tools are making good progress in terms of accuracy. A great way to test any machine
translation tool is to translate text to one language and then back to the original. An
oft-cited classic example: Not long ago, translating “The spirit is willing but the flesh is
weak” from English to Russian and back yielded “The vodka is good but the meat is
rotten.” Today, the result is “The spirit desires, but the flesh is weak,” which isn’t
perfect, but inspires much more confidence in the English-to-Russian translation.
Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP use cases:
Virtual agents and chatbots: Virtual agents such as Apple's Siri and Amazon's Alexa
use speech recognition to recognize patterns in voice commands and natural
language generation to respond with appropriate action or helpful
comments. Chatbots perform the same magic in response to typed text entries.
The best of these also learn to recognize contextual clues about human requests
and use them to provide even better responses or options over time. The next
enhancement for these applications is question answering, the ability to respond
to our questions—anticipated or not—with relevant and helpful answers in their
own words.
Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP use cases:
• Social media sentiment analysis: NLP has become an essential business tool for
uncovering hidden data insights from social media channels. Sentiment analysis
can analyze language used in social media posts, responses, reviews, and more to
extract attitudes and emotions in response to products, promotions, and events–
information companies can use in product designs, advertising campaigns, and
more.
• Text summarization: Text summarization uses NLP techniques to digest huge
volumes of digital text and create summaries and synopses for indexes, research
databases, or busy readers who don't have time to read full text. The best text
summarization applications use semantic reasoning and natural language
generation (NLG) to add useful context and conclusions to summaries.
Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP tools and approaches
Python and the Natural Language Toolkit (NLTK)
The Python programing language provides a wide range of tools and libraries for
attacking specific NLP tasks. Many of these are found in the Natural Language Toolkit,
or NLTK, an open source collection of libraries, programs, and education resources for
building NLP programs.
The NLTK includes libraries for many of the NLP tasks listed above, plus libraries for
subtasks, such as sentence parsing, word segmentation, stemming and lemmatization
(methods of trimming words down to their roots), and tokenization (for breaking
phrases, sentences, paragraphs and passages into tokens that help the computer
better understand the text). It also includes libraries for implementing capabilities
such as semantic reasoning, the ability to reach logical conclusions based on facts
extracted from text.
Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP tools and approaches
Statistical NLP, Machine Learning, and Deep Learning
The earliest NLP applications were hand-coded, rules-based systems that could perform
certain NLP tasks, but couldn't easily scale to accommodate a seemingly endless stream
of exceptions or the increasing volumes of text and voice data.
Enter statistical NLP, which combines computer algorithms with machine learning and
deep learning models to automatically extract, classify, and label elements of text and
voice data and then assign a statistical likelihood to each possible meaning of those
elements. Today, deep learning models and learning techniques based on convolutional
neural networks (CNNs) and recurrent neural networks (RNNs) enable NLP systems that
'learn' as they work and extract ever more accurate meaning from huge volumes of raw,
unstructured, and unlabeled text and voice data sets.
Source: https://www.ibm.com/topics/natural-language-processing
Natural Language Processing (NLP)
NLP tools and approaches
IBM WATSON
IBM Watson® makes complex NLP technologies accessible to employees who are
not data scientists. Our products are built for non-technical users, to help your
business easily streamline business operations, increase employee productivity and
simplify mission-critical business processes.
Watson Discovery is an AI-powered intelligent search and text-analytics platform
that eliminates data silos and retrieves information hidden within enterprise data.
Using IBM's market-leading NLP capabilities, Discovery uncovers meaningful
business insights from documents, webpages and big data, cutting research time
by more than 75%.
Source: https://www.ibm.com/watson/natural-language-processing
Natural Language Processing (NLP)
NLP tools and approaches
IBM WATSON NLP
The Watson Natural Language Processing library provides basic natural language processing
functions for syntax analysis and out-of-the-box pre-trained models for a wide variety of text
processing tasks, such as sentiment analysis, keyword extraction and vectorization. The Watson
Natural Language Processing library is available for Python only.
With Watson Natural Language Processing, you can turn unstructured data into structured data,
making the data easier to understand and transferable, in particular if you are working with a mix
of unstructured and structured data. Examples of such data are call center records, customer
complaints, social media posts, or problem reports. The unstructured data is often part of a
larger data record which includes columns with structured data. Extracting meaning and
structure from the unstructured data and combining this information with the data in the
columns of structured data, gives you a deeper understanding of the input data and can help you
to make better decisions.

Source: https://www.ibm.com/watson/natural-language-processing
Natural Language Processing (NLP)
NLP tools and approaches
IBM WATSON

Source: https://www.ibm.com/demos/live/discovery-expert-assist/self-service/home
Natural Language Processing (NLP)
The limitations of hand-written rules: the rise of statistical NLP
Natural language’s vastly large size, unrestrictive nature, and ambiguity led to two problems
when using standard parsing approaches that relied purely on symbolic, hand-crafted rules:
• NLP must ultimately extract meaning (‘semantics’) from text: formal grammars that specify
relationship between text units - parts of speech such as nouns, verbs, and adjectives -
address syntax primarily. One can extend grammars to address natural-language semantics
by greatly expanding sub-categorization, with additional rules/constraints (eg, ‘eat’ applies
only to ingestible-item nouns). Unfortunately, the rules may now become unmanageably
numerous, often interacting unpredictably, with more frequent ambiguous parses (multiple
interpretations of a word sequence are possible).

• Handwritten rules handle ‘ungrammatical’ spoken prose and (in medical contexts) the
highly telegraphic prose of in-hospital progress notes very poorly, although such prose is
human-comprehensible.
Source: https://academic.oup.com/jamia/article/18/5/544/829676?login=false
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Let’s look at a piece of text from Wikipedia:
“London is the capital and most populous city of England and the United Kingdom.
Standing on the River Thames in the south east of the island of Great Britain,
London has been a major settlement for two millennia. It was founded by the
Romans, who named it Londinium.” (Source: https://en.wikipedia.org/wiki/London )
This paragraph contains several useful facts: It would be great if a computer could
read this text and understand that:
- London is a city.
- London is located in England, and
- London was settled by Romans and so on.
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
To get there, we have to first teach our computer the most basic concepts of
written language and then move up from there.
Step 1: Sentence Segmentation: The first step in the pipeline is to break the text
apart into separate sentences. That gives us this:
• “London is the capital and most populous city of England and the United
Kingdom.”
• “Standing on the River Thames in the south east of the island of Great Britain,
London has been a major settlement for two millennia.”
• “It was founded by the Romans, who named it Londinium.”
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
We can assume that each sentence in English is a separate thought or idea. It will
be a lot easier to write a program to understand a single sentence than to
understand a whole paragraph.

Coding a Sentence Segmentation model can be as simple as splitting apart


sentences whenever you see a punctuation mark. But modern NLP pipelines often
use more complex techniques that work even when a document isn’t formatted
cleanly.

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 2: Word Tokenization: Now that we’ve split our document into sentences, we
can process them one at a time. Let’s start with the first sentence from our
document. “London is the capital and most populous city of England and the
United Kingdom.”
• The next step in our pipeline is to break this sentence into separate words
or tokens. This is called tokenization. This is the result:
“London”, “is”, “ the”, “capital”, “and”, “most”, “populous”, “city”, “of”, “England”,
“and”, “the”, “United”, “Kingdom”, “.”
Tokenization is easy to do in English. We’ll just split apart words whenever there’s a
space between them. And we’ll also treat punctuation marks as separate tokens
since punctuation also has meaning.
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 3: Predicting Parts of Speech for Each Token
Next, we’ll look at each token and try to guess its part of speech — whether it is a
noun, a verb, an adjective and so on. Knowing the role of each word in the
sentence will help us start to figure out what the sentence is talking about.
We can do this by feeding each word (and some extra words around it for context)
into a pre-trained part-of-speech classification model:

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 3: Predicting Parts of Speech for Each Token
The part-of-speech model was originally trained by feeding it millions of English
sentences with each word’s part of speech already tagged and having it learn to
replicate that behavior.
Keep in mind that the model is completely based on statistics — it doesn’t actually
understand what the words mean in the same way that humans do. It just knows
how to guess a part of speech based on similar sentences and words it has seen
before. After processing the whole sentence, we’ll have a result like this:

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 3: Predicting Parts of Speech for Each Token
With this information, we can already start to glean some very basic meaning. For
example, we can see that the nouns in the sentence include “London” and
“capital”, so the sentence is probably talking about London.
Step 4: Text Lemmatization
In English (and most languages), words appear in different forms. Look at these
two sentences:
I had a pony.
I had two ponies.
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 4: Text Lemmatization
Both sentences talk about the noun pony, but they are using different inflections. When working
with text in a computer, it is helpful to know the base form of each word so that you know that
both sentences are talking about the same concept. Otherwise the strings “pony” and “ponies”
look like two totally different words to a computer.
In NLP, we call finding this process lemmatization — figuring out the most basic form or lemma
of each word in the sentence.
The same thing applies to verbs. We can also lemmatize verbs by finding their root, unconjugated
form. So “I had two ponies” becomes “I [have] two [pony].”
Lemmatization is typically done by having a look-up table of the lemma forms of words based on
their part of speech and possibly having some custom rules to handle words that you’ve never
seen before

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 4: Text Lemmatization
Here’s what our sentence looks like after lemmatization adds in the root form of
our verb:

The only change we made was turning “is” into “be”

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 5: Identifying Stop Words
Next, we want to consider the importance of a each word in the sentence. English
has a lot of filler words that appear very frequently like “and”, “the”, and “a”. When
doing statistics on text, these words introduce a lot of noise since they appear way
more frequently than other words. Some NLP pipelines will flag them as stop
words —that is, words that you might want to filter out before doing any statistical
analysis.

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 6: Dependency Parsing
The next step is to figure out how all the words in our sentence relate to each
other. This is called dependency parsing.
The goal is to build a tree that assigns a single parent word to each word in the
sentence. The root of the tree will be the main verb in the sentence.
Step 6b: Finding Noun Phrases
So far, we’ve treated every word in our sentence as a separate entity. But
sometimes it makes more sense to group together the words that represent a
single idea or thing. We can use the information from the dependency parse tree
to automatically group together words that are all talking about the same thing.
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 7: Named Entity Recognition (NER)
Now that we’ve done all that hard work, we can finally move beyond grade-school
grammar and start actually extracting ideas.
In our sentence, we have the following nouns:

Some of these nouns present real things in the world. For example, “London”,
“England” and “United Kingdom” represent physical places on a map. It would be
nice to be able to detect that! With that information, we could automatically
extract a list of real-world places mentioned in a document using NLP.

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 7: Named Entity Recognition (NER)
The goal of Named Entity Recognition, or NER, is to detect and label these nouns
with the real-world concepts that they represent. Here’s what our sentence looks
like after running each token through our NER tagging model:

But NER systems aren’t just doing a simple dictionary lookup. Instead, they are
using the context of how a word appears in the sentence and a statistical model to
guess which type of noun a word represents. A good NER system can tell the
difference between “Brooklyn Decker” the person and the place “Brooklyn” using
context clues.
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 7: Named Entity Recognition (NER)
Here are just some of the kinds of objects that a typical NER system can tag:
• People’s names
• Company names
• Geographic locations (Both physical and political)
• Product names
• Dates and times
• Amounts of money
• Names of events
NER has tons of uses since it makes it so easy to grab structured data out of text.
It’s one of the easiest ways to quickly get value out of an NLP pipeline.
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 8: Co-reference Resolution
At this point, we already have a useful representation of our sentence. We know
the parts of speech for each word, how the words relate to each other and which
words are talking about named entities.
However, we still have one big problem. English is full of pronouns — words
like he, she, and it. These are shortcuts that we use instead of writing out names
over and over in each sentence. Humans can keep track of what these words
represent based on context. But our NLP model doesn’t know what pronouns
mean because it only examines one sentence at a time.

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 8: Co-reference Resolution
Let’s look at the third sentence in our document:

If we parse this with our NLP pipeline, we’ll know that “it” was founded by
Romans. But it’s a lot more useful to know that “London” was founded by Romans.
As a human reading this sentence, you can easily figure out that “it” means
“London”. The goal of co-reference resolution is to figure out this same mapping by
tracking pronouns across sentences. We want to figure out all the words that are
referring to the same entity.

Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
Building an NLP Pipeline, Step-by-Step
Step 8: Co-reference Resolution
Here’s the result of running co-reference resolution on our document for the word
“London”:

With co-reference information combined with the parse tree and named entity information, we
should be able to extract a lot of information out of this document!
Co-reference resolution is one of the most difficult steps in our pipeline to implement. It’s even
more difficult than sentence parsing. Recent advances in deep learning have resulted in new
approaches that are more accurate, but it isn’t perfect yet.
Source: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
Natural Language Processing (NLP)
A Markov chain or Markov process is a stochastic model describing a sequence of
possible events in which the probability of each event depends only on the state
attained in the previous event.[1][2][3] Informally, this may be thought of as, "What
happens next depends only on the state of affairs now." A countably
infinite sequence, in which the chain moves state at discrete time steps, gives
a discrete-time Markov chain (DTMC). A continuous-time process is called
a continuous-time Markov chain (CTMC). It is named after
the Russian mathematician Andrey Markov.
Markov chains have many applications as statistical models of real-world
processes,[1][4][5][6] such as studying cruise control systems in motor vehicles,
queues or lines of customers arriving at an airport, currency exchange rates and
animal population dynamics.
Source: https://en.wikipedia.org/wiki/Markov_chain
Natural Language Processing (NLP)
A recurrent neural network (RNN) is a class of artificial neural networks where
connections between nodes can create a cycle, allowing output from some nodes
to affect subsequent input to the same nodes. This allows it to exhibit temporal
dynamic behavior. Derived from feedforward neural networks, RNNs can use their
internal state (memory) to process variable length sequences of inputs.[1][2][3] This
makes them applicable to tasks such as unsegmented, connected handwriting
recognition[4] or speech recognition.[5][6] Recurrent neural networks are
theoretically Turing complete and can run arbitrary programs to process arbitrary
sequences of inputs.

Source: https://en.wikipedia.org/wiki/Recurrent_neural_network
Natural Language Processing (NLP)
In deep learning, a convolutional neural network (CNN) is a class of artificial
neural network most commonly applied to analyze visual imagery.[1] CNNs are also
known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN),
based on the shared-weight architecture of the convolution kernels or filters that
slide along input features and provide translation-equivariant responses known as
feature maps.[2][3] Counter-intuitively, most convolutional neural networks are
not invariant to translation, due to the downsampling operation they apply to the
input.[4] They have applications in image and video recognition, recommender
systems,[5] image classification, image segmentation, medical image
analysis, natural language processing,[6] brain–computer interfaces,[7] and
financial time series.
Source: https://en.wikipedia.org/wiki/Convolutional_neural_network
Natural Language Processing (NLP)
N-Gram
In the field of computational linguistics, an n-gram (sometimes also called Q-gram) is a contiguous
sequence of n items from a given sample of text or speech. The items can
be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are
collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.[1]
Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or,
less commonly, a "digram"); size 3 is a "trigram". English cardinal numbers are sometimes used, e.g.,
"four-gram", "five-gram", and so on

Source: https://en.wikipedia.org/wiki/N-gram
Natural Language Processing (NLP)
Key Point Analysis (KPA)
Was introduced in BarHaim et al. (2020a,b) as a challenging NLP task with tight
relations to Computational Argumentation, Opinion Analysis, and Summarization,
and with many practical applications (Bar-Haim et al., 2021).
Given a potentially large collection of relatively short, opinionated texts focused on
a topic of interest, the goal of KPA is to produce a succinct list of the most
prominent key-points (KPs) in the input corpus, along with their relative
prevalence. Thus, the output of KPA is a bullet-like summary, with an important
quantitative angle. Successful solutions to KPA can be used to gain better insights
from public opinions as expressed in social media, surveys, and so forth, giving rise
to a new form of a communication channel between decision makers and people
that might be impacted by the decision

Source: https://aclanthology.org/2021.argmining-1.16.pdf
Natural Language Processing (NLP)
Mad Libs is a phrasal template word game created by Leonard Stern[1][2] and Roger
Price.[3] It consists of one player prompting others for a list of words to substitute
for blanks in a story before reading aloud. The game is frequently played as a party
game or as a pastime.

The game was invented in the United States, and more than 110 million copies
of Mad Libs books have been sold since the series was first published in 1958.

Source: https://en.wikipedia.org/wiki/Mad_Libs
Natural Language Processing (NLP)
References:
Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman, Natural language processing: an introduction,
Journal of the American Medical Informatics Association, Volume 18, Issue 5, September 2011, Pages 544–551,
https://doi.org/10.1136/amiajnl-2011-000464
https://academic.oup.com/jamia/article/18/5/544/829676?login=false

Overview of the 2021 Key Point Analysis Shared Task:


https://research.ibm.com/publications/overview-of-the-2021-key-point-analysis-shared-task
https://aclanthology.org/2021.argmining-1.16.pdf

Chávez, G. (2019, February 28). Implementing a naïve Bayes classifier for text categorization in five steps.
https://towardsdatascience.com/implementing-a-naive-bayes-classifier-for-text-categorization-in-five-steps-
f9192cdd54c3

Overview of the 2021 Key Point Analysis Shared Task: https://aclanthology.org/2021.argmining-1.16.pdf

You might also like