0% found this document useful (0 votes)
3 views

NLP

The document provides an overview of Natural Language Processing (NLP), its components, and the challenges involved in understanding natural language. It discusses the steps in NLP, including lexical and syntactic analysis, semantic analysis, and text classification, highlighting the importance of algorithms like Context-Free Grammar and Top-Down Parser. Additionally, it covers the application of NLP in tasks such as spam detection and sentiment analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

NLP

The document provides an overview of Natural Language Processing (NLP), its components, and the challenges involved in understanding natural language. It discusses the steps in NLP, including lexical and syntactic analysis, semantic analysis, and text classification, highlighting the importance of algorithms like Context-Free Grammar and Top-Down Parser. Additionally, it covers the application of NLP in tasks such as spam detection and sentiment analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Natural Language

Processing
T. Muchabaiwa
Lecture Objectives
• Intro to NLP
• Components of NLP
• NLP terminology
• Steps in NLP
• Implementation of Semantic Analysis
• Text classification
• Somewhere around100,000 years ago, humans learned how
to speak, and about 7,000 years ago learned to write.
• There are two main reasons why we want our computer
agents to be able to process natural languages: first, to
communicate with humans, and second, to acquire
information from written language.
Intro to NLP
• Natural Language Processing (NLP) refers to AI method of
communicating with an intelligent systems using a natural
language such as English.
• Processing of Natural Language is required when you want an
intelligent system like robot to perform as per your
instructions, when you want to hear decision from a dialogue
based clinical expert system, etc.
• The field of NLP involves making computers to perform useful
tasks with the natural languages humans use. The input and
output of an NLP system can be −
• Speech
• Written Text
Components of NLP

• There are two components of NLP as given −


Natural Language Understanding (NLU)
• Understanding involves the following tasks −
• Mapping the given input in natural language into useful representations.
• Analyzing different aspects of the language.
Natural Language Generation (NLG)
• It is the process of producing meaningful phrases and sentences in the
form of natural language from some internal representation.
• It involves −
• Text planning − It includes retrieving the relevant content from knowledge
base.
• Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
• Text Realization − It is mapping sentence plan into sentence structure.

• The NLU is harder than NLG.


Difficulties in NLU

• NL has an extremely rich form and structure.


• It is very ambiguous. There can be different levels of ambiguity −
• Lexical ambiguity − It is at very primitive level such as word-
level.
• For example, treating the word “board” as noun or verb?
• Syntax Level ambiguity − A sentence can be parsed in different
ways.
• For example, “He lifted the beetle with red cap.” − Did he use cap
to lift the beetle or he lifted a beetle that had red cap?
• Referential ambiguity − Referring to something using pronouns.
For example, Rima went to Gauri. She said, “I am tired.” −
Exactly who is tired?
• One input can mean different meanings.
• Many inputs can mean the same thing.
NLP Terminology

• Phonology − It is study of organizing sound systematically.


• Morphology − It is a study of construction of words from primitive
meaningful units.
• Morpheme − It is primitive unit of meaning in a language.
• Syntax − It refers to arranging words to make a sentence. It also involves
determining the structural role of words in the sentence and in phrases.
• Semantics − It is concerned with the meaning of words and how to
combine words into meaningful phrases and sentences.
• Pragmatics − It deals with using and understanding sentences in
different situations and how the interpretation of the sentence is
affected.
• Discourse − It deals with how the immediately preceding sentence can
affect the interpretation of the next sentence.
• World Knowledge − It includes the general knowledge about the world.
Steps in NLP
• Lexical Analysis − It involves
identifying and analyzing the
structure of words. Lexicon of a
language means the collection of
words and phrases in a language.
Lexical analysis is dividing the
whole chunk of txt into
paragraphs, sentences, and
words.
• Syntactic Analysis (Parsing) − It
involves analysis of words in the
sentence for grammar and
arranging words in a manner that
shows the relationship among the
words. The sentence such as “The
school goes to boy” is rejected by
English syntactic analyzer.
Steps in NLP…… cntd
• Semantic Analysis − It draws the exact
meaning or the dictionary meaning
from the text. The text is checked for
meaningfulness. It is done by mapping
syntactic structures and objects in the
task domain. The semantic analyzer
disregards sentence such as “hot ice-
cream”.
• Discourse Integration − The meaning of
any sentence depends upon the
meaning of the sentence just before it.
In addition, it also brings about the
meaning of immediately succeeding
sentence.
• Pragmatic Analysis − During this, what
was said is re-interpreted on what it
actually meant. It involves deriving
those aspects of language which
require real world knowledge.
Implementation Aspects of Syntactic Analysis

• There are a number of algorithms researchers have developed


for syntactic analysis, but we consider only the following
simple methods −
• Context-Free Grammar
• Top-Down Parser
Context-Free Grammar

• It is the grammar that consists rules with a single symbol on the left-
hand side of the rewrite rules. Let us create grammar to parse a
sentence −
“The bird pecks the grains”
• Articles (DET) − a | an | the
• Nouns − bird | birds | grain | grains
• Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun
• = DET N | DET ADJ N
• Verbs − pecks | pecking | pecked
• Verb Phrase (VP) − NP V | V NP
• Adjectives (ADJ) − beautiful | small | chirping
• The parse tree breaks down the sentence into structured parts so that
the computer can easily understand and process it. In order for the
parsing algorithm to construct this parse tree, a set of rewrite rules,
which describe what tree structures are legal, need to be constructed.
Context-Free Grammar
…. cntd
• These rules say that a certain symbol may be expanded in the
tree by a sequence of other symbols. According to first order
logic rule, if there are two strings Noun Phrase (NP) and Verb
Phrase (VP), then the string combined by NP followed by VP is
a sentence. The rewrite rules for the sentence are as follows −
• S → NP VP
• NP → DET N | DET ADJ N
• VP → V NP
Lexocon −
• DET → a | the
• ADJ → beautiful | perching
• N → bird | birds | grain | grains
• V → peck | pecks | pecking
The parse tree can be created as shown −
However……
• Now consider the above rewrite rules. Since V can be replaced
by both, "peck" or "pecks", sentences such as "The bird peck the
grains" can be wrongly permitted. i. e. the subject-verb
agreement error is approved as correct.
Merit − The simplest style of grammar, therefore widely used one.
Demerits −
• They are not highly precise. For example, “The grains peck the
bird”, is a syntactically correct according to parser, but even if it
makes no sense, parser takes it as a correct sentence.
• To bring out high precision, multiple sets of grammar need to be
prepared. It may require a completely different sets of rules for
parsing singular and plural variations, passive sentences, etc.,
which can lead to creation of huge set of rules that are
unmanageable.
2. Top-Down Parser

• Here, the parser starts with the S symbol and attempts to


rewrite it into a sequence of terminal symbols that matches
the classes of the words in the input sentence until it consists
entirely of terminal symbols.
• These are then checked with the input sentence to see if it
matched. If not, the process is started over again with a
different set of rules. This is repeated until a specific rule is
found which describes the structure of the sentence.
Merit − It is simple to implement.
Demerits −
• It is inefficient, as the search process has to be repeated if an
error occurs.
• Slow speed of working.
Text Classification
• Also known as categorization: Given a text of some kind, decide which of a predefined set of
classes it belongs to.
• Language identification and genre classification are examples of text classification, as is
sentiment analysis.
• SPAM DETECTION (classifying a movie or product review as positive or negative) and spam
detection (classifying an email message as spam or not-spam).
• We can treat spam detection as a problem in supervised learning.

• A training set is readily available: the positive (spam) examples are in spam folder, the negative
(ham) examples are in inbox. Here is an excerpt:
• Spam: Wholesale FashionWatches -57% today. Designer watches for cheap ...
• Spam: You can buy ViagraFr$1.85 All Medications at unbeatable prices! ...
• Spam: WE CAN TREAT ANYTHING YOU SUFFER FROM JUST TRUST US ...
• Spam: Sta.rt earn*ing the salary yo,u d-eserve by o’btaining the prope,r crede’ntials!

• Inbox: The practical significance of hypertree width in identifying more ...


• Inbox: Abstract: We will motivate the problem of social identity clustering: ...
• Inbox: Good to see you my friend. Hey Peter, It was good to hear from you. ...
• Inbox: PDS implies convexity of the resulting optimization problem (Kernel Ridge ...
Text Classification …. contd
• From this excerpt we can start to get an idea of what might be
good features to include in the supervised learning model.
• Word combinations such as “for cheap” and “You can buy” seem
to be indicators of spam (although they would have a nonzero
probability in inbox as well).
• Character-level features also seem important: spam is more likely
to be all uppercase and to have punctuation embedded in words.
• Apparently the spammers thought that the word bigram “you
deserve” would be too indicative of spam, and thus wrote “yo,u
d-eserve” instead.
• A character model should detect this. We could either create a
full character possible word combinations of spam and non
spam, or we could handcraft features such as “number of
punctuation marks embedded in words.”

You might also like