NLP Unit1
NLP Unit1
language Processing
Unit-I
Syed Rameem Zahra
(Assistant Professor)
Department of CSE, NSUT
1
Introduction to NLP
● Natural Language Processing (NLP) refers to AI method of communicating
with an intelligent systems using a human languages (e.g. English) — speech
or text
● NLP-powered software helps us in our daily lives in various ways, for
example:
● Personal assistants: Siri, Cortana, and
Google Assistant.
● Auto-complete: In search engines (e.g.
Google).
● Spell checking: Almost everywhere, in
your browser, your IDE (e.g. Visual
Studio), desktop apps (e.g. Microsoft
Word).
● Machine Translation: Google Translate. 2
NLP: Real World Examples
3
Source: Wikipedia
Advantages of NLP
4
Applications of NLP
● Machine Translation (it is the translation of text or speech by a computer with no
human involvement.)
● Information Retrieval (software program that deals with the organization, storage,
retrieval and evaluation of information from document repositories particularly
textual information.
● Question Answering (is concerned with building systems that automatically answer
questions posed by humans in a natural language.
● Dialogue Systems (computer system intended to converse with a human)
● Information Extraction (refers to the automatic extraction of structured information
such as entities, relationships between entities, and attributes)
● Summarization ( refers to the technique of shortening long pieces of text.)
● Sentiment Analysis (tries to identify and extract opinions within a given text across
blogs, reviews, social media, forums, news etc)
5
Evolution of NLP
6
NLP Terminology
● Phonology − It is study of organizing sound systematically.
● Morphology − It is a study of construction of words from primitive meaningful
units.
● Morpheme − It is primitive unit of meaning in a language.
● Syntax − It refers to arranging words to make a sentence. It also involves
determining the structural role of words in the sentence and in phrases.
● Semantics − It is concerned with the meaning of words and how to combine
words into meaningful phrases and sentences.
● Pragmatics − It deals with using and understanding sentences in different
situations and how the interpretation of the sentence is affected.
● Discourse − It deals with how the immediately preceding sentence can affect
the interpretation of the next sentence.
● World Knowledge − It includes the general knowledge about the world.
7
Process of NLP
8
Natural Language Understanding (NLU)
9
Part-of-Speech (POS) Tagging
● Each word has a part-of-speech tag to describe its category.
● Part-of-speech tag of a word is one of major word groups (or its
subgroups).
○ open classes -- noun, verb, adjective, adverb
○ closed classes -- prepositions, determiners, conjuctions, pronouns, particples
● POS Taggers try to find POS tags for the words.
● duck is a verb or noun? (morphological analyzer cannot make decision).
● A POS tagger may make that decision by looking the surrounding words.
○ Duck! (verb)
○ Duck is delicious for dinner. (noun)
10
Lexical Processing
● The purpose of lexical processing is to determine meanings of individual
words.
● Basic methods is to lookup in a database of meanings -- lexicon
● We should also identify non-words such as punctuation marks.
● Word-level ambiguity -- words may have several meanings, and the
correct one cannot be chosen based solely on the word itself.
○ bank in English
○ yüz in Turkish
● Solution -- resolve the ambiguity on the spot by POS tagging (if possible)
or pass-on the ambiguity to the other levels.
11
Syntactic Processing
13
Knowledge Representation for NLP
15
● Lexical Analysis:
Stages of NLP ➢It involves identifying and analyzing the
structure of words.
➢Lexicon of a language means the
collection of words and phrases in a
language.
➢Lexical analysis is dividing the whole
chunk of txt into paragraphs, sentences,
and words.
● Syntactic Analysis (Parsing):
➢It involves analysis of words in the
sentence for grammar and arranging words
in a manner that shows the relationship
among the words.
➢The sentence such as “The school goes
to boy” is rejected by English syntactic
analyzer. 16
Stages of NLP (Contd…)
● Semantic Analysis:
➢It draws the exact meaning or the dictionary meaning from the text.
➢The text is checked for meaningfulness.
➢It is done by mapping syntactic structures and objects in the task domain.
➢The semantic analyzer disregards sentence such as “hot ice-cream”.
● Discourse Integration:
➢The meaning of any sentence depends upon the meaning of the sentence just
before it.
➢In addition, it also brings about the meaning of immediately succeeding sentence.
17
Stages of NLP (Contd…)
● Pragmatic Analysis:
➢During this, what was said is re-interpreted on what it actually meant.
➢It involves deriving those aspects of language which require real
world knowledge.
18
19
Why NLP is hard: Difficulties in NLU
21
Classical NLP Problems
● Mostly Solved:
● Text Classification (e.g. spam detection in Gmail).
● Part of Speech (POS) tagging: Given a sentence, determine the POS tag for each word (e.g. NOUN,
VERB, ADV, ADJ).
● Named Entity Recognition (NER): Given a sentence, determine named entities (e.g. person names,
locations, organizations).
● Making a Solid Progress:
● Sentiment Analysis: Given a sentence, determine it’s polarity (e.g. positive, negative, neutral), or
emotions (e.g. happy, sad, surprised, angry, etc)
● Co-reference Resolution: Given a sentence, determine which words (“mentions”) refer to the same
objects (“entities”). for example (Manning is a great NLP professor, he worked in the field for over two
decades).
● Word Sense Disambiguation (WSD): Many words have more than one meaning; we have to select the
meaning which makes the most sense based on the context (e.g. I went to the bank to get some money),
here bank means a financial institution, not the land beside a river.
● Machine Translation (e.g. Google Translate).
● Still Challenging:
● Dialogue agents and chat-bots, especially open-domain ones.
● Question Answering.
● Abstractive Summarization.
● NLP for low-resource languages (e.g. African languages) 22
Morphology
● Morphology comes from a Greek word meaning “Shape” or “Form”
and is used in linguistics to denote the study of words, both with
regard to their internal structure (e.g. washing -> wash + ing) and
their combination or formulation to form new or larger units (e.g.
bat->bats :: rat->rats)
● Morphology tries to formulate rules.
● It helps in different domains such as spell checkers and machine
translation.
● Morphological Analyzer and generator is a tool for analyzing the given
word and generator for generating word given the stem and its
features (like affixes).
● It identifies how a word is produced through the use of morphemes.
23
Morpheme and its Types
● The morpheme is the smallest element of
a word that has grammatical function and
meaning.
● Types:
○ Free morpheme: A single free
morpheme can become a complete
word, For instance, a bus, a bicycle,
and so forth.
○ Bound morpheme: It cannot stand
alone and must be joined to a free
morpheme to produce a word. ing, un,
and other bound morphemes are
examples.
24
Basic word classes (parts of speech)
● Content words (open-class):
– Nouns: student, university, knowledge,...
– Verbs: write, learn, teach,...
– Adjectives: difficult, boring, hard, ....
– Adverbs: easily, repeatedly,...
● Function words (closed-class):
– Prepositions: in, with, under,...
– Conjunctions: and, or,...
– Determiners: a, the, every,...
25
Words aren’t just defined by blanks
● Problem 1: Compounding
○ “ice cream”, “website”, “web site”, “New York-based”
● Problem 2: Other writing systems have no blanks, like
Chinese
● Problem 3: Clitics
○ English: “doesn’t” , “I’m” ,
○ Italian: “dirglielo” = dir + gli(e) + lo (meaning: tell + him + it)
26
How many words are there?
“Of course he wants to take the advanced course too. He already
took two beginners’ courses.”
● How many word tokens are there?
○ (16 to 19, depending on how we count punctuation)
● How many word types are there?
○ i.e. How many different words are there?
○ Again, this depends on how you count, but it’s usually much less than the number of
tokens
● The same (underlying) word can take different forms: course/courses, take/took
● We distinguish concrete word forms (take, taking) from abstract lemmas or dictionary forms
(take)
● Different words may be spelled/pronounced the same: of course vs. advanced course; two
vs. too
27
How many different words are there?
29
Derivational morphology
● Nominalization:
○ V + -ation: computerization
○ V+ -er: killer
○ Adj + -ness: fuzziness
● Negation:
○ un-: undo, unseen, …
○ mis-: mistake,...
● Adjectivization:
○ V+ -able: doable
○ N + -al: national
30
Morphemes: stems, affixes
dis-grace-ful-ly
prefix-stem-suffix-suffix
● Many word forms consist of a stem plus a number of affixes (prefixes or
suffixes)
○ Infixes are inserted inside the stem.
○ Circumfixes (German gesehen) surround the stem
● Morphemes: the smallest (meaningful/grammatical) parts of words.
○ Stems (grace) are often free morphemes.
○ Free morphemes can occur by themselves as words.
○ Affixes (dis-, -ful, -ly) are usually bound morphemes.
○ Bound morphemes have to combine with others to form words.
31
Morphemes and morphs
disgracefully
dis grace ful ly
prefix stem suffix suffix
NEG grace+N +ADJ +ADV
33
Morphological generation
34
Finite-State Automata and Regular Languages: review
35
Automata and languages
● Simple patterns:
○ Standard characters match themselves: ‘a’, ‘1’
○ Character classes: ‘[abc]’, ‘[0-9]’, negation: ‘[^aeiou]’
○ (Predefined: \s (whitespace), \w (alphanumeric), etc.)
○ Any character (except newline) is matched by ‘.’
● Complex patterns: (e.g. ^[A-Z]([a-z])+\s )
○ Group: ‘(...)’
○ Repetition: 0 or more times: ‘*’, 1 or more times: ‘+’
○ Disjunction: ‘...|...’
○ Beginning of line ‘^’ and end of line ‘$’
38
Finite-state methods for morphology
39
Stem changes
● Some irregular words require stem changes:
○ Past tense verbs: teach-taught, go-went, write-wrote
○ Plural nouns: mouse-mice, foot-feet, wife-wives
40
FSAs for derivational morphology
42
Finite-state transducers
● A finite-state transducer T = 〈 Q, Σ, Δ, qo , F, δ, σ 〉 consists of:
○ A finite set of states Q = {qo , q1 ,.., qn }
○ A finite alphabet Σ of input symbols (e.g. Σ = {a, b, c,...})
○ A finite alphabet Δ of output symbols (e.g. Δ = {+N, +pl,...})
○ A designated start state qo ∈ Q
○ A set of final states F ⊆ Q
○ A transition function δ: Q × Σ → 2Q
■ δ(q,w) = Q’ for q ∈Q, Q’ ⊆ Q, w ∈ Σ
○ An output function σ: Q × Σ → Δ*
■ σ(q,w) = ω for q ∈ Q, w ∈ Σ, ω ∈ Δ*
■ If the current state is q and the current input is w, write ω.
43
Finite-state transducers
45
Simplified Morpholgical Parsing FST
46
Some FST operations
● Inversion T-1 :
○ The inversion (T-1 ) of a transducer switches input and output
labels.
○ This can be used to switch from parsing words to generating
words.
● Composition (T◦T’): (Cascade)
○ Two transducers T = L1 ⨉ L2 and T’ = L2 ⨉ L3 can be composed
into a third transducer T’’ = L1 ⨉ L3.
47
Problems in Morphological Analyzer
● Productivity
● False Analysis
● Bound Base Morphemes
48
Productivity
49
False analysis
50
Bound Base Morphemes
51