NLP Unit I
NLP Unit I
UNIT – I
Introduction
• The idea of giving computers the ability to process human
language is as old as the idea of computers themselves.
• A vibrant interdisciplinary field with many names
corresponding to its many facets
• Speech and language processing, human language
technology, natural language processing, computational
linguistics, and speech recognition and synthesis.
• The goal of this new field is to get computers to perform
useful tasks involving human language, tasks like enabling
human-machine communication, improving human-human
communication, or simply doing useful processing of text or
speech.
Example
• One example of a useful such task is a
Conversational agent.
• The HAL 9000 computer in Stanley Kubrick’s film
2001
• A Space Odyssey is one of the most recognizable
characters in twentieth-century cinema.
• HAL is an artificial agent capable of such advanced
language-processing behavior as speaking and
understanding English, and at a crucial moment in
the plot, even reading lips.
• HAL’s creator Arthur C. Clarke was optimistic in
predicting when an artificial agent such as HAL
would be available.
Questions
• But just how far off was he?
• What would it take to create at least the
language-related parts of HAL?
• We call programs like HAL that converse with humans
via natural language conversational agents or dialogue
systems.
• The various components that make up modern
conversational agents, including language input
(automatic speech recognition and natural language
understanding) and language output (natural language
generation and speech synthesis).
Machine Translation
• Making available to non-English speaking readers
the vast amount of scientific information on the
Web in English.
• Translating for English speakers the hundreds of
millions of Web pages written in other languages
like Chinese.
• Machine translation goal is to automatically
translate a document from one language to
another.
Question Answering
• A generalization of simple web search
• Instead of just typing keywords a user might ask
complete questions, ranging from easy to hard, like the
following:
• What does “divergent” mean?
• What year was Abraham Lincoln born?
• How many states were in the United States that year?
• How much Chinese silk was exported to England by the
end of the 18th century?
• What do scientists think about the ethics of human
cloning?
Knowledge in Speech and Language Processing
• Knowledge of Language distinguishes language
processing applications from other data
processing systems
• Example: UNIX wc program – when used to
count bytes and lines, wc is an ordinary data
processing application.
• When it is used to count the words in a file it
requires knowledge about what it means to be a
word - a language processing system.
Pragmatic or Dialogue
• Kind of actions that speakers intend by their
use of sentences is pragmatic or dialogue
knowledge
Kinds of Knowledge of Language
• Phonetics and Phonology— knowledge about linguistic
sounds
• Morphology— knowledge of the meaningful
components of words
• Syntax— knowledge of the structural relationships
between words
• Semantics—knowledge of meaning
• Pragmatics— knowledge of the relationship of
meaning to the goals and intentions of the speaker.
• Discourse— knowledge about linguistic units larger
than a single utterance
Ambiguity
• Some input is ambiguous if there are multiple
alternative linguistic structures that can be built
for it
• Eg: I made her duck.
1. I cooked waterfowl for her.
2. I cooked waterfowl belonging to her.
3. I created the (plaster?) duck she owns.
4. I caused her to quickly lower her head or body.
5. I waved my magic wand and turned her into
undifferentiated waterfowl.
Contd...
• First, the words duck and her are morphologically or syntactically
ambiguous in their part-of-speech.
• Duck can be a verb or a noun, while her can be a dative pronoun or a
possessive pronoun.
• Second, the word make is semantically ambiguous; it can mean create
or cook.
• Finally, the verb make is syntactically ambiguous in a different way.
• Make can be transitive, that is, taking a single direct object (2), or it can
be ditransitive, that is, taking two objects (5), meaning that the first
object (her) got made into the second object (duck).
• Finally, make can take a direct object and a verb (4), meaning that the
object (her) got caused to perform the verbal action (duck).
• Furthermore, in a spoken sentence, there is an even deeper kind of
ambiguity; the first word could have been eye or the second word maid
Part-of Speech Tagging
• For example deciding whether duck is a verb or a
noun can be solved by part-of-speech tagging.
• Deciding whether make means “create” or
“cook” can be solved by word sense
disambiguation.
• Resolution of part-of-speech and word sense
ambiguities are two important kinds of lexical
disambiguation.
• A wide variety of tasks can be framed as lexical
disambiguation problems.
SYNTACTIC DISAMBIGUATION
• For example, a text-to-speech synthesis system
reading the word lead needs to decide whether
it should be pronounced as in lead pipe or as in
lead me on.
• Deciding whether her and duck are part of the
same entity (as in (1) or (4)) or are different
entity (as in (2)) is an example of syntactic
disambiguation and can be addressed by
probabilistic parsing.
Models and Algorithms
• Drawn from the standard toolkits of computer
science, mathematics, and linguistics
• Important models are state machines, rule
systems, logic, probabilistic models and
vector-space models.
• These models turmed into algorithms such as
state space search algorithms - dynamic
programming, and machine learning algorithms
• Expectation-Maximization (EM) and other
learning algorithms.
State Machines
• State machines are formal models that consist of
states, transitions among states, and an input
representation.
• Some of the variations of this basic model -
deterministic and non-deterministic finite-state
automata and finite-state transducers.
Regular Grammers, CFG’s & Feature Augmented
Grammars
• This FSA will recognize all the adjectives in the table above
• For eg. This will also recognize ungrammatical forms like unbig, unfast, oranger,
or smally.
• Need to set up classes of roots and specify their possible suffixes.
• Thus adj-root1 would include adjectives that can occur with un- and -ly (clear,
happy, and real) while adj-root2 will include adjectives that can’t (big, small),
and so on.
An FSA for another fragment of English derivational morphology
• This FSA models a number of derivational facts, such as the well
known generalization that any verb ending in -ize can be followed
by the nominalizing suffix-ation (Bauer, 1983; Sproat, 1993).
• Ex: a word fossilize, can be predicted as the word fossilization by
following states q0, q1, and q2.
• Similarly, adjectives ending in -al or -able at q5 (equal, formal,
realizable) can take the suffix -ity, or sometimes the suffix -ness to
state q6 (naturalness, casualness).
Morphological Recognition
• We can now use these FSAs to solve the problem of morphological
recognition
• Determining whether an input string of letters makes up a legitimate
English word or not.
• We do this by taking the morphotactic FSAs, and plugging in each
“sublexicon” into the FSA.
• We expand each arc (e.g., the reg-noun-stem arc) with all the
morphemes that make up the set of reg-noun-stem.
• The resulting FSA can then be defined at the level of the individual
letter.
Expanded FSA for a few English nouns with their
inflection.
Finite-State Transducers
• A transducer maps between one representation
and another
• FST is a type of finite automaton which maps
between two sets of symbols.
• We can visualize an FST as a two-tape automaton
which recognizes or generates pairs of strings.
• We can do this by labeling each arc in the
finite-state machine with two symbol strings, one
from each tape.
FST Vs FSA
• The FST has a more general function than an FSA
• FSA defines a formal language by defining a set of strings
• FST defines a relation between sets of strings.
• FST is as a machine that reads one string and generates another.
• Here’s a summary of this four-fold way of thinking about transducers:
• FST as recognizer: a transducer that takes a pair of strings as input and outputs
accept if the string-pair is in the string-pair language, and reject if it is not.
• FST as generator: a machine that outputs pairs of strings of the language.
Thus the output is a yes or no, and a pair of output strings.
• FST as translator: a machine that reads a string and outputs another string
• FST as set relater: a machine that computes relations between sets.
• All of these have applications in speech and language processing.
• For morphological parsing (and for many other NLP applications), we will apply
the FST as translator metaphor, taking as input a string of letters and producing
as output a string of morphemes.
An FST can be formally defined with 7 parameters:
Regular Relations
• Where FSAs are isomorphic to regular languages, FSTs are
isomorphic to regular relations.
• Regular relations are sets of pairs of strings, a natural
Regular relation extension of the regular languages, which
are sets of strings.
• Like FSAs and regular languages, FSTs and regular relations
are closed under union
• In general they are not closed under difference,
complementation and intersection
• Some useful subclasses of FSTs are closed under these
operations
• In general FSTs that are not augmented with the are more
likely to have such closure properties
Inversion and Composition
Inversion Vs. Composition
• Inversion is useful because it makes it easy to
convert a FST-as-parser into an FST-as-generator.
• Composition is useful because it allows us to take
two transducers that run in series and replace
them with one more complex transducer.
• Composition works as in algebra; applying T1 ◦ T2
to an input sequence S is identical to applying T1
to S and then T2 to the result; thus T1 ◦ T2(S) =
T2(T1(S)).
Fig. 3.9, for example, shows the composition of [a:b]+
with [b:c]+ to produce
[a:c]+.
Projection
• The projection of an FST is the FSA that is
produced by extracting Projection only one side
of the relation.
• We can refer to the projection to the left or
upper side of the relation as the upper or first
projection and the projection to the lower or
right side of the relation as the lower or second
projection.
Sequential Transducers and Determinism
• Transducers as we have described them may be
nondeterministic, in that a given input may translate
to many possible output symbols.
• Thus using general FSTs requires the kinds of search
algorithms, making FSTs quite slow in the general
case.
• This suggests that it would nice to have an algorithm
to convert a nondeterministic FST to a deterministic
one.
• But while every non-deterministic FSA is equivalent
to some deterministic FSA, not all finite-state
transducers can be determinized.
Contd...
• Sequential transducers, by contrast, are a subtype of
transducers that are deterministic on their input.
• At any state of a sequential transducer, each given symbol
of the input alphabet S can label at most one transition
out of that state.
• Fig. 3.10 gives an example of a sequential transducer
from Mohri (1997); note that here, unlike the transducer
in Fig. 3.8, the transitions out of each state are
deterministic based on the state and the input symbol.
• Sequential transducers can have epsilon symbols in the
output string, but not on the input.
A Sequential Finite-State
Transducer, from Mohri
Contd...
SUBSEQUENTIAL TRANSDUCER
• The subsequential transducer, generates an additional output
string at the final states, concatenating it onto the output
produced so far (Sch¨utzenberger, 1977).
• What makes sequential and subsequential transducers important
is their efficiency
• Because they are deterministic on input, they can be processed in
time proportional to the number of symbols in the input (they are
linear in their input length) rather than proportional to some much
larger number which is a function of the number of states.
• Another advantage of subsequential transducers is that there exist
efficient algorithms for their determinization (Mohri, 1997) and
minimization (Mohri, 2000), extending the algorithms for
determinization and minimization of finite-state automata.
Contd...
• While both sequential and subsequential transducers are
deterministic and efficient, neither of them is able to handle
ambiguity, since they transduce each input string to exactly one
possible output string.
• Since ambiguity is a crucial property of natural language, it will be
useful to have an extension of subsequential transducers that can
deal with ambiguity, but still retain the efficiency and other useful
properties of sequential transducers.
• One such generalization of subsequential transducers is the
p-subsequential transducer.
• A p-subsequential transducer allows for p(p ≥ 1) final output
strings to be associated with each final state (Mohri, 1996).
• They can thus handle a finite amount of ambiguity, which is useful
for many NLP tasks.
> Mohri (1996, 1997) show a number of tasks whose ambiguity
can be limited in this way, including the representation of
dictionaries, the compilation of morphological and phonological
rules, and local syntactic constraints.
> For each of these kinds of problems, he and others have shown
that they are p-subsequentializable, and thus can be
determinized and minimized.
> This class of transducers includes many, although not
necessarily all, morphological rules.
FSTs for Morphological Parsing
• Given the input cats, for instance, we’d like to output cat +N +Pl,
telling us that cat is a plural noun.
• In the finite-state morphology paradigm, we represent a word as a
correspondence between a lexical level and a surface level
• Lexical Level represents a concatenation of morphemes making
up a word
• Surface level represents the concatenation of letters which make
up the actual spelling of the word.
Fig. 3.12 shows these two levels for (English)
cats.
Contd...
• For finite-state morphology it’s convenient to view an FST as having two
tapes.
• The upper or lexical tape, is composed from characters from one
alphabet Σ.
• The lower or surface tape, is composed of characters from another
alphabet Δ.
• In the two level morphology of Koskenniemi (1983), allow each arc only
to have a single symbol from each alphabet.
• We can then combine the two symbol alphabets Σ and Δ to create a new
alphabet, Σ′, which makes the relationship to FSAs quite clear.
• Σ′ is a finite alphabet of complex symbols.
• Each complex symbol is composed of an input-output pair i : o; one
symbol i from the input alphabet Σ, and one symbol o from an output
alphabet Δ, thus Σ′ ⊆ Σ × Δ.
• Σ and Δ may each include the epsilon symbol Є.
Sheep Language
• An FSA accepts a language stated over a finite
alphabet of single symbols, such as the
alphabet of our sheep language:
• Σ = {b,a, !} (3.2)
• FST defined this way accepts a language stated
over pairs of symbols, as in:
• Σ′ = {a : a, b : b, ! : !, a : !, a : Є, Є : !} (3.3)
Feasible Pairs & Default Pairs
• In two-level morphology, the pair of symbols in Σ′ are
also called feasible pairs.
• Each feasible pair symbol a : b in the transducer alphabet
Σ′ expresses how the symbol a from one tape is mapped
to the symbol b on the other tape.
• For example a : Є means that an a on the upper tape will
correspond to nothing on the lower tape.
• Just as for an FSA, we can write regular expressions in
the complex alphabet Σ′.
• Since it’s most common for symbols to map to
themselves, in two-level morphology we call pairs like a :
a, default pairs
• Refer to them by the single letter a.
A Schematic Transducer
A Fleshed-out English Nominal
Inflection
Intermediate tapes
Transducers and Orthographic Rules
• Concatenating the morphemes won’t work for cases
where there is a spelling change
• It would incorrectly reject an input like foxes and accept
an input like foxs.
• English often requires spelling changes at morpheme
boundaries by introducing Spelling rules (i.e orthographic
rules)
• A number of notations are available for writing such
rules and to implement the rules as transducers.
• The ability to implement rules as a transducer turns out
to be useful throughout speech and language processing.
Spelling Rules
Chomsky and Halle Rule
Example
Combining FST Lexicon and Rules
• Fig. 3.19 shows the architecture of a two-level morphology
system, used for parsing or generating.
• The lexicon transducer maps between the lexical level, with its
stems and morphological features, and an intermediate level that
represents a simple concatenation of morphemes.
• Then a host of transducers, each representing a single spelling
rule constraint, all run in parallel so as to map between this
intermediate level and the surface level.
• Putting all the spelling rules in parallel is a design choice; we could
also have chosen to run all the spelling rules in series (as a long
cascade), if we slightly changed each rule.
Cascading
• The architecture in Fig. 3.19 is a two-level cascade of transducers.
• Cascading two automata means running them in series with the
output of the first feeding the input to the second.
• Cascades can be of arbitrary depth, and each level might be built
out of many individual transducers.
• The cascade in Fig. 3.19 has two transducers in series: the
transducer mapping from the lexical to the intermediate levels, and
the collection of parallel transducers mapping from the
intermediate to the surface level.
• The cascade can be run top-down to generate a string, or
bottom-up to parse it
• Fig. 3.20 shows a trace of the system accepting the mapping from
fox +N +PL to foxes.
Generating or Parsing with FST lexicon and rules
Ambiguity & Disambiguating
• Parsing can be slightly more complicated than generation, because of the
problem of ambiguity.
• For example, foxes can also be a verb (albeit a rare one, meaning “to baffle or
confuse”)
• The lexical parse for foxes could be fox +V +3Sg as well as fox +N +PL.
• How are we to know which one is the proper parse?
• In fact, for ambiguous cases of this sort, the transducer is not capable of
deciding.
• Disambiguating will require some external evidence such as the surrounding
words.
• Foxes is likely to be a noun in the sequence “I saw two foxes yesterday”, but a
verb in the sequence “That trickster foxes me every time!”
• Barring such external evidence, the best our transducer can do is just
enumerate the possible choices; so we can transduce foxˆs# into both fox +V
+3SG and fox +N +PL.
Automaton Intersection
• Transducers in parallel can be combined by automaton
intersection.
• The automaton intersection algorithm just takes the
Cartesian product of the states, i.e., for each state qi in
machine 1 and state qj in machine 2, we create a new
state qi j .
• Then for any input symbol a, if machine 1 would
transition to state qn and machine 2 would transition to
state qm, we transition to state qnm.
• Fig. 3.21 sketches how this intersection (∧) and
composition (◦) process might be carried out.
Intersection and Composition of
Transducers
Lexicon-Free FSTs: The Porter Stemmer
• While building a transducer from a lexicon plus rules is the
standard algorithm for morphological parsing, there are simpler
algorithms that don’t require the large on-line lexicon demanded
by this algorithm.
• These are used especially in Information Retrieval (IR) tasks like
web search, in which a query such as a Boolean combination of
relevant keywords or phrases, e.g., (marsupial OR kangaroo OR
koala) returns documents that have these words in them.
• Since a document with the word marsupials might not match the
keyword marsupial, some IR systems first run a stemmer on the
query and document words.
• Morphological information in IR is thus only used to determine
that two words have the same stem; the suffixes are thrown away.
Stemming
• One of the most widely used such stemming
algorithms is the simple and efficient Porter
stemmer Porter (1980) algorithm, which is based
on a series of simple cascaded rewrite rules.
• Cascaded rewrite rules are just the sort of thing
that could be easily implemented as an FST
• The Porter algorithm also can be viewed as a
lexicon-free FST stemmer
The algorithm contains a series of
rules like these:
Lexicon Based Morphological
Parsers
Word and Sentence Tokenization
• Segmenting running text into words and sentences –
tokenization.
• Consider the following sentences from a Wall Street
Journal and New York Times article, respectively:
• Mr. Sherwood said reaction to Sea Containers’ proposal
has been "very positive."
• In New York Stock Exchange composite trading
yesterday, Sea Containers closed at $62.625, up 62.5
cents.
• ‘‘I said, ‘what’re you? Crazy?’ ’’ said Sadowsky.
• ‘‘I can’t afford to do that.’’
• Segmenting purely on white-space would produce
words like these: cents. said, positive, Crazy?
SENTENCE SEGMENTATION
• Sentence segmentation is a crucial first step in segmentation text
processing.
• Segmenting a text into sentences is generally based on
punctuation.
• Certain kinds of punctuation (periods, question marks, exclamation
points) tend to mark sentence boundaries.
• Question marks and exclamation points are relatively unambiguous
markers of sentence boundaries.
• Periods, on the other hand, are more ambiguous.
• The period character ‘.’ is ambiguous between a sentence boundary
marker and a marker of abbreviations like Mr. or Inc.
• The previous sentence that you just read showed an even more
complex case of this ambiguity, in which the final period of Inc.
marked both an abbreviation and the sentence boundarymarker.
Detecting and Correcting Spelling Errors
• The detection and correction of spelling errors is
an integral part of modern word-processors and
search engines.
• Important in correcting errors in optical character
recognition (OCR), the automatic OCR recognition
of machine or hand-printed characters, and
on-line handwriting recognition, the recognition
of human printed or cursive handwriting as the
user is writing.
Following Kukich (1992), we can distinguish three
increasingly broader problems:
1. non-word error detection: detecting spelling errors that result in
non-words (like graffe for giraffe).
2. isolated-word error correction: correcting spelling errors that
result in nonwords, for example correcting graffe to giraffe, but
looking only at the word in isolation.
3. context-dependent error detection and correction: using the
context to help detect and correct spelling errors even if they
accidentally result in an actual word of English (real-word errors).
• This can happen Realword errors from typographical errors
(insertion, deletion, transposition) which accidentally produce a
real word (e.g., there for three), or because the writer substituted
the wrong spelling of a homophone or near-homophone (e.g.,
dessert for desert, or piece for peace).
Minimum Edit Distance
• Deciding which of two words is closer to some third word in
spelling is a special case of the general problem of string distance.
• The distance between two strings is a measure of how alike two
strings are to each other.
• Many important algorithms for finding string distance rely on
some version of the
• Minimum edit distance algorithm, named by Wagner and Fischer
(1974) but independently discovered by many people.
• The minimum edit distance between two strings is the minimum
number of editing operations (insertion, deletion, substitution)
needed to transform one string into another.
• For example the gap between the words intention and execution
is five operations, shown in Fig. 3.23 as an alignment between the
two strings.
EXAMPLE
• Given two sequences, an alignment is a
correspondance between substrings of the
two sequences.
• Thus I aligns with the empty string, N with E, T
with X, and so on.
• Beneath the aligned strings is another
representation; a series of symbols expressing
an operation list for converting the top string
into the bottom string; d for deletion, s for
substitution, i for insertion.
Representing Minimum Edit
Distance
Levenshtein distance
• We can also assign a particular cost or weight to each of these
operations.
• The Levenshtein distance between two sequences is the simplest
weighting factor in which each of the three operations has a cost
of 1 (Levenshtein, 1966).
• Thus the Levenshtein distance between intention and execution
is 5.
• Levenshtein also proposed an alternate version of his metric in
which each insertion or deletion has a cost of one, and
substitutions are not allowed (equivalent to allowing
substitution, but giving each substitution a cost of 2, since any
substitution can be represented by one insertion and one
deletion).
• Using this version, the Levenshtein distance between intention
and execution is 8.
Dynamic Programming
• The minimum edit distance is computed by dynamic
programming.
• Dynamic programming is the name for a class of algorithms, first
introduced by Bellman (1957), that apply a table-driven method
to solve problems by combining solutions to subproblems.
• This class of algorithms includes the most commonly-used
algorithms in speech and language processing; besides minimum
edit distance, these include the Viterbi and forward algorithms
and the CYK and Earley algorithm.
• The intuition of a dynamic programming problem is that a large
problem can be solved by properly combining the solutions to
various sub problems.
For example, consider the sequence or “path” of transformed words
that comprise the minimum edit distance between the strings
intention and execution shown in Fig. 3.24.
Contd...
• Dynamic programming algorithms for sequence comparison
work by creating a distance matrix with one column for each
symbol in the target sequence and one row for each symbol in
the source sequence (i.e., target along the bottom, source along
the side).
• For minimum edit distance, this matrix is the edit-distance
matrix.
• Each cell edit-distance[i,j] contains the distance between the
first i characters of the target and the first j characters of the
source.
• Each cell can be computed as a simple function of the
surrounding cells; thus starting from the beginning of the matrix
it is possible to fill in every entry.
The value in each cell is computed by taking the
minimum of the three possible paths through the
matrix which arrive there:
Minimum Edit Distance
Computation of Minimum Edit Distance
Schematic of Back Pointers