0% found this document useful (0 votes)
21 views

NLP Unit I

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

NLP Unit I

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

Natural Language Processing

UNIT – I
Introduction
• The idea of giving computers the ability to process human
language is as old as the idea of computers themselves.
• A vibrant interdisciplinary field with many names
corresponding to its many facets
• Speech and language processing, human language
technology, natural language processing, computational
linguistics, and speech recognition and synthesis.
• The goal of this new field is to get computers to perform
useful tasks involving human language, tasks like enabling
human-machine communication, improving human-human
communication, or simply doing useful processing of text or
speech.
Example
• One example of a useful such task is a
Conversational agent.
• The HAL 9000 computer in Stanley Kubrick’s film
2001
• A Space Odyssey is one of the most recognizable
characters in twentieth-century cinema.
• HAL is an artificial agent capable of such advanced
language-processing behavior as speaking and
understanding English, and at a crucial moment in
the plot, even reading lips.
• HAL’s creator Arthur C. Clarke was optimistic in
predicting when an artificial agent such as HAL
would be available.
Questions
• But just how far off was he?
• What would it take to create at least the
language-related parts of HAL?
• We call programs like HAL that converse with humans
via natural language conversational agents or dialogue
systems.
• The various components that make up modern
conversational agents, including language input
(automatic speech recognition and natural language
understanding) and language output (natural language
generation and speech synthesis).
Machine Translation
• Making available to non-English speaking readers
the vast amount of scientific information on the
Web in English.
• Translating for English speakers the hundreds of
millions of Web pages written in other languages
like Chinese.
• Machine translation goal is to automatically
translate a document from one language to
another.
Question Answering
• A generalization of simple web search
• Instead of just typing keywords a user might ask
complete questions, ranging from easy to hard, like the
following:
• What does “divergent” mean?
• What year was Abraham Lincoln born?
• How many states were in the United States that year?
• How much Chinese silk was exported to England by the
end of the 18th century?
• What do scientists think about the ethics of human
cloning?
Knowledge in Speech and Language Processing
• Knowledge of Language distinguishes language
processing applications from other data
processing systems
• Example: UNIX wc program – when used to
count bytes and lines, wc is an ordinary data
processing application.
• When it is used to count the words in a file it
requires knowledge about what it means to be a
word - a language processing system.
Pragmatic or Dialogue
• Kind of actions that speakers intend by their
use of sentences is pragmatic or dialogue
knowledge
Kinds of Knowledge of Language
• Phonetics and Phonology— knowledge about linguistic
sounds
• Morphology— knowledge of the meaningful
components of words
• Syntax— knowledge of the structural relationships
between words
• Semantics—knowledge of meaning
• Pragmatics— knowledge of the relationship of
meaning to the goals and intentions of the speaker.
• Discourse— knowledge about linguistic units larger
than a single utterance
Ambiguity
• Some input is ambiguous if there are multiple
alternative linguistic structures that can be built
for it
• Eg: I made her duck.
1. I cooked waterfowl for her.
2. I cooked waterfowl belonging to her.
3. I created the (plaster?) duck she owns.
4. I caused her to quickly lower her head or body.
5. I waved my magic wand and turned her into
undifferentiated waterfowl.
Contd...
• First, the words duck and her are morphologically or syntactically
ambiguous in their part-of-speech.
• Duck can be a verb or a noun, while her can be a dative pronoun or a
possessive pronoun.
• Second, the word make is semantically ambiguous; it can mean create
or cook.
• Finally, the verb make is syntactically ambiguous in a different way.
• Make can be transitive, that is, taking a single direct object (2), or it can
be ditransitive, that is, taking two objects (5), meaning that the first
object (her) got made into the second object (duck).
• Finally, make can take a direct object and a verb (4), meaning that the
object (her) got caused to perform the verbal action (duck).
• Furthermore, in a spoken sentence, there is an even deeper kind of
ambiguity; the first word could have been eye or the second word maid
Part-of Speech Tagging
• For example deciding whether duck is a verb or a
noun can be solved by part-of-speech tagging.
• Deciding whether make means “create” or
“cook” can be solved by word sense
disambiguation.
• Resolution of part-of-speech and word sense
ambiguities are two important kinds of lexical
disambiguation.
• A wide variety of tasks can be framed as lexical
disambiguation problems.
SYNTACTIC DISAMBIGUATION
• For example, a text-to-speech synthesis system
reading the word lead needs to decide whether
it should be pronounced as in lead pipe or as in
lead me on.
• Deciding whether her and duck are part of the
same entity (as in (1) or (4)) or are different
entity (as in (2)) is an example of syntactic
disambiguation and can be addressed by
probabilistic parsing.
Models and Algorithms
• Drawn from the standard toolkits of computer
science, mathematics, and linguistics
• Important models are state machines, rule
systems, logic, probabilistic models and
vector-space models.
• These models turmed into algorithms such as
state space search algorithms - dynamic
programming, and machine learning algorithms
• Expectation-Maximization (EM) and other
learning algorithms.
State Machines
• State machines are formal models that consist of
states, transitions among states, and an input
representation.
• Some of the variations of this basic model -
deterministic and non-deterministic finite-state
automata and finite-state transducers.
Regular Grammers, CFG’s & Feature Augmented
Grammars

• Closely related to these models are their


declarative counterparts: formal rule systems.
• Regular grammars and regular relations,
context-free grammars, and feature-augmented
grammars.
• State machines and formal rule systems are the
main tools used when dealing with knowledge of
phonology, morphology, and syntax.
Predicate Calculus
• A third class of models that plays a critical role in
capturing knowledge of language are models
based on logic.
• First order logic, also known as the predicate
calculus, as well as such related formalisms as
lambda-calculus, feature structures and semantic
primitives.
• These logical representations have traditionally
been used for modeling semantics and
pragmatics
Words & Transducers
• A fox and a fish
• Writing for the plurals of these animals takes
more than just tacking on an s.
• The plural of fox is foxes and of goose, geese.
Orthographic Rules & Morphological Rules
• It takes two kinds of knowledge to correctly
search for singulars and plurals of these forms.
• Orthographic rules tell us that English words
ending in -y are pluralized by changing the -y to
-i- and adding an -es.
• Morphological rules tell us that fish has a null
plural, and that the plural of goose is formed by
changing the vowel.
Morphological Parsing
• Parsing means taking an input and producing
some sort of linguistic structure for it.
• The problem of recognizing that a word (like
foxes) breaks down into component morphemes
(fox and -es) and building a structured
representation of this fact is called
Morphological parsing.
Stemming
• Morphological parsing or stemming applies to
many affixes other than plurals
• For example we might need to take any English
verb form ending in -ing (going, talking,
congratulating) and parse it into its verbal stem
• Surface form plus the -ing morpheme.
• So given the surface or input form going, we
might want to produce the parsed form
VERB-go + GERUND-ing.
Morphological parsing
• Morphological parsing plays a crucial role in Web
search for morphologically complex languages like
Russian or German
• Morphological parsing also plays a crucial role in
part-of-speech tagging for these morphologically
complex languages
• It is important for producing the large dictionaries
that are necessary for robust spell-checking.
Productive Suffix
• To solve the morphological parsing problem, why
couldn’t we just store all the plural forms of
English nouns and -ing forms of English verbs in a
dictionary and do parsing by lookup?
• Sometimes we can do this, and for example for
English speech recognition this is exactly what we
do.
• But for many NLP applications this isn’t possible
because -ing is a productive suffix; by this we
mean that it Productive applies to every verb.
• Similarly -s applies to almost every noun.
Productive suffixes
• Productive suffixes even apply to new words;
thus the new word fax can automatically be used
in the -ing form: faxing.
• Since new words are created every day, the class
of nouns in English increases constantly, and we
need to be able to add the plural morpheme -s to
each of these.
• Additionally, the plural form of these new nouns
depends on the spelling/pronunciation of the
singular form; for example if the noun ends in –z
then the plural form is -es rather than -s.
Turkish Language
• We certainly cannot list all the morphological
variants of every word in morphologically
complex languages like Turkish, which has words
like:
• uygarlastıramadıklarımızdanmıssınızcasına
uygar +las¸ +tır +ama +dık +lar +ımız +dan +mıs¸ +sınız +casına
civilized +BEC +CAUS +NABL +PART +PL +P1PL +ABL +PAST +2PL +AsIf

• “(behaving) as if you are among those whom we


could not civilize”
Morphemes
• The various pieces of this word have these
meanings
• +BEC “become”
• +CAUS the causative verb marker (‘cause to X’)
• +NABL “not able”
• +PART past participle form
• +P1PL 1st person pl possessive agreement
• +2PL 2nd person pl
• +ABL ablative (from/among) case marker
• +AsIf derivationally forms an adverb from a finite
verb
Finite State Transducers
• Key algorithm for morphological parsing - the
finite state transducers
• For example in information retrieval and web
search (IR),
• Mapping from foxes to fox; but might not need
to also know that foxes is plural.
• Stripping off word endings is called stemming
in IR.
• A simple stemming algorithm is called the
Porter stemmer.
Lemmatization
• For other speech and language processing
tasks, we need to know that two words have a
similar root, despite their surface differences.
• For example the words sang, sung, and sings
are all forms of the verb sing.
• The word sing is sometimes called the common
lemma of these words, and mapping from all
of these to sing is called lemmatization
Tokenization
• Tokenization or word segmentation is the task of
separating out (tokenizing) words from running
text.
• In English, words are often separated from each
other by blanks (whitespace), but whitespace is
not always sufficient;
• We’ll need to notice that New York and rock ’n’
roll are individual words despite the fact that
they contain spaces, but for many applications
we’ll need to separate I’m into the two words I
and am.
Contd...
• Finally, for many applications we need to know
how similar two words are orthographically.
• Morphological parsing is one method for
computing this similarity, but another is to just
compare the strings of letters to see how
similar they are.
• A common way of doing this is with the
minimum edit distance algorithm, which is
important throughout NLP.
Minimum Edit Distance

• Morphological parsing is one method for


computing this similarity, but another is to just
compare the strings of letters to see how
similar they are.
• A common way of doing this is with the
minimum edit distance algorithm, which is
important throughout NLP.
Survey of (Mostly) English Morphology
• Morphology is the study of the way words are
built up from smaller meaning-bearing units,
morphemes.
• A morpheme is often defined as the minimal
meaning-bearing unit in a language.
• So for example the word fox consists of a
single morpheme (the morpheme fox) while
the word cats consists of two: the morpheme
cat and the morpheme -s.
Two Broad Classes of Morphemes
• Stems & Affixes
• The exact details of the distinction vary from language to
language
• The stem is the “main” morpheme of the word,
supplying the main meaning
• The affixes add “additional” meanings of various kinds.
• Affixes are further divided into prefixes, suffixes, infixes,
and circumfixes.
• Prefixes precede the stem, suffixes follow the stem,
circumfixes do both, and infixes are inserted inside the
stem.
Examples
• For example, the word eats is composed of a stem eat and the
suffix -s.
• The word unbuckle is composed of a stem buckle and the prefix
un-.
• English doesn’t have any good examples of circumfixes, but
many other languages do.
• In German, the past participle of some verbs is formed by
adding ge- to the beginning of the stem and -t to the end; so the
past participle of the verb sagen (to say) is gesagt (said).
• Infixes - a morpheme is inserted in the middle of a word, occur
very commonly for example in the Philipine language Tagalog.
• For example the affix um, which marks the agent of an action, is
infixed to the Tagalog stem hingi “borrow” to produce humingi.
Agglutinative languages.
• A word can have more than one affix.
• For example, the word “rewrites” has the prefix
re-, the stem write, and the suffix -s.
• The word “unbelievably” has a stem (believe)
plus three affixes (un-, -able, and -ly).
• While English doesn’t tend to stack more than
four or five affixes
• Turkish can have words with nine or ten affixes
• Agglutinative languages - Languages that tend to
string affixes together like Turkish language.
Four Methods to Combine Morphemes
• There are many ways to combine morphemes
to create words.
• Four of these methods are common and play
important roles in speech and language
processing:
• Inflection
• Derivation
• Compounding
• Cliticization
Inflection
• Inflection is the combination of a word stem with
a grammatical morpheme
• Usually resulting in a word of the same class as
the original stem, and usually filling some
syntactic function like agreement.
• For example, English has the inflectional
morpheme -s for marking the plural on nouns,
and the inflectional morpheme -ed for marking
the past tense on verbs.
Derivation
• Derivation is the combination of a word stem
with a grammatical morpheme
• Usually resulting in a word of a different class,
often with a meaning hard to predict exactly.
• For example the verb computerize can take
the derivational suffix -ation to produce the
noun computerization.
Compounding & Cliticization
• A clitic is a morpheme that acts syntactically like a word,
but is reduced in form and attached (phonologically and
sometimes orthographically) to another word.
• Compounding is the combination of multiple word
stems together.
• For example the noun doghouse is the concatenation of
the morpheme dog with the morpheme house.
• Cliticization is the combination of Clitic a word stem
with a clitic.
• For example the English morpheme ’ve in the word I’ve
is a clitic, as is the French definite article l’ in the word
l’opera.
Inflectional Morphology
• English has a relatively simple inflectional system
• Only nouns, verbs, and sometimes adjectives can
be inflected
• Number of possible inflectional affixes is quite
small.
• Plural English nouns have only two kinds of
inflection: an affix that marks plural and an affix
that marks possessive.
• For example, many (but not all) English nouns can
either appear in the bare stem or singular form,
or take a plural suffix.
Examples
Regular Nouns Irregular Nouns
-------------------------------------------------------------------------
Singular cat thrush mouse ox
Plural cats thrushes mice oxen
----------------------------------------------------------
• While the regular plural is spelled -s after most nouns, it is
spelled -es after words ending in -s (ibis/ibises), -z
(waltz/waltzes), -sh (thrush/thrushes), -ch (finch/finches),
and sometimes -x (box/boxes).
• Nouns ending in -y preceded by a consonant change the -y
to -i (butterfly/butterflies).
Possessive Suffix
• The possessive suffix is realized by apostrophe
+ -s for regular singular nouns (llama’s)
• Plural nouns not ending in -s (children’s)
• A lone apostrophe after regular plural nouns
(llamas’) and some names ending in -s or -z
(Euripides’ comedies).
Three Kinds of Verbs
• English verbal inflection is more complicated
than nominal inflection.
• First, English has three kinds of verbs; main
verbs, (eat, sleep, impeach), modal verbs (can,
will, should), and primary verbs (be, have, do).
• Of these verbs a large class are regular, that is
to say all verbs of this class have the same
endings marking the same functions.
Four Morphological Forms
Regular Verbs
• These verbs are called regular because just by knowing the stem
we can predict the other forms by adding one of three
predictable endings and making some regular spelling changes
• These regular verbs and forms are significant in the morphology
of English
– Because they cover a majority of the verbs
– Because the regular class is productive.
• A productive class is one that automatically includes any new
words that enter the language.
• For example the recently-created verb fax (My mom faxed me
the note from cousin Everett) takes the regular endings -ed,
-ing, -es.
• The –s form is spelled faxes rather than faxs
Irregular Verbs & Preterite
• The irregular verbs are those that have some more or
less idiosyncratic forms of inflection.
• Irregular verbs in English often have five different forms
• While constituting a much smaller class of verbs (Quirk
et al. (1985) estimate there are only about 250 irregular
verbs, not counting auxiliaries)
• This class includes most of the very frequent verbs of the
language
• An irregular verb can inflect in the past form (also called
the preterite) by changing its vowel (eat/ate), or its
vowel and some consonants (catch/caught), or with no
change at all (cut/cut).
Examples
• The -s form is used in the “habitual present” form to distinguish the third-person singular
ending (She jogs every Tuesday) from the other choices of person and number
(I/you/we/they jog every Tuesday).
• The stem form is used in the infinitive form, and also after certain other verbs (I’d rather
walk home, I want to walk home).
• The -ing participle is used in the progressive construction to mark present or ongoing
activity (It is raining), or when the verb is treated as a noun
• This particular kind of nominal use of a verb is called a gerund use
• Eg: Fishing is fine if you live near water.
• The -ed/-en participle is used in the perfect construction (He’s eaten lunch already)
• The passive construction (The verdict was overturned yesterday).
Derivational Morphology
• English inflection is relatively simple compared to other languages
• Derivation in English is quite complex.
• Derivation is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of a different class, often with a meaning hard to
predict exactly.
• Nominalization – A common kind of derivation in English is the formation of
new nouns, often from verbs or adjectives.
• For example, the suffix -ation produces nouns from verbs ending often in the
suffix -ize (computerize → computerization).
• Here are examples of some particularly productive English nominalizing suffixes.
Contd...
• Adjectives can also be derived from nouns and verbs.
• Examples of a few suffixes deriving adjectives from nouns or verbs.

• Derivation in English is more complex than inflection for a number of reasons.


– It is generally less productive; even a nominalizing suffix like -ation, which
can be added to almost any verb ending in -ize, cannot be added to
absolutely every verb. Thus we can’t say *eatation or *spellation (we use an
asterisk (*) to mark “non-examples” of English).
– There are subtle and complex meaning differences among nominalizing
suffixes. For example sincerity has a subtle difference in meaning from
sincereness.
Cliticization
• A clitic is a unit whose status lies in between that
of an affix and a word.
• Phonological behavior of clitics is like affixes; they
tend to be short and unaccented.
• Syntactic behavior is more like words, often acting
as pronouns, articles, conjunctions, or verbs.
• Clitics preceding a word are called proclitics, while
those following are enclitics.
English clitics include these
auxiliary verbal forms:
Non-concatenative Morphology
• The kind of morphology, in which a word is composed of a string of morphemes
concatenated together is often called Concatenative Morphology
• A number of languages have extensive non-concatenative morphology, in which
morphemes are combined in more complex ways.
• The Tagalog infixation example above is one example of non-concatenative morphology,
since two morphemes (hingi and um) are intermingled.
• Another kind of non-concatenative morphology is called templatic morphology or
root-and-pattern morphology.
• This is very common in Arabic, Hebrew, and other Semitic languages.
• In Hebrew, for example, a verb is constructed using two components: a root, consisting
usually of three consonants (CCC) and carrying the basic meaning, and a template, which
gives the ordering of consonants and vowels and specifies more semantic information
about the resulting verb, such as the semantic voice (e.g., active, passive, middle).
• For example the Hebrew tri-consonantal root lmd, meaning ‘learn’ or ‘study’, can be
combined with the active voice CaCaC template to produce the word lamad, ‘he studied’,
or the intensive CiCeC template to produce the word limed, ‘he taught’, or the intensive
passive template CuCaC to produce the word lumad, ‘he was taught’.
• Arabic and Hebrew combine this templatic morphology with concatenative morphology
Agreement
• From the plural morpheme introduced above, plural is marked on both nouns and verbs
in English.
• The subject noun and the main verb in English have to agree in number, meaning that
the two must either be both singular or both plural.
• For example nouns, adjectives and sometimes verbs in many languages are marked for
gender.
• A gender is a kind of equivalence class that is used by the language to categorize the
nouns; each noun falls into one class.
• Many languages (for example Romance languages like French, Spanish, or Italian) have 2
genders, which are referred to as masculine and feminine.
• Languages like Germanic and Slavic languages have three (masculine,
• feminine, neuter).
• Some languages, for example the Bantu languages of Africa, have as many as 20 genders.
• When the number of classes is very large, we often refer to them as noun classes instead
of genders.
• Gender is sometimes marked explicitly on a noun;
• For example Spanish masculine words often end in -o and feminine words in -a.
• But in many cases the gender is not marked in the letters or phones of the noun itself.
• Instead, it is a property of the word that must be stored in a lexicon.
Finite-State Morphological Parsing
Take input forms like those in the first and third columns of
Fig. 3.2, produce output forms like those in the second and
fourth column.
Contd...
• The second column contains the stem of each word as well as
assorted morphological features.
• These features specify additional information about the stem.
• For example the feature +N means that the word is a noun; +Sg
means it is singular, +Pl that it is plural.
• Spanish has some features that don’t occur in English; for
example the nouns lugar and pavo are marked +Masc
(masculine).
• Because Spanish nouns agree in gender with adjectives, knowing
the gender of a noun will be important for tagging and parsing.
• Note that some of the input forms (like caught, goose, canto, or
vino) will be ambiguous between different morphological parses.
Building a Morphological Parser
• In order to build a morphological parser, we’ll need at least the
following:
• 1. lexicon: the list of stems and affixes, together with basic
information about them (whether a stem is a Noun stem or a Verb
stem, etc.).
• 2. morphotactics: the model of morpheme ordering that explains
which classes of morphemes can follow other classes of morphemes
inside a word.
• For example, the fact that the English plural morpheme follows the
noun rather than preceding it is a morphotactic fact.
• 3. orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes
combine (e.g., the y→ie spelling rule discussed above that changes
city + -s to cities rather than citys).
Building a Finite-State Lexicon
• A lexicon is a repository for words.
• The simplest possible lexicon would consist of
an explicit list of every word of the language
(every word, i.e., including abbreviations
(“AAA”) and proper names (“Jane” or “Beijing”))
as follows:
a, AAA, AA, Aachen, aardvark, aardwolf, aba,
abaca, aback, .
A very simple finite-state model for English
nominal inflection might look like
• To list every word in the language, computational lexicons
• These are usually structured with a list of each of the stems and
affixes of the language together with a representation of the
morphotactics that tells us how they can fit together.
• There are many ways to model morphotactics; one of the most
common is the finite-state automaton.
• A very simple finite-state model for English nominal inflection
might look like Fig. 3.3.
Contd...
• The FSA in Fig. 3.3 assumes that the lexicon includes
regular nouns (reg-noun) that take the regular -s plural
(e.g., cat, dog, fox, aardvark).
• These are the vast majority of English nouns since for
now we will ignore the fact that the plural of words like
fox have an inserted e: foxes.
• The lexicon also includes irregular noun forms that don’t
take -s, both singular irreg-sg-noun (goose, mouse) and
plural irreg-pl-noun (geese, mice).
A finite-state automaton for English verbal
inflection
English derivational morphology
• English derivational morphology is significantly more complex than English
inflectional morphology
• So automata for modeling English derivation tend to be quite complex.
• Some models of English derivation, in fact, are based on the more complex
context-free grammars of Ch. 12 (Sproat, 1993).
• Consider a relatively simpler case of derivation: the morphotactics of English
adjectives.
• Here are some examples from Antworth (1990):
• big, bigger, biggest cool, cooler, coolest, coolly
• happy, happier, happiest, happily red, redder, reddest
• unhappy, unhappier, unhappiest, unhappily real, unreal, really
• clear, clearer, clearest, clearly, unclear, unclearly
An initial hypothesis might be that adjectives can have an optional prefix (un-), an
obligatory root (big, cool, etc.) and an optional suffix (-er, -est, or -ly). This might
suggest the FSA in Fig. 3.5.

• This FSA will recognize all the adjectives in the table above
• For eg. This will also recognize ungrammatical forms like unbig, unfast, oranger,
or smally.
• Need to set up classes of roots and specify their possible suffixes.
• Thus adj-root1 would include adjectives that can occur with un- and -ly (clear,
happy, and real) while adj-root2 will include adjectives that can’t (big, small),
and so on.
An FSA for another fragment of English derivational morphology
• This FSA models a number of derivational facts, such as the well
known generalization that any verb ending in -ize can be followed
by the nominalizing suffix-ation (Bauer, 1983; Sproat, 1993).
• Ex: a word fossilize, can be predicted as the word fossilization by
following states q0, q1, and q2.
• Similarly, adjectives ending in -al or -able at q5 (equal, formal,
realizable) can take the suffix -ity, or sometimes the suffix -ness to
state q6 (naturalness, casualness).
Morphological Recognition
• We can now use these FSAs to solve the problem of morphological
recognition
• Determining whether an input string of letters makes up a legitimate
English word or not.
• We do this by taking the morphotactic FSAs, and plugging in each
“sublexicon” into the FSA.
• We expand each arc (e.g., the reg-noun-stem arc) with all the
morphemes that make up the set of reg-noun-stem.
• The resulting FSA can then be defined at the level of the individual
letter.
Expanded FSA for a few English nouns with their
inflection.
Finite-State Transducers
• A transducer maps between one representation
and another
• FST is a type of finite automaton which maps
between two sets of symbols.
• We can visualize an FST as a two-tape automaton
which recognizes or generates pairs of strings.
• We can do this by labeling each arc in the
finite-state machine with two symbol strings, one
from each tape.
FST Vs FSA
• The FST has a more general function than an FSA
• FSA defines a formal language by defining a set of strings
• FST defines a relation between sets of strings.
• FST is as a machine that reads one string and generates another.
• Here’s a summary of this four-fold way of thinking about transducers:
• FST as recognizer: a transducer that takes a pair of strings as input and outputs
accept if the string-pair is in the string-pair language, and reject if it is not.
• FST as generator: a machine that outputs pairs of strings of the language.
Thus the output is a yes or no, and a pair of output strings.
• FST as translator: a machine that reads a string and outputs another string
• FST as set relater: a machine that computes relations between sets.
• All of these have applications in speech and language processing.
• For morphological parsing (and for many other NLP applications), we will apply
the FST as translator metaphor, taking as input a string of letters and producing
as output a string of morphemes.
An FST can be formally defined with 7 parameters:
Regular Relations
• Where FSAs are isomorphic to regular languages, FSTs are
isomorphic to regular relations.
• Regular relations are sets of pairs of strings, a natural
Regular relation extension of the regular languages, which
are sets of strings.
• Like FSAs and regular languages, FSTs and regular relations
are closed under union
• In general they are not closed under difference,
complementation and intersection
• Some useful subclasses of FSTs are closed under these
operations
• In general FSTs that are not augmented with the are more
likely to have such closure properties
Inversion and Composition
Inversion Vs. Composition
• Inversion is useful because it makes it easy to
convert a FST-as-parser into an FST-as-generator.
• Composition is useful because it allows us to take
two transducers that run in series and replace
them with one more complex transducer.
• Composition works as in algebra; applying T1 ◦ T2
to an input sequence S is identical to applying T1
to S and then T2 to the result; thus T1 ◦ T2(S) =
T2(T1(S)).
Fig. 3.9, for example, shows the composition of [a:b]+
with [b:c]+ to produce
[a:c]+.
Projection
• The projection of an FST is the FSA that is
produced by extracting Projection only one side
of the relation.
• We can refer to the projection to the left or
upper side of the relation as the upper or first
projection and the projection to the lower or
right side of the relation as the lower or second
projection.
Sequential Transducers and Determinism
• Transducers as we have described them may be
nondeterministic, in that a given input may translate
to many possible output symbols.
• Thus using general FSTs requires the kinds of search
algorithms, making FSTs quite slow in the general
case.
• This suggests that it would nice to have an algorithm
to convert a nondeterministic FST to a deterministic
one.
• But while every non-deterministic FSA is equivalent
to some deterministic FSA, not all finite-state
transducers can be determinized.
Contd...
• Sequential transducers, by contrast, are a subtype of
transducers that are deterministic on their input.
• At any state of a sequential transducer, each given symbol
of the input alphabet S can label at most one transition
out of that state.
• Fig. 3.10 gives an example of a sequential transducer
from Mohri (1997); note that here, unlike the transducer
in Fig. 3.8, the transitions out of each state are
deterministic based on the state and the input symbol.
• Sequential transducers can have epsilon symbols in the
output string, but not on the input.
A Sequential Finite-State
Transducer, from Mohri
Contd...
SUBSEQUENTIAL TRANSDUCER
• The subsequential transducer, generates an additional output
string at the final states, concatenating it onto the output
produced so far (Sch¨utzenberger, 1977).
• What makes sequential and subsequential transducers important
is their efficiency
• Because they are deterministic on input, they can be processed in
time proportional to the number of symbols in the input (they are
linear in their input length) rather than proportional to some much
larger number which is a function of the number of states.
• Another advantage of subsequential transducers is that there exist
efficient algorithms for their determinization (Mohri, 1997) and
minimization (Mohri, 2000), extending the algorithms for
determinization and minimization of finite-state automata.
Contd...
• While both sequential and subsequential transducers are
deterministic and efficient, neither of them is able to handle
ambiguity, since they transduce each input string to exactly one
possible output string.
• Since ambiguity is a crucial property of natural language, it will be
useful to have an extension of subsequential transducers that can
deal with ambiguity, but still retain the efficiency and other useful
properties of sequential transducers.
• One such generalization of subsequential transducers is the
p-subsequential transducer.
• A p-subsequential transducer allows for p(p ≥ 1) final output
strings to be associated with each final state (Mohri, 1996).
• They can thus handle a finite amount of ambiguity, which is useful
for many NLP tasks.
> Mohri (1996, 1997) show a number of tasks whose ambiguity
can be limited in this way, including the representation of
dictionaries, the compilation of morphological and phonological
rules, and local syntactic constraints.
> For each of these kinds of problems, he and others have shown
that they are p-subsequentializable, and thus can be
determinized and minimized.
> This class of transducers includes many, although not
necessarily all, morphological rules.
FSTs for Morphological Parsing
• Given the input cats, for instance, we’d like to output cat +N +Pl,
telling us that cat is a plural noun.
• In the finite-state morphology paradigm, we represent a word as a
correspondence between a lexical level and a surface level
• Lexical Level represents a concatenation of morphemes making
up a word
• Surface level represents the concatenation of letters which make
up the actual spelling of the word.
Fig. 3.12 shows these two levels for (English)
cats.
Contd...
• For finite-state morphology it’s convenient to view an FST as having two
tapes.
• The upper or lexical tape, is composed from characters from one
alphabet Σ.
• The lower or surface tape, is composed of characters from another
alphabet Δ.
• In the two level morphology of Koskenniemi (1983), allow each arc only
to have a single symbol from each alphabet.
• We can then combine the two symbol alphabets Σ and Δ to create a new
alphabet, Σ′, which makes the relationship to FSAs quite clear.
• Σ′ is a finite alphabet of complex symbols.
• Each complex symbol is composed of an input-output pair i : o; one
symbol i from the input alphabet Σ, and one symbol o from an output
alphabet Δ, thus Σ′ ⊆ Σ × Δ.
• Σ and Δ may each include the epsilon symbol Є.
Sheep Language
• An FSA accepts a language stated over a finite
alphabet of single symbols, such as the
alphabet of our sheep language:
• Σ = {b,a, !} (3.2)
• FST defined this way accepts a language stated
over pairs of symbols, as in:
• Σ′ = {a : a, b : b, ! : !, a : !, a : Є, Є : !} (3.3)
Feasible Pairs & Default Pairs
• In two-level morphology, the pair of symbols in Σ′ are
also called feasible pairs.
• Each feasible pair symbol a : b in the transducer alphabet
Σ′ expresses how the symbol a from one tape is mapped
to the symbol b on the other tape.
• For example a : Є means that an a on the upper tape will
correspond to nothing on the lower tape.
• Just as for an FSA, we can write regular expressions in
the complex alphabet Σ′.
• Since it’s most common for symbols to map to
themselves, in two-level morphology we call pairs like a :
a, default pairs
• Refer to them by the single letter a.
A Schematic Transducer
A Fleshed-out English Nominal
Inflection
Intermediate tapes
Transducers and Orthographic Rules
• Concatenating the morphemes won’t work for cases
where there is a spelling change
• It would incorrectly reject an input like foxes and accept
an input like foxs.
• English often requires spelling changes at morpheme
boundaries by introducing Spelling rules (i.e orthographic
rules)
• A number of notations are available for writing such
rules and to implement the rules as transducers.
• The ability to implement rules as a transducer turns out
to be useful throughout speech and language processing.
Spelling Rules
Chomsky and Halle Rule
Example
Combining FST Lexicon and Rules
• Fig. 3.19 shows the architecture of a two-level morphology
system, used for parsing or generating.
• The lexicon transducer maps between the lexical level, with its
stems and morphological features, and an intermediate level that
represents a simple concatenation of morphemes.
• Then a host of transducers, each representing a single spelling
rule constraint, all run in parallel so as to map between this
intermediate level and the surface level.
• Putting all the spelling rules in parallel is a design choice; we could
also have chosen to run all the spelling rules in series (as a long
cascade), if we slightly changed each rule.
Cascading
• The architecture in Fig. 3.19 is a two-level cascade of transducers.
• Cascading two automata means running them in series with the
output of the first feeding the input to the second.
• Cascades can be of arbitrary depth, and each level might be built
out of many individual transducers.
• The cascade in Fig. 3.19 has two transducers in series: the
transducer mapping from the lexical to the intermediate levels, and
the collection of parallel transducers mapping from the
intermediate to the surface level.
• The cascade can be run top-down to generate a string, or
bottom-up to parse it
• Fig. 3.20 shows a trace of the system accepting the mapping from
fox +N +PL to foxes.
Generating or Parsing with FST lexicon and rules
Ambiguity & Disambiguating
• Parsing can be slightly more complicated than generation, because of the
problem of ambiguity.
• For example, foxes can also be a verb (albeit a rare one, meaning “to baffle or
confuse”)
• The lexical parse for foxes could be fox +V +3Sg as well as fox +N +PL.
• How are we to know which one is the proper parse?
• In fact, for ambiguous cases of this sort, the transducer is not capable of
deciding.
• Disambiguating will require some external evidence such as the surrounding
words.
• Foxes is likely to be a noun in the sequence “I saw two foxes yesterday”, but a
verb in the sequence “That trickster foxes me every time!”
• Barring such external evidence, the best our transducer can do is just
enumerate the possible choices; so we can transduce foxˆs# into both fox +V
+3SG and fox +N +PL.
Automaton Intersection
• Transducers in parallel can be combined by automaton
intersection.
• The automaton intersection algorithm just takes the
Cartesian product of the states, i.e., for each state qi in
machine 1 and state qj in machine 2, we create a new
state qi j .
• Then for any input symbol a, if machine 1 would
transition to state qn and machine 2 would transition to
state qm, we transition to state qnm.
• Fig. 3.21 sketches how this intersection (∧) and
composition (◦) process might be carried out.
Intersection and Composition of
Transducers
Lexicon-Free FSTs: The Porter Stemmer
• While building a transducer from a lexicon plus rules is the
standard algorithm for morphological parsing, there are simpler
algorithms that don’t require the large on-line lexicon demanded
by this algorithm.
• These are used especially in Information Retrieval (IR) tasks like
web search, in which a query such as a Boolean combination of
relevant keywords or phrases, e.g., (marsupial OR kangaroo OR
koala) returns documents that have these words in them.
• Since a document with the word marsupials might not match the
keyword marsupial, some IR systems first run a stemmer on the
query and document words.
• Morphological information in IR is thus only used to determine
that two words have the same stem; the suffixes are thrown away.
Stemming
• One of the most widely used such stemming
algorithms is the simple and efficient Porter
stemmer Porter (1980) algorithm, which is based
on a series of simple cascaded rewrite rules.
• Cascaded rewrite rules are just the sort of thing
that could be easily implemented as an FST
• The Porter algorithm also can be viewed as a
lexicon-free FST stemmer
The algorithm contains a series of
rules like these:
Lexicon Based Morphological
Parsers
Word and Sentence Tokenization
• Segmenting running text into words and sentences –
tokenization.
• Consider the following sentences from a Wall Street
Journal and New York Times article, respectively:
• Mr. Sherwood said reaction to Sea Containers’ proposal
has been "very positive."
• In New York Stock Exchange composite trading
yesterday, Sea Containers closed at $62.625, up 62.5
cents.
• ‘‘I said, ‘what’re you? Crazy?’ ’’ said Sadowsky.
• ‘‘I can’t afford to do that.’’
• Segmenting purely on white-space would produce
words like these: cents. said, positive, Crazy?
SENTENCE SEGMENTATION
• Sentence segmentation is a crucial first step in segmentation text
processing.
• Segmenting a text into sentences is generally based on
punctuation.
• Certain kinds of punctuation (periods, question marks, exclamation
points) tend to mark sentence boundaries.
• Question marks and exclamation points are relatively unambiguous
markers of sentence boundaries.
• Periods, on the other hand, are more ambiguous.
• The period character ‘.’ is ambiguous between a sentence boundary
marker and a marker of abbreviations like Mr. or Inc.
• The previous sentence that you just read showed an even more
complex case of this ambiguity, in which the final period of Inc.
marked both an abbreviation and the sentence boundarymarker.
Detecting and Correcting Spelling Errors
• The detection and correction of spelling errors is
an integral part of modern word-processors and
search engines.
• Important in correcting errors in optical character
recognition (OCR), the automatic OCR recognition
of machine or hand-printed characters, and
on-line handwriting recognition, the recognition
of human printed or cursive handwriting as the
user is writing.
Following Kukich (1992), we can distinguish three
increasingly broader problems:
1. non-word error detection: detecting spelling errors that result in
non-words (like graffe for giraffe).
2. isolated-word error correction: correcting spelling errors that
result in nonwords, for example correcting graffe to giraffe, but
looking only at the word in isolation.
3. context-dependent error detection and correction: using the
context to help detect and correct spelling errors even if they
accidentally result in an actual word of English (real-word errors).
• This can happen Realword errors from typographical errors
(insertion, deletion, transposition) which accidentally produce a
real word (e.g., there for three), or because the writer substituted
the wrong spelling of a homophone or near-homophone (e.g.,
dessert for desert, or piece for peace).
Minimum Edit Distance
• Deciding which of two words is closer to some third word in
spelling is a special case of the general problem of string distance.
• The distance between two strings is a measure of how alike two
strings are to each other.
• Many important algorithms for finding string distance rely on
some version of the
• Minimum edit distance algorithm, named by Wagner and Fischer
(1974) but independently discovered by many people.
• The minimum edit distance between two strings is the minimum
number of editing operations (insertion, deletion, substitution)
needed to transform one string into another.
• For example the gap between the words intention and execution
is five operations, shown in Fig. 3.23 as an alignment between the
two strings.
EXAMPLE
• Given two sequences, an alignment is a
correspondance between substrings of the
two sequences.
• Thus I aligns with the empty string, N with E, T
with X, and so on.
• Beneath the aligned strings is another
representation; a series of symbols expressing
an operation list for converting the top string
into the bottom string; d for deletion, s for
substitution, i for insertion.
Representing Minimum Edit
Distance
Levenshtein distance
• We can also assign a particular cost or weight to each of these
operations.
• The Levenshtein distance between two sequences is the simplest
weighting factor in which each of the three operations has a cost
of 1 (Levenshtein, 1966).
• Thus the Levenshtein distance between intention and execution
is 5.
• Levenshtein also proposed an alternate version of his metric in
which each insertion or deletion has a cost of one, and
substitutions are not allowed (equivalent to allowing
substitution, but giving each substitution a cost of 2, since any
substitution can be represented by one insertion and one
deletion).
• Using this version, the Levenshtein distance between intention
and execution is 8.
Dynamic Programming
• The minimum edit distance is computed by dynamic
programming.
• Dynamic programming is the name for a class of algorithms, first
introduced by Bellman (1957), that apply a table-driven method
to solve problems by combining solutions to subproblems.
• This class of algorithms includes the most commonly-used
algorithms in speech and language processing; besides minimum
edit distance, these include the Viterbi and forward algorithms
and the CYK and Earley algorithm.
• The intuition of a dynamic programming problem is that a large
problem can be solved by properly combining the solutions to
various sub problems.
For example, consider the sequence or “path” of transformed words
that comprise the minimum edit distance between the strings
intention and execution shown in Fig. 3.24.
Contd...
• Dynamic programming algorithms for sequence comparison
work by creating a distance matrix with one column for each
symbol in the target sequence and one row for each symbol in
the source sequence (i.e., target along the bottom, source along
the side).
• For minimum edit distance, this matrix is the edit-distance
matrix.
• Each cell edit-distance[i,j] contains the distance between the
first i characters of the target and the first j characters of the
source.
• Each cell can be computed as a simple function of the
surrounding cells; thus starting from the beginning of the matrix
it is possible to fill in every entry.
The value in each cell is computed by taking the
minimum of the three possible paths through the
matrix which arrive there:
Minimum Edit Distance
Computation of Minimum Edit Distance
Schematic of Back Pointers

You might also like