0% found this document useful (0 votes)
22 views

Unit Vapplications Notes

This document discusses several topics related to natural language processing and information retrieval, including: 1. Language models that are used to understand and process natural human languages for tasks like text classification, information retrieval, and information extraction. Common approaches involve n-gram character models. 2. Information retrieval systems that aim to find documents relevant to a user's information need in response to a query. Scoring functions like BM25 are used to rank documents by relevance to the query. 3. Methods for evaluating information retrieval systems based on metrics like precision and recall calculated from relevant and non-relevant documents returned for a query. Refinements to basic models incorporate factors like stemming, metadata, and link analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Unit Vapplications Notes

This document discusses several topics related to natural language processing and information retrieval, including: 1. Language models that are used to understand and process natural human languages for tasks like text classification, information retrieval, and information extraction. Common approaches involve n-gram character models. 2. Information retrieval systems that aim to find documents relevant to a user's information need in response to a query. Scoring functions like BM25 are used to rank documents by relevance to the query. 3. Methods for evaluating information retrieval systems based on metrics like precision and recall calculated from relevant and non-relevant documents returned for a query. Refinements to basic models incorporate factors like stemming, metadata, and link analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT V APPLICATIONS 9

AI applications – Language Models – Information Retrieval- Information Extraction – Natural Language


Processing - Machine Translation – Speech Recognition – Robot – Hardware – Perception – Planning –
Moving

There are over a trillion pages of information on the Web, almost all of it in natural Knowledge
Acquisition language. An agent that wants to do needs to understand (at least partially) the
ambiguous, messy languages that humans use. We examine the problem from
the point of view of specific information-seeking tasks: text classification, information retrieval,
and information extraction. One common factor in addressing these tasks is the use of
language models: models that predict the probability distribution of language expression.

LANGUAGE MODELS:
Formal languages, such as the programming languages Java or Python, have precisely defined language
models. A language can be defined as a set of strings; ―print(2 + 2)‖ is a
legal program in the language Python, whereas ―2)+(2 print‖ is not,

Natural languages, such as English or Spanish, cannot be characterized as a definitive


set of sentences.

Natural languages are also ambiguous.

natural languages are difficult to deal with because they are very large, and
constantly changing. Thus, our language models are, at best, an approximation.

N-gram character models

A written text is composed of characters—letters, digits, punctuation, and spaces


in English. Thus, one of the simplest language models is a probability distribution over sequences of
characters. we write P(c1:N) for the probability of a sequence of N characters, c1 through cN . In one
Web collection, P(―the‖)=0.027 and P(―zgq‖)=0.000000002. A sequence of written symbols of length n is
called an n-gram, with special case ―unigram‖ for 1-gram, ―bigram‖ for 2-gram, and ―trigram‖ for 3-
gram.

A model of the probability distribution of n-letter sequences is thus called an n-gram model. An n-gram
model is defined as a Markov chain of order n − 1. in a Markov chain the probability of character ci
depends only on the immediately preceding characters, not on any other characters. So in a trigram model

P(ci | c1:i−1) = P(ci | ci−2:i−1) .

We can define the probability of a sequence of characters P(c1:N) under the trigram model
by first factoring with the chain rule and then using the Markov assumption:
One approach to language identification is to first build a trigram character model of
each candidate language, P(ci | ci−2:i−1,L), where the variable ranges over languages. For
each the model is built by counting trigrams in a corpus of that language.

Smoothing n-gram models

The major complication of n-gram models is that the training corpus provides only an estimate of the true
probability distribution. For common character sequences such as ― th‖ any
English corpus will give a good estimate: about 1.5% of all trigrams. On the other hand, ― ht‖
is very uncommon—no dictionary words start with ht. It is likely that the sequence would
have a count of zero in a training corpus of standard English. Does that mean we should assign P(―
th‖)=0?.

If we did, then the text ―The program issues an http request‖ would have an English probability of zero,
which seems wrong. We have a problem in generalization: we
want our language models to generalize well to texts they haven’t seen yet. Just because we
have never seen ― http‖ before does not mean that our model should claim that it is impossible. Thus, we
will adjust our language model so that sequences that have a count of zero in
the training corpus will be assigned a small nonzero probability (and the other counts will be
adjusted downward slightly so that the probability still sums to 1). The process od adjusting the
probability of low-frequency counts is called smoothing.

Linear interpolation smoothing is a backoff model that combines trigram, bigram, and unigram

models by linear interpolation. It defines the probability estimate as

P *(ci|ci−2:i−1) = λ3P(ci|ci−2:i−1) + λ2P(ci|ci−1) + λ1P(ci) ,

Model evaluation

With so many possible n-gram models—unigram, bigram, trigram, interpolated smoothing


with different values of λ, etc .We can evaluate a model with cross-validation. Split the corpus into a
training corpus and a validation corpus. Determine the parameters of the model from the training data.
Then evaluate the model on the validation corpus. The evaluation can be a task-specific metric, such as
measuring accuracy on language identification. Alternatively we can have a task-independent model of
language quality: calculate the probability assigned to the validation corpus by the model; the higher the
probability the better. This metric is inconvenient because the probability of a large corpus will
be a very small number, and floating-point underflow becomes an issue.

Information retrieval

Information retrieval is the task of finding documents that are relevant to a user’s need for
information. The best-known examples of information retrieval systems are search engines
on the World Wide Web.

An information retrieval (henceforth IR) system can be characterized by

1. A corpus of documents. Each system must decide what it wants to treat as a document: a
paragraph, a page, or a multipage text.
2. Queries posed in a query language. A query specifies what the user wants to know.The query
language can be just a list of words, such as [AI book]; or it can specify a phrase of words that must
be adjacent, as in [―AI book‖]; it can contain Boolean operators as in [AI AND book]; it can include
non-Boolean operators such as [AI NEAR book] or [AI book site:www.aaai.org].
3. A result set. This is the subset of documents that the IR system judges to be relevant to
the query. By relevant, we mean likely to be of use to the person who posed the query,
for the particular information need expressed in the query.
4. A presentation of the result set. This can be as simple as a ranked list of document
titles or as complex as a rotating color map of the result set projected onto a threedimensional space,
endered as a two-dimensional display.

The earliest IR systems worked on a Boolean keyword model. Each word in the document collection is
treated as a Boolean feature that is true of a document if the word occurs in the document and false if it
does not.

IR scoring functions

A scoring function takes a document and a query and returns a numeric score; the most
relevant documents have the highest scores. In the BM25 function, the score is a linear
weighted combination of scores for each of the words that make up the query. Three factors
affect the weight of a query term: First, the frequency with which a query term appears in
a document (also known as TF for term frequency).

The BM25 function takes all three of these into account. We assume we have created
an index of the N documents in the corpus so that we can look up TF(qi, dj), the count of
the number of times word qi appears in document dj. We also assume a table of document
frequency counts, DF(qi), that gives the number of documents that contain the word qi.
Then, given a document dj and a query consisting of the words q1:N, we have

IR system evaluation

Imagine that an IR system has


returned a result set for a single query, for which we know which documents are and are not
relevant, out of a corpus of 100 documents. The document counts in each category are given
in the following table:

IR refinements

The BM25 scoring function uses a word model that treats all words as completely independent, but we
now that some words are correlated: ―couch‖ is closely related to both ―couches‖ and ―sofa.‖ Many IR
systems attempt to account for these correlations. For example, if the query is [couch], it would be a
shame to exclude from the result set those documents that mention ―COUCH‖ or ―couches‖ but not
―couch.‖ Most IR systems do case folding of ―COUCH‖ to ―couch,‖ and some use a stemming
algorithm to reduce ―couches‖ to the stem form ―couch,‖ both in the query and the documents. This
typically yields a small increase in recall (on the order of 2% for English). However, it can harm
precision. For example, stemming ―stocking‖ to ―stock‖ will tend to decrease precision for
queries about either foot coverings or financial instruments, although it could improve recall
for queries about warehousing. Stemming algorithms based on rules (e.g., remove ―-ing‖)
cannot avoid this problem, but algorithms based on dictionaries (don’t remove ―-ing‖ if the
word is already listed in the dictionary) can.
IR can be improved by considering metadata—data outside of the text of the document. Examples
include human-supplied keywords and publication data.LOn the Web, hypertext links between documents
are a crucial source of information.

The PageRank algorithm

The PageRank for a page p is defined as:

The HITS algorithm

The Hyperlink-Induced Topic Search algorithm, also known as ―Hubs and Authorities‖ or HITS, is
another influential link-analysis algorithm. HITS differs from PageRank in several ways. First, it is a
query-dependent measure: it rates pages with respect to a query. That means that it must be computed
anew for each query—a computational burden that most search engines have elected not to take on.
Given a query, HITS first finds a set of pages that are relevant to the query. It does that by intersecting hit
lists of query words, and then adding pages in the link neighborhood of these pages—pages that link to or
are linked from one of the pages in the original relevant set. Each page in this set is considered an
authority on the query to the degree that other pages in the relevant set point to it. A page is considered
a hub to the degree that it points to other authoritative pages in the relevant set. Just as with PageRank,
we don’t want to merely count the number of links; we want to give more value to the high-quality hubs
and authorities. Thus, as with PageRank, we iterate a process that updates the authority score of a page to
be the sum of the hub scores of the pages that point to it, and the hub score to be the sum of the authority
scores of the pages it points to. If we then normalize the scores and repeat k times, the process will
converge.

Question answering

Question answering is a somewhat different task, in which the query really is a question, and the answer
is not a ranked list of documents but rather a short response—a sentence, or even just a phrase.

INFORMATION EXTRACTION
Information extraction is the process of acquiring knowledge by skimming a text and looking for
occurrences of a particular class of object and for relationships among objects. A
typical task is to extract instances of addresses from Web pages, with database fields for
street, city, state, and zip code; or instances of storms from weather reports, with fields for
temperature, wind speed, and precipitation. In a limited domain, this can be done with high
accuracy. As the domain gets more general, more complex linguistic models and more complex learning
techniques are necessary.

Finite-state automata for information extraction


The simplest type of information extraction system is an attribute-based extraction system
that assumes that the entire text refers to a single object and the task is to extract attributes of
that object. For example, we mentioned in Section 12.7 the problem of extracting from the
text ―IBM ThinkBook 970. Our price: $399.00‖ the set of attributes {Manufacturer=IBM,
Model=ThinkBook970, Price=$399.00}. We can address this problem by defining a template (also
known as a pattern) for each attribute we would like to extract. The template is
defined by a finite state automaton, the simplest example of which is the regular expression,
or regex. Regular expressions are used in Unix commands such as grep, in programming
languages such as Perl, and in word processors such as Microsoft Word. The details vary
slightly from one tool to another and so are best learned from the appropriate manual, but
here we show how to build up a regular expression template for prices in dollars:
[0-9] matches any digit from 0 to 9
[0-9]+ matches one or more digits
[.][0-9][0-9] matches a period followed by two digits
([.][0-9][0-9])? matches a period followed by two digits, or nothing
[$][0-9]+([.][0-9][0-9])? matches $249.99 or $1.23 or $1000000 or . . .
Templates are often defined with three parts: a prefix regex, a target regex, and a postfix regex.
For prices, the target regex is as above, the prefix would look for strings such as ―price:‖ and
the postfix could be empty. The idea is that some clues about an attribute come from the
attribute value itself and some come from the surrounding text.

One step up from attribute-based extraction systems are relational extraction systems,
which deal with multiple objects and the relations among them. Thus, when these systems
see the text ―$249.99,‖ they need to determine not just that it is a price, but also which object
has that price. A typical relational-based extraction system is FASTUS, which handles news
stories about corporate mergers and acquisitions.

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese
trading house to produce golf clubs to be shipped to Japan. and extract the relations:

A relational extraction system can be built as a series of cascaded finite-state transducers.

That is, the system consists of a series of small, efficient finite-state automata (FSAs), where
each automaton receives text as input, transduces the text into a different format, and passes
it along to the next automaton. FASTUS consists of five stages:
1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
FASTUS’s first stage is tokenization, which segments the stream of characters into tokens
(words, numbers, and punctuation). For English, tokenization can be fairly simple; just separating
characters at white space or punctuation does a fairly good job. Some tokenizers also
deal with markup languages such as HTML, SGML, and XML.
The second stage handles complex words, including collocations such as ―set up‖ and
―joint venture,‖ as well as proper names such as ―Bridgestone Sports Co.‖ These are recognized by a
combination of lexical entries and finite-state grammar rules. For example, a
company name might be recognized by the rule
CapitalizedWord+ (―Company‖ | ―Co‖ | ―Inc‖ | ―Ltd‖)
The third stage handles basic groups, meaning noun groups and verb groups. The idea is
to chunk these into units that will be managed by the later stages. We will see how to write
a complex description of noun and verb phrases in Chapter 23, but here we have simple
rules that only approximate the complexity of English, but have the advantage of being representable by
finite state automata.

Probabilistic models for information extraction

When information extraction must be attempted from noisy or varied input, simple finite-state
approaches fare poorly. It is too hard to get all the rules and their priorities right; it is better
to use a probabilistic model rather than a rule-based model. The simplest probabilistic model
for sequences with hidden state is the hidden Markov model, or HMM.
an HMM models a progression through a sequence of
hidden states, xt, with an observation et at each step. To apply HMMs to information extraction, we can
either build one big HMM for all the attributes or build a separate HMM
for each attribute. The observations are the words of the text, and the
hidden states are whether we are in the target, prefix, or postfix part of the attribute template,
or in the background .

Conditional random fields for information extraction

An HMM is a generative model; it models the full joint


probability of observations and hidden states, and thus can be used to generate samples. That
is, we can use the HMM model not only to parse a text and recover the speaker and date,
but also to generate a random instance of a text containing a speaker and a date. Since we’re
not interested in that task, it is natural to ask whether we might be better off with a model
that doesn’t bother modeling that possibility. All we need in order to understand a text is a
discriminative model, one that models the conditional probability of the hidden attributes
given the observations (the text). Given a text e1:N , the conditional model finds the hidden
state sequence X1:N that maximizes P(X1:N | e1:N).
Modeling this directly gives us some freedom. We don’t need the independence assumptions of the
Markov model—we can have an xt that is dependent on x1. A framework
for this type of model is the conditional random field, or CRF, which models a conditional
probability distribution of a set of target variables given a set of observed variables. Like
Bayesian networks, CRFs can represent many different structures of dependencies among the
variables. One common structure is the linear-chain conditional random field for repreRANDOM FIELD
senting Markov dependencies among variables in a temporal sequence. Thus, HMMs are the
temporal version of naive Bayes models, and linear-chain CRFs are the temporal version of
logistic regression, where the predicted target is an entire state sequence rather than a single
binary variable.
Let e1:N be the observations (e.g., words in a document), and x1:N be the sequence of
hidden states (e.g., the prefix, target, and postfix states). A linear-chain conditional random
field defines a conditional probability distribution:
P(x1:N|e1:N) = α e[PN i=1 F (xi−1,xi,e,i)] ,
where α is a normalization factor (to make sure the probabilities sum to 1), and F is a feature
function defined as the weighted sum of a collection of k component feature functions:
F(xi−1, xi, e, i) = k λk fk(xi−1, xi, e, i).

Machine translation

Machine translation is the automatic translation of text from one natural language (the source)
to another (the target).

A translator (human or machine) often needs to understand the actual situation described in the source,
not just the individual words.

Machine translation systems


All translation systems must model the source and target languages, but systems vary in the
type of models they use. Some systems attempt to analyze the source language text all the way
into an interlingua knowledge representation and then generate sentences in the target language from that
representation. This is difficult because it involves three unsolved problems:
creating a complete knowledge representation of everything; parsing into that representation;
and generating sentences from that representation.

Other systems are based on a transfer model. They keep a database of translation rules
(or examples), and whenever the rule (or example) matches, they translate directly. Transfer
can occur at the lexical, syntactic, or semantic level. For example, a strictly syntactic rule
maps English [Adjective Noun] to French [Noun Adjective]. A mixed syntactic and lexical
rule maps French [S1 ―et puis‖ S2] to English [S1 ―and then‖ S2]. Figure 23.12 diagrams the
various transfer points

Statistical machine translation

we have seen how complex the translation task can be, it should come as no surprise that the most
successful machine translation systems are built by training a probabilistic
model using statistics gathered from a large corpus of text. This approach does not need
a complex ontology of interlingua concepts, nor does it need handcrafted grammars of the
source and target languages, nor a hand-labeled treebank. All it needs is data—sample translations from
which a translation model can be learned. To translate a sentence in, say, English
(e) into French (f), we find the string of words f ∗ that maximizes
f ∗ = argmax P(f | e) = argmax P(e | f) P(f) .
Here the factor P(f) is the target language model for French; it says how probable a given sentence is in
French. P(e|f) is the translation model; it says how probable an English
sentence is as a translation for a given French sentence. Similarly, P(f | e) is a translation
model from English to French.

In diagnostic applications like medicine, it is easier to model the domain in the causal direction:
P(symptoms | disease) rather than P(disease | symptoms). But in translation both
directions are equally easy. The earliest work in statistical machine translation did apply
Bayes’ rule—in part because the researchers had a good language model, P(f), and wanted
to make use of it, and in part because they came from a background in speech recognition,
which is a diagnostic problem. We follow their lead in this chapter, but we note that recent work in
statistical machine translation often optimizes P(f | e) directly, using a more
sophisticated model that takes into account many of the features from the language model.

The translation model is learned from a bilingual corpus—a collection of parallel texts,
each an English/French pair

given a source English sentence, e,


finding a French translation f is a matter of three steps:
1. Break the English sentence into phrases e1, . . . , en.
2. For each phrase ei, choose a corresponding French phrase fi. We use the notation
P(fi | ei) for the phrasal probability that fi is a translation of ei.
3. Choose a permutation of the phrases f1, . . . , fn. We will specify this permutation in a
way that seems a little complicated, but is designed to have a simple probability distribution: For each fi,
we choose a distortion di, which is the number of words that
phrase fi has moved with respect to fi−1; positive for moving to the right, negative for
moving to the left, and zero if fi immediately follows fi−1.

an example of the process. At the top, the sentence ―There is a smelly


wumpus sleeping in 2 2‖ is broken into five phrases, e1, . . . , e5. Each of them is translated
into a corresponding phrase fi, and then these are permuted into the order f1, f3, f4, f2, f5.
We specify the permutation in terms of the distortions di of each French phrase, defined as
di = START(fi) − END(fi−1) − 1 ,
where START(fi) is the ordinal number of the first word of phrase fi in the French sentence,
and END(fi−1) is the ordinal number of the last word of phrase fi−1. In Figure 23.13 we see
that f5, ―` a 2 2,‖ immediately follows f4, ―qui dort,‖ and thus d5 =0. Phrase f2, however, has
moved one words to the right of f1, so d2 =1. As a special case we have d1 =0, because f1
starts at position 1 and END(f0) is defined to be 0 (even though f0 does not exist).
All that remains is to learn the phrasal and distortion probabilities. We sketch the procedure; see the notes
at the end of the chapter for details.
1. Find parallel texts: First, gather a parallel bilingual corpus.

2. Segment into sentences: The unit of translation is a sentence, so we will have to break
the corpus into sentences.
3. Align sentences: For each sentence in the English version, determine what sentence(s)
it corresponds to in the French version.
4. Align phrases: Within a sentence, phrases can be aligned by a process that is similar to
that used for sentence alignment, but requiring iterative improvement.
5. Extract distortions: Once we have an alignment of phrases we can define distortion
probabilities. Simply count how often distortion occurs in the corpus for each distance
d = 0, ±1, ±2, . . ., and apply smoothing.
6. 6. Improve estimates with EM: Use expectation–maximization to improve the estimates
of P(f | e) and P(d) values. We compute the best alignments with the current values
of these parameters in the E step, then update the estimates in the M step and iterate the
process until convergence.

Speech recognition

Speech recognition is the task of identifying a sequence of words uttered by a speaker, given
the acoustic signal.
people interact with speech recognition systems every day to navigate voice mail systems,
search the Web from mobile phones, and other applications. Speech is an attractive option
when hands-free operation is necessary, as when operating machinery.
Speech recognition is difficult because the sounds made by a speaker are ambiguous
and, well, noisy. As a well-known example, the phrase ―recognize speech‖ sounds almost
the same as ―wreck a nice beach‖ when spoken quickly. Even this short example shows
several of the issues that make speech problematic. First, segmentation: written words in
English have spaces between them, but in fast speech there are no pauses in ―wreck a nice‖
that would distinguish it as a multiword phrase as opposed to the single word ―recognize.‖
Second, coarticulation: when speaking quickly the ―s‖ sound at the end of ―nice‖ merges
with the ―b‖ sound at the beginning of ―beach,‖ yielding something that is close to a ―sp.‖
Another problem that does not show up in this example is homophones—words like ―to,‖
―too,‖ and ―two‖ that sound the same but differ in meaning.
We can view speech recognition as a problem in most-likely-sequence explanation. As
we saw in Section 15.2, this is the problem of computing the most likely sequence of state
variables, x1:t, given a sequence of observations e1:t. In this case the state variables are the
words, and the observations are sounds. More precisely, an observation is a vector of features
extracted from the audio signal. As usual, the most likely sequence can be computed with the
help of Bayes’ rule to be:
argmax word1:t P (word 1:t | sound 1:t) = argmax word1:t P (sound 1:t | word 1:t)P (word 1:t) .
Here P (sound 1:t|word 1:t) is the acoustic model. It describes the sounds of words—that
―ceiling‖ begins with a soft ―c‖ and sounds the same as ―sealing.‖ P (word 1:t) is known as
the language model. It specifies the prior probability of each utterance—for example, that
―ceiling fan‖ is about 500 times more likely as a word sequence than ―sealing fan.

Acoustic model

Sound waves are periodic changes in pressure that propagate through the air. When these
waves strike the diaphragm of a microphone, the back-and-forth movement generates an
electric current. An analog-to-digital converter measures the size of the current—which approximates
the amplitude of the sound wave—at discrete intervals called the sampling rate.
Speech sounds, which are mostly in the range of 100 Hz (100 cycles per second) to 1000 Hz,
are typically sampled at a rate of 8 kHz. (CDs and mp3 files are sampled at 44.1 kHz.) The
precision of each measurement is determined by the quantization factor; speech recognizers
typically keep 8 to 12 bits. That means that a low-end system, sampling at 8 kHz with 8-bit
quantization, would require nearly half a megabyte per minute of speech.
Since we only want to know what words were spoken, not exactly what they sounded
like, we don’t need to keep all that information. We only need to distinguish between differ PHONE ent
speech sounds. Linguists have identified about 100 speech sounds, or phones, that can be
composed to form all the words in all known human languages. Roughly speaking, a phone
is the sound that corresponds to a single vowel or consonant, but there are some complications:
combinations of letters, such as ―th‖ and ―ng‖ produce single phones, and some letters
produce different phones in different contexts (e.g., the ―a‖ in rat and rate.
A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a particular
language.
First, we observe that although the sound frequencies in speech may be several kHz,
the changes in the content of the signal occur much less often, perhaps at no more than 100
Hz. Therefore, speech systems summarize the properties of the signal over time slices called
frames. A frame length of about 10 milliseconds (i.e., 80 samples at 8 kHz) is short enough
to ensure that few short-duration phenomena will be missed. Overlapping frames are used to
make sure that we don’t miss a signal because it happens to fall on a frame boundary.
Each frame is summarized by a vector of features. Picking out features from a speech
signal is like listening to an orchestra and saying ―here the French horns are playing loudly
and the violins are playing softly.‖ We’ll give a brief overview of the features in a typical
system. First, a Fourier transform is used to determine the amount of acoustic energy at
about a dozen frequencies. Then we compute a measure called the mel frequency cepstral
coefficient (MFCC) or MFCC for each frequency. We also compute the total energy in

the frame. That gives thirteen features; for each one we compute the difference between
this frame and the previous frame, and the difference between differences, for a total of 39
features. These are continuous-valued; the easiest way to fit them into the HMM framework
is to discretize the values.

You might also like