Unit Vapplications Notes
Unit Vapplications Notes
There are over a trillion pages of information on the Web, almost all of it in natural Knowledge
Acquisition language. An agent that wants to do needs to understand (at least partially) the
ambiguous, messy languages that humans use. We examine the problem from
the point of view of specific information-seeking tasks: text classification, information retrieval,
and information extraction. One common factor in addressing these tasks is the use of
language models: models that predict the probability distribution of language expression.
LANGUAGE MODELS:
Formal languages, such as the programming languages Java or Python, have precisely defined language
models. A language can be defined as a set of strings; ―print(2 + 2)‖ is a
legal program in the language Python, whereas ―2)+(2 print‖ is not,
natural languages are difficult to deal with because they are very large, and
constantly changing. Thus, our language models are, at best, an approximation.
A model of the probability distribution of n-letter sequences is thus called an n-gram model. An n-gram
model is defined as a Markov chain of order n − 1. in a Markov chain the probability of character ci
depends only on the immediately preceding characters, not on any other characters. So in a trigram model
We can define the probability of a sequence of characters P(c1:N) under the trigram model
by first factoring with the chain rule and then using the Markov assumption:
One approach to language identification is to first build a trigram character model of
each candidate language, P(ci | ci−2:i−1,L), where the variable ranges over languages. For
each the model is built by counting trigrams in a corpus of that language.
The major complication of n-gram models is that the training corpus provides only an estimate of the true
probability distribution. For common character sequences such as ― th‖ any
English corpus will give a good estimate: about 1.5% of all trigrams. On the other hand, ― ht‖
is very uncommon—no dictionary words start with ht. It is likely that the sequence would
have a count of zero in a training corpus of standard English. Does that mean we should assign P(―
th‖)=0?.
If we did, then the text ―The program issues an http request‖ would have an English probability of zero,
which seems wrong. We have a problem in generalization: we
want our language models to generalize well to texts they haven’t seen yet. Just because we
have never seen ― http‖ before does not mean that our model should claim that it is impossible. Thus, we
will adjust our language model so that sequences that have a count of zero in
the training corpus will be assigned a small nonzero probability (and the other counts will be
adjusted downward slightly so that the probability still sums to 1). The process od adjusting the
probability of low-frequency counts is called smoothing.
Linear interpolation smoothing is a backoff model that combines trigram, bigram, and unigram
Model evaluation
Information retrieval
Information retrieval is the task of finding documents that are relevant to a user’s need for
information. The best-known examples of information retrieval systems are search engines
on the World Wide Web.
1. A corpus of documents. Each system must decide what it wants to treat as a document: a
paragraph, a page, or a multipage text.
2. Queries posed in a query language. A query specifies what the user wants to know.The query
language can be just a list of words, such as [AI book]; or it can specify a phrase of words that must
be adjacent, as in [―AI book‖]; it can contain Boolean operators as in [AI AND book]; it can include
non-Boolean operators such as [AI NEAR book] or [AI book site:www.aaai.org].
3. A result set. This is the subset of documents that the IR system judges to be relevant to
the query. By relevant, we mean likely to be of use to the person who posed the query,
for the particular information need expressed in the query.
4. A presentation of the result set. This can be as simple as a ranked list of document
titles or as complex as a rotating color map of the result set projected onto a threedimensional space,
endered as a two-dimensional display.
The earliest IR systems worked on a Boolean keyword model. Each word in the document collection is
treated as a Boolean feature that is true of a document if the word occurs in the document and false if it
does not.
IR scoring functions
A scoring function takes a document and a query and returns a numeric score; the most
relevant documents have the highest scores. In the BM25 function, the score is a linear
weighted combination of scores for each of the words that make up the query. Three factors
affect the weight of a query term: First, the frequency with which a query term appears in
a document (also known as TF for term frequency).
The BM25 function takes all three of these into account. We assume we have created
an index of the N documents in the corpus so that we can look up TF(qi, dj), the count of
the number of times word qi appears in document dj. We also assume a table of document
frequency counts, DF(qi), that gives the number of documents that contain the word qi.
Then, given a document dj and a query consisting of the words q1:N, we have
IR system evaluation
IR refinements
The BM25 scoring function uses a word model that treats all words as completely independent, but we
now that some words are correlated: ―couch‖ is closely related to both ―couches‖ and ―sofa.‖ Many IR
systems attempt to account for these correlations. For example, if the query is [couch], it would be a
shame to exclude from the result set those documents that mention ―COUCH‖ or ―couches‖ but not
―couch.‖ Most IR systems do case folding of ―COUCH‖ to ―couch,‖ and some use a stemming
algorithm to reduce ―couches‖ to the stem form ―couch,‖ both in the query and the documents. This
typically yields a small increase in recall (on the order of 2% for English). However, it can harm
precision. For example, stemming ―stocking‖ to ―stock‖ will tend to decrease precision for
queries about either foot coverings or financial instruments, although it could improve recall
for queries about warehousing. Stemming algorithms based on rules (e.g., remove ―-ing‖)
cannot avoid this problem, but algorithms based on dictionaries (don’t remove ―-ing‖ if the
word is already listed in the dictionary) can.
IR can be improved by considering metadata—data outside of the text of the document. Examples
include human-supplied keywords and publication data.LOn the Web, hypertext links between documents
are a crucial source of information.
The Hyperlink-Induced Topic Search algorithm, also known as ―Hubs and Authorities‖ or HITS, is
another influential link-analysis algorithm. HITS differs from PageRank in several ways. First, it is a
query-dependent measure: it rates pages with respect to a query. That means that it must be computed
anew for each query—a computational burden that most search engines have elected not to take on.
Given a query, HITS first finds a set of pages that are relevant to the query. It does that by intersecting hit
lists of query words, and then adding pages in the link neighborhood of these pages—pages that link to or
are linked from one of the pages in the original relevant set. Each page in this set is considered an
authority on the query to the degree that other pages in the relevant set point to it. A page is considered
a hub to the degree that it points to other authoritative pages in the relevant set. Just as with PageRank,
we don’t want to merely count the number of links; we want to give more value to the high-quality hubs
and authorities. Thus, as with PageRank, we iterate a process that updates the authority score of a page to
be the sum of the hub scores of the pages that point to it, and the hub score to be the sum of the authority
scores of the pages it points to. If we then normalize the scores and repeat k times, the process will
converge.
Question answering
Question answering is a somewhat different task, in which the query really is a question, and the answer
is not a ranked list of documents but rather a short response—a sentence, or even just a phrase.
INFORMATION EXTRACTION
Information extraction is the process of acquiring knowledge by skimming a text and looking for
occurrences of a particular class of object and for relationships among objects. A
typical task is to extract instances of addresses from Web pages, with database fields for
street, city, state, and zip code; or instances of storms from weather reports, with fields for
temperature, wind speed, and precipitation. In a limited domain, this can be done with high
accuracy. As the domain gets more general, more complex linguistic models and more complex learning
techniques are necessary.
One step up from attribute-based extraction systems are relational extraction systems,
which deal with multiple objects and the relations among them. Thus, when these systems
see the text ―$249.99,‖ they need to determine not just that it is a price, but also which object
has that price. A typical relational-based extraction system is FASTUS, which handles news
stories about corporate mergers and acquisitions.
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese
trading house to produce golf clubs to be shipped to Japan. and extract the relations:
That is, the system consists of a series of small, efficient finite-state automata (FSAs), where
each automaton receives text as input, transduces the text into a different format, and passes
it along to the next automaton. FASTUS consists of five stages:
1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
FASTUS’s first stage is tokenization, which segments the stream of characters into tokens
(words, numbers, and punctuation). For English, tokenization can be fairly simple; just separating
characters at white space or punctuation does a fairly good job. Some tokenizers also
deal with markup languages such as HTML, SGML, and XML.
The second stage handles complex words, including collocations such as ―set up‖ and
―joint venture,‖ as well as proper names such as ―Bridgestone Sports Co.‖ These are recognized by a
combination of lexical entries and finite-state grammar rules. For example, a
company name might be recognized by the rule
CapitalizedWord+ (―Company‖ | ―Co‖ | ―Inc‖ | ―Ltd‖)
The third stage handles basic groups, meaning noun groups and verb groups. The idea is
to chunk these into units that will be managed by the later stages. We will see how to write
a complex description of noun and verb phrases in Chapter 23, but here we have simple
rules that only approximate the complexity of English, but have the advantage of being representable by
finite state automata.
When information extraction must be attempted from noisy or varied input, simple finite-state
approaches fare poorly. It is too hard to get all the rules and their priorities right; it is better
to use a probabilistic model rather than a rule-based model. The simplest probabilistic model
for sequences with hidden state is the hidden Markov model, or HMM.
an HMM models a progression through a sequence of
hidden states, xt, with an observation et at each step. To apply HMMs to information extraction, we can
either build one big HMM for all the attributes or build a separate HMM
for each attribute. The observations are the words of the text, and the
hidden states are whether we are in the target, prefix, or postfix part of the attribute template,
or in the background .
Machine translation
Machine translation is the automatic translation of text from one natural language (the source)
to another (the target).
A translator (human or machine) often needs to understand the actual situation described in the source,
not just the individual words.
Other systems are based on a transfer model. They keep a database of translation rules
(or examples), and whenever the rule (or example) matches, they translate directly. Transfer
can occur at the lexical, syntactic, or semantic level. For example, a strictly syntactic rule
maps English [Adjective Noun] to French [Noun Adjective]. A mixed syntactic and lexical
rule maps French [S1 ―et puis‖ S2] to English [S1 ―and then‖ S2]. Figure 23.12 diagrams the
various transfer points
we have seen how complex the translation task can be, it should come as no surprise that the most
successful machine translation systems are built by training a probabilistic
model using statistics gathered from a large corpus of text. This approach does not need
a complex ontology of interlingua concepts, nor does it need handcrafted grammars of the
source and target languages, nor a hand-labeled treebank. All it needs is data—sample translations from
which a translation model can be learned. To translate a sentence in, say, English
(e) into French (f), we find the string of words f ∗ that maximizes
f ∗ = argmax P(f | e) = argmax P(e | f) P(f) .
Here the factor P(f) is the target language model for French; it says how probable a given sentence is in
French. P(e|f) is the translation model; it says how probable an English
sentence is as a translation for a given French sentence. Similarly, P(f | e) is a translation
model from English to French.
In diagnostic applications like medicine, it is easier to model the domain in the causal direction:
P(symptoms | disease) rather than P(disease | symptoms). But in translation both
directions are equally easy. The earliest work in statistical machine translation did apply
Bayes’ rule—in part because the researchers had a good language model, P(f), and wanted
to make use of it, and in part because they came from a background in speech recognition,
which is a diagnostic problem. We follow their lead in this chapter, but we note that recent work in
statistical machine translation often optimizes P(f | e) directly, using a more
sophisticated model that takes into account many of the features from the language model.
The translation model is learned from a bilingual corpus—a collection of parallel texts,
each an English/French pair
2. Segment into sentences: The unit of translation is a sentence, so we will have to break
the corpus into sentences.
3. Align sentences: For each sentence in the English version, determine what sentence(s)
it corresponds to in the French version.
4. Align phrases: Within a sentence, phrases can be aligned by a process that is similar to
that used for sentence alignment, but requiring iterative improvement.
5. Extract distortions: Once we have an alignment of phrases we can define distortion
probabilities. Simply count how often distortion occurs in the corpus for each distance
d = 0, ±1, ±2, . . ., and apply smoothing.
6. 6. Improve estimates with EM: Use expectation–maximization to improve the estimates
of P(f | e) and P(d) values. We compute the best alignments with the current values
of these parameters in the E step, then update the estimates in the M step and iterate the
process until convergence.
Speech recognition
Speech recognition is the task of identifying a sequence of words uttered by a speaker, given
the acoustic signal.
people interact with speech recognition systems every day to navigate voice mail systems,
search the Web from mobile phones, and other applications. Speech is an attractive option
when hands-free operation is necessary, as when operating machinery.
Speech recognition is difficult because the sounds made by a speaker are ambiguous
and, well, noisy. As a well-known example, the phrase ―recognize speech‖ sounds almost
the same as ―wreck a nice beach‖ when spoken quickly. Even this short example shows
several of the issues that make speech problematic. First, segmentation: written words in
English have spaces between them, but in fast speech there are no pauses in ―wreck a nice‖
that would distinguish it as a multiword phrase as opposed to the single word ―recognize.‖
Second, coarticulation: when speaking quickly the ―s‖ sound at the end of ―nice‖ merges
with the ―b‖ sound at the beginning of ―beach,‖ yielding something that is close to a ―sp.‖
Another problem that does not show up in this example is homophones—words like ―to,‖
―too,‖ and ―two‖ that sound the same but differ in meaning.
We can view speech recognition as a problem in most-likely-sequence explanation. As
we saw in Section 15.2, this is the problem of computing the most likely sequence of state
variables, x1:t, given a sequence of observations e1:t. In this case the state variables are the
words, and the observations are sounds. More precisely, an observation is a vector of features
extracted from the audio signal. As usual, the most likely sequence can be computed with the
help of Bayes’ rule to be:
argmax word1:t P (word 1:t | sound 1:t) = argmax word1:t P (sound 1:t | word 1:t)P (word 1:t) .
Here P (sound 1:t|word 1:t) is the acoustic model. It describes the sounds of words—that
―ceiling‖ begins with a soft ―c‖ and sounds the same as ―sealing.‖ P (word 1:t) is known as
the language model. It specifies the prior probability of each utterance—for example, that
―ceiling fan‖ is about 500 times more likely as a word sequence than ―sealing fan.
Acoustic model
Sound waves are periodic changes in pressure that propagate through the air. When these
waves strike the diaphragm of a microphone, the back-and-forth movement generates an
electric current. An analog-to-digital converter measures the size of the current—which approximates
the amplitude of the sound wave—at discrete intervals called the sampling rate.
Speech sounds, which are mostly in the range of 100 Hz (100 cycles per second) to 1000 Hz,
are typically sampled at a rate of 8 kHz. (CDs and mp3 files are sampled at 44.1 kHz.) The
precision of each measurement is determined by the quantization factor; speech recognizers
typically keep 8 to 12 bits. That means that a low-end system, sampling at 8 kHz with 8-bit
quantization, would require nearly half a megabyte per minute of speech.
Since we only want to know what words were spoken, not exactly what they sounded
like, we don’t need to keep all that information. We only need to distinguish between differ PHONE ent
speech sounds. Linguists have identified about 100 speech sounds, or phones, that can be
composed to form all the words in all known human languages. Roughly speaking, a phone
is the sound that corresponds to a single vowel or consonant, but there are some complications:
combinations of letters, such as ―th‖ and ―ng‖ produce single phones, and some letters
produce different phones in different contexts (e.g., the ―a‖ in rat and rate.
A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a particular
language.
First, we observe that although the sound frequencies in speech may be several kHz,
the changes in the content of the signal occur much less often, perhaps at no more than 100
Hz. Therefore, speech systems summarize the properties of the signal over time slices called
frames. A frame length of about 10 milliseconds (i.e., 80 samples at 8 kHz) is short enough
to ensure that few short-duration phenomena will be missed. Overlapping frames are used to
make sure that we don’t miss a signal because it happens to fall on a frame boundary.
Each frame is summarized by a vector of features. Picking out features from a speech
signal is like listening to an orchestra and saying ―here the French horns are playing loudly
and the violins are playing softly.‖ We’ll give a brief overview of the features in a typical
system. First, a Fourier transform is used to determine the amount of acoustic energy at
about a dozen frequencies. Then we compute a measure called the mel frequency cepstral
coefficient (MFCC) or MFCC for each frequency. We also compute the total energy in
the frame. That gives thirteen features; for each one we compute the difference between
this frame and the previous frame, and the difference between differences, for a total of 39
features. These are continuous-valued; the easiest way to fit them into the HMM framework
is to discretize the values.