0% found this document useful (0 votes)
16 views

UNIT-3

The document provides an overview of Reinforcement Learning (RL), detailing its types, approaches, and applications, alongside a brief introduction to Natural Language Processing (NLP). It explains RL as a feedback-based learning method where agents learn through rewards and penalties, and discusses passive and active learning strategies. Additionally, it outlines various applications of RL in fields such as robotics, control systems, and finance, while also touching upon language models in NLP.

Uploaded by

Vijay Dhanush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

UNIT-3

The document provides an overview of Reinforcement Learning (RL), detailing its types, approaches, and applications, alongside a brief introduction to Natural Language Processing (NLP). It explains RL as a feedback-based learning method where agents learn through rewards and penalties, and discusses passive and active learning strategies. Additionally, it outlines various applications of RL in fields such as robotics, control systems, and finance, while also touching upon language models in NLP.

Uploaded by

Vijay Dhanush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

ARTIFICIAL LANGUAGE

UNIT-III
REINFORCEMENT LEARNING

SYLLABUS:
Reinforcement Learning: Introduction, Passive Reinforcement Learning, Active
Reinforcement Learning, Generalization in Reinforcement Learning, Policy
Search, applications of RL

Natural Language Processing: Language Models, Text Classification, Information


Retrieval, Information Extraction.

*******************************

Reinforcement Learning:

Introduction:

Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.

Classification of Machine Learning -

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1
1. Supervised Machine Learning:
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output.

 The labelled data means some input data is already tagged with the correct output.
 In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data.
 Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.

2. Unsupervised Machine Learning:

Unsupervised learning is a machine learning technique in which models are not supervised using
training dataset. Instead, models itself find the hidden patterns and insights from the given data.
It can be compared to learning which takes place in the human brain while learning new things.
It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.

2
3) Reinforcement Learning:

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning, the
agent interacts with the environment and explores it. The goal of an agent is to get the most
reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

-----------------------------------------------------------------------------------------------------------------

What is Reinforcement Learning?


 Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the results
of actions. For each good action, the agent gets positive feedback, and for each bad
action, the agent gets negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
 RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.

3
 The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
 The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that

"Reinforcement learning is a type of machine learning method where an intelligent


agent (computer program) interacts with the environment and learns to act within
that."

 It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.

Example: Suppose there is an AI agent present within a maze environment, and his goal
is to find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.

4
Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by the
agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the action
of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the current
state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).

5
Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.

passive learning, where the agent’s policy is fixed and the task is to learn the utilities of
states (or state–action pairs); this could also involve learning a model of the environment.

Active learning, where the agent must also learn what to do. The principal issue is
exploration: an agent must experience as much as possible of its environment in order to
learn how to behave in it.

Passive Reinforcement Learning:

 In passive learning, the agent’s policy π is fixed: in state s, it always executes the action
π(s). Its goal is simply to learn how good the policy is—that is, to learn the utility
function Uπ(s).

6
 In passive learning, the agent’s policy π is fixed: in state s, it always executes the action
π(s). Its goal is simply to learn how good the policy is—that is, to learn the utility
function Uπ(s).

The agent executes a set of trials in the environment using its policy π. In each trial, the agent
starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the
terminal states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in
that state. Typical trials might look like this:

The object is to use the information about rewards to learn the expected utility Uπ(s) associated
with each nonterminal state s

7
i. Direct utility estimation
ii. Adaptive dynamic programming
iii. Temporal-difference learning

i. Direct utility estimation:

The idea is that the utility of a state ADAPTIVE CONTROL THEORY REWARD-TO-GO is
the expected total reward from that state onward (called the expected reward-to-go), and each
trial provides a sample of this quantity for each state visited.

For example: the first trial in the set of three given earlier provides a sample total reward of 0.72
for state (1,1), two samples of 0.76 and 0.84 for (1,2), two samples of 0.80 and 0.88 for (1,3),
and so on. Thus, at the end of each sequence, the algorithm calculates the observed reward-to-go
for each state and updates the estimated utility for that state accordingly, just by keeping a
running average for each state in a table.

 Direct utility estimation succeeds in reducing the reinforcement learning problem to an


inductive learning problem, about which much is known.
 Unfortunately, it misses a very important source of information, namely, the fact that the
utilities of states are not independent!
 The utility of each state equals its own reward plus the expected utility of its successor
states. That is, the utility values obey the Bellman equations for a fixed policy

ii) Adaptive dynamic programming:


 An adaptive dynamic programming (or ADP) agent takes advantage of the constraints
among the utilities of states by learning the transition model that connects them and

8
solving the corresponding Markov decision process using a dynamic programming
method.
 For a passive learning agent, this means plugging the learned transition model P(s| s, π(s))
and the observed rewards R(s) into the Bellman equations to calculate the utilities of the
states.

 Alternatively, we can adopt the approach of modified policy iteration using a simplified
value iteration process to update the utility estimates after each change to the learned
model.

 Because the model usually changes only slightly with each observation, the value
iteration process can use the previous utility estimates as initial values and should
converge quite quickly.

 We keep track of how often each action outcome occurs and estimate the transition
probability P(s| s, a) from the frequency with which sis reached when executing a in s.

 For example, in the three trials given on page 832, Right is executed three times in (1,3)
and two out of three times the resulting state is (2,3), so P((2, 3)| (1, 3), Right) is
estimated to be 2/3.

iii. Temporal-difference learning:

 Solving the underlying MDP as in the preceding section is not the only way to bring the
Bellman equations to bear on the learning problem.
 Another way is to use the observed transitions to adjust the utilities of the observed states
so that they agree with the constraint equations.

Consider, for example, the transition from (1,3) to (2,3) in the second trial on page 832.
Suppose that, as a result of the first trial, the utility estimates are Uπ(1, 3) = 0.84 and Uπ(2,
3) = 0.92. Now, if this transition occurred all the time, we would expect the utilities to obey
the equation

9
2. Active Reinforcement Learning:
 A passive learning agent has a fixed policy that determines its behavior. An active
agent must decide what actions to take.
 Let us begin with the adaptive dynamic programming agent and consider how it must
be modified to handle this new freedom.
 First, the agent will need to learn a complete model with outcome probabilities for all
actions, rather than just the model for the fixed policy.
 The simple learning mechanism used by will do just fine for this.
 Next, we need to take into account the fact that the agent has a choice of actions. The
utilities it needs to learn are those defined by the optimal policy;
 What the greedy agent has overlooked is that actions do more than provide rewards
according to the current learned model; they also contribute to learning the true model
by affecting the percepts that are received.
 By improving the model, the agent will receive greater rewards in the future. An
agent therefore must make a tradeoff between exploitation to maximize its reward—
as reflected in its current utility estimates—and exploration to maxi mize its long-
term well-being.

10
Generalization in Reinforcement Learning:

 So far, we have assumed that the utility functions and Q-functions learned by the agents
are represented in tabular form with one output value for each input tuple.
 Such an approach works reasonably well for small state spaces, but the time to
convergence and (for ADP) the time per iteration increase rapidly as the space gets larger.
 With carefully controlled, approximate ADP methods, it might be possible to handle
10,000 states or more. This suffices for two-dimensional maze-like environments, but more
realistic worlds are out of the question.

One way to handle such problems is to use function approximation, which simply FUNCTION
APPROXIMATION means using any sort of representation for the Q-function other than a
lookup table. The representation is viewed as approximate because it might not be the case that
the true utility function or Q-function can be represented in the chosen form.

we described an evaluation function for chess that is represented as a weighted linear function of
a set of features (or basis functions) f1,...,fn:

A reinforcement learning algorithm can learn values for the parameters θ = θ1,...,θn such that the
evaluation function Uˆθ approximates the true utility function. Instead of, say, 1040 values in a table, this
function approximator is characterized by, say, n = 20 parameters— an enormous compression.

Policy Search:
 The final approach we will consider for reinforcement learning problems is called policy
search. In some ways, policy search is the simplest of all the methods. the idea is to keep
twiddling the policy as long as its performance improves, then stop.

Let us begin with the policies themselves.

11
 Remember that a policy π is a function that maps states to actions. We are interested
primarily in parameterized representations of π that have far fewer parameters than there
are states in the state space (just as in the preceding section).
 For example, we could represent π by a collection of parameterized Q-functions, one for
each action, and take the action with the highest predicted value

Each Q-function could be a linear function of the parameters θ,


 Policy search will then adjust the parameters θ to improve the policy.
 Notice that if the policy is represented by Qfunctions, then policy search results in a
process that learns Q-functions.
 This process is not the same as Q-learning!
 In Q-learning with function approximation, the algorithm finds a value of θ such that
Qˆθ is ―close‖ to Q∗, the optimal Q-function.
 Policy search, on the other hand, finds a value of θ that results in good performance; the
values found by the two methods may differ very substantially.

policy search methods often use a stochastic policy representation πθ(s, a), which specifies
the probability of selecting action a in state s. One popular representation is the softmax
function:

12
Applications of RL:

1. Robotics:
RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
2. Control:
RL can be used for adaptive control such as Factory processes, admission
control in telecommunication, and Helicopter pilot is an example of
reinforcement learning.
3. Game Playing:
RL can be used in Game playing such as tic-tac-toe, chess, etc.
4. Chemistry:
RL can be used for optimizing the chemical reactions.

13
5. Business:
RL is now used for business strategy planning.
6. Manufacturing:
In various automobile manufacturing companies, the robots use deep
reinforcement learning to pick goods and put them in some containers.
7. Finance Sector:
The RL is currently used in the finance sector for evaluating trading
strategies.

NATURAL LANGUAGE PROCESSING:


Introduction:

 There are over a trillion pages of information on the Web, almost all of it in natural
language. An agent that wants to do knowledge acquisition needs to understand (at least
partially) the ambiguous, messy languages that humans use.
 We examine the problem from the point of view of specific information-seeking tasks: text
classification, information retrieval, and information extraction.
 One common factor in addressing these tasks is the use of language models: models that
predict the probability distribution of language expressions.

LANGUAGE MODELS:

We will discuss about 2 types of languages:

- Formal Languages
- Natural Languages

14
Formal languages, such as the programming languages Java or Python, have precisely
defined language models.

 A language can be defined as a set of strings; ―print(2 + 2)‖ is a legal program in the
language Python, whereas ―2)+(2 print‖ is not.
 Since there are an infinite number of legal programs, they cannot be enumerated; instead
they are specified by a set of rules called a grammar.
 Formal languages also have rules that define the meaning or semantics of a program; for
example, the rules say that the ―meaning‖ of ―2+2‖ is 4, and the meaning of ―1/0‖ is that an
error is signaled.

Natural languages, such as English or Spanish, cannot be characterized as a definitive set of


sentences.

 Everyone agrees that ―Not to be invited is sad‖ is a sentence of English, but people
disagree on the grammaticality of ―To be not invited is sad.‖

Problems:

o Natural languages are also ambiguous.


o Natural languages are difficult to deal with because they are very large, and
constantly changing.

N-gram character models:

A written text is composed of characters—letters, digits, punctuation, and spaces in English.

Thus, one of the simplest language models is a probability distribution over sequences of
characters. we write P(c1:N ) for the probability of a sequence of N characters, c1 through cN .
In one Web collection, P(―the‖)=0.027 and P(―zgq‖)=0.000000002.

Definition: A sequence of written symbols of length n is called an n-gram.

A model of the N-GRAM MODEL probability distribution of n-letter sequences is thus called an
n-gram model.

15
An n-gram model is defined as a Markov chain of order n − 1.

that in a Markov chain the probability of character ci depends only on the immediately preceding
characters, not on any other characters. So in a trigram model (Markov chain of order 2) we have

What can we do with n-gram character models?

…..One task for which they are well suited is language identification: given a text, determine
what natural language it is written in.

o This is a relatively easy task; even with short texts such as ―Hello, world‖ or ―Wie geht
es dir,‖ it is easy to identify the first as English and the second as German.
o Computer systems identify languages with greater than 99% accuracy; occasionally,
closely related languages, such as Swedish and Norwegian, are confused.
o One approach to language identification is to first build a trigram character model of each

candidate language, P( c |c
i i−2:i−1, l), where the variable l ranges over languages.

Smoothing n-gram models:


 The major complication of n-gram models is that the training corpus provides only an
estimate of the true probability distribution.

 For common character sequences such as ― th‖ any English corpus will give a good
estimate: about 1.5% of all trigrams.
 On the other hand, ― ht‖ is very uncommon—no dictionary words start with ht. It is likely
that the sequence would have a count of zero in a training corpus of standard English.

16
 Does that mean we should assign P(― th‖)=0? If we did, then the text ―The program issues
an http request‖ would have an English probability of zero, which seems wrong.

 Just because we have never seen ― http‖ before does not mean that our model should
claim that it is impossible. Thus, we will adjust our language model so that sequences that
have a count of zero in the training corpus will be assigned a small nonzero probability.

 The process od adjusting the probability of low-frequency counts is called smoothing.

Laplace smoothing -- The simplest type of smoothing was suggested by Pierre-Simon


Laplace in the 18th century: he said that, in the lack of further information, if a random Boolean
variable X has been false in all n observations so far then the estimate for P(X = true) should be
1/(n+ 2). That is, he assumes that with two more trials, one might be true and one false.

A better approach is a backoff model, in which we start by estimating n-gram counts, but for
any particular sequence that has a low (or zero) count, we back off to (n − 1)-grams. Linear
interpolation smoothing is a backoff model that combines trigram, bigram, and models by linear
interpolation. It defines the probability estimate as

Model evaluation:

With so many possible n-gram models—unigram, bigram, trigram, interpolated smoothing with
different values of λ, etc.—how do we know what model to choose?

……….We can evaluate a model with cross-validation.

Split the corpus into a training corpus and a validation corpus. Determine the parameters
of the model from the training data. Then evaluate the model on the validation corpus.

17
N-gram word models:

o All the same mechanism VOCABULARY applies equally to word and character models.
The main difference is that the vocabulary— the set of symbols that make up the corpus
and the model—is larger.
o There are only about 100 characters in most languages, and sometimes we build character
models that are even more restrictive, for example by treating ―A‖ and ―a‖ as the same
symbol or by treating all punctuation as the same symbol.
o But with word models we have at least tens of thousands of symbols, and sometimes
millions.
Word n-gram models need to deal with out of vocabulary words. With character models,
we didn’t have to worry about someone inventing a new letter of the alphabet. But with
word models there is always the chance of a new word that was not seen in the training
corpus, so we need to model that explicitly in our language model.

Solution: This can be done by adding just one new word to the vocabulary: ,
standing for the unknown word. We can estimate n-gram counts for by this trick:
go through the training corpus, and the first time any individual word appears it is
previously unknown, so replace it with the symbol

For example, any string of digits might be replaced with , or any email address with

TEXT CLASSIFICATION:
 We now consider in depth the task of text classification, also known as categorization:
given a text of some kind, decide which of a predefined set of classes it belongs to.
 Language identification and genre classification are examples of text classification, as
is sentiment analysis (classifying a movie or product review as positive or negative)
and spam detection (classifying an email message as spam or not-spam).
 Since ―not-spam‖ is awkward, researchers have coined the term ham for not-spam. We
can treat spam detection as a problem in supervised learning.

18
 A training set is readily available: the positive (spam) examples are in my spam folder,
the negative (ham) examples are in my inbox. Here is an excerpt

 From this excerpt we can start to get an idea of what might be good features to include
in the supervised learning model.
 Word n-grams such as ―for cheap‖ and ―You can buy‖ seem to be indicators of spam.
Apparently the spammers thought that the word bigram ―you deserve‖ would be too
indicative of spam, and thus wrote ―yo,u d-eserve‖ instead.
 A character model should detect this. We could either create a full character n-gram
model of spam and ham, or we could handcraft features such as ―number of
punctuation marks embedded in words.‖

Classification by data compression:

 Another way to think about classification is as a problem in data compression.


 A lossless compression algorithm takes a sequence of symbols, detects repeated patterns
in it, and writes a description of the sequence that is more compact than the original.
 For example, the text ―0.142857142857142857‖ might be compressed to ―0.[142857]*3.‖
Compression algorithms work by building dictionaries of subsequences of the text, and
then referring to entries in the dictionary.
 The example here had only one dictionary entry, ―142857.‖

19
INFORMATION RETRIEVAL:
Definition: Information retrieval is the task of finding documents that are relevant to a user’s
need for information.

Ex: The best-known examples of information retrieval systems are search engines on the World
Wide Web. A Web user can type a query such as [AI book] into a search engine and see a list of
relevant pages.

An IR information retrieval system can be characterized by

1. A corpus of documents. Each system must decide what it wants to treat as a document: a
paragraph, a page, or a multipage text.

2. Queries posed in a query language. A query specifies what the user wants to know. The
query language can be just a list of words, such as [AI book]; or it can specify a phrase of words
that must be adjacent, as in [―AI book‖]; it can contain Boolean operators as in [AI AND book];
it can include non-Boolean operators such as [AI NEAR book] or [AI book site:www.aaai.org].

3. A result set. This is the subset of documents that the IR system judges to be relevant to
RELEVANT the query. By relevant, we mean likely to be of use to the person who posed the
query, for the particular information need expressed in the query.

4. A presentation of the result set. This can be as simple as a ranked list of document titles or
as complex as a rotating color map of the result set projected onto a threedimensional space,
rendered as a two-dimensional display.

Boolean Keyword Model: The earliest IR systems worked on a Boolean keyword model.
Each word in the document collection is treated as a Boolean feature that is true of a document if
the word occurs in the document and false if it does not.

Advantages: This model has the advantage of being simple to explain and implement.

20
Disadvantages:

1. The degree of relevance of a document is a single bit, so there is no guidance as to how


to order the relevant documents for presentation.
2. Boolean expressions are unfamiliar to users who are not programmers or logicians.
3. it can be hard to formulate an appropriate query, even for a skilled user.

IR scoring functions:

Definition: A scoring function takes a document and a query and returns a numeric score; the
most relevant documents have the highest scores.

In the BM25 function, the score is a linear weighted combination of scores for each of the words
that make up the query.

Three factors affect the weight of a query term:

i. TF: The frequency with which a query term appears in a document (also known as TF for
term frequency).
Ex: For the query [farming in Kansas], documents that mention ―farming‖ frequently will
have higher scores.
ii. IDF: The inverse document frequency of the term, or IDF. The word ―in‖ appears in
almost every document, so it has a high document frequency, and thus a low inverse
document frequency, and thus it is not as important to the query as ―farming‖ or ―Kansas.
iii. Length: The length of the document. A million-word document will probably mention
all the query words, but may not actually be about the query. A short document that
mentions all the words is a much better candidate.

21
IR system evaluation:
How do we know whether an IR system is performing well?
Traditionally, there have been two measures used in the scoring: recall and precision. We
explain them with the help of an example. Imagine that an IR system has returned a result
set for a single query, for which we know which documents are and are not relevant, out
of a corpus of 100 documents. The document counts in each category are given in the
following table:

Precision measures the proportion of documents in the result set that are actually relevant.
Ex:
In our example, the precision is 30/(30 + 10) = .75. The false positive rate is 1 − .75 = .25.

22
Recall measures the proportion of all the relevant documents in the collection that are in the
result set.
Ex: In our example, recall is 30/(30 + 20) = .60. The false negative rate is 1 − .60 = .40.

In a very large document collection, such as the World Wide Web, recall is difficult to
compute, because there is no easy way to examine every page on the Web for relevance. All
we can do is either estimate recall by sampling or ignore recall completely and just judge
precision. In the case of a Web search engine, there may be thousands of documents in the
result set, so it makes more sense to measure precision for several different sizes, such as
―P@10‖ (precision in the top 10 results) or ―P@50,‖ rather than to estimate precision in the
entire result set.

IR refinements:
One common refinement is a better model of the effect of document length on relevance
pivoted document length normalization scheme; the idea is that the pivot is the document
length at which the old-style normalization is correct; documents shorter than that get a boost
and longer ones get a penalty.
The BM25 scoring function uses a word model that treats all words as completely
independent, but we know that some words are correlated: ―couch‖ is closely related to both
―couches‖ and ―sofa.‖ Many IR systems attempt to account for these correlations.

STEMMING :
 For example, if the query is [couch], it would be a shame to exclude from the result set
those documents that mention ―COUCH‖ or ―couches‖ but not ―couch.‖
 Most IR systems CASE FOLDING do case folding of ―COUCH‖ to ―couch,‖ and some
use a stemming algorithm to reduce ―couches‖ to the stem form ―couch,‖ both in the
query and the documents.
 This typically yields a small increase in recall (on the order of 2% for English). However,
it can harm precision.
 For example, stemming ―stocking‖ to ―stock‖ will tend to decrease precision for queries
about either foot coverings or financial instruments, although it could improve recall for

23
queries about warehousing. Stemming algorithms based on rules (e.g., remove ―-ing‖)
cannot avoid this problem, but algorithms based on dictionaries (don’t remove ―-ing‖ if
the word is already listed in the dictionary) can.
 While stemming has a small effect in English, it is more important in other languages.

SYNONYM:
recognize synonyms, such as ―sofa‖ for ―couch.‖ As with stemming, this has the potential for
small gains in recall, but can hurt precision. A user who gives the query [Tim Couch] wants
to see results about the football player, not sofas.
That is, anytime there are two words that mean the same thing, speakers of the language
conspire to evolve the meanings to remove the confusion. Related words that are not
synonyms also play an important role in ranking—terms like ―leather‖, ―wooden,‖ or
―modern‖ can serve to confirm that the document really is about ―couch.‖ Synonyms and
related words can be found in dictionaries or by looking for correlations in documents or in
queries—if we find that many users who ask the query [new sofa] follow it up with the query
[new couch], we can in the future alter [new sofa] to be [new sofa OR new couch].

METADATA:
As a final refinement, IR can be improved by considering metadata—data outside of the text
of the document. Examples include human-supplied keywords and publication data. LINKS
On the Web, hypertext links between documents are a crucial source of information.

The PageRank algorithm:


 PageRank was invented to solve the problem of the tyranny of TF scores: if the query is
[IBM], how do we make sure that IBM’s home page, ibm.com, is the first result, even if
another page mentions the term ―IBM‖ more frequently?
 The idea is that ibm.com has many in-links (links to the page), so it should be ranked
higher: each in-link is a vote for the quality of the linked-to page.
 But if we only counted in-links, then it would be possible for a Web spammer to create a
network of pages and have them all point to a page of his choosing, increasing the score
of that page.

24
 Therefore, the PageRank algorithm is designed to weight links from high-quality sites
more heavily.
 What is a highquality site ? One that is linked to by other high-quality sites. The
definition is recursive, but we will see that the recursion bottoms out properly. The
PageRank for a page p is defined as:

where P R(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are
the pages that link in to p, and C(ini) is the count of the total number of out-links on page ini.

The HITS algorithm:


The Hyperlink-Induced Topic Search algorithm, also known as ―Hubs and Authorities‖ or
HITS, is another influential link-analysis algorithm .
 HITS differs from PageRank in several ways. First, it is a query-dependent measure:
it rates pages with respect to a query.
 That means that it must be computed anew for each query—a computational burden
that most search engines have elected not to take on.
 Given a query, HITS first finds a set of pages that are relevant to the query.
 It does that by intersecting hit lists of query words, and then adding pages in the link
neighborhood of these pages—pages that link to or are linked from one of the pages
in the original relevant set

Question answering:
Information retrieval is the task of finding documents that are relevant to a query, where the
query may be a question, or just a topic area or concept. Question answering is a somewhat
QUESTION ANSWERING different task, in which the query really is a question, and the
answer is not a ranked list of documents but rather a short response—a sentence, or even just
a phrase.

25
The ASKMSR system (Banko et al., 2002) is a typical Web-based question-answering
system. It is based on the intuition that most questions will be answered many times on the
Web, so question answering should be thought of as a problem in precision, not recall. We
don’t have to deal with all the different ways that an answer might be phrased—we only have
to find one of them. For example, consider the query [Who killed Abraham Lincoln?]
Suppose a system had to answer that question with access only to a single encyclopedia,
whose entry on Lincoln said
“ John Wilkes Booth altered history with a bullet. He will forever be known as the man who
ended Abraham Lincoln’s life.”

INFORMATION EXTRACTION:
Definition:
Information extraction is the process of acquiring knowledge by skimming a text and looking
for occurrences of a particular class of object and for relationships among objects.
 A typical task is to extract instances of addresses from Web pages, with database
fields for street, city, state, and zip code; or instances of storms from weather reports,
with fields for temperature, wind speed, and precipitation.
 In a limited domain, this can be done with high accuracy. As the domain gets more
general, more complex linguistic models and more complex learning techniques are
necessary.

Here we have approaches to information extraction, in order of increasing complexity on several


dimensions: deterministic to stochastic, domain-specific to general, hand-crafted to learned, and
small-scale to large-scale.

i. Finite-state automata for information extraction:


 The simplest type of information extraction system is an attribute-based extraction
system that assumes that the entire text refers to a single object and the task is to extract
attributes of that object.

Ex: For example, extracting from the text ―IBM ThinkBook 970. Our price: $399.00‖ the
set of attributes {Manufacturer=IBM, Model=ThinkBook970, Price=$399.00}.

26
We can address this problem by defining a template (also known as a pattern) for each
attribute we would like to extract. The template is defined by a finite state automaton,
the simplest example of which is the regular expression, or regex. Regular expressions
are used in Unix commands such as grep, in programming languages such as Perl, and in
word processors such as Microsoft Word.

One step up from attribute-based extraction systems are relational extraction systems,
RELATIONAL EXTRACTION which deal with multiple objects and the relations among them.
Thus, when these systems see the text ―$249.99,‖ they need to determine not just that it is a price, but
also which object has that price. A typical relational-based extraction system is FASTUS, which
handles news stories about corporate mergers and acquisitions.

FASTUS consists of five stages:

1. Tokenization

2. Complex-word handling

3. Basic-group handling

4. Complex-phrase handling

5. Structure merging

i. Tokenization:

FASTUS’s first stage is tokenization, which segments the stream of characters into tokens
(words, numbers, and punctuation). For English, tokenization can be fairly simple; just separating
characters at white space or punctuation does a fairly good job. Some tokenizes also deal with markup
languages such as HTML, SGML, and XML.

ii. complex words, including collocations such as ―set up‖ and ―joint venture,‖ as well as proper
names such as ―Bridgestone Sports Co.‖ These are recognized by a combination of lexical entries

27
and finite-state grammar rules. For example, a company name might be recognized by the rule
CapitalizedWord+ (―Company‖ | ―Co‖ | ―Inc‖ | ―Ltd‖)
iii. The third stage handles basic groups, meaning noun groups and verb groups. The idea is to chunk
these into units that will be managed by the later stages. We will see how to write a complex
description of noun and verb phrases in Chapter 23, but here we have simple rules that only
approximate the complexity of English, but have the advantage of being representable by finite
state automata. The example sentence would emerge from this stage as the following sequence of
tagged groups:

Here NG means noun group, VG is verb group, PR is preposition, and CJ is conjunction.


iv. The fourth stage combines the basic groups into complex phrases. Again, the aim is to have rules
that are finite-state and thus can be processed quickly, and that result in unambiguous (or nearly
unambiguous) output phrases. One type of combination rule deals with domain-specific events.
For example, the rule Company+ SetUp JointVenture (―with‖ Company+)?
v. The final stage merges structures that were built up in the previous step. If the next sentence says
―The joint venture will start production in January,‖ then this step will notice that there are two
references to a joint venture, and that they should be merged into one.
3. Probabilistic models for information extraction:
When information extraction must be attempted from noisy or varied input, simple finite-state
approaches fare poorly. It is too hard to get all the rules and their priorities right; it is better to use
a probabilistic model rather than a rule-based model. The simplest probabilistic model for
sequences with hidden state is the hidden Markov model, or HMM.
To apply HMMs to information extraction, we can either build one big HMM for all the attributes
or build a separate HMM for each attribute.

HMMs have two big advantages over FSAs for extraction.

28
First, HMMs are probabilistic, and thus tolerant to noise. In a regular expression, if a single
expected character is missing, the regex fails to match; with HMMs there is graceful degradation
with missing characters/words, and we get a probability indicating the degree of match, not just a
Boolean match/fail.
Second, HMMs can be trained from data; they don’t require laborious engineering of templates,
and thus they can more easily be kept up to date as text changes over time.

Conditional random fields for information extraction:

 One issue with HMMs for the information extraction task is that they model a lot of
probabilities that we don’t really need.
 An HMM is a generative model; it models the full joint probability of observations and
hidden states, and thus can be used to generate samples. That is, we can use the HMM model
not only to parse a text and recover the speaker and date, but also to generate a random
instance of a text containing a speaker and a date.
 Since we’re not interested in that task, it is natural to ask whether we might be better off with a
model that doesn’t bother modeling that possibility.
 All we need in order to understand a text is a discriminative model, one that models the
conditional probability of the hidden attributes given the observations (the text).
 A framework for this type of model is the conditional random field, or CRF, which models a
conditional probability distribution of a set of target variables given a set of observed
variables.
 Like Bayesian networks, CRFs can represent many different structures of dependencies
among the variables. One common structure is the linear-chain conditional random field for
representing Markov dependencies among variables in a temporal sequence.
 Thus, HMMs are the temporal version of naive Bayes models, and linear-chain CRFs are the
temporal version of logistic regression, where the predicted target is an entire state sequence
rather than a single binary variable.

********************************

29

You might also like