UNIT-3
UNIT-3
UNIT-III
REINFORCEMENT LEARNING
SYLLABUS:
Reinforcement Learning: Introduction, Passive Reinforcement Learning, Active
Reinforcement Learning, Generalization in Reinforcement Learning, Policy
Search, applications of RL
*******************************
Reinforcement Learning:
Introduction:
Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1
1. Supervised Machine Learning:
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output.
The labelled data means some input data is already tagged with the correct output.
In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data.
Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
Unsupervised learning is a machine learning technique in which models are not supervised using
training dataset. Instead, models itself find the hidden patterns and insights from the given data.
It can be compared to learning which takes place in the human brain while learning new things.
It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
2
3) Reinforcement Learning:
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
-----------------------------------------------------------------------------------------------------------------
3
The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that
It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
Example: Suppose there is an AI agent present within a maze environment, and his goal
is to find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.
4
Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by the
agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the action
of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the current
state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
5
Approaches to implement Reinforcement Learning
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a policy
that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.
passive learning, where the agent’s policy is fixed and the task is to learn the utilities of
states (or state–action pairs); this could also involve learning a model of the environment.
Active learning, where the agent must also learn what to do. The principal issue is
exploration: an agent must experience as much as possible of its environment in order to
learn how to behave in it.
In passive learning, the agent’s policy π is fixed: in state s, it always executes the action
π(s). Its goal is simply to learn how good the policy is—that is, to learn the utility
function Uπ(s).
6
In passive learning, the agent’s policy π is fixed: in state s, it always executes the action
π(s). Its goal is simply to learn how good the policy is—that is, to learn the utility
function Uπ(s).
The agent executes a set of trials in the environment using its policy π. In each trial, the agent
starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the
terminal states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in
that state. Typical trials might look like this:
The object is to use the information about rewards to learn the expected utility Uπ(s) associated
with each nonterminal state s
7
i. Direct utility estimation
ii. Adaptive dynamic programming
iii. Temporal-difference learning
The idea is that the utility of a state ADAPTIVE CONTROL THEORY REWARD-TO-GO is
the expected total reward from that state onward (called the expected reward-to-go), and each
trial provides a sample of this quantity for each state visited.
For example: the first trial in the set of three given earlier provides a sample total reward of 0.72
for state (1,1), two samples of 0.76 and 0.84 for (1,2), two samples of 0.80 and 0.88 for (1,3),
and so on. Thus, at the end of each sequence, the algorithm calculates the observed reward-to-go
for each state and updates the estimated utility for that state accordingly, just by keeping a
running average for each state in a table.
8
solving the corresponding Markov decision process using a dynamic programming
method.
For a passive learning agent, this means plugging the learned transition model P(s| s, π(s))
and the observed rewards R(s) into the Bellman equations to calculate the utilities of the
states.
Alternatively, we can adopt the approach of modified policy iteration using a simplified
value iteration process to update the utility estimates after each change to the learned
model.
Because the model usually changes only slightly with each observation, the value
iteration process can use the previous utility estimates as initial values and should
converge quite quickly.
We keep track of how often each action outcome occurs and estimate the transition
probability P(s| s, a) from the frequency with which sis reached when executing a in s.
For example, in the three trials given on page 832, Right is executed three times in (1,3)
and two out of three times the resulting state is (2,3), so P((2, 3)| (1, 3), Right) is
estimated to be 2/3.
Solving the underlying MDP as in the preceding section is not the only way to bring the
Bellman equations to bear on the learning problem.
Another way is to use the observed transitions to adjust the utilities of the observed states
so that they agree with the constraint equations.
Consider, for example, the transition from (1,3) to (2,3) in the second trial on page 832.
Suppose that, as a result of the first trial, the utility estimates are Uπ(1, 3) = 0.84 and Uπ(2,
3) = 0.92. Now, if this transition occurred all the time, we would expect the utilities to obey
the equation
9
2. Active Reinforcement Learning:
A passive learning agent has a fixed policy that determines its behavior. An active
agent must decide what actions to take.
Let us begin with the adaptive dynamic programming agent and consider how it must
be modified to handle this new freedom.
First, the agent will need to learn a complete model with outcome probabilities for all
actions, rather than just the model for the fixed policy.
The simple learning mechanism used by will do just fine for this.
Next, we need to take into account the fact that the agent has a choice of actions. The
utilities it needs to learn are those defined by the optimal policy;
What the greedy agent has overlooked is that actions do more than provide rewards
according to the current learned model; they also contribute to learning the true model
by affecting the percepts that are received.
By improving the model, the agent will receive greater rewards in the future. An
agent therefore must make a tradeoff between exploitation to maximize its reward—
as reflected in its current utility estimates—and exploration to maxi mize its long-
term well-being.
10
Generalization in Reinforcement Learning:
So far, we have assumed that the utility functions and Q-functions learned by the agents
are represented in tabular form with one output value for each input tuple.
Such an approach works reasonably well for small state spaces, but the time to
convergence and (for ADP) the time per iteration increase rapidly as the space gets larger.
With carefully controlled, approximate ADP methods, it might be possible to handle
10,000 states or more. This suffices for two-dimensional maze-like environments, but more
realistic worlds are out of the question.
One way to handle such problems is to use function approximation, which simply FUNCTION
APPROXIMATION means using any sort of representation for the Q-function other than a
lookup table. The representation is viewed as approximate because it might not be the case that
the true utility function or Q-function can be represented in the chosen form.
we described an evaluation function for chess that is represented as a weighted linear function of
a set of features (or basis functions) f1,...,fn:
A reinforcement learning algorithm can learn values for the parameters θ = θ1,...,θn such that the
evaluation function Uˆθ approximates the true utility function. Instead of, say, 1040 values in a table, this
function approximator is characterized by, say, n = 20 parameters— an enormous compression.
Policy Search:
The final approach we will consider for reinforcement learning problems is called policy
search. In some ways, policy search is the simplest of all the methods. the idea is to keep
twiddling the policy as long as its performance improves, then stop.
11
Remember that a policy π is a function that maps states to actions. We are interested
primarily in parameterized representations of π that have far fewer parameters than there
are states in the state space (just as in the preceding section).
For example, we could represent π by a collection of parameterized Q-functions, one for
each action, and take the action with the highest predicted value
policy search methods often use a stochastic policy representation πθ(s, a), which specifies
the probability of selecting action a in state s. One popular representation is the softmax
function:
12
Applications of RL:
1. Robotics:
RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
2. Control:
RL can be used for adaptive control such as Factory processes, admission
control in telecommunication, and Helicopter pilot is an example of
reinforcement learning.
3. Game Playing:
RL can be used in Game playing such as tic-tac-toe, chess, etc.
4. Chemistry:
RL can be used for optimizing the chemical reactions.
13
5. Business:
RL is now used for business strategy planning.
6. Manufacturing:
In various automobile manufacturing companies, the robots use deep
reinforcement learning to pick goods and put them in some containers.
7. Finance Sector:
The RL is currently used in the finance sector for evaluating trading
strategies.
There are over a trillion pages of information on the Web, almost all of it in natural
language. An agent that wants to do knowledge acquisition needs to understand (at least
partially) the ambiguous, messy languages that humans use.
We examine the problem from the point of view of specific information-seeking tasks: text
classification, information retrieval, and information extraction.
One common factor in addressing these tasks is the use of language models: models that
predict the probability distribution of language expressions.
LANGUAGE MODELS:
- Formal Languages
- Natural Languages
14
Formal languages, such as the programming languages Java or Python, have precisely
defined language models.
A language can be defined as a set of strings; ―print(2 + 2)‖ is a legal program in the
language Python, whereas ―2)+(2 print‖ is not.
Since there are an infinite number of legal programs, they cannot be enumerated; instead
they are specified by a set of rules called a grammar.
Formal languages also have rules that define the meaning or semantics of a program; for
example, the rules say that the ―meaning‖ of ―2+2‖ is 4, and the meaning of ―1/0‖ is that an
error is signaled.
Everyone agrees that ―Not to be invited is sad‖ is a sentence of English, but people
disagree on the grammaticality of ―To be not invited is sad.‖
Problems:
Thus, one of the simplest language models is a probability distribution over sequences of
characters. we write P(c1:N ) for the probability of a sequence of N characters, c1 through cN .
In one Web collection, P(―the‖)=0.027 and P(―zgq‖)=0.000000002.
A model of the N-GRAM MODEL probability distribution of n-letter sequences is thus called an
n-gram model.
15
An n-gram model is defined as a Markov chain of order n − 1.
that in a Markov chain the probability of character ci depends only on the immediately preceding
characters, not on any other characters. So in a trigram model (Markov chain of order 2) we have
…..One task for which they are well suited is language identification: given a text, determine
what natural language it is written in.
o This is a relatively easy task; even with short texts such as ―Hello, world‖ or ―Wie geht
es dir,‖ it is easy to identify the first as English and the second as German.
o Computer systems identify languages with greater than 99% accuracy; occasionally,
closely related languages, such as Swedish and Norwegian, are confused.
o One approach to language identification is to first build a trigram character model of each
candidate language, P( c |c
i i−2:i−1, l), where the variable l ranges over languages.
For common character sequences such as ― th‖ any English corpus will give a good
estimate: about 1.5% of all trigrams.
On the other hand, ― ht‖ is very uncommon—no dictionary words start with ht. It is likely
that the sequence would have a count of zero in a training corpus of standard English.
16
Does that mean we should assign P(― th‖)=0? If we did, then the text ―The program issues
an http request‖ would have an English probability of zero, which seems wrong.
Just because we have never seen ― http‖ before does not mean that our model should
claim that it is impossible. Thus, we will adjust our language model so that sequences that
have a count of zero in the training corpus will be assigned a small nonzero probability.
A better approach is a backoff model, in which we start by estimating n-gram counts, but for
any particular sequence that has a low (or zero) count, we back off to (n − 1)-grams. Linear
interpolation smoothing is a backoff model that combines trigram, bigram, and models by linear
interpolation. It defines the probability estimate as
Model evaluation:
With so many possible n-gram models—unigram, bigram, trigram, interpolated smoothing with
different values of λ, etc.—how do we know what model to choose?
Split the corpus into a training corpus and a validation corpus. Determine the parameters
of the model from the training data. Then evaluate the model on the validation corpus.
17
N-gram word models:
o All the same mechanism VOCABULARY applies equally to word and character models.
The main difference is that the vocabulary— the set of symbols that make up the corpus
and the model—is larger.
o There are only about 100 characters in most languages, and sometimes we build character
models that are even more restrictive, for example by treating ―A‖ and ―a‖ as the same
symbol or by treating all punctuation as the same symbol.
o But with word models we have at least tens of thousands of symbols, and sometimes
millions.
Word n-gram models need to deal with out of vocabulary words. With character models,
we didn’t have to worry about someone inventing a new letter of the alphabet. But with
word models there is always the chance of a new word that was not seen in the training
corpus, so we need to model that explicitly in our language model.
Solution: This can be done by adding just one new word to the vocabulary: ,
standing for the unknown word. We can estimate n-gram counts for by this trick:
go through the training corpus, and the first time any individual word appears it is
previously unknown, so replace it with the symbol
For example, any string of digits might be replaced with , or any email address with
TEXT CLASSIFICATION:
We now consider in depth the task of text classification, also known as categorization:
given a text of some kind, decide which of a predefined set of classes it belongs to.
Language identification and genre classification are examples of text classification, as
is sentiment analysis (classifying a movie or product review as positive or negative)
and spam detection (classifying an email message as spam or not-spam).
Since ―not-spam‖ is awkward, researchers have coined the term ham for not-spam. We
can treat spam detection as a problem in supervised learning.
18
A training set is readily available: the positive (spam) examples are in my spam folder,
the negative (ham) examples are in my inbox. Here is an excerpt
From this excerpt we can start to get an idea of what might be good features to include
in the supervised learning model.
Word n-grams such as ―for cheap‖ and ―You can buy‖ seem to be indicators of spam.
Apparently the spammers thought that the word bigram ―you deserve‖ would be too
indicative of spam, and thus wrote ―yo,u d-eserve‖ instead.
A character model should detect this. We could either create a full character n-gram
model of spam and ham, or we could handcraft features such as ―number of
punctuation marks embedded in words.‖
19
INFORMATION RETRIEVAL:
Definition: Information retrieval is the task of finding documents that are relevant to a user’s
need for information.
Ex: The best-known examples of information retrieval systems are search engines on the World
Wide Web. A Web user can type a query such as [AI book] into a search engine and see a list of
relevant pages.
1. A corpus of documents. Each system must decide what it wants to treat as a document: a
paragraph, a page, or a multipage text.
2. Queries posed in a query language. A query specifies what the user wants to know. The
query language can be just a list of words, such as [AI book]; or it can specify a phrase of words
that must be adjacent, as in [―AI book‖]; it can contain Boolean operators as in [AI AND book];
it can include non-Boolean operators such as [AI NEAR book] or [AI book site:www.aaai.org].
3. A result set. This is the subset of documents that the IR system judges to be relevant to
RELEVANT the query. By relevant, we mean likely to be of use to the person who posed the
query, for the particular information need expressed in the query.
4. A presentation of the result set. This can be as simple as a ranked list of document titles or
as complex as a rotating color map of the result set projected onto a threedimensional space,
rendered as a two-dimensional display.
Boolean Keyword Model: The earliest IR systems worked on a Boolean keyword model.
Each word in the document collection is treated as a Boolean feature that is true of a document if
the word occurs in the document and false if it does not.
Advantages: This model has the advantage of being simple to explain and implement.
20
Disadvantages:
IR scoring functions:
Definition: A scoring function takes a document and a query and returns a numeric score; the
most relevant documents have the highest scores.
In the BM25 function, the score is a linear weighted combination of scores for each of the words
that make up the query.
i. TF: The frequency with which a query term appears in a document (also known as TF for
term frequency).
Ex: For the query [farming in Kansas], documents that mention ―farming‖ frequently will
have higher scores.
ii. IDF: The inverse document frequency of the term, or IDF. The word ―in‖ appears in
almost every document, so it has a high document frequency, and thus a low inverse
document frequency, and thus it is not as important to the query as ―farming‖ or ―Kansas.
iii. Length: The length of the document. A million-word document will probably mention
all the query words, but may not actually be about the query. A short document that
mentions all the words is a much better candidate.
21
IR system evaluation:
How do we know whether an IR system is performing well?
Traditionally, there have been two measures used in the scoring: recall and precision. We
explain them with the help of an example. Imagine that an IR system has returned a result
set for a single query, for which we know which documents are and are not relevant, out
of a corpus of 100 documents. The document counts in each category are given in the
following table:
Precision measures the proportion of documents in the result set that are actually relevant.
Ex:
In our example, the precision is 30/(30 + 10) = .75. The false positive rate is 1 − .75 = .25.
22
Recall measures the proportion of all the relevant documents in the collection that are in the
result set.
Ex: In our example, recall is 30/(30 + 20) = .60. The false negative rate is 1 − .60 = .40.
In a very large document collection, such as the World Wide Web, recall is difficult to
compute, because there is no easy way to examine every page on the Web for relevance. All
we can do is either estimate recall by sampling or ignore recall completely and just judge
precision. In the case of a Web search engine, there may be thousands of documents in the
result set, so it makes more sense to measure precision for several different sizes, such as
―P@10‖ (precision in the top 10 results) or ―P@50,‖ rather than to estimate precision in the
entire result set.
IR refinements:
One common refinement is a better model of the effect of document length on relevance
pivoted document length normalization scheme; the idea is that the pivot is the document
length at which the old-style normalization is correct; documents shorter than that get a boost
and longer ones get a penalty.
The BM25 scoring function uses a word model that treats all words as completely
independent, but we know that some words are correlated: ―couch‖ is closely related to both
―couches‖ and ―sofa.‖ Many IR systems attempt to account for these correlations.
STEMMING :
For example, if the query is [couch], it would be a shame to exclude from the result set
those documents that mention ―COUCH‖ or ―couches‖ but not ―couch.‖
Most IR systems CASE FOLDING do case folding of ―COUCH‖ to ―couch,‖ and some
use a stemming algorithm to reduce ―couches‖ to the stem form ―couch,‖ both in the
query and the documents.
This typically yields a small increase in recall (on the order of 2% for English). However,
it can harm precision.
For example, stemming ―stocking‖ to ―stock‖ will tend to decrease precision for queries
about either foot coverings or financial instruments, although it could improve recall for
23
queries about warehousing. Stemming algorithms based on rules (e.g., remove ―-ing‖)
cannot avoid this problem, but algorithms based on dictionaries (don’t remove ―-ing‖ if
the word is already listed in the dictionary) can.
While stemming has a small effect in English, it is more important in other languages.
SYNONYM:
recognize synonyms, such as ―sofa‖ for ―couch.‖ As with stemming, this has the potential for
small gains in recall, but can hurt precision. A user who gives the query [Tim Couch] wants
to see results about the football player, not sofas.
That is, anytime there are two words that mean the same thing, speakers of the language
conspire to evolve the meanings to remove the confusion. Related words that are not
synonyms also play an important role in ranking—terms like ―leather‖, ―wooden,‖ or
―modern‖ can serve to confirm that the document really is about ―couch.‖ Synonyms and
related words can be found in dictionaries or by looking for correlations in documents or in
queries—if we find that many users who ask the query [new sofa] follow it up with the query
[new couch], we can in the future alter [new sofa] to be [new sofa OR new couch].
METADATA:
As a final refinement, IR can be improved by considering metadata—data outside of the text
of the document. Examples include human-supplied keywords and publication data. LINKS
On the Web, hypertext links between documents are a crucial source of information.
24
Therefore, the PageRank algorithm is designed to weight links from high-quality sites
more heavily.
What is a highquality site ? One that is linked to by other high-quality sites. The
definition is recursive, but we will see that the recursion bottoms out properly. The
PageRank for a page p is defined as:
where P R(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are
the pages that link in to p, and C(ini) is the count of the total number of out-links on page ini.
Question answering:
Information retrieval is the task of finding documents that are relevant to a query, where the
query may be a question, or just a topic area or concept. Question answering is a somewhat
QUESTION ANSWERING different task, in which the query really is a question, and the
answer is not a ranked list of documents but rather a short response—a sentence, or even just
a phrase.
25
The ASKMSR system (Banko et al., 2002) is a typical Web-based question-answering
system. It is based on the intuition that most questions will be answered many times on the
Web, so question answering should be thought of as a problem in precision, not recall. We
don’t have to deal with all the different ways that an answer might be phrased—we only have
to find one of them. For example, consider the query [Who killed Abraham Lincoln?]
Suppose a system had to answer that question with access only to a single encyclopedia,
whose entry on Lincoln said
“ John Wilkes Booth altered history with a bullet. He will forever be known as the man who
ended Abraham Lincoln’s life.”
INFORMATION EXTRACTION:
Definition:
Information extraction is the process of acquiring knowledge by skimming a text and looking
for occurrences of a particular class of object and for relationships among objects.
A typical task is to extract instances of addresses from Web pages, with database
fields for street, city, state, and zip code; or instances of storms from weather reports,
with fields for temperature, wind speed, and precipitation.
In a limited domain, this can be done with high accuracy. As the domain gets more
general, more complex linguistic models and more complex learning techniques are
necessary.
Ex: For example, extracting from the text ―IBM ThinkBook 970. Our price: $399.00‖ the
set of attributes {Manufacturer=IBM, Model=ThinkBook970, Price=$399.00}.
26
We can address this problem by defining a template (also known as a pattern) for each
attribute we would like to extract. The template is defined by a finite state automaton,
the simplest example of which is the regular expression, or regex. Regular expressions
are used in Unix commands such as grep, in programming languages such as Perl, and in
word processors such as Microsoft Word.
One step up from attribute-based extraction systems are relational extraction systems,
RELATIONAL EXTRACTION which deal with multiple objects and the relations among them.
Thus, when these systems see the text ―$249.99,‖ they need to determine not just that it is a price, but
also which object has that price. A typical relational-based extraction system is FASTUS, which
handles news stories about corporate mergers and acquisitions.
1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
i. Tokenization:
FASTUS’s first stage is tokenization, which segments the stream of characters into tokens
(words, numbers, and punctuation). For English, tokenization can be fairly simple; just separating
characters at white space or punctuation does a fairly good job. Some tokenizes also deal with markup
languages such as HTML, SGML, and XML.
ii. complex words, including collocations such as ―set up‖ and ―joint venture,‖ as well as proper
names such as ―Bridgestone Sports Co.‖ These are recognized by a combination of lexical entries
27
and finite-state grammar rules. For example, a company name might be recognized by the rule
CapitalizedWord+ (―Company‖ | ―Co‖ | ―Inc‖ | ―Ltd‖)
iii. The third stage handles basic groups, meaning noun groups and verb groups. The idea is to chunk
these into units that will be managed by the later stages. We will see how to write a complex
description of noun and verb phrases in Chapter 23, but here we have simple rules that only
approximate the complexity of English, but have the advantage of being representable by finite
state automata. The example sentence would emerge from this stage as the following sequence of
tagged groups:
28
First, HMMs are probabilistic, and thus tolerant to noise. In a regular expression, if a single
expected character is missing, the regex fails to match; with HMMs there is graceful degradation
with missing characters/words, and we get a probability indicating the degree of match, not just a
Boolean match/fail.
Second, HMMs can be trained from data; they don’t require laborious engineering of templates,
and thus they can more easily be kept up to date as text changes over time.
One issue with HMMs for the information extraction task is that they model a lot of
probabilities that we don’t really need.
An HMM is a generative model; it models the full joint probability of observations and
hidden states, and thus can be used to generate samples. That is, we can use the HMM model
not only to parse a text and recover the speaker and date, but also to generate a random
instance of a text containing a speaker and a date.
Since we’re not interested in that task, it is natural to ask whether we might be better off with a
model that doesn’t bother modeling that possibility.
All we need in order to understand a text is a discriminative model, one that models the
conditional probability of the hidden attributes given the observations (the text).
A framework for this type of model is the conditional random field, or CRF, which models a
conditional probability distribution of a set of target variables given a set of observed
variables.
Like Bayesian networks, CRFs can represent many different structures of dependencies
among the variables. One common structure is the linear-chain conditional random field for
representing Markov dependencies among variables in a temporal sequence.
Thus, HMMs are the temporal version of naive Bayes models, and linear-chain CRFs are the
temporal version of logistic regression, where the predicted target is an entire state sequence
rather than a single binary variable.
********************************
29