Concept Learning
Concept Learning
Dr T V RAJINI KANTH
PROFESSOR & DEAN R&D
DEPARTMENT OF CSE
SNIST, HYDERABAD
1
ML APPLICATIONS
• Traffic Alerts
• Social Media
• Transportation and Commuting
• Products Recommendations
• Virtual Personal Assistants
• Self Driving Cars
• Dynamic Pricing
• Google Translate
• Online Video Streaming
• Fraud Detection
2
RESOURCES
TEXT BOOKS:
• 1. Stephen Marsland, ―Machine Learning – An Algorithmic Perspective, Second
Edition, Chapman and Hall/CRC Machine Learning and Pattern Recognition
Series, 2014.
• 2. Tom M Mitchell, ―Machine Learning, First Edition, McGraw Hill Education,
2013.
REFERENCES:
• 1. Peter Flach, ―Machine Learning: The Art and Science of Algorithms that Make
Sense of Data‖, First Edition, Cambridge University Press, 2012.
• 2. Jason Bell, ―Machine learning – Hands on for Developers and Technical
Professionals‖, First Edition, Wiley, 2014
• 3. Ethem Alpaydin, ―Introduction to Machine Learning 3e (Adaptive
Computation and Machine Learning Series), Third Edition, MIT Press, 2014
3
UNIT-I : INTRODUCTION
• Introduction: Learning, Types of Machine Learning,
Machine Learning Examples , Decision Tree Learning
• Concept learning: Introduction, Version Spaces and the
Candidate Elimination Algorithm.
• Learning with Trees: Decision Tree Learning, the Big
Picture
• Linear Discriminants: Learning Linear Separators , The
Perceptron Algorithm , Margins
4
LEARNING
• Learning from Data since data is what we have
• Learning from Experience
• Learning is what gives us flexibility in our life i.e. one can adjust and
adapt to the circumstances
• The important parts of animal learning are remembering, adapting and
generalising
• The most fundamental parts of Intelligence are Learning and adpting
• This was the basis of the most early AI and sometimes known as
Symbolic Processing because the computer manipulates symbols that
reflect the environment
• Machine Learning methods are sometimes called Sub Symbolic because
no symbols or symbolic manipulations are involved
5
Machine Learning: A Definition
6
Machine Learning
• Machine Learning is about making computers modify or Adapt their
actions (actions may be making predictions or controlling a robot) so that
these actions get more accurate where accuracy is measured by how
well the chosen actions reflect the correct ones.
• Ex: Initially during training period i.e. Machine gets defeated by you and
after few games it will win and never fails
. It is particularly important because we might want to use some of the methods on very large
datasets, so algorithms that have high degree polynomial complexity in the size of the dataset
(or worse) will be a problem.
The complexity is often broken into two parts: the complexity of training, and the complexity of
applying the trained algorithm. 7
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
8
TYPES OF MACHINE LEARNING
Supervised learning: A training set of examples with the correct
responses (targets) is provided and, based on this training set, the
algorithm generalizes to respond correctly to all possible inputs.
This is also called learning from exemplars.
- Classification & - Regression
Unsupervised learning: Correct responses are not provided, but
instead the algorithm tries to identify similarities between the
inputs so that inputs that have something in common are
categorized together. The statistical approach to unsupervised
learning is known as density estimation.
- Data Modelling & - Compression
9
• Reinforcement learning: This is somewhere between supervised and
unsupervised learning. The algorithm gets told when the answer is wrong, but
does not get told how to correct it. It has to explore and try out different
possibilities until it works out how to get the answer right. Reinforcement
learning is sometime called learning with a critic because of this monitor that
scores the answer, but does not suggest improvements.
- Behavior Selection & - Planning
• Evolutionary learning: Biological evolution can be seen as a learning process:
biological organisms adapt to improve their survival rates and chance of
having offspring in their environment. We’ll look at how we can model this in
a computer, using an idea of fitness, which corresponds to a score for how
good the current solution is.
- General Purpose optimization
• The most common type of learning is supervised learning
10
TYPES OF MACHINE LEARNING
• Supervised (inductive) learning
– Training data includes desired outputs
• Unsupervised learning
– Training data does not include desired outputs
• Semi-supervised learning
– Training data includes a few desired outputs
• Reinforcement learning
– Rewards from sequence of actions
11
WEB PAGE EXAMPLE
Website selling software:
• Website is made more personalized to the user, and collect data about visitors,
such as their computer type/operating system, web browser, the country that
they live in, and the time of day they visited the website.
• You can get this data for any visitor, and for people who actually buy
something, you know what they bought, and how they paid for it (say PayPal
or a credit card).
• So, for each person who buys something from your website, you have a list of
data that looks like (computer type, web browser, country, time, software
bought, how paid).
• For instance, the first three pieces of data you collect could be:
• Macintosh OS X, Safari, UK, morning, SuperGame1, credit card
• Windows XP, Internet Explorer, USA, afternoon, SuperGame1, PayPal
• Windows Vista, Firefox, NZ, evening, SuperGame2, PayPal 12
SUPERVISED LEARNING
• Webpage example is a typical problem for supervised learning.
• There is a set of data (the training data) that consists of a set of
input data that has target data, which is the answer that the
algorithm should produce, attached.
• This is usually written as a set of data (xi, ti), where the inputs are
xi, the targets are ti, and the i index suggests that we have lots of
pieces of data, indexed by i running from 1 to some upper limit N.
• Algorithm can deal with noise
• It is hard to specify rigorously what generalization means.
13
REGRESSION
• Suppose following are data points and asked you to tell me the
value of the output when x = 0.44
• Top left: A few data points from a sample problem.
• Bottom left: Two possible ways to predict the values between
the known data points: connecting the points with straight lines,
or using a cubic approximation
FIGURE 1 & 2 14
• Top and bottom right: Two more complex approximators that pass through
the points, although the lower one is rather better than the top.
• Since the value x = 0.44 isn’t in the examples given, you need to find some
way to predict what value it has.
• You assume that the values come from some sort of function, and try to
find out what the function is.
• able to give the output value y for any given value of x.
• This is known as a regression problem in statistics: fit a mathematical
function describing a curve, so that the curve passes as close as possible to
all of the data points.
• It is generally a problem of function approximation or interpolation,
working out the value between values that we know.
15
• The top-left plot shows a plot of the 7 values of x and y in the
table, while the other plots show different attempts to fit a
curve through the data points.
• The bottom-left plot shows two possible answers found by
using straight lines to connect up the points, and also what
happens if we try to use a cubic function (something that
can be written as ax3 +bx2 +cx+d = 0).
• The top-right plot shows what happens when we try to
match the function using a different polynomial, this time of
the form ax10 + bx9 + . . . + jx + k = 0, and finally the bottom-
right plot shows the function y = 3 sin(5x).
16
• In fact, the data were made with the sine function plotted on the
bottom-right, so that is the correct answer in this case, but the
algorithm doesn’t know that, and to it the two solutions on the right
both look equally good.
• The only way we can tell which solution is better is to test how well
they generalize.
• We pick a value that is between our data points, use our curves to
predict its value, and see which is better.
• This will tell us that the bottom-right curve is better in the example.
17
Classification
• The classification problem consists of taking input vectors and
deciding which of N classes they belong to, based on training
from exemplars of each class.
• The most important point about the classification problem is
that it is discrete—each example belongs to precisely one class,
and the set of classes covers the whole possible output space.
• These two constraints are not necessarily realistic; sometimes
examples might belong partially to two different classes.
18
• For ex: consider a vending machine, where we use a
neural network to learn to recognize all the different
coins.
• Train the classifier to recognize all New Zealand coins,
but what if a British coin is put into the machine?
• In that case, the classifier will identify it as the New
Zealand coin that is closest to it in appearance, but
this is not really what is wanted: rather, the classifier
should identify that it is not one of the coins it was
trained on.
19
COIN CLASSIFIER SET UP:
Coin is pushed into the slot, the machine takes a few
measurements of it.
• Diameter, Weight, and Shape are the features that will generate
our input vector.
• Input vector will have three elements, each of which will be a
number of that feature (Ex:1=circle, 2=hexagon, etc)
• Vending machine includes an atomic absorption spectroscope,
then estimate the density of the material and its composition, or if
it had a camera, take a photograph of the coin and feed that
image into the classifier.
• which features to choose is not always an easy one 20
• Too many inputs, training time of the classifier will be longer & if
the number of input dimensions grows, the number of data
points required increases.
FIG: The New Zealand coins
FIG: Left: A set of straight line decision
boundaries for a classification problem.
Right: decision boundaries that separate the
plusses from the lightening strikes better, not
a straight line
21
• Ex: if we tried to separate coins based only on colour,
because the 20 ¢ and 50 ¢ coins are both silver and the
$1 and $2 coins both bronze. However, if we use colour
and diameter, it is a good job of the coin classification
problem for NZ coins.
• There are some features that are entirely useless. For
example, knowing that the coin is circular doesn’t tell us
anything about NZ coins, which are all circular
22
THE BRAIN AND THE NEURON
• Brain is a powerful and complicated system, the basic building blocks that it is
made up of are fairly simple and easy to understand.
• It deals with noisy and even inconsistent data, and produces answers that are
usually correct from very high dimensional data (such as images) very quickly.
• It weighs about 1.5 kg and is losing parts of itself all the time but its
performance does not degrade appreciably.
• Most basic level is the processing units of the brain is nerve cells called
neurons. There are lots of them (100 billion = 1011)
• Transmitter chemicals within the fluid of the brain raise or lower the electrical
potential inside the body of the neuron.
• If this membrane potential reaches some threshold, the neuron spikes or fires,
and a pulse of fixed strength and duration is sent down the axon.
23
• The axons divide (arborise) into connections to many other
neurons, connecting to each of these neurons in a synapse.
• Each neuron is typically connected to thousands of other
neurons, so that it is estimated that there are about 100 trillion
(= 1014) synapses within the brain.
• After firing, the neuron must wait for some time to recover its
energy before it can fire again.
• Each neuron can be viewed as a separate processor, performing
a very simple computation: deciding whether or not to fire.
24
• This makes the brain a massively parallel computer made up of 1011
processing elements. If that is all there is to the brain, then we should be
able to model it inside a computer and end up with animal or human
intelligence inside a computer.
• This is the view of strong AI.
25
Learning with Trees
• One of the most common and powerful data structures in the whole of computer
science: the binary tree with computational cost O(logN), where N is the number
of data points.
• Classification by decision trees has grown in popularity over recent years.
• You are very likely to have been subjected to decision trees if you’ve ever phoned
a helpline, for example for computer faults.
• The phone operators are guided through the decision tree by your answers to
their questions.
• The idea of a decision tree is that break classification down into a set of choices
about each feature in turn, starting at the root (base) of the tree and progressing
down to the leaves, where we receive the classification decision.
• The trees are turned into a set of if-then rules, suitable for use in a rule induction
system.
• In terms of optimization and search, decision trees use a greedy heuristic to
perform search, evaluating the possible options at the current stage of learning
and making the one that seems optimal at that point.
• logarithm is base 2 because imagining that encode everything using binary digits
(bits), and define 0 log 0 = 0.
• A graph of the entropy is given in Figure 12.2.
• Suppose that we have a set of positive and negative examples of some feature
(where the feature can only take 2 values: positive and negative).
• If all of the examples are positive, then don’t get any extra information from
knowing the value of the feature for any particular example, since whatever the
value of the feature, the example will be positive.
• Entropy of that feature is 0.
• If the feature separates the examples into
50% positive and 50% negative, then the
amount of entropy is at a maximum, and
knowing about that feature is very useful
to us.
• The basic concept is that it tells us how
much extra information we would get
from knowing the value of that feature.
• The best feature to pick as the one to
classify and gives you the most
information, i.e., the one with the highest
entropy. FIGURE 12.2 A graph of entropy,
• After using that feature, re-evaluate the detailing how much information is
entropy of each feature and again pick the available from finding out another piece
one with the highest entropy. of information given what you already
know.
ID3
• Information Gain is to work out how much the entropy of the whole training set
would decrease if we choose each particular feature for the next classification
step.
• It is defined as the entropy of the whole set minus the entropy when a
particular feature is chosen.
• This is defined by (where S is the set of examples, F is a possible feature out of
the set of all possible ones, and |Sf | is a count of the number of members of S
that have value f for feature F):
• Data (with outcomes) S = {s1 = true, s2 = false, s3 = false, s4 = false} and one
feature F that can have values {f1, f2, f3}.
• Ex: Feature value for s1 could be f2, for s2 it could be f2, for s3, f3 and for s4, f1 then we
can calculate the entropy of S as (where means true, of which we have one example,
and means false, of which we have three examples):
• The function Entropy(Sf ) is similar, but only computed with the subset of data where
feature F has values f.
• If you were trying to follow those calculations on a calculator, you might be wondering
how to compute log2 p.
• We now want to compute the information gain of F, so we now need to compute each
of the values inside the summation in Equation (12.2),
• |Sf | / |S |Entropy(S) (in our example, the features are ‘Deadline’, ‘Party’, and ‘Lazy’):.
CLASSIFICATION AND REGRESSION TREES (CART)
• There is another well-known tree-based algorithm, CART, whose name indicates
that it can be used for both classification and regression.
• Classification is not wildly different in CART, although it is usually constrained to
construct binary trees.
• This might seem odd at first, but there are sound computer science reasons why
binary trees are good, as suggested in the computational cost discussion above,
and it is not a real limation.
• Always turn questions into binary decisions by splitting the question up a little.
• when your nearest assignment deadline is, which is either ‘urgent’, ‘near’, or
‘none’) can be split into two questions: first, ‘is the deadline urgent?’, and then if
the answer to that is ‘no’, second ‘is the deadline near?’
• The only real difference with classification in CART is that a different information
measure is commonly used
• Gini Impurity The entropy that was used in ID3 as the information measure is not
the only way to pick features. Another possibility is something known as the Gini
impurity.
• The ‘impurity’ in the name suggests that the aim of the decision tree is to have
each leaf node represent a set of data points that are in the same class, so that
there are no mismatches.
• This is known as purity. If a leaf is pure then all of the training data within it have
just one class.
• If we count the number of data points at the node (or better, the fraction of the
number of data points) that belong to a class i (call it N(i)), then it should be 0 for
all except one value of i.
• So suppose that you want to decide on which feature to choose for a split.
• The algorithm loops over the different features and checks how many points
belong to each class.
• If the node is pure, then N(i) = 0 for all values of i except one particular one.
• So for any particular feature k you can compute:
• The idea is to consider the cost of misclassifying an instance of class i as class j and add a
weight that says how important each data point is.
• It is typically labelled as ij and is presented as a matrix, with element ij representing the cost
of misclassifying i as j.
• Using it is simple, modifying the Gini impurity (Equation (12.8)) to be:
• There is another benefit to using these weights, which is to successively
improve the classification ability by putting higher weight on data points that
the algorithm is getting wrong.
• The new part about CART is its application in regression. While it might seem
strange to use trees for regression.
• Suppose that the outputs are continuous, so that a regression model is
appropriate.
• None of the node impurity measures that have considered so far will work.
• Go back to our old favourite—the sum-of-squares error.
• To evaluate the choice of which feature to use next, also need to find the value
at which to split the dataset according to that feature. Remember that the
output is a value at each leaf.
• Output, computed as the mean average of all the data points that are
situated in that leaf.
• This is the optimal choice in order to minimize the sum-of-squares error,
but it also means that we can choose the split point quickly for a given
feature, by choosing it to minimize the sum-of-squares error.
• Pick the feature that has the split point that provides the best sum-of-
squares error, and continue to use the algorithm as for classification.
CLASSIFICATION EXAMPLE
• Consider an example using ID3, when we want to construct the decision
tree to decide what to do in the evening.
• Get a suitable dataset (here, the last ten days):
• To produce a decision tree for this problem, the first thing that we need to do is
work out which feature to use as the root node.
• We start by computing the entropy of S:
and then find which feature has the maximal information gain:
• Therefore, the root node will be the party feature, which has two feature values
(‘yes’ and ‘no’), so it will have two branches coming out of it (see Figure 12.6).
• When we look at the ‘yes’ branch, we see that in all five cases where there was
a party we went to it, so we just put a leaf node there, saying ‘party’.
• For the ‘no’ branch, out of the five cases there are three different outcomes, so
now we need to choose another feature.
• The five cases we are looking at are:
Concept learning
• Introduction, Version Spaces and the Candidate
Elimination Algorithm
CONCEPT LEARNING TASK
• Ex: "days on which my friend Aldo enjoys his favorite water sport." shown in Table
2.1
• The attribute EnjoySport indicates whether or not Aldo enjoys his favorite water
sport on this day.
• The task is to learn to predict the value of EnjoySport for an arbitrary day, based
on the values of its other attributes.
• What hypothesis representation shall we provide to the learner in this case?
47
• Each hypothesis be a vector of six constraints of values, Sky, AirTemp,
Humidity, Wind, Water, and Forecast.
• For each attribute, the hypothesis will either 0 indicate by a "?‘’ that any
value is acceptable for this attribute, 0 specify a single required value (e.g.,
Warm) for the attribute, or 0 indicate by a "0" that no value is acceptable.
• If some instance x satisfies all the constraints of hypothesis h, then h
classifies x as a positive example (h(x) = 1).
• Hypothesis that Aldo enjoys his favorite sport only on cold days with high
humidity is represented by the expression
(?, Cold, High, ?, ?, ?)
• The most general hypothesis-that every day is a positive-is represented by
(?, ?, ?, ?, ?, ?)
• Most specific possible hypothesis-that no day is a positive-is represented
by (0,0,0,0,0,0)
48
• To summarize, the EnjoySport concept learning task requires learning
the set of days for which EnjoySport = yes
• It is described by the target function, the set of candidate hypotheses
considered by the learner, and the set of available training examples.
• The definition of the EnjoySport concept learning task in this general
form is given in Table 2.2.
• The set of items over which the concept is defined is called the set of
instances, which we denote by X.
• X is the set of all possible days, each represented by the attributes Sky,
AirTemp, Humidity, Wind, Water, and Forecast.
• The concept or function to be learned is called the target concept,
which we denote by c.
49
• c can be any boolean valued function defined over the instances X; that is, c :
X -> {O, 1).
• In this example, the target concept corresponds to the value of the attribute
EnjoySport
• c(x) = 1 if EnjoySport = Yes, and
• c(x) = 0 if EnjoySport = No
• Instances for which c(x) = 1 are called positive examples, or members of the
target concept.
• Instances for which C(X) = 0 are called negative examples, or nonmembers of
the target concept.
• Ordered pair (x, c(x)) to describe the training example consisting of the
instance x and its target concept value c(x).
• D to denote the set of available training examples.
50
51
• Given a set of training examples of the target concept c, the problem
faced by the learner is to hypothesize, or estimate, c.
• H is used to denote the set of all possible hypotheses
• Usually H is determined by the human designer's choice of hypothesis
representation.
• Each hypothesis h in H represents a boolean-valued function defined
over X; that is, h : X -> {O, 1}.
• The goal of the learner is to find a hypothesis h such that h(x) = c(x) for
a"x” in X.
• The inductive learning hypothesis: Any hypothesis found to
approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over
other unobserved examples. 52
CONCEPT LEARNING AS SEARCH
• Concept learning can be viewed as the task of searching through a large
space of hypotheses implicitly defined by the hypothesis representation.
• The goal of this search is to find the hypothesis that best fits the training
examples.
• By selecting a hypothesis representation, the designer of the learning
algorithm implicitly defines the space of all hypotheses that the program can
ever represent and therefore can ever learn.
• Ex: Consider the instances X and hypotheses H in the EnjoySport learning
task.
• Attribute Sky has three possible values: AirTemp, Humidity, Wind, Water,
and Forecast each have two possible values
53
• Instance space X contains exactly 3 .2. 2 .2 2 .2 = 96 distinct instances.
• A similar calculation shows that there are 5.4.4 .4 .4.4 = 5 120
syntactically distinct hypotheses within H.
• Every hypothesis containing one or more "Ø" symbols represents the
empty set of instances; that is, it classifies every instance as negative.
Number of semantically distinct hypotheses is only
1 + (4.3.3.3.3.3) = 973
General-to-Specific Ordering of Hypotheses
• Design learning algorithms that exhaustively search even infinite
hypothesis spaces without explicitly enumerating every hypothesis.
• To illustrate the general-to-specific ordering, consider the two hypotheses
• hi = (Sunny, ?, ?, Strong, ?, ?) h2 = (Sunny, ?, ?, ?, ?, ?)
54
• Consider instances that are classified positive by hl and by h2.
• h2 imposes fewer constraints on the instance, it classifies more instances
as positive.
• Instance classified positive by hl will be classified positive by h2.
• h2 is more general than hl. Instance x in X and hypothesis h in H, then x
satisfies h if and only if h(x) = 1.
• Given hypotheses hj and hk, hj is more_general_than_or_equal_to hk if
and only if any instance that satisfies hk also satisfies hj.
• Definition: Let hj and hk be boolean-valued functions defined over X. Then
hj is more_general_than_or_equal_to hk (written hj ≥g hk ) if and only if
• Consider cases where one hypothesis is strictly more general than the
other. hj is (strictly) more_general_than
55
hk (written hj >g hk) if and only if (hj ≥g hk) Ʌ (hk hi).
The ≥g relation is important because it provides a useful structure over the hypothesis
space H for any concept learning problem 56
FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS
• How can we use the more-general-than partial ordering (the relation is
reflexive, anti-symmetric, and transitive) to organize the search for a hypothesis
consistent with the observed training examples?
• Assume that learner is given the sequence of training examples from Table 2.1
for the EnjoySport task. The first step of FIND-S is to initialize h to the most
specific hypothesis in H
57
• Hypothesis is too specific first training example i.e. positive.
• None of the "Ø" constraints in h are satisfied by this example, so each is
replaced by the next more general constraint that fits the example
• h <- (Sunny, Warm, Normal, Strong, Warm, Same)
• This h is still very specific; it asserts that all instances are negative except for the
single positive training example.
• Second training example (also positive in this case) forces the algorithm to
further generalize h, this time substituting a "?' in place of any attribute value
• The refined hypothesis in this case is
h <- (Sunny, Warm, ?, Strong, Warm, Same)
• Third training example in this case a negative example-the algorithm makes no
change to h.
58
• FIND-S algorithm simply ignores every negative example!
• h is the most specific hypothesis in H consistent with the
observed positive examples.
• Because the target concept c is also assumed to be in H and to
be consistent with the positive training examples, c must be
more_general_than_or_equal_to .
• But the target concept c will never cover a negative example,
thus neither will h (by the definition of more_general_than).
• No revision to h will be required in response to any negative
example.
• Fourth (positive) example leads to a further generalization of h
h <- (Sunny, Warm, ?, Strong, ?, ?) 59
• The search moves from hypothesis to hypothesis, searching from the most
specific to progressively more general hypotheses along one chain of the partial
ordering. Figure 2.2 shows search in terms of the instance and hypothesis
spaces.
• FIND-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples.
• Its final hypothesis will also be consistent with the negative examples provided
the correct target concept is contained in H, and provided the training examples
are correct.
• Several Questions still left unanswered by this learning algorithm
➢ Has the learner converged to the correct target concept? Although FIND-S will find a hypothesis
consistent with the training data, it has no way to determine whether it has found the only
hypothesis in H consistent with the data.
➢ Why prefer the most specific hypothesis?
➢ Are the training examples consistent?
➢ What if there are several maximally specific consistent hypotheses?
60
61
VERSION SPACES AND THE CANDIDATE-ELIMINATION
ALGORITHM
• A second approach to concept learning, the CANDIDATE- ELIMINATION
Algorithm, that addresses several of the limitations of FIND-S.
• The key idea in this algorithm is to output a description of the set of all
hypotheses consistent with the training examples.
• This computes the description of this set without explicitly enumerating all of
its members is accomplished by again using the more-general-than partial
ordering.
• This has been applied to problems such as learning regularities in chemical
mass spectroscopy (Mitchell 1979) and learning control rules for heuristic
search (Mitchell et al. 1983).
• It provides a useful conceptual framework for introducing several fundamental
issues in machine learning.
62
Representation
• This finds all describable hypotheses that are consistent with the observed training
examples.
• A hypothesis is consistent with the training examples if it correctly classifies these
examples.
• Definition: A hypothesis h is consistent with a set of training examples D if and only if h(x)
= c(x) for each example <x, c(x)> in D.
64
A More Compact Representation for Version Spaces
• The CANDIDATE-ELIMINATION Algorithm works on the same principle as
the above LIST-THEN-ELIMINATE Algorithm.
• Employs more compact representation of the version space.
• Members form general and specific boundary sets that delimit the
version space within the partially ordered hypothesis space.
65
• Given the four training examples from Table 2.1, FIND-S outputs the hypothesis
h = (Sunny, Warm, ?, Strong, ?, ?)
• This is just one of six different hypotheses from H that are consistent with these
training examples in Fig. 2.3. and constitute version space relative to this set of
data and this hypothesis.
• Arrows among these six hypotheses in Fig. 2.3 indicate instances of the more-
general-than relation.
• Represent the version space in terms of its most specific and most general
members.
• Theorem 2.1. Version space representation theorem: Let X be an arbitrary set
of instances and let H be a set of boolean-valued hypotheses defined over X. Let
c : X -> {0, 1} be an arbitrary target concept defined over X, and let D be an
arbitrary set of training examples {<x, c(x)>}. For all X, H, c, and D such that S and
G are well defined,
66
CANDIDATE-ELIMINATION Algorithm
• It computes the version space containing all hypotheses from H that are
consistent.
• It begins by initializing the version space to the set of all hypotheses in H; that
is, by initializing the G boundary set to contain the most general hypothesis in
H, G0 -> {(?, ?, ?, ?, ?, ?)}
• Initializing the S boundary set to contain the most specific (least general)
hypothesis, S0 -> {Ø, Ø, Ø, Ø, Ø, Ø}
• These two boundary sets delimit the entire hypothesis space, because every
other hypothesis in H is both more general than S0 and more specific than G0.
It Identifies non-minimal and non-maximal hypotheses.
• Ex: It was applied to the first two training examples from Table 2.1. Boundary
sets are first initialized to G0 and S0, the most general and most specific
hypotheses in H.
67
68
• First two steps, positive training examples may force the S boundary of the
version space to become increasingly general. Negative training examples play
the complimentary role of forcing the G boundary to become increasingly
specific. 69
• Consider the third training example, shown in Fig.2.5. This negative example
reveals that the G boundary of the version space is overly general; that is, the
hypothesis in G incorrectly predicts that this new example is a positive example.
70
• The fourth training example, as shown in Figure 2.6, further generalizes the S
boundary of the version space.
• It also results in removing one member of the G boundary, because this member
fails to cover the new positive example. 71
• After processing these four examples, the boundary sets S4 and G4 delimit the
version space of all hypotheses consistent with the set of incrementally observed
training examples. The entire version space, including those hypotheses bounded
by S4 and G4, is shown in Fig.2.7. further training data is encountered, the S and G
boundaries will move monotonically closer to each other, delimiting a smaller and
smaller version space of candidate hypotheses. 72
LINEAR DISCRIMINANTS : THE PERCEPTRON
• The Perceptron is nothing more than a collection of McCulloch and Pitts
neurons together with a set of inputs and some weights to fasten the
inputs to the neurons.
• The neurons are shown on the right, and you can see both the additive
part (shown as a circle) and the thresholder and it is part of the neuron.
73
• Neurons in the Perceptron are completely independent of each other: it
doesn’t matter to any neuron what the others are doing, it works out
whether or not to fire by multiplying together its own weights and the
input, adding them together, and comparing the result to its own
threshold, regardless of what the other neurons are doing.
• Even the weights that go into each neuron are separate for each one, so
the only thing they share is the inputs, since every neuron sees all of the
inputs to the network.
• The number of inputs is the same as the number of neurons, but this does
not have to be the case — in general there will be m inputs and n neurons.
• The number of inputs is determined for us by the data, and so is the
number of outputs, since we are doing supervised learning, so Perceptron
has to learn to reproduce a particular target, that is, a pattern of firing and
non-firing neurons for the given input.
74
• Which neuron the weight feeds into, label them as wij , where the j index runs
over the number of neurons and i index running over the number of inputs.
• So W32 is the weight that connects input node 3 to neuron 2.
• In implementation of the neural network, use a two-dimensional array to hold
these weights.
• Working out whether or not a neuron should fire is easy: set the values of the
input nodes to match the elements of an input vector like a vector of 0s and 1s,
so if there are 5 neurons, as in Fig. 3.2, then a typical output pattern could be
(0, 1, 0, 0, 1), which means that the second and fifth neurons fired and the
others did not.
• There are m weights that are connected to that neuron, one for each of the
input nodes.
• If we label the neuron that is wrong as k, then the weights that we are
interested in are wik, where i runs from 1 to m.
75
• compute yk −tk (the difference between the output yk which is what the neuron
did, and the target for that neuron, tk which is what the neuron should have
done. This is a possible error function).
• If it is negative then the neuron should have fired and didn’t, so we make the
weights bigger, and vice versa if it is positive, which we can do by subtracting
the error value.
• if we wanted the neuron to fire we’d need to make the value of the weight
negative as well we’ll multiply those two things together to see how we should
change
• the weight: Δ wik = −(yk − tk) × xi and the new value of the weight is the old
value plus this value.
• Even if a neuron is wrong, changing the relevant weight doesn’t do anything ,
we need to change the threshold.
• It is done by multiplying the value above by a parameter called the learning
rate, usually labelled as η. The value of the learning rate decides how fast the
network learns.
76
• Final rule for updating a weight wij :
wij <- wij − η (yj − tj) · xi ----- (3.3)
• The first time the network might get some of the answers correct and
some wrong; the next time it will hopefully improve, and eventually its
performance will stop improving.
77
LINEAR SEPARABILITY
• Output neuron example of the OR data it tries to separate out
the cases where the neuron should fire from those where it
shouldn’t.
• Looking at the graph on the right side of Figure 3.4, you should
be able to draw a straight line that separates out the crosses
from the circles without difficulty (it is done in Figure 3.6).
78
• Perceptron does: it tries to find a straight line (in 2D, a plane in
3D, and a hyperplane in higher dimensions) where the neuron
fires on one side of the line, and doesn’t on the other.
• This line is called the decision boundary or discriminant function, and an
example of one is given in Figure 3.7.
• Consider just one input
vector x. The neuron fires
if x·wT ≥ 0 (where w is the
row of W that connects
the inputs to one particular
neuron)
The a · b notation describes the
inner or scalar product between two vectors. 79
• Multiply each element of the first vector by the matching element of the
second and adding them all together.
• a · b = ||a|| ||b|| cos Ɵ , where Ɵ is the angle between a and b and ||a|| is
the length of the vector a.
• Inner product computes a function of the angle between the two vectors,
scaled by their lengths.
• Boundary case of a Perceptron is where we find an input vector x1 that has x1 ·
wT = 0.
• Find another input vector x2 that satisfies x2 · wT = 0.
• Putting these two equations together we get:
82
• For regression we are making a prediction about an unknown value y by
computing some function of known values xi .
• Output y is going to be a sum of the xi values, each multiplied by a constant
parameter:
• The βi define a straight line (plane in 3D, hyperplane in higher dimensions) that
goes through (or at least near) the data points. Figure 3.13 shows this in two
and three dimensions.
• Define the line (plane or hyperplane in higher dimensions) that best fits the
data.
• The most common solution is to try to minimise the distance between each
datapoint and the line that we fit.
• We can measure the distance between a point and a line by defining another
line that goes through the point and hits the line.
83
• second line will be shortest when it hits the line at right angles, and then we
can use Pythagoras’ theorem to know the distance.
• Now, we can try to minimise an error function that measures the sum of all
these distances.
• If we ignore the square roots, and just minimise the sum-of-squares of the
errors, then we get the most common minimisation, which is known as least-
squares optimisation.
• What we are doing is choosing the parameters in order to minimise the squared
difference between the prediction and the actual data value, summed over all
of the datapoints.
85