Embedded System Notes
Embedded System Notes
Unit - I
What is AI ?
• Artificial Intelligence (AI) is a branch of Science which deals with helping machines
find solutions to complex problems in a more human-like fashion.
• This generally involves borrowing characteristics from human intelligence, and
applying them as algorithms in a computer friendly way.
• A more or less flexible or efficient approach can be taken depending on the
requirements established, which influences how artificial the intelligent behavior
appears
• Artificial intelligence can be viewed from a variety of perspectives.
✓ From the perspective of intelligence
artificial intelligence is making machines "intelligent" -- acting as we would
expect people to act.
o The inability to distinguish computer responses from human responses
is called the Turing test.
o Intelligence requires knowledge
o Expert problem solving - restricting domain to allow including
significant relevant knowledge
✓ From a business perspective AI is a set of very powerful tools, and
methodologies for using those tools to solve business problems.
✓ From a programming perspective, AI includes the study of symbolic
programming, problem solving, and search.
o Typically AI programs focus on symbols rather than numeric
processing.
o Problem solving - achieve goals.
o Search - seldom access a solution directly. Search may include a
variety of techniques.
o AI programming languages include:
– LISP, developed in the 1950s, is the early programming language
strongly associated with AI. LISP is a functional programming language with
procedural extensions. LISP (LISt Processor) was specifically designed for
processing heterogeneous lists -- typically a list of symbols. Features of LISP
are run- time type checking, higher order functions (functions that have other
functions as parameters), automatic memory management (garbage collection)
and an interactive environment.
– The second language strongly associated with AI is PROLOG.
PROLOG was developed in the 1970s. PROLOG is based on first order logic.
PROLOG is declarative in nature and has facilities for explicitly limiting the
search space.
– Object-oriented languages are a class of languages more recently used
for AI programming. Important features of object-oriented languages include:
concepts of objects and messages, objects bundle data and methods for
manipulating the data, sender specifies what is to be done receiver decides
how to do it, inheritance (object hierarchy where objects inherit the attributes
of the more general class of objects). Examples of object-oriented languages
are Smalltalk, Objective C, C++. Object oriented extensions to LISP (CLOS -
Common LISP Object System) and PROLOG (L&O - Logic & Objects) are
also used.
• Artificial Intelligence is a new electronic machine that stores large amount of
information and process it at very high speed
• The computer is interrogated by a human via a teletype It passes if the human cannot
tell if there is a computer or human at the other end
• The ability to solve problems
• It is the science and engineering of making intelligent machines, especially intelligent
computer programs. It is related to the similar task of using computers to understand
human intelligence.
Importance of AI
• Game Playing
You can buy machines that can play master level chess for a few hundred dollars.
There is some AI in them, but they play well against people mainly through brute
force computation--looking at hundreds of thousands of positions. To beat a world
champion by brute force and known reliable heuristics requires being able to look at
200 million positions per second.
• Speech Recognition
In the 1990s, computer speech recognition reached a practical level for limited
purposes. Thus United Airlines has replaced its keyboard tree for flight information
by a system using speech recognition of flight numbers and city names. It is quite
convenient. On the other hand, while it is possible to instruct some computers using
speech, most users have gone back to the keyboard and the mouse as still more
convenient.
• Computer Vision
The world is composed of three-dimensional objects, but the inputs to the human eye
and computers' TV cameras are two dimensional. Some useful programs can work
solely in two dimensions, but full computer vision requires partial three-dimensional
information that is not just a set of two-dimensional views. At present there are only
limited ways of representing three-dimensional information directly, and they are not
as good as what humans evidently use.
• Expert Systems
A ``knowledge engineer'' interviews experts in a certain domain and tries to embody
their knowledge in a computer program for carrying out some task. How well this
works depends on whether the intellectual mechanisms required for the task are
within the present state of AI. When this turned out not to be so, there were many
disappointing results. One of the first expert systems was MYCIN in 1974, which
diagnosed bacterial infections of the blood and suggested treatments. It did better than
medical students or practicing doctors, provided its limitations were observed.
Namely, its ontology included bacteria, symptoms, and treatments and did not include
patients, doctors, hospitals, death, recovery, and events occurring in time. Its
interactions depended on a single patient being considered. Since the experts
consulted by the knowledge engineers knew about patients, doctors, death, recovery,
etc., it is clear that the knowledge engineers forced what the experts told them into a
predetermined framework. The usefulness of current expert systems depends on their
users having common sense.
• Heuristic Classification
One of the most feasible kinds of expert system given the present knowledge of AI is
to put some information in one of a fixed set of categories using several sources of
information. An example is advising whether to accept a proposed credit card
purchase. Information is available about the owner of the credit card, his record of
payment and also about the item he is buying and about the establishment from which
he is buying it (e.g., about whether there have been previous credit card frauds at this
establishment).
Early work in AI
• “Artificial Intelligence (AI) is the part of computer science concerned with designing
intelligent computer systems, that is, systems that exhibit characteristics we associate
with intelligence in human behaviour – understanding language, learning, reasoning,
solving problems, and so on.”
• Scientific Goal To determine which ideas about knowledge representation, learning,
rule systems, search, and so on, explain various sorts of real intelligence.
• Engineering Goal To solve real world problems using AI techniques such as
knowledge representation, learning, rule systems, search, and so on.
• Traditionally, computer scientists and engineers have been more interested in the
engineering goal, while psychologists, philosophers and cognitive scientists have been
more interested in the scientific goal.
• The Roots - Artificial Intelligence has identifiable roots in a number of older
disciplines, particularly:
✓ Philosophy
✓ Logic/Mathematics
✓ Computation
✓ Psychology/Cognitive Science
✓ Biology/Neuroscience
✓ Evolution
• There is inevitably much overlap, e.g. between philosophy and logic, or between
mathematics and computation. By looking at each of these in turn, we can gain a
better understanding of their role in AI, and how these underlying disciplines have
developed to play that role.
• Philosophy
✓ ~400 BC Socrates asks for an algorithm to distinguish piety from non-piety.
✓ ~350 BC Aristotle formulated different styles of deductive reasoning, which
could mechanically generate conclusions from initial premises, e.g. Modus Ponens
If A ? B and A then B
If A implies B and A is true then B is true when it’s raining you
✓ 1596 – 1650 Rene Descartes idea of mind-body dualism – part of the mind is
exempt from physical laws.
✓ 1646 – 1716 Wilhelm Leibnitz was one of the first to take the materialist position
which holds that the mind operates by ordinary physical processes – this has the
implication that mental processes can potentially be carried out by machines.
• Logic/Mathematics
✓ Earl Stanhope’s Logic Demonstrator was a machine that was able to solve
syllogisms, numerical problems in a logical form, and elementary questions of
probability.
✓ 1815 – 1864 George Boole introduced his formal language for making logical
inference in 1847 – Boolean algebra.
✓ 1848 – 1925 Gottlob Frege produced a logic that is essentially the first-order
logic that today forms the most basic knowledge representation system.
✓ 1906 – 1978 Kurt Gödel showed in 1931 that there are limits to what logic can
do. His Incompleteness Theorem showed that in any formal logic powerful
enough to describe the properties of natural numbers, there are true statements
whose truth cannot be established by any algorithm.
✓ 1995 Roger Penrose tries to prove the human mind has non-computable
capabilities.
• Computation
✓ 1869 William Jevon’s Logic Machine could handle Boolean Algebra and Venn
Diagrams, and was able to solve logical problems faster than human beings.
✓ 1912 – 1954 Alan Turing tried to characterise exactly which functions are
capable of being computed. Unfortunately it is difficult to give the notion of
computation a formal definition. However, the Church-Turing thesis, which states
that a Turing machine is capable of computing any computable function, is
generally accepted as providing a sufficient definition. Turing also showed that
there were some functions which no Turing machine can compute (e.g. Halting
Problem).
✓ 1903 – 1957 John von Neumann proposed the von Neuman architecture which
allows a description of computation that is independent of the particular
realisation of the computer.
✓ 1960s Two important concepts emerged: Intractability (when solution time
grows atleast exponentially) and Reduction (to ‘easier’ problems).
• Psychology / Cognitive Science
✓ Modern Psychology / Cognitive Psychology / Cognitive Science is the science
which studies how the mind operates, how we behave, and how our brains process
information.
✓ Language is an important part of human intelligence. Much of the early work on
knowledge representation was tied to language and informed by research into
linguistics.
✓ It is natural for us to try to use our understanding of how human (and other
animal) brains lead to intelligent behavior in our quest to build artificial intelligent
systems. Conversely, it makes sense to explore the properties of artificial systems
(computer models/simulations) to test our hypotheses concerning human systems.
✓ Many sub-fields of AI are simultaneously building models of how the human
system operates, and artificial systems for solving real world problems, and are
allowing useful ideas to transfer between them.
• Biology / Neuroscience
✓ Our brains (which give rise to our intelligence) are made up of tens of billions of
neurons, each connected to hundreds or thousands of other neurons.
✓ Each neuron is a simple processing device (e.g. just firing or not firing depending
on the total amount of activity feeding into it). However, large networks of
neurons are extremely powerful computational devices that can learn how best to
operate.
✓ The field of Connectionism or Neural Networks attempts to build artificial
systems based on simplified networks of simplified artificial neurons.
✓ The aim is to build powerful AI systems, as well as models of various human
abilities.
✓ Neural networks work at a sub-symbolic level, whereas much of conscious human
reasoning appears to operate at a symbolic level.
✓ Artificial neural networks perform well at many simple tasks, and provide good
models of many human abilities. However, there are many tasks that they are not
so good at, and other approaches seem more promising in those areas.
• Evolution
✓ One advantage humans have over current machines/computers is that they have a
long evolutionary history.
✓ Charles Darwin (1809 – 1882) is famous for his work on evolution by natural
selection. The idea is that fitter individuals will naturally tend to live longer and
produce more children, and hence after many generations a population will
automatically emerge with good innate properties.
✓ This has resulted in brains that have much structure, or even knowledge, built in at
birth.
✓ This gives them at the advantage over simple artificial neural network systems
that have to learn everything.
✓ Computers are finally becoming powerful enough that we can simulate evolution
and evolve good AI systems.
✓ We can now even evolve systems (e.g. neural networks) so that they are good at
learning.
✓ A related field called genetic programming has had some success in evolving
programs, rather than programming them by hand.
• Sub-fields of Artificial Intelligence
✓ Neural Networks – e.g. brain modelling, time series prediction, classification
✓ Evolutionary Computation – e.g. genetic algorithms, genetic programming
✓ Vision – e.g. object recognition, image understanding
✓ Robotics – e.g. intelligent control, autonomous exploration
✓ Expert Systems – e.g. decision support systems, teaching systems
✓ Speech Processing– e.g. speech recognition and production
✓ Natural Language Processing – e.g. machine translation
✓ Planning – e.g. scheduling, game playing
✓ Machine Learning – e.g. decision tree learning, version space learning
• Speech Processing
✓ As well as trying to understand human systems, there are also numerous real
world applications: speech recognition for dictation systems and voice activated
control; speech production for automated announcements and computer interfaces.
✓ How do we get from sound waves to text streams and vice-versa?
• Logical AI
What a program knows about the world in general the facts of the specific situation in
which it must act, and its goals are all represented by sentences of some mathematical
logical language. The program decides what to do by inferring that certain actions are
appropriate for achieving its goals.
• Search
AI programs often examine large numbers of possibilities, e.g. moves in a chess game
or inferences by a theorem proving program. Discoveries are continually made about
how to do this more efficiently in various domains.
• Pattern Recognition
When a program makes observations of some kind, it is often programmed to
compare what it sees with a pattern. For example, a vision program may try to match
a pattern of eyes and a nose in a scene in order to find a face. More complex patterns,
e.g. in a natural language text, in a chess position, or in the history of some event are
also studied.
• Representation
Facts about the world have to be represented in some way. Usually languages of
mathematical logic are used.
• Inference
From some facts, others can be inferred. Mathematical logical deduction is adequate
for some purposes, but new methods of non-monotonic inference have been added to
logic since the 1970s. The simplest kind of non-monotonic reasoning is default
reasoning in which a conclusion is to be inferred by default, but the conclusion can be
withdrawn if there is evidence to the contrary. For example, when we hear of a bird,
we man infer that it can fly, but this conclusion can be reversed when we hear that it
is a penguin. It is the possibility that a conclusion may have to be withdrawn that
constitutes the non-monotonic character of the reasoning. Ordinary logical reasoning
is monotonic in that the set of conclusions that can the drawn from a set of premises is
a monotonic increasing function of the premises.
• Planning
Planning programs start with general facts about the world (especially facts about the
effects of actions), facts about the particular situation and a statement of a goal. From
these, they generate a strategy for achieving the goal. In the most common cases, the
strategy is just a sequence of actions.
• Epistemology
This is a study of the kinds of knowledge that are required for solving problems in the
world.
• Ontology
Ontology is the study of the kinds of things that exist. In AI, the programs and
sentences deal with various kinds of objects, and we study what these kinds are and
what their basic properties are. Emphasis on ontology begins in the 1990s.
• Heuristics
A heuristic is a way of trying to discover something or an idea imbedded in a
program. The term is used variously in AI. Heuristic functions are used in some
approaches to search to measure how far a node in a search tree seems to be from a
goal. Heuristic predicates that compare two nodes in a search tree to see if one is
better than the other, i.e. constitutes an advance toward the goal, may be more useful.
• Genetic Programming
Genetic programming is a technique for getting programs to solve a task by mating
random Lisp programs and selecting fittest in millions of generations.
Unit - II
Search is a method that can be used by computers to examine a problem space like
this in order to find a goal. Often, we want to find the goal as quickly as possible or without
using too many resources. A problem space can also be considered to be a search space
because in order to solve the problem, we will search the space for a goal state.We will
continue to use the term search space to describe this concept. In this chapter, we will look at
a number of methods for examining a search space. These methods are called search
methods.
• State Space Graphs: If the number of possible states of the system is small enough,
we can represent all of them, along with the transitions between them, in a state space
graph, e.g.
• Routes through State Space: Our general aim is to search for a route, or sequence of
transitions, through the state space graph from our initial state to a goal state.
• Sometimes there will be more than one possible goal state. We define a goal test to
determine if a goal state has been achieved.
• The solution can be represented as a sequence of link labels (or transitions) on the
state space graph. Note that the labels depend on the direction moved along the link.
• Sometimes there may be more than one path to a goal state, and we may want to find
the optimal (best possible) path. We can define link costs and path costs for
measuring the cost of going along a particular path, e.g. the path cost may just equal
the number of links, or could be the sum of individual link costs.
• For most realistic problems, the state space graph will be too large for us to hold all of
it explicitly in memory at any one time.
• Search Trees: It is helpful to think of the search process as building up a search tree
of routes through the state space graph. The root of the search tree is the search node
corresponding to the initial state.
• The leaf nodes correspond either to states that have not yet been expanded, or to states
that generated no further nodes when expanded.
• At each step, the search algorithm chooses a new unexpanded leaf node to expand.
The different search strategies essentially correspond to the different algorithms one
can use to select which is the next mode to be expanded at each stage.
Examples of search problems
• Traveling Salesman Problem: Given n cities with known distances between each
pair, find the shortest tour that passes through all the cities exactly once before
returning to the starting city.
• A lower bound on the length l of any tour can be computed as follows
✓ For each city i, 1 ≤ i ≤ n, find the sum si of the distances from city i to the two
nearest cities.
✓ Compute the sum s of these n numbers.
✓ Divide the result by 2 and round up the result to the nearest integer
lb = s / 2
• The lower bound for the graph shown in the Fig can be computed as follows:
lb = [(1 + 3) + (3 + 6) + (1 + 2)
+ (3 + 4) + (2 + 3)] / 2 = 14.
• For any subset of tours that must include particular edges of a given graph, the lower
bound can be modified accordingly. E.g.: For all the Hamiltonian circuits of the graph
that must include edge (a, d), the lower bound can be computed as follows:
lb = [(1 + 5) + (3 + 6) + (1 + 2) + (3 + 5) + (2 + 3)] / 2 = 16.
Root node includes only the starting vertex a with a lower bound of
➢ Node 2 represents the inclusion of edge (a, c). Since b is not visited before c,
this node is terminated.
➢ Node 3 represents the inclusion of edge (a, d)
lb = [(1 + 5) + (3 + 6) + (1 + 2) + (3 + 5) + (2 + 3)] / 2 = 16.
➢ Among all the four live nodes of the root, node 1 has a better lower bound.
Hence we branch from node 1.
➢ Node 5 represents the inclusion of edge (b, c)
lb = [(1 + 3) + (3 + 6) + (1 + 6) + (3 + 4) + (2 + 3)] / 2 = 16.
➢ Node 9 represents the inclusion of the edges (c, e), (e, d) and (d, a). Hence,
the length of the tour,
l = 3 + 6 + 2 + 3 + 5 = 19.
➢ Node 10 represents the inclusion of the edges (d, c), (c, e) and (e, a). Hence,
the length of the tour,
l = 3 + 7 + 4 + 2 + 8 = 24.
➢ Node 11 represents the inclusion of the edges (d, e), (e, c) and (c, a). Hence,
the length of the tour,
l = 3 + 7 + 3 + 2 + 1 = 16.
➢ Node 11 represents an optimal tour since its tour length is better than or equal
to the other live nodes, 8, 9, 10, 3 and 4.
➢ The optimal tour is a → b → d → e → c → a with a tour length of 16.
• Breadth First Search (BFS): BFS expands the leaf node with the lowest path cost so
far, and keeps going until a goal node is generated. If the path cost simply equals the
number of links, we can implement this as a simple queue (“first in, first out”).
• This is guaranteed to find an optimal path to a goal state. It is memory intensive if the
state space is large. If the typical branching factor is b, and the depth of the shallowest
goal state is d – the space complexity is O(bd), and the time complexity is O(bd).
• BFS is an easy search technique to understand. The algorithm is presented below.
breadth_first_search ()
state ;
exhausted)
else continue ;
• The algorithm is illustrated using the bridge components configuration problem. The
initial state is PDFG, which is not a goal state; and hence set it as the current state.
Generate another state DPFG (by swapping 1st and 2nd position values) and add it to
the list. That is not a goal state, hence; generate next successor state, which is FDPG
(by swapping 1st and 3rd position values). This is also not a goal state; hence add it to
the list and generate the next successor state GDFP.
• Only three states can be generated from the initial state. Now the queue Q will have
three elements in it, viz., DPFG, FDPG and GDFP. Now take DPFG (first state in the
list) as the current state and continue the process, until all the states generated from
this are evaluated. Continue this process, until the goal state DGPF is reached.
• The 14th evaluation gives the goal state. It may be noted that, all the states at one
level in the tree are evaluated before the states in the next level are taken up; i.e., the
evaluations are carried out breadth-wise. Hence, the search strategy is called breadth-
first search.
Depth First Search (DFS): DFS expands the leaf node with the highest path cost so
far, and keeps going until a goal node is generated. If the path cost simply equals the
number of links, we can implement this as a simple stack (“last in, first out”).
• This is not guaranteed to find any path to a goal state. It is memory efficient even if
the state space is large. If the typical branching factor is b, and the maximum depth of
the tree is m – the space complexity is O(bm), and the time complexity is O(bm).
• In DFS, instead of generating all the states below the current level, only the first state
below the current level is generated and evaluated recursively. The search continues
till a further successor cannot be generated.
• Then it goes back to the parent and explores the next successor. The algorithm is
given below.
depth_first_search ()
else return ;
depth_first_search (current_state) ;
else continue ;
• Since DFS stores only the states in the current path, it uses much less memory during
the search compared to BFS.
• The probability of arriving at goal state with a fewer number of evaluations is higher
with DFS compared to BFS. This is because, in BFS, all the states in a level have to
be evaluated before states in the lower level are considered. DFS is very efficient
when more acceptable solutions exist, so that the search can be terminated once the
first acceptable solution is obtained.
• BFS is advantageous in cases where the tree is very deep.
• An ideal search mechanism is to combine the advantages of BFS and DFS.
Informed search
• Informed search uses some kind of evaluation function to tell us how far each
expanded state is from a goal state, and/or some kind of heuristic function to help us
decide which state is likely to be the best one to expand next.
• The hard part is to come up with good evaluation and/or heuristic functions. Often
there is a natural evaluation function, such as distance in miles or number objects in
the wrong position.
• Sometimes we can learn heuristic functions by analyzing what has worked well in
similar previous searches.
• The simplest idea, known as greedy best first search, is to expand the node that is
already closest to the goal, as that is most likely to lead quickly to a solution. This is
like DFS in that it attempts to follow a single route to the goal, only attempting to try
a different route when it reaches a dead end. As with DFS, it is not complete, not
optimal, and has time and complexity of O(bm). However, with good heuristics, the
time complexity can be reduced substantially.
• Branch and Bound: An enhancement of backtracking.
• Applicable to optimization problems.
• For each node (partial solution) of a state-space tree, computes a bound on the value
of the objective function for all descendants of the node (extensions of the partial
solution).
• Uses the bound for:
➢ Ruling out certain nodes as “nonpromising” to prune the tree – if a node’s
bound is not better than the best solution seen so far.
➢ Guiding the search through state-space.
• The search path at the current node in a state-space tree can be terminated for any one
of the following three reasons:
➢ The value of the node’s bound is not better than the value of the best solution
seen so far.
➢ The node represents no feasible solutions because the constraints of the
problem are already violated.
➢ The subset of feasible solutions represented by the node consists of a single
point and hence we compare the value of the objective function for this
feasible solution with that of the best solution seen so far and update the latter
with the former if the new solution is better.
• Best-First branch-and-bound:
➢ A variation of backtracking.
➢ Among all the nonterminated leaves, called as the live nodes, in the current
tree, generate all the children of the most promising node, instead of
generation a single child of the last promising node as it is done in
backtracking.
➢ Consider the node with the best bound as the most promising node.
• A* Search: Suppose that, for each node n in a search tree, an evaluation function f(n)
is defined as the sum of the cost g(n) to reach that node from the start state, plus an
estimated cost h(n) to get from that state to the goal state. That f(n) is then the
estimated cost of the cheapest solution through n.
• A* search, which is the most popular form of best-first search, repeatedly picks the
node with the lowest f(n) to expand next. It turns out that if the heuristic function h(n)
satisfies certain conditions, then this strategy is both complete and optimal.
• In particular, if h(n) is an admissible heuristic, i.e. is always optimistic and never
overestimates the cost to reach the goal, then A* is optimal.
• The classic example is finding the route by road between two cities given the straight
line distances from each road intersection to the goal city. In this case, the nodes are
the intersections, and we can simply use the straight-line distances as h(n).
• Hill Climbing / Gradient Descent: The basic idea of hill climbing is simple: at each
current state we select a transition, evaluate the resulting state, and if the resulting
state is an improvement we move there, otherwise we try a new transition from where
we are.
• We repeat this until we reach a goal state, or have no more transitions to try. The
transitions explored can be selected at random, or according to some problem specific
heuristics.
• In some cases, it is possible to define evaluation functions such that we can compute
the gradients with respect to the possible transitions, and thus compute which
transition direction to take to produce the best improvement in the evaluation
function.
• Following the evaluation gradients in this way is known as gradient descent.
• In neural networks, for example, we can define the total error of the output activations
as a function of the connection weights, and compute the gradients of how the error
changes as we change the weights. By changing the weights in small steps against
those gradients, we systematically minimize the network’s output errors.
Searching And-Or graphs
• The DFS and BFS strategies for OR trees and graphs can be adapted for And-Or trees
• The main difference lies in the way termination conditions are determined, since all
goals following an And node must be realized, whereas a single goal node following
an Or node will do
• A more general optimal strategy is AO* (O for ordered) algorithm
• As in the case of the A* algorithm, we use the open list to hold nodes that have been
generated but not expanded and the closed list to hold nodes that have been expanded
• The algorithm is a variation of the original given by Nilsson
• It requires that nodes traversed in the tree be labeled as solved or unsolved in the
solution process to account for And node solutions which require solutions to all
successors nodes.
• A solution is found when the start node is labeled as solved
• The AO* algorithm
➢ Step 1: Place the start node s on open
➢ Step 2: Using the search tree constructed thus far, compute the most promising
solution tree T0
➢ Step 3:Select a node n that is both on open and a part of T0. Remove n from
open and place it on closed
➢ Step 4: If n ia terminal goal node, label n as solved. If the solution of n results
in any of n’s ancestors being solved, label all the ancestors as solved. If the
start node s is solved, exit with success where T0 is the solution tree. Remove
from open all nodes with a solved ancestor
➢ Step 5: If n is not a solvable node, label n as unsolvable. If the start node is
labeled as unsolvable, exit with failure. If any of n’s ancestors become
unsolvable because n is, label them unsolvable as well. Remove from open all
nodes with unsolvable ancestors
➢ Otherwise, expand node n generating all of its successors. For each such
successor node that contains more than one subproblem, generate their
successors to give individual subproblems. Attach to each newly generated
node a back pointer to its predecessor. Compute the cost estimate h* for each
newly generated node and place all such nodes thst do not yet have
descendents on open. Next recomputed the values oh h* at n and each
ancestors of n
➢ Step 7: Return to step 2
• It can be shown that AO* will always find a minimum-cost solution tree if one exists,
provided only that h*(n) ≤ h(n), and all arc costs are positive. Like A*, the efficiency
depends on how closely h* approximates h
Simulated Annealing
• If the new state means that the system as a whole has a lower energy than it did in the
previous state, then it is accepted.
• If the energy is higher than for the previous state, then a probability is applied to
determine whether the new state is accepted or not. This probability is called a
Boltzmann acceptance criterion and is calculated as follows: e(_dE/T) where T is the
current temperature of the system, and dE is the increase in energy that has been
produced by moving from the previous state to the new state.
• The temperature in this context refers to the percentage of steps that can be taken that
lead to a rise in energy: At a higher temperature, more steps will be accepted that lead
to a rise in energy than at low temperature.
• To determine whether to move to a higher energy state or not, the probability
e(_dE/T) is calculated, and a random number is generated between 0 and 1. If this
random number is lower than the probability function, the new state is accepted. In
cases where the increase in energy is very high, or the temperature is very low, this
means that very few states will be accepted that involve an increase in energy, as
e(_dE/T) approaches zero.
• The fact that some steps are allowed that increase the energy of the system enables the
process to escape from local minima, which means that simulated annealing often can
be an extremely powerful method for solving complex problems with many local
maxima.
• Some systems use e(_dE/kT) as the probability that the search will progress to a state
with a higher energy, where k is Boltzmann’s constant (Boltzmann’s constant is
approximately 1.3807 _ 10_23 Joules per Kelvin).
• Simulated annealing usesMonte Carlo simulation to identify the most stable state (the
state with the lowest energy) for a system.
• This is done by running successive iterations of metropolis Monte Carlo simulation,
using progressively lower temperatures. Hence, in successive iterations, fewer and
fewer steps are allowed that lead to an overall increase in energy for the system.
• A cooling schedule (or annealing schedule) is applied, which determines the manner
in which the temperature will be lowered for successive iterations.
• Two popular cooling schedules are as follows:
Tnew = Told _ dT
• The cooling schedule is extremely important, as is the choice of the number of steps
of metropolis Monte Carlo simulation that are applied in each iteration.
• These help to determine whether the system will be trapped by local minima (known
as quenching). The number of times the metropolis Monte Carlo simulation is applied
per iteration is for later iterations.
• Also important in determining the success of simulated annealing are the choice of the
initial temperature of the system and the amount by which the temperature is
decreased for each iteration.
• These values need to be chosen carefully according to the nature of the problem being
solved. When the temperature, T, has reached zero, the system is frozen, and if the
simulated annealing process has been successful, it will have identified a minimum
for the total energy of the system.
• Simulated annealing has a number of practical applications in solving problems with
large numbers of interdependent variables, such as circuit design.
• It has also been successfully applied to the traveling salesman problem.
• Uses of Simulated Annealing
➢ Simulated annealing was invented in 1983 by Kirkpatrick, Gelatt, and Vecchi.
➢ It was first used for placing VLSI* components on a circuit board.
➢ Simulated annealing has also been used to solve the traveling salesman
problem, although this approach has proved to be less efficient than using
heuristic methods that know more about the problem.
➢ It has been used much more successfully in scheduling problems and other
large combinatorial problems where values need to be assigned to a large
number of variables to maximize (or minimize) some function of those
variables.
Real-Time A*
Logic is concerned with reasoning and the validity of arguments. In general, in logic, we are
not concerned with the truth of statements, but rather with their validity. That is to say,
although the following argument is clearly logical, it is not something that we would consider
to be true:
Mary is a lemon
This set of statements is considered to be valid because the conclusion (Mary is blue) follows
logically from the other two statements, which we often call the premises. The reason that
validity and truth can be separated in this way is simple: a piece of a reasoning is considered
to be valid if its conclusion is true in cases where its premises are also true. Hence, a valid set
of statements such as the ones above can give a false conclusion, provided one or more of the
premises are also false.
Logic is concerned with truth values. The possible truth values are true and false.
These can be considered to be the fundamental units of logic, and almost all logic is
ultimately concerned with these truth values.
Logic is widely used in computer science, and particularly in Artificial Intelligence. Logic is
widely used as a representational method for Artificial Intelligence. Unlike some other
representations, logic allows us to easily reason about negatives (such as, “this book is not
red”) and disjunctions (“or”—such as, “He’s either a soldier or a sailor”).
Logic is also often used as a representational method for communicating concepts and
theories within the Artificial Intelligence community. In addition, logic is used to represent
language in systems that are able to understand and analyze human language.
As we will see, one of the main weaknesses of traditional logic is its inability to deal
with uncertainty. Logical statements must be expressed in terms of truth or falsehood—it is
not possible to reason, in classical logic, about possibilities. We will see different versions of
logic such as modal logics that provide some ability to reason about possibilities, and also
probabilistic methods and fuzzy logic that provide much more rigorous ways to reason in
uncertain situations.
Logical Operators
• In reasoning about truth values, we need to use a number of operators, which can be
applied to truth values.
• We are familiar with several of these operators from everyday language:
I like apples and oranges.
• Here we see the four most basic logical operators being used in everyday language.
The operators are:
➢ and
➢ or
➢ not
➢ if . . . then . . . (usually called implies)
• One important point to note is that or is slightly different from the way we usually use
it. In the sentence, “You can have an icecream or a cake,” the mother is usually
suggesting to her child that he can only have one of the items, but not both. This is
referred to as an exclusive-or in logic because the case where both are allowed is
excluded.
• The version of or that is used in logic is called inclusive-or and allows the case with
both options.
• The operators are usually written using the following symbols, although other
symbols are sometimes used, according to the context:
and ∧
or ∨
not ¬
implies →
iff ↔
• Iff is an abbreviation that is commonly used to mean “if and only if.”
• We see later that this is a stronger form of implies that holds true if one thing implies
another, and also the second thing implies the first.
• For example, “you can have an ice-cream if and only if you eat your dinner.” It may
not be immediately apparent why this is different from “you can have an icecream if
you eat your dinner.” This is because most mothers really mean iff when they use if in
this way.
• To use logic, it is first necessary to convert facts and rules about the real world into
logical expressions using the logical operators
• Without a reasonable amount of experience at this translation, it can seem quite a
daunting task in some cases.
• Let us examine some examples. First, we will consider the simple operators, ∧, ∨, and
¬.
• Sentences that use the word and in English to express more than one concept, all of
which is true at once, can be easily translated into logic using the AND operator, ∧.
• For example: “It is raining and it is Tuesday.” might be expressed as: R ∧T, Where R
means “it is raining” and T means “it is Tuesday.”
• For example, if it is not necessary to discuss where it is raining, R is probably enough.
• If we need to write expressions such as “it is raining in New York” or “it is raining
heavily” or even “it rained for 30 minutes on Thursday,” then R will probably not
suffice. To express more complex concepts like these, we usually use predicates.
Hence, for example, we might translate “it is raining in New York” as: N(R) We
might equally well choose to write it as: R(N)
• This depends on whether we consider the rain to be a property of New York, or vice
versa. In other words, when we write N(R), we are saying that a property of the rain is
that it is in New York, whereas with R(N) we are saying that a property of New York
is that it is raining. Which we use depends on the problem we are solving. It is likely
that if we are solving a problem about New York, we would use R(N), whereas if we
are solving a problem about the location of various types of weather, we might use
N(R).
• Let us return nowto the logical operators. The expression “it is raining inNew York,
and I’meither getting sick or just very tired”can be expressed as follows: R(N) ∧(S(I)
∨T(I))
• Here we have used both the ∧ operator, and the ∨ operator to express a collection of
statements. The statement can be broken down into two sections, which is indicated
by the use of parentheses.
• The section in the parentheses is S(I) ∨T(I), which means “I’m either getting sick OR
I’m very tired”. This expression is “AND’ed”with the part outside the parentheses,
which is R(N).
• Finally, the ¬ operator is applied exactly as you would expect—to express negation.
• It is important to get the ¬ in the right place. For example: “I’m either not well or
• The position of the ¬ here indicates that it is bound to W(I) and does not play any
• Here we have used the following symbol translations: S(y) means that y is a
sandwich. E(x, y)
means that x (the man) eats y (the sandwich).
P(y) means that y (the sandwich) has pickles in it.
• This kind of notation is obviously more useful when writing logical expressions that
are intended to be read by humans but when manipulated by a computer do not add
any value.
Truth Tables
• We can use variables to represent possible truth values, in much the same way that
variables are used in algebra to represent possible numerical values.
• We can then apply logical operators to these variables and can reason about the way
in which they behave.
• It is usual to represent the behavior of these logical operators using truth tables.
• A truth table shows the possible values that can be generated by applying an operator
to truth values.
• Not
➢ First of all, we will look at the truth table for not, ¬.
• And
➢ Now, let us examine the truth table for our first binary operator—one which
acts on two variables:
52 = 25 →0 = 2 (false →false)
➢ In fact, in the second and third examples, the consequent could be given any
meaning, and the statement would still be true. For example, the following
statement is valid:
52 = 25 →Logic is weird
➢ Notice that when looking at simple logical statements like these, there does
not need to be any real-world relationship between the antecedent and the
consequent.
➢ For logic to be useful, though, we tend to want the relationships being
expressed to be meaningful as well as being logically true.
• iff
➢ The truth table for iff (if and only if {↔}) is as follows:
➢ It can be seen that A ↔ B is true as long as A and B have the same value.
➢ In other words, if one is true and the other false, then A ↔ B is false.
Otherwise, if A and B have the same value, A↔ B is true.
• Truth tables are not limited to showing the values for single operators.
• For example, a truth table can be used to display the possible values for A ∧(B ∨C).
• Note that for two variables, the truth table has four lines, and for three variables, it has
eight. In general, a truth table for n variables will have 2n lines.
• The use of brackets in this expression is important. A ∧ (B ∨ C) is not the same as (A
∧B) ∨C.
precedence, meaning that it is most closely tied to its symbols. ∧ has a greater
precedence than ∨, which means that the sentence above can be expressed as (¬A) ∨
((¬B) ∧C)
• Similarly, when we write ¬A ∨B this is the same as (¬A) ∨B rather than ¬(A ∨B)
Tautology
• This truth table has a property that we have not seen before: the value of the
• In the language of logic, we can replace wibble with the symbol A, in which case
these two statements can be rewritten as
A→A
A ∨¬A
A ∧¬A
(A ∨¬A)→(A ∧ ¬A)
Equivalence
B∧ A
• It should be fairly clear that these two expressions will always have the same value
for a given pair of values for A and B.
• In otherwords, we say that the first expression is logically equivalent to the second
expression.
• We write this as A ∧ B _ B ∧ A. This means that the ∧ operator is commutative.
• Note that this is not the same as implication: A ∧ B→B ∧ A, although this second
statement is also true.
• The difference is that if for two expressions e1 and e2: e1 _ e2, then e1 will always
have the same value as e2 for a given set of variables.
• On the other hand, as we have seen, e1→e2 is true if e1 is false and e2 is true.
• There are a number of logical equivalences that are extremely useful.
• The following is a list of a few of the most common:
A∨ A_A
A∧ A_A
A ∧ (B ∧ C) _ (A ∧ B) ∧ C (∧ is associative)
A ∨ (B ∨ C) _ (A ∨ B) ∨ C (∨ is associative)
A ∧ (B ∨ C) _ (A ∧ B) ∨ (A ∧ C) (∧ is distributive over ∨ )
A ∧ (A ∨ B) _ A
A ∨ (A ∧ B) _ A
A ∧ true _ A
A ∧ false _ false
A ∨ true _ true
A ∨ false _ A
• All of these equivalences can be proved by drawing up the truth tables for each side of
the equivalence and seeing if the two tables are the same.
• We do not need to use the →□ symbol at all—we can replace it with a combination
of ¬ and ∨ .
A ∧ B _ ¬(¬A ∨ ¬B)
• In fact, any binary logical operator can be expressed using ¬ and ∨ . This is a fact
that is employed in electronic circuits, where nor gates, based on an operator called
nor, are used. Nor is represented by ↓, and is defined as follows:
A ↓ B _ ¬(A ∨ B)
A ∧ B _ ¬(¬A ∨ ¬B)
A ∨ B _ ¬(¬A ∧ ¬B)
Propositional Logic
➢ Here we have used set notation to define the possible values that are contained
within the alphabet ∑.
➢ Note that we allow an infinite number of proposition letters, or propositional
symbols, p1, p2, p3, . . . , and so on.
➢ More usually, we will represent these by capital letters P, Q, R, and so on,
➢ If we need to represent a very large number of them, we will use the subscript
notation (e.g., p1).
➢ An expression is referred to as a well-formed formula (often abbreviated as
wff) or a sentence if it is constructed correctly, according to the rules of the
syntax of propositional calculus, which are defined as follows.
➢ In these rules, we use A, B, C to represent sentences. In other words, we
define a sentence recursively, in terms of other sentences.
➢ The following are wellformed sentences:
P,Q,R. . .
true, false
(A)
¬A
A∧ B
A∨ B
A→B
A↔ B
P ∧ Q ∨ (B ∧ ¬C)→A ∧ B ∨ D ∧ (¬E)
• Semantics
➢ The semantics of the operators of propositional calculus can be defined in
terms of truth tables.
➢ The meaning of P ∧ Q is defined as “true when P is true and Q is also true.”
➢ The meaning of symbols such as P and Q is arbitrary and could be ignored
altogether if we were reasoning about pure logic.
Predicate Calculus
• Syntax
➢ Predicate calculus allows us to reason about properties of objects and
relationships between objects.
➢ In propositional calculus, we could express the English statement “I like
“I do not like cheese,” but it does not allow us to extract any information about
the cheese, or me, or other things that I like.
➢ In predicate calculus, we use predicates to express properties of objects. So the
sentence “I like cheese” might be expressed as L(me, cheese) where L is a
predicate that represents the idea of “liking.” Note that as well as expressing a
property of me, this statement also expresses a relationship between me and
cheese. This can be useful, as we will see, in describing environments for
robots and other agents.
➢ For example, a simple agent may be concerned with the location of various
blocks, and a statement about the world might be T(A,B), which could mean:
Block A is on top of Block B.
➢ It is also possible to make more general statements using the predicate
calculus.
➢ For example, to express the idea that everyone likes cheese, we might say
(x)(P(x)→L(x, C))
➢ The symbol is read “for all,” so the statement above could be read as “for
every x it is true that if property P holds for x, then the relationship L holds
between x and C,” or in plainer English: “every x that is a person likes
cheese.” (Here we are interpreting P(x) as meaning “x is a person” or, more
precisely, “x has property P.”)
➢ Note that we have used brackets rather carefully in the statement above.
➢ This statement can also be written with fewer brackets: x P(x) →L(x, C),
is called the universal quantifier.
➢ The quantifier can be used to express the notion that some values do have a
certain property, but not necessarily all of them: (x)(L(x,C))
➢ This statement can be read “there exists an x such that x likes cheese.”
➢ This does not make any claims about the possible values of x, so x could be a
person, or a dog, or an item of furniture. When we use the existential
quantifier in this way, we are simply saying that there is at least one value of x
for which L(x,C) holds.
➢ The following is true: (x)(L(x,C))→( x)(L(x,C)), but the following is not:
(x)(L(x,C))→( x)(L(x,C))
• Relationships between and
➢ It is also possible to combine the universal and existential quantifiers, such as
in the following statement: (x) (y) (L(x,y)).
➢ This statement can be read “for all x, there exists a y such that L holds for x
and y,” which we might interpret as “everyone likes something.”
➢ A useful relationship exists between and . Consider the statement “not
everyone likes cheese.” We could write this as
laws, we can see that this is equivalent to ¬(A ∧ ¬B). Hence, the statement
➢ This can be read as “It is not true that for all x the following is not true: x is a
person and x does not like cheese.” If you examine this rather convoluted
sentence carefully, you will see that it is in fact the same as “there exists an x
such that x is a person and x does not like cheese.” Hence we can rewrite it as
➢ In making this transition from statement (2) to statement (3), we have utilized
• The type of predicate calculus that we have been referring to is also called firstorder
predicate logic (FOPL).
• A first-order logic is one in which the quantifiers and can be applied to objects or
terms, but not to predicates or functions.
• So we can define the syntax of FOPL as follows. First,we define a term:
• A constant is a term.
• A variable is a term. f(x1, x2, x3, . . . , xn) is a term if x1, x2, x3, . . . , xn are all
terms.
• Anything that does not meet the above description cannot be a term.
• For example, the following is not a term: x P(x). This kind of construction we call a
sentence or a well-formed formula (wff), which is defined as follows.
• In these definitions, P is a predicate, x1, x2, x3, . . . , xn are terms, and A,B are wff ’s.
The following are the acceptable forms for wff ’s:
P(x1, x2, x3, . . . , xn)
¬A
A∧ B
A∨ B
A→B
A↔ B
(x)A
(x)A
Soundness
• We have seen that a logical system such as propositional logic consists of a syntax, a
semantics, and a set of rules of deduction.
• A logical system also has a set of fundamental truths, which are known as axioms.
• The axioms are the basic rules that are known to be true and from which all other
theorems within the system can be proved.
• An axiom of propositional logic, for example, is A→(B→A)
• A theorem of a logical system is a statement that can be proved by applying the rules
of deduction to the axioms in the system.
• If A is a theorem, then we write ├ A
• A logical system is described as being sound if every theorem is logically valid, or a
tautology.
• It can be proved by induction that both propositional logic and FOPL are sound.
• Completeness
➢ A logical system is complete if every tautology is a theorem—in other words, if
every valid statement in the logic can be proved by applying the rules of deduction
to the axioms. Both propositional logic and FOPL are complete.
• Decidability
➢ A logical system is decidable if it is possible to produce an algorithm that will
determine whether any wff is a theorem. In other words, if a logical system is
decidable, then a computer can be used to determine whether logical expressions
in that system are valid or not.
➢ We can prove that propositional logic is decidable by using the fact that it is
complete.
➢ We can prove that a wff A is a theorem by showing that it is a tautology. To show
if a wff is a tautology, we simply need to draw up a truth table for that wff and
show that all the lines have true as the result. This can clearly be done
algorithmically because we know that a truth table for n values has 2n lines and is
therefore finite, for a finite number of variables.
➢ FOPL, on the other hand, is not decidable. This is due to the fact that it is not
possible to develop an algorithm that will determine whether an arbitrary wff in
FOPL is logically valid.
• Monotonicity
➢ A logical system is described as being monotonic if a valid proof in the system
cannot be made invalid by adding additional premises or assumptions.
➢ In other words, if we find that we can prove a conclusion C by applying rules of
deduction to a premise B with assumptions A, then adding additional assumptions
➢ In other words, even adding contradictory assumptions does not stop us from
making the proof in a monotonic system.
➢ In fact, it turns out that adding contradictory assumptions allows us to prove
anything, including invalid conclusions. This makes sense if we recall the line in
the truth table for →, which shows that false → true. By adding a contradictory
assumption, we make our assumptions false and can thus prove any conclusion.
Modal Logics and Possible Worlds
• The forms of logic that we have dealt with so far deal with facts and properties of
objects that are either true or false.
• In these classical logics, we do not consider the possibility that things change or that
things might not always be as they are now.
• Modal logics are an extension of classical logic that allow us to reason about
possibilities and certainties.
• In other words, using a modal logic, we can express ideas such as “although the sky is
usually blue, it isn’t always” (for example, at night). In this way, we can reason about
possible worlds.
• A possible world is a universe or scenario that could logically come about.
• The following statements may not be true in our world, but they are possible, in the
sense that they are not illogical, and could be true in a possible world:
Trees are all blue.
• It is possible that some of these statements will become true in the future, or even that
they were true in the past.
• It is also possible to imagine an alternative universe in which these statements are true
now.
• The following statements, on the other hand, cannot be true in any possible world:
A ∧ ¬A
• The first of these illustrates the law of the excluded middle, which simply states that a
fact must be either true or false: it cannot be both true and false.
• It also cannot be the case that a fact is neither true nor false. This is a law of classical
logic, it is possible to have a logical system without the law of the excluded middle,
and in which a fact can be both true and false.
• The second statement cannot be true by the laws of mathematics. We are not
interested in possible worlds in which the laws of logic and mathematics do not hold.
• A statement that may be true or false, depending on the situation, is called contingent.
• A statement that must always have the same truth value, regardless of which possible
world we consider, is noncontingent.
• Hence, the following statements are contingent:
A∧ B
A∨ B
A ∨ ¬A
A ∧ ¬A
If you like all ice cream, then you like this ice cream.
• Clearly, a noncontingent statement can be either true or false, but the fact that it is
noncontingent means it will always have that same truth value.
• If a statement A is contingent, then we say that A is possibly true, which is written ◊
A
• If A is noncontingent, then it is necessarily true, which is written □ A
• Reasoning in Modal Logic
➢ It is not possible to draw up a truth table for the operators ◊ and □
➢ The following rules are examples of the axioms that can be used to reason in
this kind of modal logic:
□A→◊A
□¬A→¬◊A
◊ A→¬□ A
➢ Although truth tables cannot be drawn up to prove these rules, you should be able to reason
about them using your understanding of the meaning of the ◊ and □ operators.
Possible world representations
• It describes method proposed by Nilsson which generalizes firtst order logic in the modeling of
uncertain beliefs
• The method assigns truth values ranging from 0 to 1 to possible worlds
• Each set of possible worlds corresponds to a different interpretation of sentences contained in a
knowledge base denoted as KB
• Consider the simple case where a KB contains only the single sentence S, S may be either true or
false. We envision S as being true in one set of possible worlds W1 and false in another set W2 . The
actual world , the one we are in, must be in one of the two sets, but we are uncertain which one.
Uncertainty is expressed by assigning a probability P to W1 and 1 – P to W2. We can say then that the
probability of S being true is P
• When KB contains L sentences, S1,… SL , more sets of possible worlds are required to represent all
consistent truth value assignments. There are 2L possible truth assignments for L sentences.
• Truth Value assignments for the set {P. P→Q, Q}
Consistent Inconsistent
P Q P→Q P Q P→Q
• Probability theory is used to discuss events, categories, and hypotheses about which
there is not 100% certainty.
• We might write A→B, which means that if A is true, then B is true. If we are unsure
whether A is true, then we cannot make use of this expression.
• In many real-world situations, it is very useful to be able to talk about things that lack
certainty. For example, what will the weather be like tomorrow? We might formulate
a very simple hypothesis based on general observation, such as “it is sunny only 10%
of the time, and rainy 70% of the time”. We can use a notation similar to that used for
predicate calculus to express such statements:
P(S) = 0.1
P(R) = 0.7
The first of these statements says that the probability of S (“it is sunny”) is 0.1. The
second says that the probability of R is 0.7. Probabilities are always expressed as real
numbers between 0 and 1. A probability of 0 means “definitely not” and a probability
of 1 means “definitely so.” Hence, P(S) = 1 means that it is always sunny.
• Many of the operators and notations that are used in prepositional logic can also be
used in probabilistic notation. For example, P(¬S) means “the probability that it is
not sunny”; P(S ∧ R) means “the probability that it is both sunny and rainy.” P(A ∨
B), which means “the probability that either A is true or B is true,” is defined by the
following rule: P(A ∨B) = P(A) + P(B) - P(A ∧B)
• The notation P(B|A) can be read as “the probability of B, given A.” This is known as
conditional probability—it is conditional on A. In other words, it states the probability
that B is true, given that we already know that A is true. P(B|A) is defined by the
following rule: Of course, this rule cannot be used in cases where P(A) = 0.
• For example, let us suppose that the likelihood that it is both sunny and rainy at the
same time is 0.01. Then we can calculate the probability that it is rainy, given that it is
sunny as follows:
• The basic approach statistical methods adopt to deal with uncertainty is via the
axioms of probability:
➢ Probabilities are (real) numbers in the range 0 to 1.
➢ A probability of P(A) = 0 indicates total uncertainty in A, P(A) = 1 total
certainty and values in between some degree of (un)certainty.
➢ Probabilities can be calculated in a number of ways.
• Probability = (number of desired outcomes) / (total number of outcomes)
So given a pack of playing cards the probability of being dealt an ace from a full
normal deck is 4 (the number of aces) / 52 (number of cards in deck) which is 1/13.
Similarly the probability of being dealt a spade suit is 13 / 52 = 1/4.
If you have a choice of number of items k from a set of items n then the
to 1.
• Conditional probability, P(A|B), indicates the probability of of event A given that we
know event B has occurred.
• A Bayesian Network is a directed acyclic graph:
➢ A graph where the directions are links which indicate dependencies that exist
between nodes.
➢ Nodes represent propositions about events or events themselves.
➢ Conditional probabilities quantify the strength of dependencies.
• Consider the following example:
• Bayes’ theorem can be used to calculate the probability that a certain event will occur
or that a certain proposition is true
• The theorem is stated as follows:
➢ P(B) is called the prior probability of B. P(B|A), as well as being called the
conditional probability, is also known as the posterior probability of B.
➢ P(A ∧B) = P(A|B)P(B)
➢ Note that due to the commutativity of ∧, we can also write
➢ P(A ∧B) = P(B|A)P(A)
➢ Hence, we can deduce: P(B|A)P(A) = P(A|B)P(B)
➢ This can then be rearranged to give Bayes’ theorem:
➢ This reads that given some evidence E then probability that hypothesis is
true is equal to the ratio of the probability that E will be true given times the
Definition of learning
Definition
A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks T, as measured by P, improves
with experience E.
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering commands recorded while
observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself
Definition
A computer program which learns from experience is called a machine learning
program or simply a learning program. Such a program is sometimes also referred to as a learner.
Components of Learning
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating
general concepts about the data as a whole. The creation of knowledge involves application of
known models and creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.
3. Generalization
The third component of the learning process is known as generalisation.
The term generalization describes the process of turning the knowledge about stored data into a
form that can be utilized for future action. These actions are to be carried out on tasks that are
similar, but not identical, to those what have been seen before. In generalization, the goal is to
discover those properties of the data that will be most relevant to future tasks.
4. Evaluation
Evaluation is the last component of the learning process.
It is the process of giving feedback to the user to measure the utility of the learned
knowledge. This feedback is then utilised to effect improvements in the whole learning process
Types of Learning
In general, machine learning algorithms can be classified into three types.
• Supervised learning
• Unsupervised learning
• Reinforcement learning
Supervised learning
A training set of examples with the correct responses (targets) is provided and, based on
this training set, the algorithm generalises to respond correctly to all possible inputs. This is also
called learning from exemplars. Supervised learning is the machine learning task of learning a
function that maps an input to an output based on example input-output pairs.
In supervised learning, each example in the training set is a pair consisting of an input
object (typically a vector) and an output value. A supervised learning algorithm analyzes the
training data and produces a function, which can be used for mapping new examples. In the
optimal case, the function will correctly determine the class labels for unseen instances. Both
classification and regression problems are supervised learning problems. A wide range of
supervised learning algorithms are available, each with its strengths and weaknesses. There is no
single learning algorithm that works best on all supervised learning problems.
Remarks
A “supervised learning” is so called because the process of an algorithm learning from
the training dataset can be thought of as a teacher supervising the learning process. We know the
correct answers (that is, the correct outputs), the algorithm iteratively makes predictions on the
training data and is corrected by the teacher. Learning stops when the algorithm achieves an
acceptable level of performance.
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labeled as “healthy” or “sick”.
Unsupervised learning
Correct responses are not provided, but instead the algorithm tries to identify similarities
between the inputs so that inputs that have something in common are categorised together. The
statistical approach to unsupervised learning is
known as density estimation.
Unsupervised learning is a type of machine learning algorithm used to draw inferences
from datasets consisting of input data without labeled responses. In unsupervised learning
algorithms, a classification or categorization is not included in the observations. There are no
output values and so there is no estimation of functions. Since the examples given to the learner
are unlabeled, the accuracy of the structure that is output by the algorithm cannot be evaluated.
The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data.
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
Based on this data, can we infer anything regarding the patients entering the clinic?
Reinforcement learning
This is somewhere between supervised and unsupervised learning. The algorithm gets
told when the answer is wrong, but does not get told how to correct it. It has to explore and try
out different possibilities until it works out how to get the answer right. Reinforcement learning
is sometime called learning with a critic because of this monitor that scores the answer, but does
not suggest improvements.
Reinforcement learning is the problem of getting an agent to act in the world so as to
maximize its rewards. A learner (the program) is not told what actions to take as in most forms
of machine learning, but instead must discover which actions yield the most reward by trying
them. In the most interesting and challenging cases, actions may affect not only the immediate
reward but also the next situations and, through that, all subsequent rewards.
Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get the
reward/punishment. We can use a similar method to train computers to do many tasks, such as
playing backgammon or chess, scheduling jobs, and controlling robot limbs. Reinforcement
learning is different from supervised learning. Supervised learning is learning from examples
provided by a knowledgeable expert.
Throughout this book we will return to this perspective of learning as a search problem
in order to characterize learning methods by their search strategies and by the underlying
structure of the search spaces they explore. We will also find this viewpoint useful in formally
analyzing the relationship between the size of the hypothesis space to be searched, the number
of training examples available, and the confidence we can have that a hypothesis consistent with
the training data will correctly generalize to unseen examples.
• What algorithms exist for learning general target functions from specific training examples? In
what settings will particular algorithms converge to the desired function, given sufficient
training data? Which algorithms perform best for which types of problems and representations?
• How much training data is sufficient? What general bounds can be found to relate the confidence
in learned hypotheses to the amount of training experience and the character of the learner's
hypothesis space?
• When and how can prior knowledge held by the learner guide the process of generalizing from
examples? Can prior knowledge be helpful even when it is only approximately correct?
• What is the best strategy for choosing a useful next training experience, and how does the choice
of this strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can this
process itself be automated?
• How can the learner automatically alter its representation to improve its ability to represent
and learn the target function?
Decision Tree
Introduction Decision Trees are a type of Supervised Machine Learning (that is you
explain what the input is and what the corresponding output is in the training data) where the
data is continuously split according to a certain parameter. The tree can be explained by two
entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes.
And the decision nodes are where the data is split.
An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
and ‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’.
In this case this was a binary classification problem (a yes no type problem). There are two main
types of Decision Trees:
What we have seen above is an example of classification tree, where the outcome was a
variable like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Supervised learning algorithms are trained using labeled Unsupervised learning algorithms are trained using
data. unlabeled data.
Supervised learning model takes direct feedback to check Unsupervised learning model does not take any
if it is predicting correct output or not. feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns
in data.
In supervised learning, input data is provided to the In unsupervised learning, only input data is provided to
model along with the output. the model.
The goal of supervised learning is to train the model so The goal of unsupervised learning is to find the
that it can predict the output when it is given new data. hidden patterns and useful insights from the unknown
dataset.
Supervised learning needs supervision to train the model. Unsupervised learning does not need any
supervision to train the model.
Supervised learning can be used for those cases where Unsupervised learning can be used for those cases
we know the input as well as corresponding outputs. where we have only input data and no
corresponding output data.
Supervised learning model produces an accurate result. Unsupervised learning model may give less
accurate result as compared to supervised
learning.
Supervised learning is not close to true Artificial intelligence Unsupervised learning is more close to the true
as in this, we Artificial
first train the model for each data, and then only it can Intelligence as it learns similarly as a child learns
predict the correct output. daily routine things by his experiences.
It includes various algorithms such as Linear Regression, It includes various algorithms such as Clustering,
Logistic
KNN, and Apriori algorithm.
Regression, Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian Logic, etc.
One criticism that is often made of neural networks—especially the MLP—is that it is not clear
exactly what it is doing: while we can go and have a look at the activations of the neurons and
the weights, they don’t tell us much.
In this topic (probabilistic classifier ) we are going to look at methods that are based on
statistics, and that are therefore more transparent, in that we can always extract and look at the
probabilities and see what they are, rather than having to worry about weights that have no
obvious meaning.
There is, however, something we can do. If we know how many classes there are in the
data, then we can try to estimate the parameters for that many Gaussians, all at once. If we don’t
know, then we can try different numbers and see which one works best. We will talk about this
issue more for a different method (the k-means algorithm) in Unit 2. It is perfectly possible to
use any other probability distribution instead of a Gaussian, but Gaussians are by far the most
common choice. Then the output for any particular datapoint that is input to the algorithm will
be the sum of the values expected by all of the M Gaussians:
where _(x ; μm, ∑ 𝑚) is a Gaussian function with mean μm and covariance matrix ∑ 𝑚, and the αm are
∑𝑀 𝛼𝑚 =1.
weights with the constraint that 𝑚=1
The given figures 4 shows two examples, where the data (shown by the histograms) comes from
two different Gaussians, and the model is computed as a sum or mixture of the two Gaussians
together.
FIGURE 4: Histograms of training data from a mixture of two Gaussians and two fitted models,
shown as the line plot. The model shown on the left fits well, but the one on the right produces
two Gaussians right on top of each other that do not fit the data well.
The figure also gives you some idea of how to use the mixture model once it has been created.
The probability that input xi belongs to class m can be written as (where a hat on a variable (ˆ·)
means that we are estimating the value of that variable):
The problem is how to choose the weights αm. The common approach is to aim for the
maximum
likelihood solution (the likelihood is the conditional probability of the data given the model, and
the maximum likelihood solution varies the model to maximise this conditional probability). In
fact, it is common to compute the log likelihood and then to maximise that; it is guaranteed to be
negative, since probabilities are all less than 1, and the logarithm spreads out the values, making
the optimisation more effective. The algorithm that is used is an example of a very general one
known as the expectation- maximisation (or more compactly, EM) algorithm.
3.3 The Expectation-Maximisation (EM) Algorithm
The basic idea of the EM algorithm is that sometimes it is easier to add extra variables
that are not actually known (called hidden or latent variables) and then to maximise the function
over those variables. This might seem to be making a problem much more complicated than it
needs to be, but it turns out for many problems that it makes finding the solution significantly
easier.
In order to see how it works, we will consider the simplest interesting case of the
Gaussian mixture model: a combination of just two Gaussian mixtures. The assumption now is
that sample from that Gaussian. If the probability of picking Gaussian one is p, then the entire
model looks like this (where N(μ, σ2) specifies a Gaussian distribution with mean μ and standard
deviation σ):
Finding the maximum likelihood solution (actually the maximum log likelihood) to this
problem is then a case of computing the sum of the logarithm of Equation over all of the
training data, and differentiating it, which would be rather difficult. Fortunately, there is a way
around it. The key insight that we need is that if we knew which of the two Gaussian
components the datapoint came from, then the computation would be easy. The mean and
standard deviation for each component could be computed from the datapoints that belong to that
component, and there would not be a problem. Although we don’t know which component
each datapoint came from, we can pretend we do, by introducing a new variable f. If f = 0
then the data came from Gaussian one, if f = 1 then it came from Gaussian two.
This is the typical initial step of an EM algorithm: adding latent variables. Now we just
need to work out how to optimise over them. This is the time when the reason for the algorithm
being called expectation-maximisation becomes clear.We don’t know much about variable f
(hardly surprising, since we invented it), but we can compute its expectation (that is, the value
that we ‘expect’ to see, which is the mean average) from the data:
where D denotes the data. Note that since we have set f = 1 this means that we are choosing
Gaussian two.
Computing the value of this expectation is known as the E-step. Then this estimate of the
expectation is maximised over the model parameters (the parameters of the two Gaussians and
the mixing parameter π), the M-step. This requires differentiating the expectation with respect to
each of the model parameters. These two steps are simply iterated until the algorithm converges.
Note that the estimate never gets any smaller, and it turns out that EM algorithms are guaranteed
to reach a local maxima. To see how this looks for the two-component Gaussian mixture, we’ll
take a closer look at the algorithm:
The trick with applying EM algorithms to problems is in identifying the correct latent
variables to include, and then simply working through the steps. They are very powerful methods
for a wide variety of statistical learning problems. We are now going to turn our attention to
something much simpler, which is how we can use information about nearby datapoints to
decide on classification output. For this we don’t use a model of the data at all, but directly use
the data that is available.