0% found this document useful (0 votes)
31 views224 pages

Artificial Intelligence - Unit 1 - 5

Uploaded by

mszishan2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views224 pages

Artificial Intelligence - Unit 1 - 5

Uploaded by

mszishan2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 224

UNIT-3

Expert Systems
➢ An expert system is a computer program that is designed to
solve complex problems and to provide decision-making
ability like a human expert.
(Or)
➢ The expert systems are the computer applications
developed to solve complex problems in a particular
domain, at the level of extra-ordinary human intelligence
and expertise.
➢ It performs this by extracting knowledge from its
knowledge base using the reasoning and inference rules
according to the user queries.
➢ The expert system is a part of AI, and the first ES was
developed in the year 1970, which was the first successful
approach of artificial intelligence.
Areas of Artificial Intelligence
Expert Systems
➢ Expert Systems solves the most complex issue as an expert
by extracting the knowledge stored in its knowledge base.
The system helps in decision making for complex problems
using both facts and heuristics like a human expert.
➢ It is called so because it contains the expert knowledge of a
specific domain and can solve any complex problem of that
particular domain. These systems are designed for a
specific domain, such as medicine, science, etc.
➢ The performance of an expert system is based on the
expert's knowledge stored in its knowledge base. The more
knowledge stored in the KB, the more that system improves
its performance.
➢ One of the common examples of an ES is a suggestion of
spelling errors while typing in the Google search box.
Characteristics of Expert Systems
➢ High performance: The expert system provides high
performance for solving any type of complex problem
of a specific domain with high efficiency and accuracy.
➢ Understandable: It responds in a way that can be easily
understandable by the user. It can take input in human
language and provides the output in the same way.
➢ Reliable: It is much reliable for generating an efficient
and accurate output.
➢ Highly responsive: ES provides the result for any
complex query within a very short period of time.
Capabilities of Expert Systems
The expert systems are capable of −
➢ Advising
➢ Instructing and assisting human in decision making
➢ Demonstrating
➢ Deriving a solution
➢ Diagnosing
➢ Explaining
➢ Interpreting input
➢ Predicting results
➢ Justifying the conclusion
➢ Suggesting alternative options to a problem
In capabilities of Expert Systems
They are incapable of −
➢ Substituting human decision makers
➢ Possessing human capabilities
➢ Producing accurate output for inadequate knowledge base
➢ Refining their own knowledge
Components of Expert Systems
The components of ES include −
➢ Knowledge Base
➢ Inference Engine
➢ User Interface
Components of Expert Systems
Knowledge Base:
➢ It contains domain-specific and high-quality knowledge.
➢ Knowledge is required to exhibit intelligence. The success
of any ES majorly depends upon the collection of highly
accurate and precise knowledge.

What is Knowledge?
➢ The data is collection of facts. The information is organized
as data and facts about the task domain. Data,
information, and past experience combined together are
termed as knowledge.
Components of Expert Systems
Components of Knowledge Base:
The knowledge base of an ES is a store of both, factual and
heuristic knowledge.
➢ Factual Knowledge − It is the information widely accepted
by the Knowledge Engineers and scholars in the task
domain.
➢ Heuristic Knowledge − It is about practice, accurate
judgement, one’s ability of evaluation, and guessing.
Components of Expert Systems
Inference Engine(Rules of Engine)
➢ The inference engine is known as the brain of the expert
system as it is the main processing unit of the system. It
applies inference rules to the knowledge base to derive a
conclusion or deduce new information. It helps in deriving
an error-free solution of queries asked by the user.
➢ With the help of an inference engine, the system extracts
the knowledge from the knowledge base.

There are two types of inference engine:


➢ Deterministic Inference engine: The conclusions drawn
from this type of inference engine are assumed to be true. It
is based on facts and rules.
Components of Expert Systems
➢ Probabilistic Inference engine: This type of inference
engine contains uncertainty in conclusions, and based on
the probability.

Inference engine uses the below modes to derive the


solutions:
➢ Forward Chaining: It starts from the known facts and
rules, and applies the inference rules to add their conclusion
to the known facts.
➢ Backward Chaining: It is a backward reasoning method
that starts from the goal and works backward to prove the
known facts.
Components of Expert Systems
User Interface:
➢ With the help of a user interface, the expert system interacts
with the user, takes queries as an input in a readable format,
and passes it to the inference engine. After getting the
response from the inference engine, it displays the output to
the user.
➢ In other words, it is an interface that helps a non-expert
user to communicate with the expert system to find a
solution.
Popular Examples of Expert System
➢ DENDRAL: It was an artificial intelligence project that was
made as a chemical analysis expert system. It was used in organic
chemistry to detect unknown organic molecules with the help of
their mass spectra and knowledge base of chemistry.
➢ MYCIN: It was one of the earliest backward chaining expert
systems that was designed to find the bacteria causing infections
like bacteraemia and meningitis. It was also used for the
recommendation of antibiotics and the diagnosis of blood clotting
diseases.
➢ PXDES: It is an expert system that is used to determine the type
and level of lung cancer. To determine the disease, it takes a
picture from the upper body, which looks like the shadow. This
shadow identifies the type and degree of harm.
➢ CaDeT: The CaDet expert system is a diagnostic support system
that can detect cancer at early stages.
Architecture of Expert System
Architecture of Expert System
➢ Knowledge Base: It is warehouse of special heuristics or rules,
which are used directly by knowledge, facts (productions). It has
knowledge that is needed for understanding, formulating, &
problem solving.
➢ Working Memory: It helps to describe the current running
problem and record intermediate output.
Records Intermediate Hypothesis & Decisions: 1. Plan, 2. Agenda,
3. Solution
➢ Inference Engine: It is important part of expert system which
helps to manage entire structure of expert system, and it delivers
to different methodology for reasoning.
➢ Explanation System: It helps to trace responsibility and justify
the behavior of expert system by firing questions and answers,
such as Why, How, What, Where, When, Who.
Architecture of Expert System
➢ User Interface: It allows users to insert their queries with using
own Natural Language Processing otherwise menus & graphics.
➢ Knowledge Engineer: Main objective of this engineer is to
design system for specific problem domain with using of expert
system shell.
➢ Users: They are non expert person who want to seek direct
advice.
➢ Expert system shell: It contains the special software
development environment, and it has basic components of expert
system such as – Knowledge-based management system,
Workplace, Explanation facility, Reasoning capacity, Inference
engine, user interface.
➢ This shell is linked along with pre-defined method for designing
different applications through configuring of those components.
Phases in building Expert System
The following points highlight the five main phases to develop
an expert system.
The phases are:
1. Identification
2. Conceptualization
3. Formalization (Designing)
4. Implementation
5. Testing (Validation, Verification and Maintenance).
Phases in building Expert System
➢ A knowledge engineer is an AI specialist, perhaps a
computer scientist or programmer, who is skilled in the ‘Art’
of developing expert systems.
➢ You don’t need a degree in “knowledge engineering” to call
yourself a knowledge engineer; in fact, nearly everyone who
has ever contributed to the technical side of the expert
system development process could be considered a
knowledge engineer.
➢ A domain expert is an individual who has significant
expertise in the domain of the expert system being
developed.
➢ It is not critical that the domain expert understand AI or
expert systems; that is one of the functions of the knowledge
engineer.
Phases in building Expert System

The knowledge engineer and the domain expert usually work very
closely together for long periods of time throughout the several
stages of the development process.
1.Identification Phase
➢ To begin, the knowledge engineer, who may be unfamiliar with
this particular domain, consults manuals and training guides to
gain some familiarity with the subject. Then the domain expert
describes several typical problem states.
➢ The knowledge engineer attempts to extract fundamental concepts
from the similar cases in order to develop a more general idea of
the purpose of the expert system.
➢ After the domain expert describes several cases, the knowledge
engineer develops a ‘first-pass’ problem description.
➢ Typically, the domain expert may feel that the description does
not entirely represent the problem.
➢ The domain expert then suggests changes to the description and
provides the knowledge engineer with additional examples to
illustrate further the problem’s fine points.
1.Identification Phase

Next, the knowledge engineer revises the description, and the domain
expert suggests further changes. This process is repeated until the
domain expert is satisfied that the knowledge engineer understands
the problems and until both are satisfied that the description
adequately portrays the problem which the expert system is expected
to solve.
2.Conceptualisation Phase
 In the conceptualisation stage, the knowledge engineer
frequently creates a diagram of the problem to depict
graphically the relationships between the objects and processes
in the problem domain.
 It is often helpful at this stage to divide the problem into a series
of sub-problems and to diagram both the relationships among
the pieces of each sub-problem and the relationships among the
various sub-problems.
 As in the identification stage, the conceptualisation stage
involves a circular procedure of iteration and reiteration
between the knowledge engineer and the domain expert. When
both agree that the key concepts-and the relationships among
them-have been adequately conceptualised, this stage is
complete.
2.Conceptualisation Phase

Not only is each stage in the expert system development process


circular, the relationships among the stages may be circular as well.
Since each stage of the development process adds a level of detail
to the previous stage, any stage may expose a weakness in a
previous stage
3.Formalisation (Designing) Phase
➢ The formalisation process is often the most interactive stage of
expert system development, as well as the most time consuming.
➢ The knowledge engineer must develop a set of rules and ask the
domain expert if those rules adequately represent the expert’s
knowledge.
➢ The domain expert reviews the rules proposed by the knowledge
engineer and suggests changes, which are then incorporated into
the knowledge base by the knowledge engineer.
➢ As in the other development stages, this process also is iterative:
the rule review is repeated and the rules are refined continually
until the results are satisfactory. It is not unusual for the
formalisation process of a complex expert system to last for
several years.
4.Implementation Phase
➢ During the implementation stage the formalised concepts are
programmed into the computer which has been chosen for
system development, using the predetermined techniques and
tools to implement a ‘first-pass’ (prototype) of the expert system.
➢ If the prototype works at all, the knowledge engineer may be able
to determine if the techniques chosen to implement the expert
system were the appropriate ones.
➢ On the other hand, the knowledge engineer may discover that the
chosen techniques simply cannot be implemented. It may not be
possible, for example, to integrate the knowledge representation
techniques selected for different sub-problems.
➢ At that point, the concepts may have to be re-formalised, or it
even may be necessary to create new development tools to
implement the system efficiently.
4.Implementation Phase

Once the prototype system has been refined sufficiently to allow


it to be executed, the expert system is ready to be tested
thoroughly to ensure that it expertise’s correctly.
5.Testing (Validation, Verification
and Maintenance) Phase
➢ Testing provides an opportunity to identify the weaknesses
in the structure and implementation of the system and to
make the appropriate corrections.
➢ Depending on the types of problems encountered, the
testing procedure may indicate that the system was
implemented incorrectly, or perhaps that the rules were
implemented correctly but were poorly or incompletely
formulated.
➢ Results from the tests are used as ‘feedback’ to return to a
previous stage and adjust the performance of the system.
5.Testing (Validation, Verification
and Maintenance) Phase
➢ Once the system has proven to be capable of correctly
solving straight-forward problems, the domain expert
suggests complex problems which typically would require a
great deal of human expertise.
➢ These more demanding tests should uncover more serious
flaws and provide ample opportunity to ‘fine tune’ the
system even further.
➢ Ultimately, an expert system is judged to be entirely
successful only when it operates at the level of a human
expert.
➢ The testing process is not complete until it indicates that the
solutions suggested by the expert system are consistently as
valid as those provided by a human domain expert.
Applications of Expert System
➢ In designing and manufacturing domain :It can be broadly
used for designing and manufacturing physical devices such as
camera lenses and automobiles.
➢ In the knowledge domain: These systems are primarily used for
publishing the relevant knowledge to the users. The two popular
ES used for this domain is an advisor and a tax advisor.
➢ In the finance domain :In the finance industries, it is used to
detect any type of possible fraud, suspicious activity, and advise
bankers that if they should provide loans for business or not.
➢ In the diagnosis and troubleshooting of devices: In medical
diagnosis, the ES system is used, and it was the first area where
these systems were used.
➢ Planning and Scheduling: The expert systems can also be used
for planning and scheduling some particular tasks for achieving
the goal of that task.

Uncertainty
➢ In AI knowledge representation uses techniques such as first-
order logic and propositional logic with certainty, which means
we were sure about the predicates.
➢ With this knowledge representation, we might write A→B,
which means if A is true then B is true, but consider a situation
where we are not sure about whether A is true or not then we
cannot express this statement, this situation is called uncertainty.
➢ So to represent uncertain knowledge, where we are not sure
about the predicates, we need uncertain reasoning or
probabilistic reasoning.
Causes of uncertainty
Following are some leading causes of uncertainty to occur in
the real world.
➢ Information occurred from unreliable sources.
➢ Experimental Errors
➢ Equipment fault
➢ Temperature variation
➢ Climate change.
Probabilistic reasoning
➢ Probabilistic reasoning is a way of knowledge representation
where we apply the concept of probability to indicate the
uncertainty in knowledge.
➢ In probabilistic reasoning, we combine probability theory with
logic to handle the uncertainty.
➢ We use probability in probabilistic reasoning because it provides
a way to handle the uncertainty that is the result of someone's
laziness and ignorance.
➢ In the real world, there are lots of scenarios, where the certainty
of something is not confirmed, such as "It will rain today,"
"behavior of someone for some situations," "A match between
two teams or two players."
➢ These are probable sentences for which we can assume that it
will happen but not sure about it, so here we use probabilistic
reasoning.
Need of Probabilistic reasoning in AI
➢ When there are unpredictable outcomes.
➢ When specifications or possibilities of predicates becomes too
large to handle.
➢ When an unknown error occurs during an experiment.
➢ In probabilistic reasoning, there are two ways to solve problems
with uncertain knowledge:
Bayes' rule
Bayesian Statistics
➢ As probabilistic reasoning uses probability and related terms,
so before understanding probabilistic reasoning, let's
understand some common terms:

➢ Probability: Probability can be defined as a chance that an


uncertain event will occur. It is the numerical measure of the
likelihood that an event will occur. The value of probability
always remains between 0 and 1 that represent ideal
uncertainties.
0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
P(A) = 0, indicates total uncertainty in an event A.
P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event
by using the below formula

Examples:
➢In a drawer of ten socks where 8 of them are yellow, there is a
20% chance of choosing a sock that is not yellow.

➢There are 9 red candies in a bag and 1 blue candy in the same
bag. The chance of picking a blue candy is 10%.
P(¬A) = probability of a not happening event.
P(¬A) + P(A) = 1.
➢ Event: Each possible outcome of a variable is called an event.
➢ Sample space: The collection of all possible events is called
sample space.
➢ Random variables: Random variables are used to represent
the events and objects in the real world.
➢ Prior probability: The prior probability of an event is
probability computed before observing new information.
➢ Posterior Probability: The probability that is calculated after
all evidence or information has taken into account. It is a
combination of prior probability and new information.
Conditional Probability
➢Conditional probability is a probability of occurring an
event when another event has already happened.

➢Let's suppose, we want to calculate the event A when


event B has already occurred, "the probability of A under
the conditions of B", it can be written as:

Where P(A⋀B)= Joint probability of a and B


P(B)= Marginal probability of B.
Conditional Probability
If the probability of A is given and we need to find the
probability of B, then it will be given as:
Conditional Probability
It can be explained by using Venn diagram, where B is
occurred event, so sample space will be reduced to set B, and
now we can only calculate event A when event B is already
occurred by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like C Language and
40% of the students who likes C and Java, and then what is the
percent of students those who like C Language also like Java?
Solution:
Let, A is an event that a student likes Java
B is an event that a student likes C Language.

Hence, 57% are the students who like C also like Java.
Prior Probability
Prior Probability- Degree of belief in an event, in the
absence of any other information
Example:
➢ P(rain tomorrow)= 0.7
➢ P(no rain tomorrow)= 0.3 Rain

No
No rain
rain
Conditional Probability
What is the probability of an event , given knowledge of
another event.
Example:
➢ P(raining | sunny)
➢ P(raining | cloudy)
➢ P(raining | cloudy, cold)
Conditional Probability…
In some cases , given knowledge of one or more random
variables, we can improve our prior belief of another
random variable.
For Example:
➢ P(slept in stadium) = 0.5
➢ P(slept in stadium | liked match) = 0.33
➢ P(didn’t slept in stadium | liked match) = 0.67
Bayes Theorem
➢ Bayes' theorem is also known as Bayes' rule, Bayes' law,
or Bayesian reasoning, which determines the probability
of an event with uncertain knowledge.
➢ In probability theory, it relates the conditional probability
and marginal probabilities of two random events.
➢ Bayes' theorem was named after the British
mathematician Thomas Bayes.
➢ The Bayesian inference is an application of Bayes'
theorem, which is fundamental to Bayesian statistics.
➢ It is a way to calculate the value of P(B|A) with the
knowledge of P(A|B).
Bayes Theorem …
➢ Bayes' theorem allows updating the probability prediction
of an event by observing new information of the real
world.
➢ Example: If cancer corresponds to one's age then by using
Bayes' theorem, we can determine the probability of cancer
more accurately with the help of age.
➢ Bayes' theorem can be derived using product rule and
conditional probability of event A with known event B:
As from product rule we can write:
P(A ∧ B) = P(A|B)P(B) and
Similarly, the probability of event B with known event A:
P(A ∧ B) = P(B|A)P(A)
Bayes Theorem …
Equating right hand side of both the equations, we will get:

➢The above equation (a) is called as Bayes' rule or Bayes'


theorem. This equation is basic of most modern AI systems
for probabilistic inference.
➢It shows the simple relationship between joint and conditional
probabilities. Here,
➢P(A|B) is known as posterior, which we need to calculate, and it
will be read as Probability of hypothesis A when we have occurred
an evidence B.
➢P(B|A) is called the likelihood, in which we consider that
hypothesis is true, then we calculate the probability of evidence.
Bayes Theorem …
Multiple Prior Probabilities

A1 A2 A3

P( B) = P( A1 ).P( B | A1 ) + P( A2 ).P( B | A2 ) + P( A3 ).P( B | A3 )


Bayes Theorem …
➢P(A) is called the prior probability, probability of hypothesis
before considering the evidence
➢P(B) is called marginal probability, pure probability of an
evidence.
In the equation (a), in general, we can write
P (B) = P(Ai)*P(B|Ai), hence the Bayes' rule can be written as:

Where A1, A2, A3,........, An is a set of mutually exclusive


and exhaustive events.
Applying Bayes' rule
➢ Bayes' rule allows us to compute the single term P(B|A)
in terms of P(A|B), P(B), and P(A).
➢ This is very useful in cases where we have a good
probability of these three terms and want to determine the
fourth one.
➢Suppose we want to perceive the effect of some unknown
cause, and want to compute that cause, then the Bayes' rule
becomes:
Example 1:
➢ Suppose a patient exhibits symptoms that make her physician
concerned that she may have a particular disease. The disease is
relatively rare in this population, with a prevalence of 0.2%
(meaning it affects 2 out of every 1,000 persons). The physician
recommends a screening test that costs Rs.10000 and requires a
blood sample. Before agreeing to the screening test, the patient
wants to know what will be learned from the test, specifically she
wants to know the probability of disease, given a positive test
result, i.e., P(Disease | Screen Positive).
➢ The physician reports that the screening test is widely used and
has a reported sensitivity of 85%. In addition, the test comes
back positive 8% of the time and negative 92% of the time.
Example 1:
➢ The information that is available is as follows:
➢ P(Disease)=0.002, i.e., prevalence = 0.002
➢ P(Screen Positive | Disease)=0.85, i.e., the probability of
screening positive, given the presence of disease is 85% (the
sensitivity of the test), and
➢ P(Screen Positive)=0.08, i.e., the probability of screening
positive overall is 8% or 0.08. We can now substitute the values
into the above equation to compute the desired probability,
➢ We know that P(Disease)=0.002, P(Screen Positive |
Disease)=0.85 and P(Screen Positive)=0.08. We can now
substitute the values into the above equation to compute the
desired probability,
➢ P(Disease | Screen Positive) = (0.85)(0.002)/(0.08) = 0.021
Example 1:
➢ The information that is available is as follows:
➢ P(Disease)=0.002, i.e., prevalence = 0.002
➢ P(Screen Positive | Disease)=0.85, i.e., the probability of screening positive,
given the presence of disease is 85% (the sensitivity of the test), and
➢ P(Screen Positive)=0.08, i.e., the probability of screening positive overall is
8% or 0.08. We can now substitute the values into the above equation to
compute the desired probability,
➢ We know that P(Disease)=0.002, P(Screen Positive | Disease)=0.85 and
P(Screen Positive)=0.08. We can now substitute the values into the above
equation to compute the desired probability,
➢ P(Disease | Screen Positive) = (0.85)(0.002)/(0.08) = 0.021
➢ The patient undergoes the test and it comes back positive, there is a 2.1%
chance that he has the disease.
➢ Also, note, however, that without the test, there is a 0.2% chance that she has
the disease (the prevalence in the population).
Example 2:
➢ In a recent newspaper article, it was reported that light
trucks, which include SUV’s, pick-up trucks and minivans,
accounted for 40% of all personal vehicles on the road in
2018. Assume the rest are cars. Of every 100,000 car
accidents, 20 involve a fatality; of every 100,000 light truck
accidents, 25 involve a fatality. If a fatal accident is chosen
at random, what is the probability the accident involved a
light truck?
Example 2:
Events
C- Cars
T –Light truck
F –Fatal Accident
N- Not a Fatal Accident
Given, P(F|C) = 20/10000 and P(F|T) = 25/100000
P(T) = 0.4
In addition we know C and T are complementary events
P(C)=1-P(T)=0.6
Our goal is to compute the conditional probability of a Light truck
accident given that it is fatal P(T|F).
Example 2:
Consider P(T|F)
Conditional probability of a Light truck accident given that it is
fatal
How do we calculate?
Using conditional probability formula
P( T  F ) P( F | T )  P( T )
P( T | F ) = =
P( F ) P( F | T )  P( T ) + P( F | C )  P( C )

( 0.00025 )( 0.4 )
=
( 0.00025 )( 0.4 ) + ( 0.0002 )( 0.6 )

= 0.4545
Bayesian Networks
➢ Bayesian belief network is key computer technology for
dealing with probabilistic events and to solve a problem
which has uncertainty. We can define a Bayesian network
as:
➢ "A Bayesian network is a probabilistic graphical model
which represents a set of variables and their conditional
dependencies using a directed acyclic graph."
➢ It is also called a Bayes network, belief network, decision
network, or Bayesian model.
Bayesian Networks
➢ Bayesian networks are probabilistic, because these networks
are built from a probability distribution, and also use
probability theory for prediction and anomaly detection.
➢ Real world applications are probabilistic in nature, and to
represent the relationship between multiple events, we need a
Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics,
automated insight, reasoning, time series prediction,
and decision making under uncertainty.
➢ Bayesian Network can be used for building models from data
and experts opinions, and it consists of two parts:
Directed Acyclic Graph
Table of conditional probabilities.
A Bayesian network graph (Directed Acyclic Graph) is made up
of nodes and Arcs (directed links), where:
➢ Each node corresponds to the random
variables, and a variable can
be continuous or discrete.
➢ Arc or directed arrows represent the causal
relationship or conditional probabilities between
random variables.
➢ These directed links or arrows connect the
pair of nodes in the graph.
➢ These links represent that one node directly
influence the other node, and if there is no
directed link that means that nodes are
independent with each other
➢In the above diagram X1, X2,X3 and X4 are random variables represented
by the nodes of the network graph.
➢If we are considering node X3, which is connected with node X1 by a
directed arrow, then node X1 is called the parent of Node X3.
➢Node X4 is independent of node X1.
Conditional Probability Tables- CPTs
➢ The conditional probability tables in the network give the
probabilities for the value of the random variables
depending on the combination of values for the parent
nodes.
➢ Each row must be sum to 1.
➢ All variables are Boolean, and therefore, the probability of
a true value is p, the probability of false value must be 1-p.
➢ A table for a boolean variable with k-parents contains 2k
independently specifiable probabilities.
➢ A variable with no parents has only one row, representing
the prior probabilities of each possible values of the
variable.
Joint Probability Distribution
➢Bayesian network is based on Joint probability distribution and
conditional probability. So let's first understand the joint
probability distribution:

➢If we have variables A,B,C,D then the probabilities of a


different combination of their variables are known as Joint
probability distribution.

➢A directed graph G with four vertices A,B,C and D. If P(xA, xB,


xC, xD) factorizes with respect to G, then we must have

P(xA, xB, xC, xD) =P(xA)P(xB|xA)P(xC|xB)P(xD|xC)


Bayesian Network- Burglar Alarm
➢ You have installed a new burglar-alarm at home.
➢ It is fairly reliable at detecting a burglary, but also responds
on occasion to minor earthquakes.
➢ You have also have two neighbors, David and Sophia, who
have promised to call you at work when they hear the
alarm.
➢ David always calls when he hears the alarm, but sometimes
confuses the telephone ringing with the alarm and calls
then, too.
➢ Sophia, on the other hand, likes loud music and misses the
alarm altogether.
➢ Given the evidence of who has or has not called, we would
like to estimate the probability of a burglary.
Problem:
Calculate the probability that alarm has sounded, but
there is neither a burglary, nor an earthquake occurred,
and David and Sophia both called the Harry.
Solution:
➢ The Bayesian network for the above problem is given
below. The network structure is showing that burglary and
earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and
Sophia's calls depend on alarm probability.
➢ The network is representing that our assumptions do not
directly perceive the burglary and also do not notice the
minor earthquake, and they also not confer before calling.
Problem:
List of all events occurring in this network:
➢ Burglary (B)
➢ Earthquake(E)
➢ Alarm(A)
➢ David Calls(D)
➢ Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S,
A, B, E], can rewrite the above probability statement using joint probability
distribution:
P[D, S, A, B, E]= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
➢ From the formula of joint distribution, we can write the problem statement in
the form of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by
using Joint distribution.
Inference in Bayesian Networks
Purpose:
➢ Probabilistic Inference System is to compute Posterior
Probability Distribution for a set of query variables given
some observed events.
➢ That is, some assignment of values to a set of evidence
variables.
Inference in Bayesian Networks
Notations:
➢ X - Denotes the query variable.
➢ E - Set of Evidence variables
➢ e - Particular observed event
➢ Y - Non evidence, non query variable Y1,…..,Yn (Hidden
variables)
➢ The complete set of variables X={X} U E U Y
➢ A typical query ask for the Posterier Probability
Distribution P{X|e}
➢ In the burglary network, we might observe the event in
which
JohnCalls = true and MaryCalls =true
➢ We could then ask for, say, the probability that a burglary
has occurred:
P(Burglary | JohnCalls = true, MaryCalls = true) = ?
P(B | J=true, M=true)
Types of Inferences:
➢ Inference by Enumeration
(inference by listing or recording all variables)

➢ Inference by Variable Elimination


(inference by variable removal)
Inference by Enumeration
➢ Any conditional probability can be computed by summing
terms from the full joint distribution.
➢ More specifically, a query P(X | e) can be answered using
equation:
P(X | e)= α P(X , e) = α Σy P(X, e, y)
where α is normalized constant
X – Query variable
e - event
y – number of terms (hidden variables)
Inference by Enumeration…
➢ Consider P (Burglary | JohnCalls = true, MaryCalls = true)
➢ Burglary - query variable (X)
➢ JohnCalls - Evidence variable 1 (E1)
➢ MaryCalls - Evidence variable 2 (E2)
➢ The hidden variable of this query are earthquake and alarm.
Inference by Enumeration…

P(B | j, m) = P(B,j,m) / P(j,m)

P(B | j, m)= α P(B,j,m)

P(B | j, m) = αΣE,A P(B,E,A,j,m)


P(B | j, m) = αΣE,A P(B)P(E)P(A|E,B)P(j|A)P(m|A)

P(B | j, m) = α P(B)ΣE P(E) ΣA P(A|E,B)P(j|A)P(m|A)


P(B | j, m) = α P(B)ΣE P(E) ΣA P(A|B,E)P(j|A)P(m|A)
Inference by Enumeration…
Let us consider for simplicity purpose
burglary = true
P(b | j, m) = α P(b)Σe P(e) ΣaP(a|e,b)P(j|a)P(m|a)

This expression can be evaluated by looping through the variables in


order, multiplying CPT entries as we go. For each summation, we also
need to loop over the variable’s possible values. Using the numbers from
above figure, we obtain P(b | j , m) = α×0.00059224. The corresponding
computation for ¬b yields α×0.0014919; hence,
P(B | j , m) = α (0.00059224, 0.0014919) ≈ 0.284, 0.716 .
That is, the chance of a burglary, given calls from both neighbors, is
about 28%.
Inference by Variable Elimination
➢ The enumeration algorithm can be improved substantially by
eliminating repeated calculations.
➢ The idea is simple: do the calculation once and save the results
for later use. This is a form of dynamic programming.
➢ Variable elimination works by evaluating expressions, from
the previous equation (derived in inference by enumeration)

P(b | j, m) = α P(b)Σe P(e) ΣaP(a|e,b)P(j|a)P(m|a)


Inference by Variable Elimination…
➢ Intermediate results are stored, and summations over each
variable are done only for those portions of the expression
that depend on the variable.
➢ Let us illustrate this process for the burglary network.
➢ We evaluate the below expression in such a way that we
annotated each part of the expression with the name of the
associated variable; these parts are called factors
P(B | j, m) = α P(B)ΣE P(E) ΣA P(A|E,B)P(j|A)P(m|A)
Variable Elimination
P(B | j, m) = α P(B)ΣE P(E) ΣA P(A|B,E)P(j|A)P(m|A)
Inference by Variable Elimination…
α P(B)ΣE P(E) ΣA P(A|E,B)P(j|A)P(m|A)

P(J|A) P(M|A)
A 0.90 (0.1) A 0.70 (0.30)

-A 0.05 (0.95) -A 0.01 (0.99)

P(j|A)P(m|A)
A 0.9 * 0.7

-A 0.05 * 0.01
Inference by Variable Elimination…
α P(B)ΣE P(E) ΣA P(A|E,B)f1(A)

P(J|A) P(M|A)
A 0.90 (0.1) A 0.70 (0.30)

-A 0.05 (0.95) -A 0.01 (0.99)

f1 (A)
A 0.63

-A 0.0005
Inference by Variable Elimination…
α P(B)ΣE P(E) ΣA P(A|E,B)f1(A)
P(A|E,B)
e,b 0.95 (0.05)
f1 (A)
e , -b 0.29 (0.71)
A 0.63

-e , b 0.94(0.06)
-A 0.0005
-e , -b 0.001(0.999)

ΣAP(A|E,B) f1(A)
e,b 0.95 * 0.63 + 0.05 * 0.0005

e , -b 0.29 *0.63 + 0.71 * 0.0005

-e , b 0.94 * 0.63 + 0.06 * 0.0005


-e , -b 0.001 * 0.63 + 0.999 * 0.0005
Inference by Variable Elimination…
α P(B)ΣE P(E) f2(E,B)
P(A|E,B)
e,b 0.95 (0.05)
f1 (A)
e , -b 0.29 (0.71)
A 0.63

-e , b 0.94(0.06)
-A 0.0005
-e , -b 0.001(0.999)

f2(E,B)
e,b 0.60

e , -b 0.18

-e , b 0.59
-e , -b 0.001
Inference by Variable Elimination…
α P(B)ΣE P(E) f2(E,B)
f2(E,B)
e,b 0.60
P(E=T) P(E=F) P(B=T) P(B=F)
0.002 0.998 0.001 0.999 e , -b 0.18

-e , b 0.59
-e , -b 0.001

P(B)ΣE P(E) f2(E,B)


b 0.60 * 0.002 * 0.001 + 0.59 * 0.998 * 0.001

-b 0.18 * 0.002 * 0.999 + 0.001 * 0.998 * 0.999


Inference by Variable Elimination…
α f3(B)
f2(E,B)
e,b 0.60
P(E=T) P(E=F) P(B=T) P(B=F)
0.002 0.998 0.001 0.999 e , -b 0.18

-e , b 0.59
-e , -b 0.001

f3(B)
b 0.0006

-b 0.0013
Inference by Variable Elimination…
α f3(B) → P(B | j , m)
f3(B)
b 0.0006

-b 0.0013

N= 0.0006 + 0.0013 =0.0019

P(B | j , m)
b 0.32

-b 0.68

That is, the chance of a burglary, given calls from both


neighbors, is about 32%.
WHAT IS FUZZY LOGIC?
• Definition of fuzzy
• Fuzzy – “not clear, distinct, or precise; blurred”
• Definition of fuzzy logic
• A form of knowledge representation suitable for notions that
cannot be defined precisely, but which depend upon their
contexts.
• So we cannot decide in real life that the given problem or
statement is either true or false.
• At that time, this concept provides many values between the
true and false and gives the flexibility to find the best solution
to that problem.
WHAT IS FUZZY LOGIC?
The inventor of fuzzy logic, Lotfi Zadeh, observed that unlike
computers, the human decision making includes a range of
possibilities between YES and NO, such as −
The fuzzy logic works on the levels of possibilities of input to
achieve the definite output.

CERTAINLY YES
POSSIBLY YES
CANNOT SAY
POSSIBLY NO
CERTAINLY NO
Implementation
 It can be implemented in systems with various sizes and
capabilities ranging from small micro-controllers to large,
networked, workstation-based control systems.
 It can be implemented in hardware, software, or a combination
of both.

Why Fuzzy Logic?


Fuzzy logic is useful for commercial and practical purposes.
 It can control machines and consumer products.
 It may not give accurate reasoning, but acceptable reasoning.
 Fuzzy logic helps to deal with the uncertainty in engineering.
TRADITIONAL REPRESENTATION OF
LOGIC

Slow Fast
Speed = 0 Speed = 1
bool speed;
get the speed
if ( speed == 0) {
// speed is slow
}
else {
// speed is fast
}
FUZZY LOGIC REPRESENTATION
Slowest
• For every problem
[ 0.0 – 0.25 ]
must represent in
terms of fuzzy sets.
Slow
• What are fuzzy sets? [ 0.25 – 0.50 ]

Fast
[ 0.50 – 0.75 ]

Fastest
[ 0.75 – 1.00 ]
FUZZY LOGIC REPRESENTATION
CONT.

Slowest Slow Fast Fastest


float speed;
get the speed
if ((speed >= 0.0)&&(speed < 0.25)) {
// speed is slowest
}
else if ((speed >= 0.25)&&(speed < 0.5))
{
// speed is slow
}
else if ((speed >= 0.5)&&(speed < 0.75))
{
// speed is fast
}
else // speed >= 0.75 && speed < 1.0
{
// speed is fastest
}
Fuzzy Set
 The set theory of classical is the subset of Fuzzy set theory.
Fuzzy logic is based on this theory, which is a generalization of
the classical theory of set (i.e., crisp set) introduced by Zadeh
in 1965.
 A fuzzy set is a collection of values which exist between 0 and
1.
 Fuzzy sets are denoted or represented by the tilde (~) character.
The sets of Fuzzy theory were introduced in 1965 by Lofti A.
Zadeh and Dieter Klaua.
 In the fuzzy set, the partial membership also exists. This theory
released as an extension of classical set theory.
Fuzzy Set
This theory is denoted mathematically as A fuzzy set (Ã)
is a pair of U and M, where U is the Universe of
discourse and M is the membership function which takes
on values in the interval [ 0, 1 ]. The universe of
discourse (U) is also denoted by Ω or X.
Operations on Fuzzy Set
Given à and B are the two fuzzy sets, and X be the universe of
discourse with the following respective member functions:

The operations of Fuzzy set are as follows:


1. Union Operation: The union operation of a fuzzy set is
defined by:
μA∪B(x) = max (μA(x), μB(x))
Example:
Let's suppose A is a set which contains following elements:
A = {( X1, 0.6 ), (X2, 0.2), (X3, 1), (X4, 0.4)}
And, B is a set which contains following elements:
B = {( X1, 0.1), (X2, 0.8), (X3, 0), (X4, 0.9)}
then,
AUB = {( X1, 0.6), (X2, 0.8), (X3, 1), (X4, 0.9)}
Operations on Fuzzy Set
2. Intersection Operation: The intersection operation of fuzzy
set is defined by:
μA∩B(x) = min (μA(x), μB(x))

Example:
Let's suppose A is a set which contains following elements:
A = {( X1, 0.3 ), (X2, 0.7), (X3, 0.5), (X4, 0.1)}

and, B is a set which contains following elements:

B = {( X1, 0.8), (X2, 0.2), (X3, 0.4), (X4, 0.9)}

then,

A∩B = {( X1, 0.3), (X2, 0.2), (X3, 0.4), (X4, 0.1)}


Operations on Fuzzy Set
3. Complement Operation: The complement operation of
fuzzy set is defined by:

μĀ(x) = 1-μA(x),

Example:
Let's suppose A is a set which contains following elements:

A = {( X1, 0.3 ), (X2, 0.8), (X3, 0.5), (X4, 0.1)}


then,

Ā= {( X1, 0.7 ), (X2, 0.2), (X3, 0.5), (X4, 0.9)}


Fuzzy Logic Systems Architecture
Fuzzy Logic Systems Architecture
Fuzzy Logic architecture has four main parts as shown in the
diagram:
Fuzzification:
 Fuzzification step helps to convert inputs. It allows you to
convert, crisp numbers into fuzzy sets. Crisp inputs measured
by sensors and passed into the control system for further
processing. This component divides the input signals into
following five states in any Fuzzy Logic system:
 Large Positive (LP)
 Medium Positive (MP)
 Small (S)
 Medium Negative (MN)
 Large negative (LN)
Fuzzy Logic Systems Architecture
Rule Base:
 It contains all the rules and the if-then conditions offered by the
experts to control the decision-making system. The recent
update in fuzzy theory provides various methods for the design
and tuning of fuzzy controllers. This updates significantly
reduce the number of the fuzzy set of rules.

Inference Engine:
 It helps you to determines the degree of match between fuzzy
input and the rules. Based on the % match, it determines which
rules need implment according to the given input field. After
this, the applied rules are combined to develop the control
actions.
Fuzzy Logic Systems Architecture
Defuzzification:
 At last the Defuzzification process is performed to convert the
fuzzy sets into a crisp value. There are many types of
techniques available, so you need to select it which is best
suited when it is used with an expert system.
Fuzzy logic algorithm
1) Initialization process:
▪ Define the linguistic variables.
▪ Construct the fuzzy logic membership functions that
define the meaning or values of the input and output
terms used in the rules.
▪ Construct the rule base (Break down the control problem
into a series of IF X AND Y, THEN Z rules based on the
fuzzy logic rules).
2)Convert crisp input data to fuzzy values using the
membership functions (fuzzification).
3) Evaluate the rules in the rule base (inference).
4) Combine the results of each rule (inference).
5)Convert the output data to non-fuzzy values
(defuzzification).
Example: Air conditioner system
controlled by a FLS
Example: Air conditioner system
controlled by a FLS
 The system adjusts the temperature of the room according
to the current temperature of the room and the target value.
The fuzzy engine periodically compares the room
temperature and the target temperature, and produces a
command to heat or cool the room.

Define linguistic variables and terms


 Linguistic variables are input and output variables in the
form of simple words or sentences. For room temperature,
cold, warm, hot, etc., are linguistic terms.
 Temperature (t) = {too-cold, cold, warm, hot, too hot}
 Every member of this set is a linguistic term and it can
cover some portion of overall temperature values.
Example: Air conditioner system
controlled by a FLS
Construct membership functions for them
The membership functions of temperature variable are as shown −
Example: Air conditioner system
controlled by a FLS
Construct knowledge base rules
Create a matrix of room temperature values versus target
temperature values that an air conditioning system is expected to
provide.

RoomTemp.
Too Cold Cold Warm Hot Too Hot
/Target
Too Cold No_Change Heat Heat Heat Heat

Cold Cool No_Change Heat Heat Heat

Warm Cool Cool No_Change Heat Heat

Hot Cool Cool Cool No_Change Heat


Too Hot Cool Cool Cool Cool No_Change
Example: Air conditioner system
controlled by a FLS
Build a set of rules into the knowledge base in the form of IF-
THEN-ELSE structures.
For Air Conditioner example, the following rules can be used:
1) IF (temperature is cold OR too-cold) AND (target is warm)
THEN command is heat.
2) IF (temperature is hot OR too-hot) AND (target is warm)
THEN command is cool.
3) IF (temperature is warm) AND (target is warm) THEN
command is nochange.
Example: Air conditioner system
controlled by a FLS
Defuzzification: The result is a fuzzy value and should be
defuzzified to obtain a crisp output. This is the purpose of the
defuzzifier component of a FLS. Defuzzification is performed
according to the membership function of the output variable. o
This defuzzification is not part of the 'mathematical fuzzy logic'
and various strategies are possible.
The mostly-used algorithms for defuzzification are listed.
1) Finding the center of gravity.
2) Finding the center of gravity for singletons.
3) Finding the average mean.
4) Finding the left most maximum.
5) Finding the right most maximum.
Membership Function
 As this function allows you to quantify linguistic term.
Also, represent a fuzzy set graphically. Although, MF for a
fuzzy set A on the universe of discourse. That X is
defined as μA:X → [0,1].
 In this function, between a value of 0 and 1, each element
of X is mapped. We can define it as the degree of
membership. Also, it quantifies the degree of membership
of the element. That is in X to the fuzzy set A.
x-axis– It represents the universe of discourse.
y-axis – It represents the degrees of membership in the
[0, 1] interval.
Membership Function
We can apply different membership functions to fuzzify a
numerical value. Also, we use simple functions as complex. As
they do not add more precision in the output.
We can define all membership functions for LP, MP, S, MN, and
LN. That is shown as below −

Fuzzy Logic System – Membership Function


Triangular membership functions

0, if x < a
(x – a) / (b- a), if a ≤ x ≤ b
F(x, a, b, c) =
(c – x) / (c – b), if b ≤ x ≤ c
0, if c < x
Cont…
1.2

1
Membership Values

0.8

0.6

0.4

0.2

a b c
0
0 20 40 60 80 100

Figure Triangular Function


Trapezoidal membership function

0, if x < a
(x – a) / (b- a), if a ≤ x ≤ b
F(x, a, b, c, d) = 1, if b < x < c
(d – x) / (d – c), if c ≤ x ≤ d
0, if d < x
Trapezoidal membership function

0, if x < a
(x – a) / (b- a), if a ≤ x ≤ b
F(x, a, b, c, d) = 1, if b < x < c
(d – x) / (d – c), if c ≤ x ≤ d
0, if d < x
Cont…
Gaussian membership function
− ( x −b ) 2

 ( x, a, b) = e 2a2

The graph given in Fig. 10.6 is for parameters a = 0.22, b = 0.78

a b
Figure Gaussian Membership Function
Applications of Fuzzy Logic
Following are the different application areas where the Fuzzy Logic
concept is widely used:
 It is used in Businesses for decision-making support system.
 It is used in Automative systems for controlling the traffic and
speed, and for improving the efficiency of automatic transmissions.
Automative systems also use the shift scheduling method for
automatic transmissions.
 This concept is also used in the Defence in various areas. Defence
mainly uses the Fuzzy logic systems for underwater target
recognition and the automatic target recognition of thermal infrared
images.
 It is also widely used in the Pattern Recognition and
Classification in the form of Fuzzy logic-based recognition and
handwriting recognition. It is also used in the searching of fuzzy
images.
Applications of Fuzzy Logic
 Fuzzy logic systems also used in Securities.
 It is also used in microwave oven for setting the lunes power and
cooking strategy.
 This technique is also used in the area of modern control
systems such as expert systems.
 Finance is also another application where this concept is used for
predicting the stock market, and for managing the funds.
 It is also used for controlling the brakes.
 It is also used in the industries of chemicals for controlling the ph,
and chemical distillation process.
 It is also used in the industries of manufacturing for the
optimization of milk and cheese production.
 It is also used in the vacuum cleaners, and the timings of washing
machines.
 It is also used in heaters, air conditioners, and humidifiers.
Utility Theory and utility functions
 Decision theory, in its simplest form, deals with choosing
among actions based on the desirability of their immediate
outcomes
 If agent may not know the current state and define
RESULT(a) as a random variable whose values are the
possible outcome states. The probability of outcome s ,
given evidence observations e, is written
P(RESULT(a) = s ‘| a, e)
where the a on the right-hand side of the conditioning bar
stands for the event that action a is executed
 The agent’s preferences are captured by a utility function,
U(s), which assigns a single number to express the
desirability of a state.
 The expected utility of an action given the evidence, EU
(a|e), is just the average utility value of the outcomes,
weighted by the probability that the outcome occurs:
EU (a|e) = ∑ P(RESULT(a) = s’ | a, e)U(s’)
The principle of maximum expected utility (MEU) says that a
rational agent should choose the action that maximizes the
agent’s expected utility:
action = argmax EU (a|e)
In a sense, the MEU principle could be seen as defining all of
AI. All an intelligent agent has to do is calculate the various
quantities, maximize utility over its actions, and away it goes.
Basis of Utility Theory
 Intuitively, the principle of Maximum Expected Utility
(MEU) seems like a reasonable way to make decisions, but
it is by no means obvious that it is the only rational way.

 Why should maximizing the average utility be so special?


 What’s wrong with an agent that maximizes the weighted
sum of the cubes of the possible utilities, or tries to
minimize the worst possible loss?
 Could an agent act rationally just by expressing preferences
between states, without giving them numeric values?
 Finally, why should a utility function with the required
properties exist at all?
Constraints on rational preferences
 These questions can be answered by writing down some
constraints on the preferences that a rational agent should have
and then showing that the MEU principle can be derived from
the constraints
A B the agent prefers A over B.
A ∼ B the agent is indifferent between A and B.
A ∼ B the agent prefers A over B or is indifferent between
them.
We can think of the set of outcomes for each action as a lottery—
think of each action as a ticket. A lottery L with possible outcomes
S1,...,Sn that occur with probabilities p1,...,pn is written
L = [p1, S1; p2, S2; ... pn, Sn] .
Constraints on rational preferences
 In general, each outcome Si of a lottery can be either an atomic
state or another lottery. The primary issue for utility theory is to
understand how preferences between complex lotteries are
related to preferences between the underlying states in those
lotteries.
To address this issue we list six constraints that we require any
reasonable preference relation to obey:
 Orderability: Given any two lotteries, a rational agent must
either prefer one to the other or else rate the two as equally
preferable. That is, the agent cannot avoid deciding.
Exactly one of (A B), (B A), or (A ∼ B) holds.
Constraints on rational preferences
 Transitivity: Given any three lotteries, if an agent prefers A to B and
prefers B to C, then the agent must prefer A to C.
(A B) ∧ (B C) ⇒ (A C)
 Continuity: If some lottery B is between A and C in preference, then
there is some probability p for which the rational agent will be
indifferent between getting B for sure and the lottery that yields A
with probability p and C with probability 1 − p.
A B C ⇒ ∃ p [p, A; 1 − p, C] ∼ B .
 Substitutability: If an agent is indifferent between two lotteries A
and B, then the agent is indifferent between two more complex
lotteries that are the same except that B is substituted for A in one of
them. This holds regardless of the probabilities and the other
outcome(s) in the lotteries.
A ∼ B ⇒ [p, A; 1 − p, C] ∼ [p, B; 1 − p, C] .
This also holds if we substitute for ∼ in this axiom.
Constraints on rational preferences
 Monotonicity: Suppose two lotteries have the same two possible
outcomes, A and B. If an agent prefers A to B, then the agent must
prefer the lottery that has a higher probability for A (and vice versa).
A B ⇒ (p > q ⇔ [p, A; 1 − p, B] [q, A; 1 − q, B])

 Decomposability: Compound lotteries can be reduced to simpler


ones using the laws of probability. This has been called the “no fun
in gambling” rule because it says that two consecutive lotteries can
be compressed into a single equivalent lottery.
[p, A; 1 − p, [q,B; 1 − q, C]] ∼ [p, A; (1 − p)q,B; (1 − p)(1 − q), C] .
Expected Utilities
UNIT-4

Learning
Machine Learning Paradigms
What is learning?
 “Learning denotes changes in a system that ... enable a
system to do the same task more efficiently the next time.”
–Herbert Simon
 “Learning is constructing or modifying representations of
what is being experienced.”
–Ryszard Michalski
 “Learning is making useful changes in our minds.” –
Marvin Minsky

3
Paradigms in Machine Learning
 A paradigm, as most of us know, is a set of ideas,
assumptions and values held by an entity and they shape
the way that entity interacts with their environment.
 For machine learning, this translates into the set of policies
and assumptions inherited by a machine learning algorithm
which dictate how it interacts with both the data inputs and
the user.
Machine Learning
 Machine Learning is said as a subset of artificial
intelligence that is mainly concerned with the development
of algorithms which allow a computer to learn from the
data and past experiences on their own. The term machine
learning was first introduced by Arthur Samuel in 1959.
We can define it in a summarized way as:
 Machine learning enables a machine to automatically
learn from data, improve performance from
experiences, and predict things without being explicitly
programmed.
Machine Learning
 With the help of sample historical data, which is known
as training data, machine learning algorithms build
a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine
learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or
uses the algorithms that learn from historical data. The
more we will provide the information, the higher will be the
performance.
 A machine has the ability to learn if it can improve its
performance by gaining more data.
How does Machine Learning work
A Machine Learning system learns from historical data,
builds the prediction models, and whenever it receives
new data, predicts the output for it. The accuracy of
predicted output depends upon the amount of data, as the
huge amount of data helps to build a better model which
predicts the output more accurately.
Classification of Machine Learning
At a broad level, machine learning can be classified into three
types:
 Supervised learning
 Unsupervised learning
 Reinforcement learning
Supervised Learning
 Supervised learning is a type of machine learning method in
which we provide sample labeled data to the machine learning
system in order to train it, and on that basis, it predicts the
output.
 The system creates a model using labeled data to understand
the datasets and learn about each data, once the training and
processing are done then we test the model by providing a
sample data to check whether it is predicting the exact output
or not.
 The goal of supervised learning is to map input data with the
output data. The supervised learning is based on supervision,
and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning
is spam filtering.
Supervised Learning
Supervised learning can be grouped further in two categories of
algorithms:
 Classification
 Regression

Classification:
 Classification is a process of categorizing a given set of data
into classes, It can be performed on both structured or
unstructured data. The process starts with predicting the class
of given data points. The classes are often referred to as target,
label or categories.
Classification
 The classification predictive modeling is the task of
approximating the mapping function from input variables to
discrete output variables. The main goal is to identify which
class/category the new data will fall into.
 Heart disease detection can be identified as a classification
problem, this is a binary classification since there can be
only two classes i.e has heart disease or does not have heart
disease. The classifier, in this case, needs training data to
understand how the given input variables are related to the
class. And once the classifier is trained accurately, it can be
used to detect whether heart disease is there or not for a
particular patient.
Classification
 Since classification is a type of supervised learning, even
the targets are also provided with the input data. Let us get
familiar with the classification in machine learning
terminologies.
Examples of supervised machine learning algorithms for
classification are:
 Decision Tree Classifiers
 Support Vector Machines
 Naive Bayes Classifiers
 K Nearest Neighbor
 Artificial Neural Networks
Regression
 The regression algorithms attempt to estimate the mapping
function (f) from the input variables (x) to numerical or
continuous output variables (y). Now, the output variable could
be a real value, which can be an integer or a floating point
value. Therefore, the regression prediction problems are
usually quantities or sizes.
 For example, if you are provided with a dataset about houses,
and you are asked to predict their prices, that is a regression
task because the price will be a continuous output.
Examples of supervised machine learning algorithms for
regression:
 Linear Regression
 Logistic Regression
 Regression Decision Trees
 Artificial Neural Networks
Unsupervised Learning
 As the name suggests, unsupervised learning is a machine
learning technique in which models are not supervised
using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be
compared to learning which takes place in the human brain
while learning new things. It can be defined as:
 Unsupervised learning is a type of machine learning in
which models are trained using unlabeled dataset and are
allowed to act on that data without any supervision.
 Unsupervised learning cannot be directly applied to a
regression or classification problem because unlike
supervised learning, we have the input data but no
corresponding output data.
Unsupervised Learning
 The goal of unsupervised learning is to find the
underlying structure of dataset, group that data
according to similarities, and represent that dataset in a
compressed format.
 Example: Suppose the unsupervised learning algorithm is
given an input dataset containing images of different types
of cats and dogs. The algorithm is never trained upon the
given dataset, which means it does not have any idea about
the features of the dataset. The task of the unsupervised
learning algorithm is to identify the image features on their
own. Unsupervised learning algorithm will perform this
task by clustering the image dataset into the groups
according to similarities between images.
Unsupervised Learning

Working of Unsupervised Learning


Unsupervised Learning
 The unsupervised learning algorithm can be further categorized
into two types of problems:
 Clustering: Clustering is a method of grouping the objects into
clusters such that objects with most similarities remains into a
group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of
those commonalities.
 Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose
a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Unsupervised Learning
Below is the list of some popular unsupervised learning algorithms:
 K-means clustering
 KNN (k-nearest neighbors)
 Hierarchal clustering
 Anomaly detection
 Neural Networks
 Principle Component Analysis
 Independent Component Analysis
 Apriori algorithm
 Singular value decomposition
Inductive learning
 Inductive Learning Algorithm (ILA) is an iterative and
inductive machine learning algorithm which is used for
generating a set of a classification rule, which produces
rules of the form “IF-THEN”, for a set of examples,
producing rules at each iteration and appending to the set of
rules.
 In this first input x,(the verified value) given to a function f
and the output is f(x).
 Then we can give different set of inputs (raw inputs) to the
same function f, and verify the output f(x).
 By using the outputs we generate (learn) the rules.
Basic Idea:
 There are basically two methods for knowledge extraction
firstly from domain experts and then with machine
learning.
 For a very large amount of data, the domain experts are not
very useful and reliable. So we move towards the machine
learning approach for this work.
 To use machine learning One method is to replicate the
experts logic in the form of algorithms but this work is very
tedious, time taking and expensive.
 So we move towards the inductive algorithms which itself
generate the strategy for performing a task and need not
instruct separately at each step.
Inductive learning
 Inductive learning also known as discovery learning, is a
process where the learner discovers rules by observing
examples.
 We can often work out rules for ourselves by observing
examples. If there is a pattern; then record it.
 We the apply the rule in different situations to see if it
works.
 With inductive language learning, tasks are designed
specifically to guide the learner and assist them in
discovering a rule.
Inductive learning
 Inductive learning: System tries to make a “general rule”
from a set of observed instances.
 Example:
Mango → f(Mango) -> sweet (e1)
Banana → f(Banana) -> sweet (e2)
…..
Fruits → f(Fruits) → sweet (general rule)
Example
 Suppose an example set having attributes - Place type,
weather, location, decision and seven examples.
 Our task is to generate a set of rules that under what
condition what is the decision.
Subset 1:

s.no place type weather location decision

1 hilly winter kullu Yes

2 mountain windy Shimla Yes

3 beach warm goa Yes

4 beach warm Shimla Yes

Subset 2:
s.no place type weather location decision

5 mountain windy Mumbai No

6 beach windy Mumbai No

7 beach windy goa No


 at iteration 1
row 3 & 4 column weather is selected and row 3 & 4 are marked.
the rule is added to R IF weather is warm then a decision is yes.
 at iteration 2
row 1 column place type is selected and row 1 is marked.
the rule is added to R IF place type is hilly then the decision is yes.
 at iteration 3
row 2 column location is selected and row 2 is marked.
the rule is added to R IF location is Shimla then the decision is yes.
 at iteration 4
row 5&6 column location is selected and row 5&6 are marked.
the rule is added to R IF location is Mumbai then a decision is no.
 at iteration 5
row 7 column place type & the weather is selected and row 7 is marked.
rule is added to R IF place type is beach AND weather is windy then the
decision is no.
finally we get the rule set :-

Rule Set

Rule 1: IF the weather is warm THEN the decision is yes.


Rule 2: IF place type is hilly THEN the decision is yes.
Rule 3: IF location is Shimla THEN the decision is yes.
Rule 4: IF location is Mumbai THEN the decision is no.
Rule 5: IF place type is beach AND the weather is windy
THEN the decision is no.
Decision Trees
 Decision Tree is a Supervised learning technique that can be
used for both classification and Regression problems, but
mostly it is preferred for solving Classification problems. It is
a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are
the Decision Node and Leaf Node. Decision nodes are used
to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not
contain any further branches.
 The decisions or the test are performed on the basis of
features of the given dataset.
Decision Trees

An example of a decision tree can be explained using above binary


tree. Let’s say you want to predict whether a person is fit given their
information like age, eating habit, and physical activity, etc. The
decision nodes here are questions like ‘What’s the age?’, ‘Does he
exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are
outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary
classification problem (a yes no type problem).
Decision Trees
There are two main types of Decision Trees:
Classification trees (Yes/No types)
 What we’ve seen above is an example of classification tree,
where the outcome was a variable like ‘fit’ or ‘unfit’. Here the
decision variable is Categorical.

Regression trees (Continuous data types)


 Here the decision or the outcome variable is Continuous, e.g. a
number like 123.
 Working Now that we know what a Decision Tree is, we’ll see
how it works internally.
 There are many algorithms out there which construct Decision
Trees, but one of the best is called as ID3 Algorithm. ID3
Stands for Iterative Dichotomiser 3.
Decision Trees
 ID3 algorithm is a classification algorithm that follows
a greedy approach of building a decision tree by selecting
a best attribute that yields maximum Information Gain
(IG) or minimum Entropy (H).
 Entropy: Entropy is the measures
of impurity, disorder or uncertainty in a bunch of
examples.
 Entropy controls how a Decision Tree decides to split the
data. It actually effects how a Decision Tree draws its
boundaries.
 The Equation of Entropy:
Decision Trees
 Information gain (IG): It measures how much
“information” a feature gives us about the class. It is also
called as Kullback-Leibler divergence
Why it matter ?
 Information gain is the main key that is used by Decision
Tree Algorithms to construct a Decision Tree.
 Decision Trees algorithm will always tries to
maximize Information gain.
 An attribute with highest Information gain will tested/split
first.
 The Equation of Information gain:
Play Entropy calculations: Entropy of the current
Day Outlook Temperature Humidity Wind
Golf
state. In the above example, we can see in total
D1 Sunny Hot High Weak No
there are 5 No’s and 9 Yes’s.
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
Remember that the Entropy is 0 if all
D9 Sunny Cool Normal Weak Yes
members belong to the same class, and 1
D10 Rain Mild Normal Weak Yes when half of them belong to one class and
D11 Sunny Mild Normal Strong Yes other half belong to other class that is
D12 Overcast Mild High Strong Yes perfect randomness.
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

In the above equation, out of 6 Strong


examples, we have 3 examples where the
outcome was ‘Yes’ for Play Golf and 3
where we had ‘No’ for Play Golf.
Play
Day Outlook Temperature Humidity Wind
Golf Information Gain (IG)
D1 Sunny Hot High Weak No Calculation:
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes where ‘x’ are the possible values for an
D6 Rain Cool Normal Strong No attribute. Here, attribute ‘Wind’ takes two
D7 Overcast Cool Normal Strong Yes possible values in the sample data, hence x
D8 Sunny Mild High Weak No = {Weak, Strong} We’ll have to calculate:
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Amongst all the 14 examples we have 8


places where the wind is weak and 6
where the wind is Strong.
Day Outlook Temperature Humidity Wind
Play Information Gain (IG)
Golf
Calculation:
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
Now out of the 8 Weak examples, 6 of
D7 Overcast Cool Normal Strong Yes
them were ‘Yes’ for Play Golf and 2 of
D8 Sunny Mild High Weak No
them were ‘No’ for ‘Play Golf’. So, we
D9 Sunny Cool Normal Weak Yes
have,
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes We already calculated Entropy(Sstrong)=1
D14 Rain Mild High Strong No
Draw a decision tree for the given data set
using ID3 algorithm
Day Outlook Temperature Humidity Wind Play Golf

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Day Outlook Temperature Humidity Wind
Play In the given example there are
Golf
four attributes {outlook,
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
temperature, humidity, wing}
D3 Overcast Hot High Weak Yes and there is class which
D4 Rain Mild High Weak Yes contains binary values i.e., yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
or no.
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No We need to calculate
D9 Sunny Cool Normal Weak Yes information gain for each
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
attribute so that we can decide
D12 Overcast Mild High Strong Yes which attribute will be taken as
D13 Overcast Hot Normal Weak Yes a root node for drawing a
D14 Rain Mild High Strong No
decision tree.
So for calculating Information
gain we need to calculate
Entropy values.
Tempe Humidit Play Attribute : Outlook
Day Outlook Wind
rature y Golf
Values(Outlook) = Sunny, Overcast, Rain
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Tempe Humidit Play
Day Outlook
rature y
Wind
Golf Attribute : Temperature
D1 Sunny Hot High Weak No
Values(Temperature) = Hot, Mild, Cool
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Gain(S,Temperature) = 0.94-(4/14)(1.0)
-(6/14)(0.9183)
-(4/14)(0.8113)
=0.0289
Tempe Humidit Play
Day Outlook
rature y
Wind
Golf Attribute : Humidity
D1 Sunny Hot High Weak No
Values(Humidity) = High, Normal
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Tempe Humidit Play
Day Outlook Wind
rature y Golf Attribute : Wind
D1 Sunny Hot High Weak No Values(Wind) = Strong, Weak
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
 We calculating information gain for all attributes:
Gain(S,Outlook)= 0.2464,
Gain(S,Temperature)= 0.0289
Gain(S,Humidity)=0.1516
Gain(S,Wind) =0.0478
 We can clearly see that IG(S, Outlook) has the highest
information gain of 0.246, hence we chose Outlook attribute as
the root node. At this point, the decision tree looks like.
 Here we observe that whenever the outlook is Overcast,
Play Golf is always ‘Yes’, it’s no coincidence by any
chance, the simple tree resulted because of the highest
information gain is given by the attribute Outlook.
 Now how do we proceed from this point? We can simply
apply recursion, you might want to look at the algorithm
steps described earlier.
 Now that we’ve used Outlook, we’ve got three of them
remaining Humidity, Temperature, and Wind. And, we had
three possible values of Outlook: Sunny, Overcast, Rain.
 Where the Overcast node already ended up having leaf
node ‘Yes’, so we’re left with two subtrees to compute:
Sunny and Rain.
Attribute : Temperature
Values(Temperature) = Hot, Mild, Cool

Temp
e Humidit Play
Day Wind
ratur y Golf
e
D1 Hot High Weak No
D2 Hot High Strong No
D8 Mild High Weak No
D9 Cool Normal Weak Yes
D11 Mild Normal Strong Yes
Attribute : Humidity
Values(Humidity) = High, Normal
Temp
e Humidit Play
Day Wind
ratur y Golf
e
D1 Hot High Weak No
D2 Hot High Strong No
D8 Mild High Weak No
D9 Cool Normal Weak Yes
D11 Mild Normal Strong Yes
Attribute : Wind
Values(Wind) = Strong, Weak

Temp
e Humidit Play
Day Wind
ratur y Golf
e
D1 Hot High Weak No
D2 Hot High Strong No
D8 Mild High Weak No
D9 Cool Normal Weak Yes
D11 Mild Normal Strong Yes
Gain(Ssunny,Temperature)= 0.570

Gain(Ssunny,Humidity)=0.97

Gain(Ssunny,Wind) =0.0192
Attribute : Temperature
Values(Temperature) = Hot, Mild, Cool

Tempe Humidit Play


Day Wind
rature y Golf
D4 Mild High Weak Yes
D5 Cool Normal Weak Yes
D6 Cool Normal Strong No
D10 Mild Normal Weak Yes
D14 Mild High Strong No
Attribute : Humidity
Values(Humidity) = High, Normal
Tempe Humidit Play
Day Wind
rature y Golf
D4 Mild High Weak Yes
D5 Cool Normal Weak Yes
D6 Cool Normal Strong No
D10 Mild Normal Weak Yes
D14 Mild High Strong No
Attribute : Wind
Values(Wind) = Strong, Weak
Tempe Humidit Play
Day Wind
rature y Golf
D4 Mild High Weak Yes
D5 Cool Normal Weak Yes
D6 Cool Normal Strong No
D10 Mild Normal Weak Yes
D14 Mild High Strong No
Gain(Srain,Temperature)= 0.0192

Gain(Srain,Humidity)=0.0192

Gain(Srain,Wind) =0.97
Decision Tree
Neural Networks
● Artificial neural network (ANN) is a machine learning
approach that models human brain and consists of a number
of artificial neurons.
● Neuron in ANNs tend to have fewer connections than
biological neurons.
● Each neuron in ANN receives a number of inputs.
● An activation function is applied to these inputs which
results in activation level of neuron (output value of the
neuron).
● Knowledge about the learning task is given in the form of
examples called training examples.
Contd..

● An Artificial Neural Network is specified by:


− neuron model: the information processing unit of the NN,
− an architecture: a set of neurons and links connecting neurons.
Each link has a weight,
− a learning algorithm: used for training the NN by modifying the
weights in order to model a particular learning task correctly on the
training examples.
● The aim is to obtain a NN that is trained and generalizes
well.
● It should behaves correctly on new instances of the learning
task.
Neuron
● The neuron is the basic information processing unit of a
NN. It consists of:
1 A set of links, describing the neuron inputs, with weights
W1, W2, …, Wm
2 An adder function (linear combiner) for computing the
weighted sum of the inputs:
m
(real numbers) u =  wjxj
j =1

3 Activation function  for limiting the amplitude of the


neuron output. Here ‘b’ denotes bias.

y =  (u + b)
The Neuron Diagram
Bias
b
x1 w1
Activation
Induced function
Field

  (−)
Output
v
x2 w2 y
Input
values

  Summing
function

xm wm
weights
Bias of a Neuron

● The bias b has the effect of applying a transformation to the


weighted sum u
v=u+b
● The bias is an external parameter of the neuron. It can be
modeled by adding an extra input.
● v is called induced field of the neuron
m
v= w x
j =0
j j

w0 = b
Neural Networks Activation
Functions
 Activation functions are mathematical equations that determine
the output of a neural network.
 The function is attached to each neuron in the network, and
determines whether it should be activated (“fired”) or not, based
on whether each neuron’s input is relevant for the model’s
prediction.
 Activation functions also help normalize the output of each
neuron to a range between 1 and 0 or between -1 and 1.
 An additional aspect of activation functions is that they must be
computationally efficient because they are calculated across
thousands or even millions of neurons for each data sample.
Contd..
 Modern neural networks use a technique called
backpropagation to train the model, which places an
increased computational strain on the activation function,
and its derivative function.
 It’s just a thing function that you use to get the output of
node. It is also known as Transfer Function.
 It is used to determine the output of neural network like yes
or no. It maps the resulting values in between 0 to 1 or -1 to
1 etc. (depending upon the function).
Step Function
 A step function is a function like that used by the original
Perceptron.
 The output is a certain value, A1, if the input sum is above a
certain threshold and A0 if the input sum is below a certain
threshold.
 The values used by the Perceptron were A1 = 1 and A0 = 0.
Step Function
Linear or Identity Activation
Function
 As you can see the function is a line or linear. Therefore,
the output of the functions will not be confined between
any range.
 Equation : f(x) = x
 Range : (-infinity to infinity)
 It doesn’t help with the complexity or various parameters of
usual data that is fed to the neural networks.
Sigmoid or Logistic Activation
Function
 The sigmoid function is an activation function where it
scales the values between 0 and 1 by applying a threshold.
 The above equation represents a sigmoid function. When
we apply the weighted sum in the place of X, the values are
scaled in between 0 and 1.
 The beauty of an exponent is that the value never reaches
zero nor exceed 1 in the above equation.
 The large negative numbers are scaled towards 0 and large
positive numbers are scaled towards 1.
Sigmoid Function
Tanh or Hyperbolic Tangent
Function
 The Tanh function is an activation function which re scales
the values between -1 and 1 by applying a threshold just
like a sigmoid function.
 The advantage i.e the values of a tanh is zero centered
which helps the next neuron during propagating.
 When we apply the weighted sum of the inputs in the
tanh(x), it re scales the values between -1 and 1. .
 The large negative numbers are scaled towards -1 and large
positive numbers are scaled towards 1.
ReLU(Rectified Linear Unit) :
 This is one of the most widely used activation function.
 The benefits of ReLU is the sparsity, it allows only values
which are positive and negative values are not passed
which will speed up the process and it will negate or bring
down possibility of occurrence of a dead neuron.
f(x) = (0,max)
 This function will allow only the maximum values to pass
during the front propagation .
 The draw backs of ReLU is when the gradient hits zero for
the negative values, it does not converge towards the
minima which will result in a dead neuron while back
propagation.
Network Architectures
● Three different classes of network architectures

− single-layer feed-forward
− multi-layer feed-forward
− recurrent

● The architecture of a neural network is linked with the learning


algorithm used to train
Single Layer Feed-forward

Input layer Output layer


of of
source nodes neurons
Perceptron: Neuron Model
(Special form of single layer feed forward)

− The perceptron was first proposed by Rosenblatt (1958) is a


simple neuron that is used to classify its input into one of two
categories.
− A perceptron uses a step function that returns +1 if weighted
sum of its input  0 and -1 otherwise
+ 1 if v  0
 (v ) = 
− 1 if v  0
b (bias)
x1
w1
v y
x2 w2
(v)
wn
xn
Perceptron for Classification
● The perceptron is used for binary classification.
● First train a perceptron for a classification task.
− Find suitable weights in such a way that the training examples are correctly
classified.
− Geometrically try to find a hyper-plane that separates the examples of the
two classes.
● The perceptron can only model linearly separable classes.
● When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the mean
squared error.
● Given training examples of classes C1, C2 train the perceptron in
such a way that :
− If the output of the perceptron is 1 then the input is assigned to class C1
− If the output is 0 then the input is assigned to C2
Learning Process for Perceptron
● Initially assign random weights to inputs between -0.5 and +0.5
● Training data is presented to perceptron and its output is
observed.
● If output is incorrect, the weights are adjusted accordingly
using following formula.
wi  wi + (a* xi *e), where ‘e’ is error produced
and ‘a’ (-1  a  1) is learning rate
− ‘a’ is defined as 0 if output is correct, it is +ve, if output is too low and –
ve, if output is too high.
− Once the modification to weights has taken place, the next piece of
training data is used in the same way.
− Once all the training data have been applied, the process starts again until
all the weights are correct and all errors are zero.
− Each iteration of this process is known as an epoch.
Example: Perceptron to learn OR
function
● Initially consider w1 = -0.2 and w2 = 0.4
● Training data say, x1 = 0 and x2 = 0, output is 0.
● Compute y = Step(w1*x1 + w2*x2) = 0. Output is correct
so weights are not changed.
● For training data x1=0 and x2 = 1, output is 1
● Compute y = Step(w1*x1 + w2*x2) = 0.4 = 1. Output is
correct so weights are not changed.
● Next training data x1=1 and x2 = 0 and output is 1
● Compute y = Step(w1*x1 + w2*x2) = - 0.2 = 0. Output is
incorrect, hence weights are to be changed.
● Assume a = 0.2 and error e=1
wi = wi + (a * xi * e) gives w1 = 0 and w2 =0.4
● With these weights, test the remaining test data.
● Repeat the process till we get stable result.
Perceptron: Limitations
● The perceptron can only model linearly separable
functions,
− those functions which can be drawn in 2-dim graph and
single straight line separates values in two part.
● Boolean functions given below are linearly separable:
− AND
− OR
− COMPLEMENT
● It cannot model XOR function as it is non linearly
separable.
− When the two classes are not linearly separable, it may
be desirable to obtain a linear separator that minimizes
the mean squared error.
Multi layer feed-forward NN (FFNN)

● FFNN is a more general network architecture, where there


are hidden layers between input and output layers.
● Hidden nodes do not directly receive inputs nor send
outputs to the external environment.
● FFNNs overcome the limitation of single-layer NN.
● They can handle non-linearly separable learning tasks.

Input Output
layer layer

Hidden Layer
3-4-2 Network
FFNN for XOR
FFNN for XOR
● The ANN for XOR has two hidden nodes that realizes this non-linear
separation and uses the sign (step) activation function.
● Arrows from input nodes to two hidden nodes indicate the directions of
the weight vectors.
● The output node is used to combine the outputs of the two hidden nodes.
Inputs Output of Hidden Nodes Output Node X1 XOR X2
X1 X2 H1 (OR) H2 (NAND) AND (H1 , H2)
0 0 0 1 0 0
0 1 1 1 1 1
1 0 1 1 1 1
1 1 1 0 0 0

Since we are representing two input states by 0 (false) and 1


(true), we will take two hidden states H1 and H2 for performing
OR ,NAND operation and the result of that will act as input for
output node to perform AND operation which results ultimately
in XOR operation .
FFNN NEURON MODEL
● The classical learning algorithm of FFNN is based on the
gradient descent method.
● For this reason the activation function used in FFNN are
continuous functions of the weights, differentiable
everywhere.
● The activation function for node i may be defined as a
simple form of the sigmoid function in the following
manner:
1
 (Vi) =
1 + e ( −Vi )

where A > 0, Vi =  Wij * Yj , such that Wij is a weight of the


link from node i to node j and Yj is the output of node j.
Training Algorithm: Backpropagation
● The Backpropagation algorithm learns in the same way as
single perceptron.
● It searches for weight values that minimize the total error of
the network over the set of training examples (training set).
● Backpropagation consists of the repeated application of the
following two passes:
− Forward pass: In this step, the network is activated on one
example and the error of (each neuron of) the output layer
is computed.
− Backward pass: in this step the network error is used for
updating the weights. The error is propagated backwards
from the output layer through the network layer by layer.
This is done by recursively computing the local gradient of
each neuron.
Backpropagation

● Back-propagation training algorithm

Network activation
Forward Step

Error propagation
Backward Step

● Backpropagation adjusts the weights of the NN in order to


minimize the network total mean squared error.
Contd..
● Consider a network of three layers.
● Let us use i to represent nodes in input layer, j to represent
nodes in hidden layer and k represent nodes in output layer.
● wij refers to weight of connection between a node in input
layer and node in hidden layer.
● The following equation is used to derive the output value Yj
of node j
Yj = 1
−X
1+ e j

where, Xj =  xi . wij - j , 1 i  n; n is the number of


inputs to node j, and j is threshold for node j
Total Mean Squared Error
 The error function For simplicity, we’ll use the Mean
Squared Error function. For the first output, the error is the
correct output value minus the actual output of the neural
network:
0.5 - 0.735 = -0.235
 For the second output:
0.5 - 0.455 = 0.045
Now we’ll calculate the Mean Squared Error:
MSE(O1) = ½ (-0.235)2 = 0.027
MSE(O2) = ½ (0.045)2 = 0.001
The Total Error is the sum of the two errors:
Total Error = 0.0276 + 0.001 = 0.0286
This is the number we need to minimize with backpropagation.
Weight Update Rule
● The Backprop weight update rule is based on the gradient
descent method:
− It takes a step in the direction yielding the maximum
decrease of the network error E.
− This direction is the opposite of the gradient of E.
● Iteration of the Backpropagation algorithm is usually
terminated when the sum of squares of errors of the output
values for all training data in an epoch is less than some
threshold such as 0.01

wij = wij + wij


Backprop learning algorithm
(incremental-mode)
n=1;
initialize weights randomly;
while (stopping criterion not satisfied or n <max_iterations)
for each example (x,d)
- run the network with input x and compute the output y
- update the weights in backward order starting from those of
the output layer:
w ji = w ji + w ji
with w ji computed using the (generalized) Delta rule
end-for
n = n+1;
end-while;
Stopping criterions
● Total mean squared error change:
− Back-prop is considered to have converged when the absolute rate of
change in the average squared error per epoch is sufficiently small
(in the range [0.1, 0.01]).
● Generalization based criterion:
− After each epoch, the NN is tested for generalization.
− If the generalization performance is adequate then stop.
− If this stopping criterion is used then the part of the training set used
for testing the network generalization will not used for updating the
weights.
Reinforcement Learning
 Reinforcement learning is an area of Machine Learning. It is
about taking suitable action to maximize reward in a particular
situation.
 It is employed by various software and machines to find the best
possible behavior or path it should take in a specific situation.
 Reinforcement learning differs from the supervised learning in a
way that in supervised learning the training data has the answer
key with it so the model is trained with the correct answer itself
whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given
task.
 In the absence of a training dataset, it is bound to learn from its
experience.
Learning from rewards
 In Reinforcement Learning (RL), agents are trained on
a reward and punishment mechanism. The agent is rewarded
for correct moves and punished for the wrong ones. In doing so,
the agent tries to minimize wrong moves and maximize the right
ones.
Learning from rewards
Example: The problem is as follows: We have an agent and a reward, with an
hurdle (2,2) in between. The agent is supposed to find the best possible path to
reach the reward.

The agent learns by trying all the possible paths and then choosing the path which
gives him the reward with the least hurdles. Each right step will give the agent a
reward and each wrong step will subtract the reward of the agent. The total reward
will be calculated when it reaches the final reward that is at the state where an agent
gets +1 reward
Steps in Reinforcement Learning
 Input: The input should be an initial state from which the
model will start
 Output: There are many possible output as there are variety
of solution to a particular problem
 Training: The training is based upon the input, The model
will return a state and the user will decide to reward or
punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.
 Policy: It is a mapping of an action to every
possible state in the system (sequence of
states).
 Optimal Policy: A policy which maximizes
the long term reward.
Active and Passive Reinforcement
Learning
 Both active and passive reinforcement learning are types of
Reinforcement Learning.
 In case of passive reinforcement learning, the agent’s policy
is fixed which means that it is told what to do.
 In contrast to this, in active reinforcement learning, an
agent needs to decide what to do as there’s no fixed policy
that it can act on.
 Therefore, the goal of a passive reinforcement learning
agent is to execute a fixed policy (sequence of actions) and
evaluate it while that of an active reinforcement learning
agent is to act and learn an optimal policy.
Passive Reinforcement Learning
Techniques
 In this kind of RL, agent assume that the agent’s policy
π(s) is fixed.
 Agent is therefore bound to do what the policy dictates,
although the outcomes of agent actions are probabilistic.
 The agent may watch what is happening, so the agent
knows what states the agent is reaching and what rewards
the agent gets there.
Techniques:
1. Direct utility estimation
2. Adaptive dynamic programming
3. Temporal difference learning
Active Reinforcement Learning
Techniques
 In this kind of RL agent , it assume that the agent’s policy
π(s) is not fixed.
 Agent is therefore not bound on existing policy and tries to
act and find an Optimal policy for calculating and
maximizing the overall reward value.

Techniques:
1. Q-Learning
2. ADP with exploration function
Applications of Reinforcement
Learning
 Robotics for industrial automation.
 Business strategy planning
 Machine learning and data processing
 It helps you to create training systems that provide custom
instruction and materials according to the requirement of
students.
 Aircraft control and robot motion control

You might also like