AIML
AIML
STUDY MATERIAL
REGISTER NUMBER:…………………………………………………………………………....
YEAR / SEM:………………………………………………………………………………………...
ACADEMIC YEAR:………………………………………………………………………………
PREPARED BY
1. Mr.R.MOHAN, AP/CSE
2. Mrs.G.SELVAPRIYA, AP/CSE
3. Mr.S.SHANMUGANATHAN, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
SYLLABUS:
Introduction to AI - AI Applications - Problem solving agents – search
algorithms – uninformed search strategies – Heuristic search
strategies – Local search and optimization problems – adversarial
search – constraint satisfaction problems (CSP)
PART A
1. What is Artificial Intelligence? Or Define Artificial Intelligence.
Artificial intelligence is a wide-ranging branch of computer science
concerned with building smart machines capable of performing tasks
that typically require human intelligence.
Artificial Intelligence is the process of building intelligent machines from
vast volumes of data.
Systems learn from past learning and experiences and perform human-
like tasks.
It enhances the speed, precision, and effectiveness of human efforts.
AI includes predictive
Data Science includes data
modeling to predict events ML is a subset of Artificial
operations based on user
based on the previous and Intelligence.
requirements.
current data.
It is mainly used in fraud Applications of AI include Online recommendations, facial
detection, healthcare, BI chat bots, voice assistants, recognition, and NLP are a few
analysis, and more. and weather prediction. examples of ML.
Types of Agent
Simple reflex agents respond directly to percepts.
Model-based reflex agents maintain internal state to track aspects of
the world that are not evident in the current percept.
Goal-based agents act to achieve their goals, and
Utility-based agents try to maximize their own expected “happiness.”
17. Define Problem Solving Agent. List the steps involved in Problem
Solving Agent.
Problem Solving Agent
Problem-solving agent is a goal-based agent that focuses on goals
using a group of algorithms and techniques to solve a well-defined
problem.
An agent may need to plan a sequence of actions that form a path
to a goal state.
Such an agent is called a problem-solving agent, and the
computational process it undertakes is called search.
Steps performed by Problem-solving agent
Goal Formulation
Problem Formulation
Search
Solution
Execution
19. Define search algorithms and list the types of search algorithm.
A search algorithm takes a search problem as input and returns a
solution.
Types of Search Algorithm
Uninformed Search Algorithms
o Depth First Search
o Depth Limited Search
o Breadth First Search
o Iterative Deepening Search
o Uniform Cost Search
o Bidirectional Search
Informed (Heuristic) Search Strategies
o Greedy Search
o A* Tree Search
20. Define Depth First Search. List the advantages and disadvantages
of DFS.
• Depth-first search (DFS) is an algorithm for traversing or searching
tree or graph data structures.
• The algorithm starts at the root node and explores as far as possible
along each branch before backtracking.
21. Which solution would DFS find to move from node S to node G if run
on the graph below?
Solution.
22. Define Depth Limited Search. List its advantages and disadvantages.
A depth-limited search algorithm is similar to depth-first search with a
predetermined limit.
Depth-limited search can solve the drawback of the infinite path in the
Depth-first search.
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
Depth-limited search also has a disadvantage of incompleteness.
It may not be optimal if the problem has more than one solution.
24. Which solution would BFS find to move from node S to node G if run on
the graph below?
Solution.
26. Define Uniform Cost Search. List its advantages and disadvantages.
A searching algorithm for traversing a weighted tree or graph is uniform-
cost search.
When a separate cost is provided for each edge, this algorithm is used.
The uniform-cost search's main purpose is to discover the shortest path
to the goal node with the lowest cumulative cost.
Cost of a node is defined as:
cost(node) = cumulative cost of all nodes from root
cost(root) = 0
Advantages of Uniform-cost Search:
The path with the lowest cost is chosen at each state.
UCS is complete only if states are finite and there should be no loop
with zero weight.
UCS is optimal only if there is no negative cost.
Disadvantages of Uniform-cost Search:
It is just concerned with the expense of the path.
27. Which solution would UCS find to move from node S to node G if run on
the graph below?
Solution.
29. Define informed (heuristic) search strategies and mention its types.
Here, the algorithms have information on the goal state, which helps
in more efficient searching.
This information is obtained by something called a heuristic.
Search Heuristics:
In an informed search, a heuristic is a function h(n) estimates how
close a state is to the goal state.
h(n) = estimated cost of the cheapest path from the state at node n to
a goal state.
30. Find the path from S to G using greedy search. The heuristic values h
of each node below the name of the node.
Solution.
Starting from S, we can traverse to A(h=9) or D(h=5). We choose D,
as it has the lower heuristic cost. Now from D, we can move to
B(h=4) or E(h=3). We choose E with a lower heuristic cost. Finally,
from E, we go to G(h=0). This entire traversal is shown in the search
tree below, in blue.
Solution.
Starting from S, the algorithm computes g(x) + h(x) for all nodes in
the fringe at each step, choosing the node with the lowest sum.
The entire work is shown in the table below.
Solution
PART B
1. Define Artificial Intelligence and its types, goals, advantages and limitations
in detail.
ARTIFICIAL INTELLIGENCE:
1.1 Artificial Intelligence Definition
1.2 Types of Artificial Intelligence
1.2.1 Types of Artificial Intelligence-based on capabilities
1.2.2 Types of Artificial Intelligence-based on functionalities
1.3 Four possible goals to pursue in artificial intelligence:
1.3.1 Acting humanly: The Turing test approach
1.3.2 Thinking humanly: The cognitive modeling approach
1.3.3 Thinking rationally: The “laws of thought” approach
1.3.4 Acting rationally: The rational agent approach
1.4 Subfields of Artificial Intelligence
1.5 Advantages of Artificial Intelligence
1.6 Limitations of Artificial Intelligence
Cognitive Science
o Cognitive science is the interdisciplinary study of mind and
intelligence, embracing philosophy, psychology, artificial
intelligence, neuroscience, linguistics, and anthropology.
o Cognitive Computing:
The ultimate goal of cognitive computing is to imitate the human
thought process in a computer model.
Using self-learning algorithms, pattern recognition by neural
networks, and natural language processing, a computer can mimic
the human way of thinking.
Here, computerized models are deployed to simulate the human
cognition process.
o Computer Vision:
Computer vision works by allowing computers to see, recognize, and
process images, the same way as human vision does, and then it
provides an appropriate output.
The computer must understand what it sees, and then analyze it,
accordingly.
o Natural Language Processing:
Natural language processing means developing methods that help us
communicate with machines using natural human languages like
English.
Table 1.2 - Comparison of Data Science, Artificial Intelligence and Machine Learning:
1950:
Alan Turing publishes Computing Machinery and Intelligence. In the
paper, Turing—proposes to answer the question 'can machines
think?' and introduces the Turing Test to determine if a computer can
demonstrate the same intelligence as a human. The value of the
Turing test has been debated ever since.
1956:
John McCarthy coins the term 'artificial intelligence' at the first-ever
AI conference at Dartmouth College. Later that year, Allen Newell,
J.C. Shaw, and Herbert Simon create the Logic Theorist, the first-ever
running AI software program.
1958:
In 1958, John McCarthy had defined the high-level language Lisp,
which was to become the dominant AI programming language for the
next 30 years.
1959:
At IBM, Nathaniel Rochester and his colleagues produced some of the
first AI programs.
Herbert Gelernter (1959) constructed the Geometry Theorem Prover,
which was able to prove theorems that many students of mathematics
would find quite tricky.
1967:
Frank Rosenblatt builds the Mark 1 Perceptron, the first computer
based on a neural network that 'learned' though trial and error.
1980s:
Neural networks which use a back propagation algorithm to train
itself become widely used in AI applications.
1997:
IBM's Deep Blue beats then world chess champion Garry Kasparov, in
a chess match (and rematch).
2011:
IBM Watson beats champions Ken Jennings and Brad Rutter
at Jeopardy!
2015:
Baidu'sMinwa supercomputer uses a special kind of deep neural
network called a convolution neural network to identify and categorize
images with a higher rate of accuracy than the average human.
2016:
DeepMind'sAlphaGo program, powered by a deep neural network,
beats Lee Sodol, the world champion Go player, in a five-game match.
APPLICATIONS OF AI:
4.1. AI Application in E-Commerce
4.2. AI Application in Education
4.3. AI Application in Lifestyle
4.4. AI Application in Navigation
4.5. AI Application in Robotics
4.6. AI Application in Human Resource
4.7. AI Application in Healthcare
4.8. AI Application in Agriculture
4.9. AI Application in Gaming
4.10. AI Application in Automobiles
4.11. AI Application in Social Media
4.12. AI Application in Marketing
4.13. AI Application in Chatbots
4.14. AI Application in Finance
AGENTS
5.1 Agents and Environments
5.2 Basic Terminologies
5.3 Steps in designing an agent
5.4 Types of Agent
Environments
The environment could be everything—the entire universe.
Rational Agent
A rational agent acts so as to maximize the expected value of the
performance measure, given the percept sequence it has seen so
far.
Task Environment
A task environment specification includes the performance
measure, the external environment, the actuators, and the
sensors.
Model-based reflex
Goal-based agents
Learning agents:
It allows the agent to operate in initially unknown environments and to become more
competent than its initial knowledge alone might allow. A learning agent can be divided into
four conceptual components, as shown in figure 1.10:
Cells can contain objects, which the agent can pick up, push, or
otherwise act upon; a wall or other impassible obstacle in acell
prevents an agent from moving into that cell.
The vacuum world canbe formulated as a grid world problem as
shown in figure 1.12:
SEARCH ALGORITHMS
7.1 Search Algorithm
7.2 Types of Search Algorithm
7.2.1 UNINFORMED SEARCH ALGORITHMS
7.2.1.1 Depth First Search
7.2.1.2 Depth Limited Search
7.2.1.3 Breadth First Search
7.2.1.4 Iterative Deepening Search
7.2.1.5 Uniform Cost Search
7.2.1.6 Bidirectional Search
7.2.2 INFORMED (HEURISTIC) SEARCH STRATEGIES
7.2.2.1 Greedy Search
7.2.2.2 A* Tree Search
7.2.2.3 A* Graph Search
• These type of search does not maintain any internal state, that’s
it is also known as Blind search.
• Types of uninformed search algorithms are shown in Figure 1.13.
7.2.1.1 Depth First Search
7.2.1.2 Depth Limited Search
7.2.1.3 Breadth First Search
7.2.1.4 Iterative Deepening Search
7.2.1.5 Uniform Cost Search
7.2.1.6 Bidirectional Search
Each of these algorithms will have:
A problem graph, containing the start node S and the goal node
G.
A strategy, describing the manner in which the graph will be
traversed to get to G.
A fringe, which is a data structure used to store all the possible
states (nodes) from the current states.
A tree, that results while traversing to the goal node.
A solution plan, which the sequence of nodes from S to G.
Example:
Solution.
Algorithm
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
Depth-limited search also has a disadvantage of incompleteness.
It may not be optimal if the problem has more than one solution.
Example:
Solution.
Further, change the depth-limit =[0-3], it will again expand the nodes
from level 0 till level 3 and the search terminate
with A->B->D->F->E->H sequence where H is the desired goal node.
[C*/ε]).
Example:
Which solution would UCS find to move from node S to node G if run
on the graph below?
Solution.
Search Heuristics:
In an informed search, a heuristic is a functionh(n) estimates how
close a state is to the goal state.
h(n) = estimated cost of the cheapest path from the state at node n to
a goal state.
Example:
Question. Find the path from S to G using greedy search. The heuristic
values h of each node below the name of the node.
Solution.
Starting from S, we can traverse to A(h=9) or D(h=5). We choose D,
as it has the lower heuristic cost. Now from D, we can move to
B(h=4) or E(h=3). We choose E with a lower heuristic cost. Finally,
from E, we go to G(h=0). This entire traversal is shown in the search
tree below, in blue.
Heuristic:
The following points should be noted with heuristics in A* search.
f(x) = g(x) + h(x)
h(x) is called the forward cost and is an estimate of the distance of
the current node from the goal node.
g(x) is called the backward cost and is the cumulative cost of a
node from the root node.
A* search is optimal only when for all nodes, the forward cost for a
node h(x) underestimates the actual cost h*(x) to reach the goal.
This property of A* heuristic is called admissibility.
Admissibility: 0 <= h(x) <= h*(x)
Solution.
Starting from S, the algorithm computes g(x) + h(x) for all nodes in the
fringe at each step, choosing the node with the lowest sum.
The entire work is shown in the table below.
A* algorithm
Example:
Question. Use graph searches to find paths from S to G in the following
graph.
Solution
The local search algorithm explores the above landscape by finding the
following two points:
Global Minimum: If the elevation corresponds to the cost, then the task
is to find the lowest valley, which is known as Global Minimum.
The topographical regions shown in the Figure 1.19 can be defined as:
Global Maximum: It is the highest point on the hill, which is the goal
state.
Local Maximum: It is the peak higher than all other peaks but lower
than the global maximum.
Flat local maximum: It is the flat area over the hill where it has no
uphill or downhill. It is a saturated point of the hill.
Shoulder: It is also a flat area where the summit is possible.
Current state: It is the current position of the person.
ADVERSARIAL SEARCH
9.1 Adversarial Search in Artificial Intelligence
9.2 Elements of Game Playing search
9.3 Example:
9.3.1 Game Tree For Tic Tac Toe
9.4 Types Of Algorithms In Adversarial Search
9.4.1 Minimax Strategy / Minimax Algorithm
9.4.2 Alpha – Beta Pruning
9.3 Example:
Game tic-tac-toe, has two or three possible outcomes.
Either to win, to lose, or to draw the match with values +1,-1 or 0.
Game tree designed for tic-tac-toe refer Figure 1.23.
Here, the node represents the game state and edges represent the moves
taken by the players.
Game Tree
A game tree is where the nodes represent the states of the game and
edges represent the moves made by the players in the game.
Players will be two namely:
o MIN: Decrease the chances to win the game.
o MAX: Increases his chances of winning the game.
MAX will choose that path which will increase its utility value and MIN
will choose the opposite path which could help it to minimize MAX’s
utility value.
Step 2:
Now, first we find the utilities value for the Maximizer, its initial value
is -∞,
For node D max(-1,- -∞) => max(-1,4)= 4
o For Node E max(2, -∞) => max(2, 6)= 6
o For Node F max(-3, -∞) => max(-3,-5) = -3
o For node G max(0, -∞) = max(0, 7) = 7
Step 3:
In the next step, it's a turn for minimizer, so it will compare all nodes
value with +∞, and will find the 3 rd layer node values.
o For node B= min(4,6) = 4
o For node C= min (-3, 7) = -3
Step 4:
Now it's a turn for Maximizer, and it will again choose the maximum of all
nodes value and find the maximum value for the root node.
In this game tree, there are only 4 layers, hence reach immediately to the
root node, but in real games, there will be more than 4 layers.
o For node A max(4, -3)= 4
That was the complete workflow of the minimax two player game.
Minimax algorithm
Step 1:
At the first step the, Max player will start first move from node A
where α= -∞ and β= +∞, these value of alpha and beta passed down to
node B where again α= -∞ and β= +∞, and Node B passes the same
value to its child D.
Step 2:
At Node D, the value of α will be calculated as its turn for Max. The
value of α is compared with firstly 2 and then 3, and the max (2, 3) = 3
will be the value of α at node D and node value will also 3.
Step 3:
Now algorithm backtrack to node B, where the value of β will change
as this is a turn of Min, Now β= +∞, will compare with the available
subsequent nodes value, i.e. min (∞, 3) = 3, hence at node B now
α= -∞, and β= 3.
Step 4:
At node E, Max will take its turn, and the value of alpha will change.
The current value of alpha will be compared with 5, so max (-∞, 5) = 5,
hence at node E α= 5 and β= 3, where α>=β, so the right successor of
E will be pruned, and algorithm will not traverse it, and the value at
node E will be 5.
Step 5:
At next step, algorithm again backtrack the tree, from node B to node
A. At node A, the value of alpha will be changed the maximum
available value is 3 as max (-∞, 3)= 3, and β= +∞, these two values
now passes to right successor of A which is Node C.
At node C, α=3 and β= +∞, and the same values will be passed on to
node F.
Step 6:
At node F, again the value of α will be compared with left child which
is 0, and max(3,0)= 3, and then compared with right child which is 1,
and max(3,1)= 3 still α remains 3, but the node value of F will become
1.
Step 7:
Node F returns the node value 1 to node C, at C α= 3 and β= +∞, here
the value of beta will be changed, it will compare with 1 so min (∞, 1) =
1. Now at C, α=3 and β= 1, and again it satisfies the condition α>=β,
so the next child of C which is G will be pruned, and the algorithm will
not compute the entire sub-tree G.
Step 8:
C now returns the value of 1 to A here the best value for A is max
(3, 1) = 3.
Following is the final game tree which is the showing the nodes which
are computed and nodes which has never computed.
Hence the optimal value for the maximizer is 3 for this example.
The game will be started from the last level of the game tree,
and the value will be chosen accordingly.
The rule which will be followed is: “Explore nodes if necessary
otherwise prune the unnecessary nodes.”
There are many possible solutions to this problem, such as in figure 1.26
SYLLABUS:
Acting under uncertainty – Bayesian inference – Naïve Bayes
models. Probabilistic reasoning – Bayesian networks – exact
inference in BN – approximate inference in BN – causal networks.
PART A
1. Define uncertainty and list the causes of uncertainty.
Uncertainty:
The knowledge representation, A→B, means if A is true then B is true,
but a situation where not sure about whether A is true or not then cannot
express this statement, this situation is called uncertainty.
So to represent uncertain knowledge, uncertain reasoning or probabilistic
reasoning is used.
Causes of uncertainty:
1. Causes of uncertainty in the real world
2. Information occurred from unreliable sources.
3. Experimental Errors
4. Equipment fault
5. Temperature variation
6. Climate change.
1
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
2
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
7. In a class, there are 70% of the students who like English and 40%
of the students who likes English and mathematics, and then what
is the percent of students those who like English also like
mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics
3
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
11. Consider two events: A (it will rain tomorrow) and B (the sun will
shine tomorrow).
Use Bayes’ theorem to compute the posterior probability of each event
occurring, given the resulting weather conditions for today:
P(A|sunny) = P(sunny|A) * P(A) / P(sunny)
4
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
5
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
Local Semantics
6
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
7
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
8
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
PART B
Agents almost never have access to the whole truth about their environment.
Agents must, therefore, act under uncertainty.
The problem is that this rule is wrong. Not all patients with toothaches have
cavities; some of them have gum disease, an abscess, or one of several other
problems:
Unfortunately, in order to make the rule true, we have to add an almost unlimited
list of possible causes. We could try turning the rule into a causal rule:
9
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
But this rule is not right either; not all cavities cause pain The only way to fix the
rule is to make it logically exhaustive: to augment the left-hand side with all the
qualifications required for a cavity to cause a toothache. Even then, for the
purposes of diagnosis, one must also take into account the possibility that the
patient might have a toothache and a cavity that are unconnected. Trying to use
first-order logic to cope with a domain like medical diagnosis thus fails for three
main reasons:
That is, we expect that out of all the situations that are indistinguishable
from the current situation as far as the agent's knowledge goes, the patient will
have a cavity in 80% of them. This belief could be derived from statistical data-80%
of the toothache patients seen so far have had cavities-or from some general rules,
or from a combination of evidence sources.
The 80% summarizes those cases in which all the factors needed for a cavity
to cause a toothache are present and other cases in which the patient has both
toothache and cavity but the two are unconnected. The missing 20% summarizes
all the other possible causes of toothache that we are too lazy or ignorant to confirm
or deny.
10
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
1.1 Uncertainty:
The knowledge representation, A→B, means if A is true then B is true,
but a situation where not sure about whether A is true or not then
cannot express this statement, this situation is called uncertainty.
So to represent uncertain knowledge, uncertain reasoning or
probabilistic reasoning is used.
1.1.1Causes of uncertainty:
Causes of uncertainty in the real world
1. Information occurred from unreliable sources.
2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
11
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
1.2.3 Probability:
Probability can be defined as a chance that an uncertain event
will occur.
The value of probability always remains between 0 and 1 that
represent ideal uncertainties.
o 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
o P(A) = 0, indicates total uncertainty in an event A.
o P(A) =1, indicates total certainty in an event A.
Formula to find the probability of an uncertain event
12
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
1.2.4.1 Example:
In a class, there are 70% of the students who like English and 40% of
the students who likes English and mathematics, and then what is
the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics
2. Explain in detail about Bayesian inference and Naive Bayes Model or Naive
Bayes Theorem or Bayes Rule.
13
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
14
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
15
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
16
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
17
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
18
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
Global Semantics
Local Semantics
Markov Blanket
Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents
3.4 Example:
Harry installed a new burglar alarm at his home to detect burglary.
The alarm reliably responds at detecting a burglary but also responds
for minor earthquakes. Harry has two neighbors David and Sophia,
who have taken a responsibility to inform Harry at work when they
hear the alarm. David always calls Harry when he hears the alarm,
but sometimes he got confused with the phone ringing and calls at
that time too. On the other hand, Sophia likes to listen to high music,
so sometimes she misses to hear the alarm. Here we would like to
compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is
neither a burglary, nor an earthquake occurred, and David and
Sophia both called the Harry.
19
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
Solution:
The Bayesian network for the above problem is given in figure 2.2.
The network structure is showing that burglary and earthquake is
the parent node of the alarm and directly affecting the probability
of alarm's going off, but David and Sophia's calls depend on alarm
probability.
Variables:
Burglar, Earthquake, Alarm, John Calls, Mary Calls
20
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
Let's take the observed probability for the Burglary and earthquake
component:
P(B=True) = 0.002, which is the probability of burglary.
P(B=False)= 0.998, which is the probability of no burglary.
P(E=True)= 0.001, which is the probability of a minor earthquake
P(E=False)= 0.999, Which is the probability that an earthquake not
occurred.
21
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
From the formula of joint distribution, the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain
by using Joint distribution.
22
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
2. Biomonitoring:
a. This involves the use of indicators to quantify the concentration of
chemicals in the human body.
3. Information retrieval:
a. Bayesian networks assist in information retrieval for research,
which is a constant process of extracting information from
databases.
4. Image processing:
a. A form of signal processing, image processing uses mathematical
operations to convert images into digital format.
5. Gene regulatory network:
a. A Bayesian network is an algorithm that can be applied to gene
regulatory networks in order to make predictions about the effects
of genetic variations on cellular phenotypes.
b. Gene regulatory networks are a set of mathematical equations that
describe the interactions between genes, proteins, and metabolites.
c. They are used to study how genetic variations affect the
development of a cell or organism.
6. Turbo code:
a. Turbo codes are a type of error correction code capable of achieving
very high data rates and long distances between error correcting
nodes in a communications system.
b. They have been used in satellites, space probes, deep-space
missions, military communications systems, and civilian wireless
communication systems, including WiFi and 4G LTE cellular
telephone systems.
7. Document classification:
a. The main issue is to assign a document multiple classes. The task
can be achieved manually and algorithmically. Since manual effort
takes too much time, algorithmic documentation is done to
complete it quickly and effectively.
4. Explain in detail about Bayesian Inference and its type Exact Inference
with suitable example.
The basic task for any probabilistic inference system is to compute the
posterior probability distribution for a set of query variables, given some observed
event-that is, some assignment of values to a set of evidence variables. We will use
23
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
the notation X denotes the query variable; E denotes the set of evidence variables
El, . . . , Em, and e is a particular observed event; Y will denote the nonevidence
variables Yl, . . . , (some- times called the hidden variables). Thus, the complete set
of variables X = {X} U E U Y. A typical query asks for the posterior probability
distribution P(X|e)4
Inference by enumeration
Conditional probability can be computed by summing terms from the full joint
distribution. More specifically, a query P(X|e) can be answered using Equation,
which we repeat here for convenience:
Consider the query P(Burglary1 JohnCalls = true, &Jury Calls = true). The
hidden variables for this query are Earthquake and Alarm. From Equation (13.6),
using initial letters for the variables in order to shorten the expressions, we have
24
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
The structure of this computation is shown in above Figure. Using the numbers
from Figure, we obtain P(b| j , m) = a x 0.00059224. 'The co~respondingc
omputation for ~b yields a x 0.0014919; hence
25
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
That is, the chance of a burglary, given calls from both neighbors, is about 28%.
The evaluation process for the expression in Equation is shown as an expression
tree in Figure.
26
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
the number of parents of each node is bounded by a constant, then the complexity
will also be linear in the number of nodes. These results hold for any ordering
consistent with the topological ordering of the network .
27
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
A network with 2 nodes (fire icon and smoke icon) and 1 edge (arrow
pointing from fire to smoke).
This network can be both a Bayesian or causal network.
The key distinction, however, is when interpreting this network.
For a Bayesian network, we view the nodes as variables and the arrow
as a conditional probability, namely the probability of smoke given
information about fire.
When interpreting this as a causal network, we still view nodes as
variables, however, the arrow indicates a causal connection.
28
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
In this case, both interpretations are valid. However, if we were to flip the
edge direction, the causal network interpretation would be invalid, since
smoke does not cause fire.
1.The do-operator
The do-operator is a mathematical representation of a physical
intervention.
If the model starts with Z → X → Y, simulate an intervention in X by
deleting all the incoming arrows to X, and manually setting X to some
value x_0. Refer Figure 2.4 denotes the example of do-operator.
29
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
2: Confounding
A simple example of confounding is shown in the figure 2.5 below.
30
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
We begin with a random sampling process for a Bayes net that has no evidence
associated with it. The idea is to sample each variable in turn, in topological order.
The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable’s parents. (Because we sample in topological
order, the parents are guaranteed to have values already.) This algorithm is shown
in Figure. Applying it to the network with the ordering Cloudy, Sprinkler, Rain,
WetGrass, we might produce a random event as follows:
31
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
Let ˆP(X |e) be the estimated distribution that the algorithm returns; this
distribution is computed by normalizing NPS(X,e), the vector of sample counts for
each value of X where the sample agrees with the evidence e:
Consider the query P(Rain1 Sprinkler = true, Wet Grass = true) applied to the
network in Figure. The evidence variables Sprinkler and WetGrass are fixed to their
observed values and the hidden variables Cloudy and Rain are initialized randomly-
let us say to true and false respectively. Thus, the initial state is [true, true, false,
true]. Now the following steps are executed repeatedly:
32
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
1. Cloudy is sampled, given the current values of its Markov blanket variables: in
this case, we sample from P(Cloudy1 Sprinkler = true, Rain =false). Suppose the
result is Cloudy =false. Then the new current state is [false, true, false, true].
2. Rain is sampled, given the current values of its Markov blanket variables: in this
case, we sample from P(Rain1 Cloudy =false, Sprinkler = true, WetGrass = true).
Suppose this yields Rain = true. The new current state is [false, true, true, true].
Each state visited during this process is a sample that contributes to the estimate
for the query variable Rain. If the process visits 20 states where Rain is true and 60
states where Rain is false, then the answer to the query is NORMALIZE
33
PREPARED BY: R.Mohan, G.Selvapriya, S.Shanmuganathan, AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
SYLLABUS:
Introduction to machine learning – Linear Regression Models: Least
squares, single & multiple variables, Bayesian linear regression,
gradient descent, Linear Classification Models: Discriminant function –
Probabilistic discriminative model - Logistic regression, Probabilistic
generative model – Naive Bayes, Maximum margin classifier – Support
vector machine, Decision Tree, Random forests
PART A
1. Define Machine Learning.
Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning ” in 1959
while at IBM.
He defined machine learning as “the field of study that gives computers
the ability to learn without being explicitly programmed “.
Machine learning is programming computers to optimize a performance
criterion using example data or past experience. The model may be
predictive to make predictions in the future, or descriptive to gain
knowledge from data.
9. Define Classification.
The Classification algorithm is a Supervised Learning technique that is
used to identify the category of new observations on the basis of training
data.
In Classification, a program learns from the given dataset or observations
and then classifies new observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.
Where,
m: Slope
c: y-intercept
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter variable
is marginalized.
o Multi-class Problems:
o If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
o Example: Classifications of types of crops, Classification of types
of music.
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
PART B
1.1.3 Examples
1.1.3.1Handwriting recognition learning problem
Task T : Recognizing and classifying handwritten words
within images
Performance P : Percent of words correctly classified
Training experience E : A dataset of handwritten words with
given classifications
1.1.3.2A robot driving learning problem
Task T : Driving on highways using vision sensors
Performance P : Average distance traveled before an error
Training experience E : A sequence of images and steering
commands recorded while observing a human driver
1.3.1 Classification:
The Classification algorithm is a Supervised Learning technique
that is used to identify the category of new observations on the
basis of training data.
In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number
of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
1.3.1 Regression:
Regression is a supervised learning technique which helps in
finding the correlation between variables and enables us to
predict the continuous output variable based on the one or more
predictor variables.
1.3.2 Clustering:
Clustering or cluster analysis is a machine learning technique,
which groups the unlabeled dataset.
It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that
has less or no similarities with another group."
Where,
m: Slope
c: y-intercept
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
Step 6: Calculate the slope and the y-intercept using the formulas
# Calculating 'm' and 'c'
num = 0
denom = 0
for i in range(n):
num += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = num / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
Output:
Mathematical Approach:
Residual/Error = Actual values – Predicted Values
Sum of Residuals/Errors = Sum(Actual- Predicted Values)
Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter
variable is marginalized.
Program
fromsklearn.datasets import load_boston
fromsklearn.model_selection import train_test_split
fromsklearn.metrics import r2_score
fromsklearn.linear_model import BayesianRidge
Output
Test Set r2 score : 0.7943355984883815
The goal is to minimize the cost as much as possible in order to find the
best fit line.
Learning Rate
A learning rate is used for each pair of input and output values. It is a
scalar factor and coefficients are updated in direction towards minimizing
error.
The process is repeated until a minimum sum squared error is achieved or
no further improvement is possible.
Repeat this process until our Cost function is very small (ideally 0).
4. Explain in detail about Linear Discriminant Functions and its types. Also
elaborate about logistic regression in detail.
(3.1)
where w is the weight vector and w0 the bias or threshold weight.
(3.2)
or
(3.3)
In figure 3.7, the hyperplane H divides the feature space into two
half-spaces:
o Decisionregion R1 for w1
o region R2 for w2.
The discriminant function g(x) gives an algebraic measure of the
distance from x to the hyperplane.
The way to express x as
(3.4)
where x p is the normal projection of x onto H, and r is the desired
algebraic distance which is positive if x is on the positive side and
negative if x is on the negative side. Then, because g(x p)=0,
Since then
(3.5)
or
(3.6)
o If w0=0, then g(x) has the homogeneous form , and the hyperplane
passes through the origin
(3.7)
or
(3.8)
(3.9)
For the multi-class case, the posterior probability of class Ckis given by a
softmax transformation of a linear function of x
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
The objective of the regularization parameter is to balance the margin
maximization and loss.
After adding the regularization parameter, the cost functions looks as
below.
Different kernel functions can be specified for the decision functions and
its possible to specify custom kernels
6.6 Disadvantages
If the number of features is a lot bigger than the number of data points,
choosing kernel functions and regularization term is crucial.
SVMs don't directly provide probability estimates. Those are calculated
using an expensive five-fold cross-validation.
Works best on small sample sets because of its high training time.
6.7 Applications
SVMs can be used to solve various real-world problems:
SVMs are helpful in text and hypertext categorization.
Classification of images can also be performed using SVMs.
Classification of satellite data like SAR data using supervised SVM.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly.
Example:
Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram: Refer fig 3.14
Decision Tree
7.7.1 Entropy:
Entropy is a metric to measure the impurity in a given
attribute.
Entropy is a measure of the randomness in the information
being processed.
The higher the entropy, the harder it is to draw any conclusions
from that information.
Flipping a coin is an example of an action that provides
information that is random.
Entropy can be calculated as:
7.7.6 Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
It finds out the statistical significance between the differences
between sub-nodes and parent node.
RANDOM FOREST
8.1 Random Forest
8.2 Steps in the working process of Random Forest
8.3 Need for Random Forest
8.4 Example:
8.5 Important Features of Random Forest
8.6 Applications of Random Forest
8.7 Advantages of Random Forest
8.8 Disadvantages of Random Forest
8.9 Difference between Decision Tree & Random Forest
In the above figure 3.16, the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an
apple.