AITools Unit 2
AITools Unit 2
Eg. So if you want your program to predict, for example, traffic patterns at a
busy intersection (task T), you can run it through a machine learning
algorithm with data about past traffic patterns (experience E) and, if it has
successfully ―learned‖, it will then do better at predicting future traffic
patterns (performance measure P).
Eg. suppose you are given an basket filled with different kinds of fruits.
Now the first step is to train the machine with all different fruits one by one
like this:
If shape of object is rounded and depression at top having color Red
then it will be labelled as –Apple.
If shape of object is long curving cylinder having color Green-Yellow
then it will be labelled as –Banana.
Now suppose after training the data, you have given a new separate fruit say
Banana from basket and asked to identify it.
Since machine has already learned the things from previous data and this
time have to use it wisely. It will first classify the fruit with its shape and
color, and would confirm the fruit name as BANANA and put it in Banana
category. Thus machine learns the things from training data(basket
containing fruits) and then apply the knowledge to test data(new fruit).
The system doesn‘t figure out the right output, but it explores the data and
can draw inferences from datasets to describe hidden structures from
unlabeled data.
Eg.
suppose it is given an image having both dogs and cats which have not seen
ever. Thus machine has no any idea about the features of dogs and cat so
we can‘t categorize it in dogs and cats.
The systems that use this method are able to considerably improve learning
accuracy.
Eg.
The goal of reinforcement learning in this case is to train the dog (agent) to
complete a task within an environment, which includes the surroundings of
the dog as well as the trainer.
First, the trainer issues a command, which the dog observes (observation).
The dog then responds by taking an action. If the action is close to the
desired behavior, the trainer will likely provide a reward, such as a food
treat; otherwise, no reward or a negative reward will be provided.
At the beginning of training, the dog will likely take more random actions
like rolling over when the command given is ―sit,‖ as it is trying to associate
specific observations with actions and rewards. This association, or
mapping, between observations and actions is called policy.
From the dog‘s perspective, the ideal case would be one in which it would
respond correctly to every command, so that it gets as many treats as
possible.
So, the whole meaning of reinforcement learning training is to ―tune‖ the
dog‘s policy so that it learns the desired behaviors that will maximize some
reward. After training is complete, the dog should be able to observe the
owner and take the appropriate action, for example, sitting when
commanded to ―sit‖ by using the internal policy it has developed.
Q) Differentiate ML Vs Classical/Traditional Algorithms.
Traditional Programming
Traditional programming is a manual process—meaning a programmer
creates the program.
Machine Learning
Unlike traditional programming, machine learning is an automated
process. ML algorithm automatically formulates the rules from the data.
ML vs Classical Algorithms:
Data sets are made up of data objects. A data object represents an entity in
a database table. Data objects are typically described by attributes. Data
objects can also be referred to as samples, examples, instances, data points,
or objects. If the data objects are stored in a database, they are data tuples.
That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
Classification:
Classification is a type of supervised learning. It specifies the class to which
data elements belong to and is best used when the output has finite and
discrete values. It predicts a class for an input variable as well. Here the
class label is categorical.
Similarly the form of the output, dependent variable or response variable can
in principle is anything, but most methods assume that yi is a categorical or
nominal variable from some finite set, yi ∈ {1, . . . , C}, where C is no.of
classes.
We assume y = f(x) for some unknown function f, and the goal of learning is
to estimate the function f given a labelled training set, and then to make
predictions using ˆy = ˆ f(x). (We use the hat symbol to denote an estimate.)
Our main goal is to make predictions on new inputs, meaning ones that we
have not seen before called generalization. Since predicting the response on
the training set is easy as we can just look up the answer.
In above fig., the yellow circle is harder to classify, since some yellow things
are labeled y = 1 and some are labeled y = 0, and some circles are labeled y =
1 and some y = 0.
Consequently it is not clear what the right label should be in the case of the
yellow circle. Similarly, the correct label for the blue arrow is unclear.
This corresponds to the most probable class label, and is called the mode of
the distribution p(y|x,D); it is also known as a MAP estimate (MAP stands for
maximum a posteriori).
Real-world applications of classification:
Classification is probably the most widely used form of machine learning,
and has been used to solve many interesting and often difficult real-world
problems.
Regression:
In the context of data mining, x and y are numeric database attributes. The
coefficients, w and b (called regression coefficients), specify the slope of the
line and the y-intercept, respectively.
Eg.
where,
y is the response variable.
a, b1, b2...bn are the coefficients.
x1, x2, ...xn are the predictor variables.
Eg. Prediction of height based on age and gender.
Features(X):age, gender
class(y): height
Regression equation will be: height = w1*age + w2*gender + b
Eg. KNN
k-Nearest-Neighbor Classifiers
Algorithm:
◦ Initialize k value.
◦ Calculate the distance between test data and each row of training
data. Here we will use Euclidean distance as our distance metric since it‗s
the most popular method. The other metrics that can be used are
manhatten, cosine, etc.
◦ Sort the calculated distances in ascending order based on distance
values
◦ Get top k rows from the sorted array
◦ Get the most frequent class of these rows
◦ Return the predicted class
Eg.
Advantages:
Quick calculation time
Simple algorithm – to interpret
useful for regression and classification
No assumptions about data – no need to make additional
assumptions, tune several parameters, or build a model. This
makes it crucial in nonlinear data case.
Disadvantages:
Accuracy depends on the quality of the data
With large data, the prediction stage might be slow
Require high memory – need to store all of the training data
Eg. Entering high school students make program choices among general
program, vocational program and academic program. Their choice might be
modeled using their writing score and their social economic status.
Unlike supervised learning, we are not told what the desired output is for
each input. Instead, we will formalize our task as one of density estimation,
that is, we want to build models of the form p(xi|θ).
The above figure, plots some 2d data, representing the height and weight of
a group of 210 people. It seems that there might be various clusters, or
subgroups, although it is not clear how many. Let K denote the number of
clusters. Our first goal is to estimate the distribution over the number of
clusters, p(K|D); this tells us if there are subpopulations within the data. For
simplicity, we often approximate the distribution p(K|D) by its mode,
K∗ = arg maxK p(K|D).
In the supervised case, we were told that there are two classes (male and
female), but in the unsupervised case, we are free to choose as many or few
clusters as we like. Picking a model of the “right” complexity is called
model selection.
Our second goal is to estimate which cluster each point belongs to. Let zi(zi
is an example of a hidden or latent variable, since it is never observed in the
training set.) ∈ {1, . . . , K} represent the cluster to which data point i is
assigned.
PCA in particular, has been applied in many different areas. Some examples
include the following:
• In biology, it is common to use PCA to interpret gene microarray data, to
account for the fact that each measurement is usually the result of many
genes which are correlated in their behavior by the fact that they belong to
different biological pathways.
• In natural language processing, it is common to use a variant of PCA called
latent semantic analysis for document retrieval.
• In signal processing (e.g., of acoustic or neural signals), it is common to use
ICA (which is a variant of PCA) to separate signals into their different sources
• In computer graphics, it is common to project motion capture data to a low
dimensional space, and use it to create animations.
As with unsupervised learning in general, there are two main applications for
learning sparse graphs: to discover new knowledge, and to get better joint
probability density estimators.
4. Matrix Completion
Sometimes we have missing data, that is, variables whose values are
unknown.
For example, we might have conducted a survey, and some people might not
have answered certain questions. Or we might have various sensors, some of
which fail.
The corresponding design matrix will then have ―holes‖ in it; these missing
entries are often represented by NaN, which stands for ―not a number‖. The
goal of imputation is to infer plausible values for the missing entries also
called matrix completion.
1. Image inpainting:
The goal is to “fill in” holes (e.g., due to scratches or occlusions) in an
image with realistic texture, where we denoise the image, as well as impute
the pixels hidden behind the occlusion. This can be tackled by building a
joint probability model of the pixels, given a set of clean images, and then
inferring the unknown variables (pixels) given the known variables(pixels).
Fig. (a) A noisy image with an occluder. (b) An estimate of the underlying
pixel intensities, based on a pairwise MRF model.
2. Collaborative filtering:
Eg. Predicting which movies people will want to watch based on how they,
and other people, have rated movies which they have already seen.
Fig. movie-rating data. Training data is in red, test data is denoted by ?,
empty cells are unknown.
If the count of the itemset satisfy a minimum support threshold then the
itemset is said to be a frequent itemset.
Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data sets.
A typical example of frequent itemset mining is market basket analysis.
This process analyzes customer buying habits by finding associations
between the different items that customers place in their ―shopping
baskets‖.
Algorithm: The k-means algorithm for partitioning, where each cluster's center is
represented by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing i objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for
each cluster;
(5) until no change;
Eg. Consider the following 2-dimensional data objects:
Sl.No. A B
1 1 1
2 1.5 2
3 3 4
4 5 7
5 3.5 5
6 4.5 5
7 3.5 4.5
Let us start with random k=2 and the centroids be c1= (1,1) and c2= (5,7)
Now we find the Euclidean distance from all the given points to the centroids taken and
is as shown below:
st
From the above table it is clear that the points (1,1), (1.5,2) and (3,4) falls in the 1 cluster
nd
and remaining points fall in the 2 cluster.
st
From the above table it is clear that the points (1,1) and (1.5,2) fall in the 1 cluster
nd
and the remaining into the 2 cluster.
Since we get different points into the cluster compared to the previous step we continue
the process and update the centroids as follows:
C1 = [(1+1.5)/2, (1+2)/2] = (1.25, 1.5)
C2 = [(3+5+3+4.5+3.5)/5, (4+7+5+5+4.5)/5] = (4.8, 5.1)
Now again we calculate the distance between all the given points and the updated C1 and
C2 as follows.
st nd
From the above table it is evident that the points in the 1 cluster and 2 cluster are same
as the previous step.
So we stop the iterations and the final cluster of points are cluster1 = {(1,1) , (1.5,2) }
and cluster2 = {(3,4),(5,7),(3,5),(4.5,5),(3.5,4.5)}
Strengths
–Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
–Often terminates at a local optimum. The global optimum may be found
using techniques such as simulated annealing and genetic algorithms
Weaknesses
–Applicable only when mean is defined.
–Need to specify k, the number of clusters, in advance
–Trouble with noisy data and outliers
–Not suitable to discover clusters with non-convex shapes.
A learning model that summarizes data with a set of parameters of fixed size
(independent of the number of training examples) is called a parametric
model. No matter how much data you throw at a parametric model, it won‘t
change its mind about how many parameters it needs.
b0 + b1*x1 + b2*x2 = 0
Where b0, b1 and b2 are the coefficients of the line that control the intercept
and slope, and x1 and x2 are two input variables.
Some more examples of popular nonparametric machine learning algorithms
are:
Naive Bayes
Logistic Regression
More data: Require a lot more training data to estimate the mapping
function.
Slower: A lot slower to train as they often have far more parameters to
train.
Overfitting: More of a risk to overfit the training data and it
is harder to explain why specific predictions are made.
Q) Briefly explain Semi supervised learning.
Semi-supervised machine learning is a combination
of supervised and unsupervised machine learning methods.
With more common supervised machine learning methods,
you train a machine learning algorithm on a ―labeled‖ dataset in which each
record includes the outcome information.
This allows the algorithm to deduce patterns and identify relationships
between your target variable and the rest of the dataset based on information
it already has.
In contrast, unsupervised machine learning algorithms learn from a dataset
without the outcome variable.
In semi-supervised learning, an algorithm learns from a dataset that
includes both labeled and unlabeled data, usually mostly unlabeled.
Need: When you don‘t have enough labeled data to produce an accurate
model and you don‘t have the ability or resources to get more data, you can
use semi-supervised techniques to increase the size of your training data.
Consider the sample graph given in below, where we have 2 label classes (red and green)
and 4 nodes coloured (2 for each class). We want to predict the label of node 4.
Fig. Illustrating of RL
Agent
Agents are the software programs that make intelligent decisions and they
are basically learners in RL.
Agents take action by interacting with the environment and they receive
rewards based on their actions.
Eg. Mario navigating in a video game.
Eg. 1: A human agent has eyes, ears, and other organs for sensors and
hands, legs, vocal tract, and so on for actuators.
Eg. 2: A robotic agent might have cameras and infrared range finders for
sensors and various motors for actuators.
Eg. 3: A software agent receives keystrokes, file contents, and network
packets as sensory inputs and acts on the environment by displaying on the
screen, writing files, and sending network packets.
Eg. 4: Vaccum cleaner that cleans blocks A & B. A reward of +1 will be given
if it sucks dust and a reward of 0 will be given if in same block and -1 if
moves from one block to another.
The elements of RL are:
Fig. Elements of RL
Policy function
A policy defines the agent's behavior in an environment. The way in which
the agent decides which action to perform depends on the policy.
A policy is often denoted by the symbol 𝛑. A policy can be in the form of a
lookup table or a complex search process.
Eg. If you want to reach your office from home; there will be different routes
to reach your office, and some routes are shortcuts, while some routes are
long. These routes are called policies because they represent the way in
which we choose to perform an action to reach our goal.
Value function
A value function denotes how good it is for an agent to be in a
particular state. It is dependent on the policy and is often denoted by
v(s). It is equal to the total expected reward received by the agent
starting from the initial state.
There can be several value functions. The optimal value function is the
one that has the highest value for all the states compared to other
value functions.
Similarly, an optimal policy is the one that has the optimal value
function.
Model
Model is the agent's representation of an environment. The learning can be
of two types—
In model-based learning, the agent exploits previously learned
information to accomplish a task. Eg. Epsilon-Greedy approach, random
selection approach.
whereas in model-free learning, the agent simply relies on a trial-and-error
experience for performing the right action. Eg. Q-learning or policy gradient.
Eg. If you want to reach your office from home faster. In model-based
learning, you simply use a previously learned experience (map) to reach the
office faster, whereas in model-free learning you will not use a previous
experience and will try all different routes and choose the faster one.
If the environment itself does not change with the passage of time but the
agent‘s performance score does, then we say the environment is
semidynamic.
Eg. Chess, when played with a clock, is semidynamic.
x 0 1 2 3 4
y 1 1.8 3.3 4.5 6.3
Sol:
Let the required linear equation be
y=a1x+a0. We know,
Here m = 5
bx
Q) Fit a curve of the type y=ae to the following data
x 0 1 2 3
We know,
And m=4
Q) Define Correlation.
Correlation is a term that is a measure of the strength of a linear
relationship between two quantitative variables (e.g., height, weight).
eg. postivie correlation: the more you exercise, the more calories you will
burn.
Q) Define Covariance.
The covariance of two variables x and y in a dataset measures how they are
linearly related. A positive covariance would indicate a positive linear
relationship between the variables and negative covariance indicates
opposite.
Eg. Find the covariance between x and y where x = (2.1, 2.2, 3.6, 4.0) and y
= (8, 10, 12, 14).