0% found this document useful (0 votes)
15 views

ML_Unit-1 _PDF

Uploaded by

NarasimY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

ML_Unit-1 _PDF

Uploaded by

NarasimY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

SUnit-1

What is Machine Learning?, Examples of machine learning applications, supervised


Learning: learning a class from examples, Vapnik- Chervonenkis dimension, probably
approximately correct learning, noise, learning multiple classes, regression, model selection
and generalization, dimensions of a supervised machine learning algorithm.
Decision Tree Learning: Introduction, Decisions Tree representation, Appropriate problems
for decision tree learning, the basic decision tree learning algorithm, Hypothesis space
search in decision tree learning, Inductive bias in decision tree learning, issues in decision
tree learning

Introduction ML:
A rapidly developing field of technology, machine learning allows computers to
automatically learn from previous data. For building mathematical models and making
predictions based on historical data or information, machine learning employs a variety of
algorithms. It is currently being used for a variety of tasks, including speech recognition,
email filtering, auto-tagging on Face book, a recommender system, and image recognition.
A subset of artificial intelligence known as machine learning focuses primarily on the
creation of algorithms that enable a computer to independently learn from data and
previous experiences.
Without being explicitly programmed, machine learning enables a machine to automatically
learn from data, improve performance from experiences, and predict things.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample
historical data, or training data. For the purpose of developing predictive models, machine
learning brings together statistics and computer science. Algorithms that learn from
historical data are either constructed or utilized in machine learning.

What is Machine Learning


In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a
human does? So here comes the role of Machine Learning.

How does Machine Learning work


A machine learning system builds prediction models, learns from previous data, and
predicts the output of new data whenever it receives it. The amount of data helps to build a
better model that accurately predicts the output, which in turn affects the accuracy of the
predicted output.

Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic
based on the data and predict the output. Our perspective on the issue has changed as a

S E SURESH, MCA, AP SET ML UNIT 1 Page 1


result of machine learning. The Machine Learning algorithm's operation is depicted in the
following block diagram:

Features of Machine Learning

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

Need for Machine Learning


 The demand for machine learning is steadily rising.
 Because it is able to perform tasks that are too complex for a person to directly
implement, machine learning is required.
 Humans are constrained by our inability to manually access vast amounts of data;
as a result, we require computer systems, which is where machine learning comes
in to simplify our lives.
 By providing them with a large amount of data and allowing them to automatically
explore the data, build models, and predict the required output, we can train
machine learning algorithms.
 The cost function can be used to determine the amount of data and the machine
learning algorithm's performance.
 We can save both time and money by using machine learning.
Following are some key points which show the importance of Machine Learning:
 Rapid increment in the production of data
 Solving complex problems, which are difficult for a human
 Decision making in various sector including finance
 Finding hidden patterns and extracting useful information from data.
Examples of Machine Learning Applications

1. Learning Associations
Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is a Market Based Analysis.

Market Based Analysis is one of the key techniques used by large relations to show
associations between items. It allows retailers to identify relationships between the items
that people buy together frequently.

Given a set of transactions, we can find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.

S E SURESH, MCA, AP SET ML UNIT 1 Page 2


TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions.
Support Count( ) – Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Example: {Milk, Diaper}->{Beer}

Rule Evaluation Metrics –


 Support(s)
The number of transactions that include items in the {X} and {Y} parts of the rule as
a percentage of the total number of transaction. It is a measure of how frequently
the collection of items occur together as a percentage of all transactions.
 Support = (X+Y) / total
It is interpreted as fraction of transactions that contain both X and Y.
 Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no
of transactions that includes all items in {A} to the no of transactions that includes
all items in {A}.
 Conf(X=>Y) = Supp(X U Y) / Supp(X)
It measures how often each item in Y appears in transactions that contains items in
X also.
 Lift(l)
The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other. The
expected confidence is the confidence divided by the frequency of {Y}.
 Lift(X=>Y) = Conf(X=>Y) / Supp(Y)
Lift value near 1 indicates X and Y almost often appear together as expected, greater
than 1 means they appear together more than expected and less than 1 means they
appear less than expected. Greater lift values indicate stronger association.

Example – From the above table, {Milk, Diaper}=>{Beer}


s= ({Milk, Diaper, Beer}) / |T|
= 2/5
= 0.4

S E SURESH, MCA, AP SET ML UNIT 1 Page 3


c= (Milk, Diaper, Beer) / (Milk, Diaper)
= 2/3
= 0.67

l= Supp({Milk, Diaper, Beer}) / Supp({Milk, Diaper})*Supp({Beer})


= 0.4/(0.6*0.6)
= 1.11

The Association rule is very useful in analyzing datasets. The data is collected using bar-
code scanners in supermarkets. Such databases consists of a large number of transaction
records which list all items bought by a customer on a single purchase. So the manager
could know if certain groups of items are consistently purchased together and use this data
for adjusting store layouts, cross-selling, promotions based on statistics.

Apriori Algorithm is also used in association rule mining for discovering frequent itemsets
in the transactions database. It was proposed by Agrawal & Srikant in 1993.

Exercise:
A customer does 4 transactions with you. In the first transaction, she buys 1 apple, 1 beer,
1 rice, and 1 chicken. In the second transaction, she buys 1 apple, 1 beer, 1 rice. In the
third transaction, she buys 1 apple, 1 beer only. In fourth transactions, she buys 1 apple
and 1 orange.

Support(Apple) = 4/4

So, Support of {Apple} is 4 out of 4 or 100%

Confidence(Apple -> Beer) = Support(Apple, Beer)/Support(Apple)


= (3/4)/(4/4)
= 3/4

So, Confidence of {Apple -> Beer} is 3 out of 4 or 75%

Lift(Beer -> Rice) = Support(Beer, Rice)/(Support(Beer) * Support(Rice))


= (2/4)/(3/4) * (2/4)
= 1.33

So, Lift value is greater than 1 implies Rice is likely to be bought if Beer is bought.
The Dataset
Market Basket dataset consists of 15010 observations with Date, Time, Transaction and
Item feature or columns. The date variable or column ranges from 30/10/2016 to
09/04/2017. Time is a categorical variable that tells the time. Transaction is a quantitative
variable that helps in differentiation of transactions. Item is a categorical variable that links
with a product.

# Loading data
dataset=read.transactions('"C:/Users/admin/Documents/Market_Basket_Optimisation.csv'
, sep = ',', rm.duplicates = TRUE)

# Structure
str(dataset)

Performing Association Rule Mining on Dataset:

S E SURESH, MCA, AP SET ML UNIT 1 Page 4


Using the Association Rule Mining algorithm on the dataset which includes 15010
observations.

# Installing Packages
install.packages("arules")
install.packages("arulesViz")

# Loading package
library(arules)
library(arulesViz)

# Fitting model
# Training Apriori on the dataset
set.seed = 220 # Setting seed
associa_rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.2))

# Plot
itemFrequencyPlot(dataset, topN = 10)

# Visualising the results


inspect(sort(associa_rules, by = 'lift')[1:10])
plot(associa_rules, method = "graph", measure = "confidence", shading = "lift")

Supervised learning is the types of machine learning in which machines are trained using
well "labeled" training data, and on basis of that data, machines predict the output. The
labeled data means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as a
student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

How Supervised Learning Works?


In supervised learning, models are trained using labeled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:

S E SURESH, MCA, AP SET ML UNIT 1 Page 5


Steps Involved in Supervised Learning:
o First Determine the type of training dataset
o Collect/Gather the labeled training data.
o Split the training dataset into training dataset, test dataset, and validation
dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as
the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts
the correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:

2. Classification
Classification is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features.

S E SURESH, MCA, AP SET ML UNIT 1 Page 6


The goal of classification is to build a model that accurately predicts the class labels of new
instances based on their features.

There are two main types of classification: binary classification and multi-class
classification.
Binary classification involves classifying instances into two classes, such as “spam” or “not
spam”, while multi-class classification involves classifying instances into more than two
classes.

The process of building a classification model typically involves the following steps:
i. Data preparation: This step involves cleaning and pre-processing the data, such as
removing missing values and transforming the data into a format that can be used
by the classification algorithm.
ii. Model selection: This step involves choosing an appropriate classification algorithm
based on the characteristics of the data and the desired outcome. Common
algorithms include decision trees, k-nearest neighbors, and support vector
machines.
iii. Model training: This step involves using the training data to train the classification
algorithm and build the model. The model is trained by adjusting its parameters to
minimize the difference between the predicted class labels and the actual class
labels.
iv. Model evaluation: This step involves evaluating the performance of the
classification model on a test dataset that is separate from the training data. This
can be done by calculating metrics such as accuracy, precision, recall, and F1-
score.
v. Model deployment: This step involves deploying the classification model in a
production environment, where it can be used to make predictions on new
instances.

Example:
Classification
■ Example: Credit scoring
■ Differentiating between low-risk and high-risk customers from their income and
savings

Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk

S E SURESH, MCA, AP SET ML UNIT 1 Page 7


Example of a training dataset where each circle corresponds to one data instance with
input values in the corresponding axes and its sign indicates the class. For simplicity, only
two customer attributes, income and savings, are taken as input and the two classes are
low-risk (‘+’) and high-risk (‘−’). An example discriminant that separates the two types of
examples is also shown.

Classification: Applications
■ Aka Pattern recognition
■ Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
■ Character recognition: Different handwriting styles.
■ Speech recognition: Temporal dependency.
¨ Use of a dictionary or the syntax of the language.
Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for
speech
■ Medical diagnosis: From symptoms to illnesses
■ Biometrics

Knowledge Extraction
Learning a rule from data also allows knowledge extraction.
The rule is extraction a simple model that explains the data, and looking at this model we
have an explanation about the process underlying the data.
For example, once we learn the discriminant separating low-risk and high-risk customers,
we have the knowledge of the properties of low-risk customers.
We can then use this information to target potential low-risk customers more efficiently, for
example, through advertising.

3. Regression analysis
It is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is
changing corresponding to an independent variable when other independent variables are
held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:

S E SURESH, MCA, AP SET ML UNIT 1 Page 8


Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on the
one or more predictor variables. It is mainly used for prediction, forecasting, time series
modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the
datapoints and the regression line is minimum." The distance between datapoints and
line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:


o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:


o Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
o Independent Variable: The factors which affect the dependent variables or which
are used to predict the values of the dependent variables are called independent
variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so
it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous variable.
There are various scenarios in the real world where we need some future predictions such
as weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need
Regression analysis which is a statistical method and used in machine learning and data
science. Below are some other reasons for using Regression analysis:
o Regression estimates the relationship between the target and the independent
variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.

S E SURESH, MCA, AP SET ML UNIT 1 Page 9


o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent variables.
Here we are discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Unsupervised Learning
It is a machine learning technique in which models are not supervised using training
dataset. Instead, models itself find the hidden patterns and insights from the given data. It
can be compared to learning which takes place in the human brain while learning new
things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that dataset in a compressed
format.

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never trained upon
the given dataset, which means it does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to identify the image features on their
own. Unsupervised learning algorithm will perform this task by clustering the image dataset
into the groups according to similarities between images.

S E SURESH, MCA, AP SET ML UNIT 1 Page 10


Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it.

Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two types of problems:

S E SURESH, MCA, AP SET ML UNIT 1 Page 11


o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities between
the data objects and categorizes them as per the presence and absence of those
commonalities.

o Association: An association rule is an unsupervised learning method which is used


for finding the relationships between variables in the large database. It determines
the set of items that occurs together in the dataset. Association rule makes
marketing strategy more effective. Such as people who buy X item (suppose a bread)
are also tend to purchase Y (Butter/Jam) item. A typical example of Association rule
is Market Basket Analysis.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Density Estimation Clustering: One method for density estimation is clustering where the
aim is to find clusters or groupings of input. In the case of a company with a data of past
customers, the customer data contains the demographic information as well as the past
transactions with the company, and the company may want to see the distribution of the
profile of its customers, to see what type of customers frequently occur. In such a case, a
clustering model allocates customers similar in their attributes to the same group,
providing the company with natural groupings of its customers; this is called customer
segmentation. Once such groups are found, the company may decide strategies, for
example, services and products, specific to different groups; this is known as customer
relationship management. Such a grouping also allows identifying those who are outliers,
namely, those who are different from other customers, which may imply a niche in the
market that can be further exploited by the company.

An interesting application of clustering is in image compression. In this case, the input


instances are image pixels represented as RGB values. A clustering program groups pixels
with similar colors in the same group, and such groups correspond to the colors occurring
frequently in the image. If in an image, there are only shades of a small number of colors,
and if we code those belonging to the same group with one color, for example, their average,
then the image is quantized. Let us say the pixels are 24 bits to represent 16 million colors,
but if there are shades of only 64 main colors, for each pixel we need 6 bits instead of 24.
For example, if the scene has various shades of blue in different parts of the image, and if
we use the same average blue for all of them, we lose the details in the image but gain
space in storage and transmission. Ideally, we would like to identify higher-level regularities
by analyzing repeated image patterns, for example, texture, objects, and so forth. This
allows a higher-level, simpler, and more useful description of the scene, and for example,
achieves better compression than compressing at the pixel level. If we have scanned
document pages, we do not have random on/off pixels but bitmap images of characters.

S E SURESH, MCA, AP SET ML UNIT 1 Page 12


There is structure in the data, and we make use of this redundancy by finding a shorter
description of the data: 16 × 16 bitmap of ‘A’ takes 32 bytes; its ASCII code is only 1 byte.

In document clustering, the aim is to group similar documents. For example, news reports
can be subdivided as those related to politics, sports, fashion, arts, and so on. Commonly, a
document is represented as a bag of words—that is, we predefine a lexicon of N words, and
each document is an N-dimensional binary vector whose element i is 1 if word i appears in
the document; suffixes “–s” and “–ing” are removed to avoid duplicates and words such as
“of,” “and,” and so forth, which are not informative, are not used. Documents are then
grouped depending on the number of shared words. It is of course critical how the lexicon is
chosen.

Reinforcement Learning

o Reinforcement Learning is a feedback-based Machine learning technique in which


an agent learns to behave in an environment by performing the actions and seeing
the results of actions. For each good action, the agent gets positive feedback, and for
each bad action, the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without
any labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the
goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal
of an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it
learns to perform the task in a better way. Hence, we can say that "Reinforcement
learning is a type of machine learning method where an intelligent agent
(computer program) interacts with the environment and learns to act within
that." How a Robotic dog learns the movement of his arms is an example of
Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns
from its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment, and his
goal is to find the diamond. The agent interacts with the environment by performing
some actions, and based on those actions, the state of the agent gets changed, and
it also receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change state/remain in
the same state, and get feedback), and by doing these actions, he learns and
explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and what
actions lead to negative feedback penalty. As a positive reward, the agent gets a
positive point, and as a penalty, it gets a negative point.

S E SURESH, MCA, AP SET ML UNIT 1 Page 13


Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL,
we assume the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by
the agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
o Value(): It is expected long-term retuned with the discount factor and opposite to
the short-term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as
a current action (a).

Key Features of Reinforcement Learning


o In RL, the agent is not instructed about the environment and what actions need to
be taken.
o It is based on the hit and trial process.
o The agent takes the next action and changes states according to the feedback of the
previous action.
o The agent may get a delayed reward.
o The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.

Approaches to implement Reinforcement Learning


There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-
term return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a
policy that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:

S E SURESH, MCA, AP SET ML UNIT 1 Page 14


o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model representation
is different for each environment.

Applications of Machine learning


We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a
photo with our Facebook friends, then we automatically get a tagging suggestion with
name,and the technology behind this is machine learning's face detection and recognition
algorithm.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition."
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.

S E SURESH, MCA, AP SET ML UNIT 1 Page 15


5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and
spam emails in our spam box, and the technology behind this is Machine learning. Below
are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

7. Virtual Personal Assistant:


Wehave various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri.
As the name suggests, they help us in finding the information using our voice instruction.
These assistants can help us in various ways just by our voice instructions such as Play
music, call someone, Open an email, Scheduling an appointment, etc.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that
a fraudulent transaction can take place such as fake accounts, fake ids, and steal
money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is
always a risk of up and downs in shares, so for this machine learning's long short term
memory neural network is used for the prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this feature,
which is a Neural Machine Learning that translates the text into our familiar language, and
it called as automatic translation.

S E SURESH, MCA, AP SET ML UNIT 1 Page 16


Chapter 2: Supervised Learning
Learning a Class from Examples

Let us say we want to learn the class, C, of a “family car.” We have a set of examples of
cars, and we have a group of people that we survey to whom we show these cars. The
people look at the cars and label them;
the positive examples cars that they believe are family cars are positive examples, and the
other negative examples cars are negative examples.

Class learning is finding a description that is shared by all the positive examples and none
of the negative examples.
 Class C of a “family car”
 Prediction: Is car x a family car?
 Knowledge extraction: What do people expect from a family car?
 Output: Positive (+) and negative (–) examples
Input representation

Training set for the class of a “family car.” Each data point corresponds to one example car,
and the coordinates of the point indicate the price and engine power of that car. ‘+’ denotes
a positive example of the class (a family car), and ‘−’ denotes a negative example (not a
family car); it is another type of car

Let us denote price as the first input attribute x1 (e.g., in U.S. dollars) and engine power as
the second attribute x2 (e.g., engine volume in cubic centimeters). Thus we represent each
car using two numeric values:

x   1 if x is positive
x   1
 x2 
r
0 if x is negative
X  {x t ,r t }tN1

Each car is represented by such an ordered pair (x,r) and the training set contains N such
examples.
where t indexes deferent examples in the set; it does not represent time or any such order.

Our training data can now be plotted in the two-dimensional (x1, x2) space where each
instance t is a data point at coordinates (𝑥1𝑡 , 𝑥2𝑡) and its type, namely, positive versus
negative, is given by r t .

S E SURESH, MCA, AP SET ML UNIT 1 Page 17


We may have reason to believe that for a car to be a family car, its price and engine power
should be in a certain range

Eqn. 2.4:

 p1  price  p2  AND e1  engine power  e2 

for suitable values of p1, p2, e1, and e2.

Example of a hypothesis class. The class of family car is a rectangle in the price-engine
power space.

Equation 2.4 thus assumes C to be a rectangle in the price-engine power space.

Equation 2.4 fixes H, the hypothesis class from which we believe C is drawn, namely, the
set of rectangles. The learning algorithm then finds hypothesis the particular hypothesis, h
∈ H, specified by a particular quadruple of (𝑝1ℎ , 𝑝2ℎ , 𝑒1ℎ , 𝑒2ℎ ), to approximate C as closely as
possible.

Though the expert defines this hypothesis class, the values of the parameters are not
known; that is, though we choose H, we do not know which particular h ∈ H is equal, or
closest, to C. But once we restrict our attention to this hypothesis class, learning the class
reduces to the easier problem of finding the four parameters that define h. The aim is to
find h ∈ H that is as similar as possible to C. Let us say the hypothesis h makes a
prediction for an instance x such that

 1 if h says x is positive
1hx   r 
N
h( x )   E (h | X )  t t

0 if h says x is negative t 1

In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x).
What we have is the training set X, which is a small subset empirical error of the set of all
possible x.

The empirical error is the proportion of training instances where predictions of h do not
match the required values given in X. The error of hypothesis h given the training set X is
E(h|x), where 1(a = b) is 1 if a = b and is 0 if a = b.

S E SURESH, MCA, AP SET ML UNIT 1 Page 18


The term ‘generalization’ refers to the model’s capability to adapt and react properly to
previously unseen, new data, which has been drawn from the same distribution as the one
used to build the model.
In other words, generalization examines how well a model can digest new data and make
correct predictions after getting trained on a training set.

Note that if x1 and x2 are real-valued, there are infinitely many such h for which this is
satisfied, namely, for which the error, E, is 0, but given a future example somewhere close
to the boundary between positive and negative examples, different candidate hypotheses
may make different generalization predictions. This is the problem of generalization — that
is, how well our hypothesis will correctly classify future examples that are not part of the
training set.

A version space is a hierarchical representation of knowledge that enables you to keep


track of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.

One possibility is to find the most specific hypothesis, S, that is the hypothesis tightest
rectangle that includes all the positive examples and none of the negative examples (see
figure 2.4). This gives us one hypothesis, h = S, as our induced class.
Note that the actual class C may be larger than S but is most general never smaller.
The most general hypothesis, G, is the largest rectangle we hypothesis can draw that
includes all the positive examples and none of the negative examples (figure 2.4).
Any h ∈ H between S and G is a valid hypothesis with no error, said to be consistent with
the training set, and such h make version space up the version space.
Given another training set, S, G, version space, the parameters and thus the learned
hypothesis, h, can be different.

C is the actual class and h is our induced hypothesis. The point where C is 1 but h is 0 is a
false negative, and the point where C is 0 but h is 1 is a false positive. Other points—
namely, true positives and true negatives—are correctly classified.

Actually, depending on X and H, there may be several S i and Gj which respectively make up
the S-set and the G-set. Every member of the S-set is consistent with all the instances, and
there are no consistent hypotheses that are more specific. Similarly, every member of the G-
set is consistent with all the instances, and there are no consistent hypotheses that are

S E SURESH, MCA, AP SET ML UNIT 1 Page 19


more general. These two make up the boundary sets and any hypothesis between them is
consistent and is part of the version space. There is an algorithm called candidate
elimination that incrementally updates the S- and G-sets as it sees training instances one
by one;

Given X, we can find S, or G, or any h from the version space and use it as our hypothesis,
h. It seems intuitive to choose h halfway between S margin and G; this is to increase the
margin, which is the distance between the boundary and the instances closest to it.

We choose the hypothesis with the largest margin, for best separation. The shaded
instances are those that define (or support) the margin; other instances can be removed
without affecting h.

S is the most specific and G is the most general hypothesis

In some applications, a wrong decision may be very costly and in such a case, we can say
that any instance that falls in between S and G is a doubt case of doubt, which we cannot
label with certainty due to lack of data. In such a case, the system rejects the instance and
defers the decision to a human expert.

A version space is a hierarchical representation of knowledge that enables you to keep track
of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.

The version space method is a concept learning process accomplished by managing


multiple models within a version space.

Version Space Characteristics


Tentative heuristics are represented using version spaces.
A version space represents all the alternative plausible descriptions of a heuristic.
A plausible description is one that is applicable to all known positive examples and no

S E SURESH, MCA, AP SET ML UNIT 1 Page 20


known negative example.

A version space description consists of two complementary trees:


1. One that contains nodes connected to overly general models, and
2. One that contains nodes connected to overly specific models.
Node values/attributes are discrete.

Fundamental Assumptions
1. The data is correct; there are no erroneous instances.
2. A correct description is a conjunction of some of the attributes with values.

Vapnik - Chervonenkis(VC) Dimension

Let us say we have a dataset containing N points. These N points can be labeled in 2Nways
as positive and negative. Therefore, 2N different learning problems can be defined by N data
points. If for any of these problems, we can find a hypothesis h∈H that separates the
positive examples from the negative, then we say H shatters N points. That is, any learning
problem definable by N examples can be learned with no error by a hypothesis
drawn from H. The maximum number of points that can be shattered by H is called the
Vapnik-Chervonenkis (VC) dimension of H, is denoted as VC(H), and measures the capacity
of H.

In figure 2.6, we see that an axis-aligned rectangle can shatter four points in two
dimensions. Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two
dimensions, is four. In calculating the VC dimension, it is enough that we find four points
that can be shattered; it is not necessary that we be able to shatter any four points in two
dimensions.

VC dimension may seem pessimistic. It tells us that using a rectangle as our hypothesis
class, we can learn only datasets containing four points and not more.

Probably Approximately Correct (PAC) Learning


Using the tightest rectangle, S, as our hypothesis, we would like to find how many examples
we need. We would like our hypothesis to be approximately correct, namely, that the error
probability be bounded by some value. We also would like to be confident in our hypothesis
in that we want to know that our hypothesis will be correct most of the time (if not always);
so we want to be probably correct as well (by a probability we can specify).

S E SURESH, MCA, AP SET ML UNIT 1 Page 21


PAC learning In Probably Approximately Correct (PAC) learning, given a class, C, and
examples drawn from some unknown but fixed probability distribution, p(x), we want to find
the number of examples, N, such that with probability at least 1 − δ, the hypothesis h has
error at most ϵ, for arbitrary.

δ ≤ 1/2 and ϵ> 0


P{CΔh ≤ ϵ} ≥ 1 – δ

where CΔh is the region of difference between C and h.


In our case, because S is the tightest possible rectangle, the error region between C and h =
S is the sum of four rectangular strips (see figure 2.7). We would like to make sure that the
probability of a positive example falling in here (and causing an error) is at most ϵ. For any
of these strips, if we can guarantee that the probability is upper bounded by ϵ/4, the error is
at most 4(ϵ/4) = ϵ. Note that we count the overlaps in the corners twice, and the total actual
error in this case is less than 4(ϵ/4). The probability that a randomly drawn example misses
this strip is 1 − ϵ/4. The probability that all N independent draws miss the strip is (1−ϵ/4)N ,
and the probability that all N independent draws miss any of the four strips is at most 4(1 −
ϵ/4)N , which we would like to be at most δ. We have the inequality

(1 − x) ≤ exp[−x]
So if we choose N and δ such thatwe have
4 exp[−ϵN/4] ≤ δ
we can also write 4(1 − ϵ/4)N ≤ δ. Dividing both sides by 4, taking (natural) log and
rearranging terms, we have
N ≥ (4/ϵ) log(4/δ)

Therefore, provided that we take at least (4/ϵ) log(4/δ) independent examples from C and
use the tightest rectangle as our hypothesis h, with confidence probability at least 1 − δ, a
given point will be misclassified with error probability at most ϵ. We can have arbitrary large
confidence by decreasing δ and arbitrary small error by decreasing ϵ, and we see in
equation 2.7 that the number of examples is a slowly growing function of 1/ϵ and 1/δ,
linear and logarithmic, respectively.

S E SURESH, MCA, AP SET ML UNIT 1 Page 22


Noise
Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult
to learn and zero error may be infeasible with a simple hypothesis class (see figure 2.8).
There are several interpretations of noise:
1.There may be imprecision in recording the input attributes, which may
shift the data points in the input space.
2.There may be errors in labelling the data points, which may reliable positive instances as
negative and vice versa. This is sometimes called teacher noise.

There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modelled as a random
component and is included in “noise.”

As can be seen in figure 2.8, when there is noise, there is not a simple boundary between
the positive and negative instances and to separate them, one needs a complicated
hypothesis that corresponds to a hypothesis class with larger capacity. A rectangle can be
defined by four numbers, but to define a more complicated shape one needs a more
complex model with a much larger number of parameters. With a complex model,

one can make a perfect fit to the data and attain zero error; see the wiggly shape in figure
2.8. Another possibility is to keep the model simple and allow some error; see the rectangle
in figure 2.8.Using the simple rectangle (unless its training error is much bigger) makes
more sense because of the following.

1. It is a simple model to use. It is easy to check whether a point is inside or outside a


rectangle and we can easily check, for a future data instance, whether it is a positive or a
negative instance.
2. It is a simple model to train and has fewer parameters. It is easier to find the corner
values of a rectangle than the control points of an arbitrary shape. With a small training set
when the training instances differ a little bit, we expect the simpler model to change less
than complex model: A simple model is thus said to have less variance. On the other hand,
a too simple model assumes more, is more rigid, and may fail if indeed the underlying class
is not that simple: A simpler model has more bias. Finding the optimal model corresponds
to minimizing both the bias and the variance.
3. It is a simple model to explain. A rectangle simply corresponds to defining intervals on
the two attributes. By learning a simple model, we can extract information from the raw
data given in the training set.

S E SURESH, MCA, AP SET ML UNIT 1 Page 23


4. If indeed there is mislabelling or noise in input and the actual classis really a simple
model like the rectangle, then the simple rectangle, because it has less variance and is less
affected by single instances, will be a better discriminator than the wiggly shape, although
the simple one may make slightly more errors on the training set. Given comparable
empirical error, we say that a simple (but not too simple) model would generalize better
than a complex model. This principle Occam’s razor is known as Occam’s razor, which
states that simpler explanations are more plausible and any unnecessary complexity should
be shaved off.

Learning Multiple Classes


In our example of learning a family car, we have positive examples belonging to the class
family car and the negative examples belonging to all other cars. This is a two-class
problem. In the general case, we have K

classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one and exactly one of
them. The training set is now of the form

An example is given in figure 2.9 with instances from three classes: family car, sports car,
and luxury sedan. In machine learning for classification, we would like to learn the
boundary separating the instances of one class from the instances of all other classes. Thus
we view a K-class classification problem as K two-class problems. The training examples
belonging to Ci are the positive instances of hypothesis hi and the examples of all other
classes are the negative instances of hi . Thus in a K-class problem, we have K hypotheses
to learn such that

For a given x, ideally only one of hi(x), i = 1, . . . , K is 1 and we can choose a class. But
when no, or two or more, hi(x) is 1, we cannot choose reject a class, and this is the case of
doubt and the classifier rejects such cases. In our example of learning a family car, we used
only one hypothesis and only modelled the positive examples. Any negative example outside
is not a family car. Alternatively, sometimes we may prefer to build two hypotheses, one for
the positive and the other for the negative instances. This assumes a structure also for the
negative instances that can be covered by another hypothesis. Separating family cars from

S E SURESH, MCA, AP SET ML UNIT 1 Page 24


sports cars is such a problem; each class has a structure of its own. The advantage is that
if the input is a luxury sedan, we can have both hypotheses decide negative and reject the
input. If in a dataset, we expect to have all classes with similar distribution—shapes in the
input space—then the same hypothesis class can be used for all classes. For example, in a
handwritten digit recognition dataset, we would expect all digits to have similar
distributions. But in a medical diagnosis dataset, for example, where we have two classes
for sick and healthy people, we may have completely different distributions for the two
classes; there may be multiple ways for a person to be sick, reflected differently in the
inputs: All healthy people are alike; each sick person is sick in his or her own way.

A Table of Test Outcomes


Let's say there is a condition with a binary outcome ("yes" vs. "no", 1 vs. 0, or whatever you
want to call it). Suppose we conduct a test that is designed to detect this condition; the test
also has a binary outcome. The totality of outcomes can thus be represented with a two-by-
two table, which is also called the Confusion Matrix.
Suppose 10,000 patients get tested for flu; out of them, 9,000 are actually healthy and
1,000 are actually sick. For the sick people, a test was positive for 620 and negative for 380.
For the healthy people, the same test was positive for 180 and negative for 8,820. Let's
summarize these results in a table:
Now comes our first batch of definitions.
True positive (TP): Positive test result matches reality — the person is actually sick and
tested positive.
False positive (FP): Positive test result doesn't match reality — the test is positive but
the person is not actually sick.
True negative (TN): Negative test result matches reality — the person is not sick and
tested negative.
False negative (FN): Negative test result doesn't match reality — the test is negative but
the person is actually sick.
A useful heuristic: positive vs. negative reflects the test outcome; true vs. false reflects
whether the test got it right or got it wrong.
Armed with these and N for the total population (10,000, in our case), we are now ready to
tackle the multitude of definitions statisticians have produced over the years to describe the
performance of tests:
Prevalence: How common is the actual disease in the population?
o (FN+TP)/N
o In the example: (380+620)/10000=0.1
Accuracy: How often is the test correct?
o (TP+TN)/N
o In the example: (620+8820)/10000=0.944
Misclassification rate: How often the test is wrong?
o 1 - Accuracy = (FP+FN)/N
o In the example: (180+380)/10000=0.056
Sensitivity or True Positive Rate (TPR) or Recall: When the patient is sick, how often
does the test actually predict it correctly?
o TP/(TP+FN)
o In the example: 620/(620+380)=0.62
Specificity or True Negative Rate (TNR): When the patient is not sick, how often does
the test actually predict it correctly?
o TN/(TN+FP)
o In the example: 8820/(8820+180)=0.98
False Positive Rate (FPR): The probability of false alarm.
o 1 - Specificity = FP/(TN+FP)
o In the example: 180/(8820+180)=0.02
False Negative Rage (FNR): Miss rate; the probability of missing a sickness with a test.
o 1 - Sensitivity = FN/(TP+FN)
o In the example: 380/(620+380)=0.38

S E SURESH, MCA, AP SET ML UNIT 1 Page 25


Precision or Positive Predictive Value (PPV): When the prediction is positive, how often
is it correct?
o TP/(TP+FP)
o In the example: 620/(620+180)=0.775
Negative Predictive Value (NPV): When the prediction is negative, how often is it
correct?
o TN/(TN+FN)
o In the example: 8820/(8820+380)=0.959
Positive Likelihood Ratio: Odds of a positive prediction given that the person is sick
(used with odds formulations of probability).
o TPR/FPR
o In the example: 0.62/0.02=31
Negative Likelihood Ratio: Odds of a positive prediction given that the person is not
sick.
o FNR/TNR
o In the example: 0.38/0.98=0.388

RIDTs are said to have a sensitivity of 62.3%; this is just a clever way of saying that for a
person with flu, the test will be positive 62.3% of the time. For people who do not have the
flu, the test is more accurate since its specificity is 98.2% — only 1.8% of healthy people
will be flagged positive.
The positive likelihood ratio is said to be 34.5; let's see how it was computed:

This is to say, if the person is sick, odds are 35-to-1 that the test will be positive. And the
negative likelihood ratio is said to be 0.38:

This is to say, if the person is not sick, odds are 1-to-3 that the test will be positive.
In other words, these flu tests are pretty good when a person is actually sick, but not great
when the person is not sick.
True Positive and True Negative values mean the predicted value matches the actual
value.
A Type I Error happens when the model makes an incorrect prediction, as in, the model
predicted positive for an actual negative value.
A Type II Error happens when the model makes an incorrect prediction of an actual
positive value as negative.

Regression
Regression in machine learning is a type of supervised learning task that involves predicting
a continuous output variable based on one or more input variables, also known as features
or predictors. The goal of regression is to learn a function that maps the input variables to a
continuous output variable, which can be used to make predictions on new, unseen data.

Regression models can take on different forms, depending on the type of function used to
model the relationship between the input variables and the output variable. Some common

S E SURESH, MCA, AP SET ML UNIT 1 Page 26


types of regression models include linear regression, polynomial regression, and logistic
regression.

Linear regression is a simple and widely used regression technique that models the
relationship between the input variables and the output variable as a linear function. In
other words, the output variable is modeled as a weighted sum of the input variables, plus
an intercept term. The coefficients of the input variables are learned from the training data
using techniques such as ordinary least squares or gradient descent.

Polynomial regression is a type of regression that models the relationship between the input
variables and the output variable as a polynomial function. This can be useful when the
relationship between the variables is nonlinear, and a linear model is not sufficient to
capture the underlying pattern in the data.

Logistic regression is a type of regression that is used for classification tasks, where the
output variable is a categorical variable. Logistic regression models the relationship between
the input variables and the probability of the output variable belonging to a certain class,
using a logistic function.

Regression models can be evaluated using metrics such as mean squared error, mean
absolute error, R-squared, or root mean squared error, depending on the specific problem
and the characteristics of the data. Techniques such as cross-validation or regularization
can also be used to improve the performance of the model and prevent over fitting to the
training data.

Interpolation and extrapolation


In regression, interpolation refers to the process of estimating a value of the output variable
for an input value that falls within the range of the input values that were used to train the
model. In other words, interpolation is the process of estimating the output variable for an
input value that is not present in the training data, but falls between two input values that
are present in the training data.

For example, consider a regression model that predicts the price of a house based on its
size in square feet. If the model is trained on a dataset that contains houses with sizes
ranging from 500 to 2000 square feet, interpolation would involve estimating the price of a
house with a size of 1500 square feet, which is within the range of input values used to
train the model.

Interpolation in regression is typically straightforward and can be performed using the


learned function or model. The model simply takes the input value as input and produces a
corresponding output value as output, based on the learned relationship between the input
and output variables.

In contrast, extrapolation refers to the process of estimating a value of the output variable
for an input value that falls outside the range of the input values that were used to train the
model. Extrapolation can be more challenging and can lead to less reliable predictions, as it
involves making predictions based on assumptions about the behavior of the model outside
the range of the training data.
Model selection and Generalization
Model selection is the process of selecting the best model among a set of candidate models
for a given machine learning task. Model selection is an important step in the machine

S E SURESH, MCA, AP SET ML UNIT 1 Page 27


learning pipeline, as it can have a significant impact on the accuracy and reliability of the
resulting model.
In model selection, a set of candidate models is typically trained and evaluated on
a validation set or through cross-validation. The performance of each model is compared
using one or more evaluation metrics, such as accuracy, precision, recall, F1 score, or
mean squared error. The model with the best performance on the validation set or through
cross-validation is selected as the final model.
There are several techniques that can be used for model selection, including:
1. Grid search: Grid search involves evaluating all possible combinations of hyper
parameters for each model in the candidate set. This can be computationally
expensive but can be effective for small sets of hyper parameters.
2. Random search: Random search involves randomly sampling hyper parameters for
each model in the candidate set. This can be more computationally efficient than
grid search and can be effective for high-dimensional hyper parameter spaces.
3. Bayesian optimization: Bayesian optimization is a more sophisticated approach
that uses a probabilistic model to select the most promising set of hyper parameters
to evaluate. This can be more computationally efficient than grid search or random
search and can be effective for complex hyper parameter spaces.
4. Ensemble methods: Ensemble methods involve combining multiple models to
produce a final prediction, such as bagging, boosting, or stacking. Ensemble
methods can be effective for improving the accuracy and robustness of the final
model, but can be more complex to implement and interpret.
In general, the choice of model selection technique depends on the specific problem and the
characteristics of the data. It is important to carefully evaluate the performance of each
model on a validation set or through cross-validation and to use appropriate techniques
and evaluation metrics to ensure that the final model produces reliable and accurate
predictions.

There are many evaluation metrics that can be used in model selection, and the choice of
metric depends on the specific problem and the characteristics of the data. Here are some
of the most common evaluation metrics used in model selection:
1. Accuracy: Accuracy is the proportion of correctly classified instances, or the
number of true positives and true negatives divided by the total number of
instances. Accuracy is commonly used for classification tasks.
2. Precision: Precision is the proportion of true positives among the instances
predicted as positive, or the number of true positives divided by the number of true
positives plus false positives. Precision is useful when the cost of false positives is
high.
3. Recall: Recall is the proportion of true positives among the instances that are
actually positive, or the number of true positives divided by the number of true
positives plus false negatives. Recall is useful when the cost of false negatives is
high.
4. F1 score: The F1 score is the harmonic mean of precision and recall, or 2 times the
product of precision and recall divided by the sum of precision and recall. The F1
score is a balanced measure that takes both precision and recall into account.
5. Mean squared error (MSE): The MSE is the average of the squared differences
between the predicted and actual values, or the sum of squared errors divided by
the number of instances. MSE is commonly used for regression tasks.
6. Root mean squared error (RMSE): The RMSE is the square root of the MSE, or the
square root of the sum of squared errors divided by the number of

S E SURESH, MCA, AP SET ML UNIT 1 Page 28


instances. RMSE is useful for measuring the magnitude of the errors in the
predicted values.
7. R-squared (R2): The R2 score is a measure of the proportion of variance in the
target variable that is explained by the model, or the ratio of the explained
variance to the total variance. R2 is commonly used for regression tasks.

In general, it is important to carefully choose the appropriate evaluation metric for the
specific problem and to use appropriate techniques and evaluation methods to ensure that
the model selection process produces reliable and accurate predictions.

Generalization
Generalization is the ability of a machine learning model to perform well on new, unseen
data, beyond the data used to train the model. Model selection is closely related to
generalization, as the goal of model selection is to choose the best model that can generalize
well to new data.

In machine learning, the ultimate goal is to develop a model that can accurately and
reliably predict the output variable for new, unseen input data. To achieve this goal, it is
important to choose a model that is not only accurate on the training data, but also
generalizes well to new data.

Model selection helps to achieve this goal by evaluating the performance of different models
on a validation set or through cross-validation, and selecting the model that performs the
best on this set. By choosing the best model based on its performance on a separate
validation set, we can reduce the risk of over fitting to the training data and increase the
likelihood that the model will generalize well to new data.

However, it is important to note that model selection is just one aspect of achieving good
generalization in machine learning. Other important factors include data preprocessing,
feature selection or extraction, hyper parameter tuning, and regularization. By carefully
considering all of these factors, we can develop models that not only perform well on the
training data, but also generalize well to new, unseen data.

Ill-posed problem
An ill-posed problem in model selection refers to a problem where the data and the model
are insufficiently constrained, making it difficult or impossible to determine a unique
solution or to reliably evaluate the performance of the model.
In model selection, an ill-posed problem can arise when there are too many candidate
models or when the data is noisy, incomplete, or ambiguous. In such cases, it can be
difficult to select the best model or to accurately evaluate the performance of the models, as
the models may be too complex or too flexible to capture the underlying patterns in the
data.
To address an ill-posed problem in model selection, it is important to carefully consider the
characteristics of the data and the models, and to use appropriate regularization
techniques to constrain the models and reduce their complexity. Regularization techniques
such as L1 regularization or L2 regularization can be used to reduce the number of
parameters in the model and prevent overfitting to the training data, improving the model's
ability to generalize to new data.

In addition, it may be helpful to use techniques such as cross-validation or Bayesian model


averaging to evaluate the performance of the models and to estimate the uncertainty in the

S E SURESH, MCA, AP SET ML UNIT 1 Page 29


model selection process. These techniques can help to reduce the impact of noise in the
data and to provide a more reliable estimate of the performance of the models.

In general, it is important to carefully consider the characteristics of the data and the
models, and to use appropriate techniques and evaluation metrics to ensure that the model
selection process produces reliable and accurate predictions, even in the presence of an ill-
posed problem.

Inductive Bias
The inductive bias of a learning algorithm refers to the set of assumptions or biases that the
algorithm makes about the relationship between the input variables and the output variable
in the data. The inductive bias guides the learning process by constraining the space of
possible hypotheses that the algorithm can consider and by prioritizing certain hypotheses
over others.

The inductive bias of a learning algorithm can be explicit or implicit. Explicit biases are
built into the algorithm through the choice of the learning algorithm or through the
selection of specific hyper parameters.

For example, a decision tree algorithm has an explicit bias towards simple decision trees,
while a neural network algorithm has an explicit bias towards smooth, continuous
functions.
Implicit biases, on the other hand, are inherent in the structure of the data and are not
explicitly built into the algorithm. For example, an implicit bias may arise from the
distribution of the input variables or from the structure of the output variable.

The choice of inductive bias can have a significant impact on the performance of the
learning algorithm, as it determines the set of hypotheses that the algorithm considers and
the way in which the algorithm generalizes to new data. A good inductive bias should be
able to capture the underlying patterns in the data while avoiding over fitting to the training
data.

In general, the choice of inductive bias depends on the specific problem and the
characteristics of the data. It is important to carefully consider the trade-offs between
simplicity and expressiveness, and to use appropriate techniques and evaluation metrics to
ensure that the learning algorithm produces reliable and accurate predictions.

Select A Model Based On Bias Value


In machine learning, bias is a measure of the systematic error or the tendency of a model to
consistently over- or under-predict the output variable. In general, a model with high bias is
too simple and may underfit the data, while a model with low bias is too complex and may
overfit the data.

When selecting a model based on bias value, the goal is to find a model that has a balance
between bias and variance, which is the tendency of the model to vary significantly
depending on the specific training data used. A model with high variance may be too flexible
and may overfit the data, while a model with low variance may be too rigid and may
underfit the data.

S E SURESH, MCA, AP SET ML UNIT 1 Page 30


To select a model based on bias value, one approach is to use a validation set or cross-
validation to evaluate the performance of different models and to choose the model that has
the lowest overall error or the best trade-off between bias and variance.

Typically, a model with high bias will have a high error on both the training and
the validation sets, while a model with high variance will have a low error on the training
set but a high error on the validation set. Therefore, the goal is to choose a model that has a
low error on both the training and validation sets.

One useful technique for selecting a model based on bias value is to plot the learning curve,
which shows the error of the model as a function of the size of the training set. A model
with high bias will typically have a high error on both the training and validation sets, but
the error will converge to a high value as the size of the training set increases. A model with
high variance, on the other hand, will typically have a low error on the training set but a
high error on the validation set, and the gap between the two errors will not converge as the
size of the training set increases.

By analyzing the learning curve, it is possible to identify the point at which the error of the
model converges or plateaus and to choose the model that has the lowest error at this point.
This can help to select a model that has a good balance between bias and variance and that
is likely to generalize well to new, unseen data.

Triple Trade-off in ML
The triple trade-off in machine learning refers to the trade-offs between three important
factors that affect the performance of a machine learning model: bias, variance, and model
complexity.

Bias refers to the tendency of a model to consistently make incorrect assumptions about
the relationship between the input variables and the output variable. Models with high
bias are typically too simple and may underfit the data.

Variance refers to the tendency of a model to vary significantly depending on the specific
training data used. Models with high variance are typically too complex and may overfit the
data.

Model complexity refers to the number of parameters or the degree of flexibility of the
model. Models with high complexity are typically more flexible and may have a higher
capacity to capture complex patterns in the data, but may also be more prone to over
fitting.

The triple trade-off arises because increasing one factor may come at the expense of the
other two factors. For example, increasing the complexity of the model may reduce bias and
improve the ability of the model to capture complex patterns in the data, but may also
increase variance and make the model more prone to over fitting.

Similarly, reducing the complexity of the model may reduce variance and improve the ability
of the model to generalize to new data, but may also increase bias and make the model too
simple to capture the underlying patterns in the data.

To achieve good performance in machine learning, it is important to find a balance between


bias, variance, and model complexity. This can be achieved through techniques such as

S E SURESH, MCA, AP SET ML UNIT 1 Page 31


regularization, cross-validation, and model selection, which aim to reduce the impact of
over fitting and to select the best model that has the right balance between bias and
variance for the specific problem and the characteristics of the data.

Dimensions of a Supervised Machine Learning Algorithm


There are several dimensions of a supervised machine learning algorithm, each of which
can be adjusted to improve the performance of the algorithm for a specific problem. These
dimensions include:
1. Input representation: This refers to the way in which the input data is represented
and encoded for the learning algorithm. For example, the input data may be represented as
raw text, images, or numerical features.
2. Feature selection/extraction: This refers to the process of selecting or extracting the
most relevant features from the input data to improve the performance of the learning
algorithm. Feature selection can help to reduce the dimensionality of the input data and
improve the efficiency of the learning algorithm, while feature extraction can help to capture
complex patterns in the data.
3. Model selection: This refers to the choice of the learning algorithm or model used to
make predictions based on the input data. There are many different types of models,
including linear regression, decision trees, neural networks, and support vector machines,
each with its own strengths and weaknesses.
4. Hyper parameter tuning: This refers to the process of selecting the optimal values for
the hyper parameters of the learning algorithm, such as the learning rate, regularization
strength, or the number of hidden layers in a neural network. Hyper parameter tuning can
help to improve the performance of the learning algorithm and reduce the risk of over fitting
to the training data.
5. Loss function: This refers to the objective function that the learning algorithm tries
to optimize during training. The choice of loss function depends on the specific problem and
the characteristics of the data, and can have a significant impact on the performance of the
learning algorithm.
6. Evaluation metrics: This refers to the metrics used to evaluate the performance of
the learning algorithm on the validation or test data. Common evaluation metrics include
accuracy, precision, recall, F1 score, mean squared error, and R-squared.
7. Independent and identically distributed (iid) describes a set of random variables that
are independent of each other and have the same probability distribution.
Independence means that the value of one variable does not depend on the value of any
other variable in the set.
For example, the outcomes of flipping a coin are independent of each other, as the result of
one flip does not affect the result of any other flip.
Identically distributed means that all the variables in the set have the same probability
distribution. For example, if we flip a fair coin multiple times, each flip has a 50% chance of
resulting in heads and a 50% chance of resulting in tails. The probability distribution of
each flip is the same, even though the specific outcome may be different.
In machine learning, the assumption of id data is often used to simplify the analysis and to
ensure that the model can generalize well to new, unseen data. The assumption of
independence ensures that the model does not make incorrect assumptions about the
relationships between the variables, while the assumption of identical distribution ensures
that the model can learn the underlying patterns in the data.
However, it is important to note that not all data is independent and identically distributed.
In some cases, the data may be correlated or may have different distributions in different
subsets. In such cases, it is important to carefully consider the characteristics of the data

S E SURESH, MCA, AP SET ML UNIT 1 Page 32


and to use appropriate techniques and models that can handle the specific types of
dependencies and distributions in the data.

By adjusting these dimensions, it is possible to fine-tune the supervised learning


algorithm to achieve the best possible performance for the specific problem and the
characteristics of the data.

Decision Tree Learning

Decision Tree
A decision tree is a flowchart-like structure in which – each internal node represents a "test"
on an attribute – each branch represents the outcome of the test – each leaf node
represents a class label (decision taken after computing all attributes). • The paths from
root to leaf represent classification rules.
Introduction:
Decision Tree Learning
• Decision tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree.
• Learned trees can also be re-represented as sets of if-then rules to improve human
readability.
• These learning methods are among the most popular of inductive inference algorithms
• have been successfully applied to a broad range of tasks from learning to diagnose
medical cases to learning to assess credit risk of loan applicants.
• widely used algorithms are ID3, ASSISTANT, and C4.5
• These decision tree learning methods search a completely expressive hypothesis space
and thus avoid the difficulties of restricted hypothesis spaces
. • Their inductive bias is a preference for small trees over large trees.
DECISION TREE REPRESENTATION
• classifies instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance.
• Each node in the tree specifies a test of some attribute of the instance
• branch descending from that node corresponds to one of the possible values for this
attribute.

S E SURESH, MCA, AP SET ML UNIT 1 Page 33


• An instance is classified by – starting at the root node of the tree, – testing the attribute
specified by this node, – then moving down the tree branch corresponding to the value of
the attribute in the given example.
• This process is then repeated for the subtree rooted at the new node.

• In general, decision trees represent a disjunction of conjunctions of constraints on the


attribute values of instances.
• Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and
the tree itself to a disjunction of these conjunctions.

(Outlook = Sunny ^ Humidity = Normal) V (Outlook = Overcast) v (Outlook = Rain ^ Wind =


Weak)

decision tree learning is generally best suited to problems with the following characteristics:
1. Instances are represented by attribute-value pairs.
– Instances are described by a fixed set of attributes and their values.
– Attribute can take
• on a small number of disjoint possible values
• real-values
2.The target function has discrete output values.
– The decision tree generally assigns a boolean classification (e.g., yes or no) to each
example.
– Can have more than two possible output values
– Also real-valued outputs ( though the application of decision trees in this setting is less
common).
3.Disjunctive descriptions may be required.
– As noted above, decision trees naturally represent disjunctive expressions.
3. The training data may contain errors.
– Decision tree learning methods are robust to errors, both errors in classifications of the
training examples and errors in the attribute values that describe these examples.
4. The training data may contain missing attribute values
– Decision tree methods can be used even when some training examples have
unknown values.
– (e.g., if the Humidity of the day is known for only some of the training examples)

• classification problems:
– Problems in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as classification problems.
• Decision tree learning has therefore been applied to problems such as
– learning medical patients by their disease
– equipment malfunctions by their cause
– loan applicants by their likelihood of defaulting on payments

S E SURESH, MCA, AP SET ML UNIT 1 Page 34


THE BASIC DECISION TREE LEARNING ALGORITHM
• ID3 algorithm
– Iterative Dichotomiser 3
– algorithm invented by Ross Quinlan

ENTROPY MEASURES HOMOGENEITY OF EXAMPLES


In order to define information gain precisely, we begin by defining a measure commonly
used in information theory, called entropy, that characterizes the impurity of an arbitrary
collection of examples. Given a collection S, containing positive and negative examples of
some target concept, the entropy of S relative to this Boolean classification is

if the target attribute can take on c different values, then the entropy of S relative to this c-
wise classification is defined as

INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY


The expected reduction in entropy caused by partitioning the examples according to this
attribute. More precisely, the information gain, Gain(S, A) of an attribute A, relative to a
collection of examples S, is defined as

S E SURESH, MCA, AP SET ML UNIT 1 Page 35


HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING
ID3 can be characterized as searching a space of hypotheses for one that fits the training
examples.
The hypothesis space searched by ID3 is the set of possible decision trees.
ID3 performs a simple-to complex, hill-climbing search through this hypothesis space,
beginning with the empty tree, then considering progressively more elaborate hypotheses in
search of a decision tree that correctly classifies the training data.
The evaluation function that guides this hill-climbing search is the information gain
measure.

 ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued
functions, relative to the available attributes.
 ID3 maintains only a single current hypothesis as it searches through the space of
decision trees.
 ID3 in its pure form performs no backtracking in its search.
 ID3 uses all training examples at each step in the search to make statistically based
decisions regarding how to refine its current hypothesis.

INDUCTIVE BIAS IN DECISION TREE LEARNING

ISSUES IN DECISION TREE LEARNING


Avoiding Overfitting the Data
Definition: Given a hypothesis space H, a hypothesis hϵH is said to overfit the training data
if there exists some alternative hypothesis hϵH, such that h has smaller error than h' over
the training examples, but h' has a smaller error than h over the entire distribution of
instances.

S E SURESH, MCA, AP SET ML UNIT 1 Page 36


REDUCED ERROR PRUNING

RULE POST-PRUNING

S E SURESH, MCA, AP SET ML UNIT 1 Page 37


In rule post pruning, one rule is generated for each leaf node in the tree. Each attribute test
along the path from the root to the leaf becomes a rule antecedent (precondition) and the
classification at the leaf node becomes the rule consequent (post condition). For example,

Incorporating Continuous-Valued Attributes

S E SURESH, MCA, AP SET ML UNIT 1 Page 38


S E SURESH, MCA, AP SET ML UNIT 1 Page 39

You might also like