ML_Unit-1 _PDF
ML_Unit-1 _PDF
Introduction ML:
A rapidly developing field of technology, machine learning allows computers to
automatically learn from previous data. For building mathematical models and making
predictions based on historical data or information, machine learning employs a variety of
algorithms. It is currently being used for a variety of tasks, including speech recognition,
email filtering, auto-tagging on Face book, a recommender system, and image recognition.
A subset of artificial intelligence known as machine learning focuses primarily on the
creation of algorithms that enable a computer to independently learn from data and
previous experiences.
Without being explicitly programmed, machine learning enables a machine to automatically
learn from data, improve performance from experiences, and predict things.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample
historical data, or training data. For the purpose of developing predictive models, machine
learning brings together statistics and computer science. Algorithms that learn from
historical data are either constructed or utilized in machine learning.
Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic
based on the data and predict the output. Our perspective on the issue has changed as a
1. Learning Associations
Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show
associations between items. It allows retailers to identify relationships between the items
that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.
1 Bread, Milk
Before we start defining the rule, let us first see the basic definitions.
Support Count( ) – Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Example: {Milk, Diaper}->{Beer}
The Association rule is very useful in analyzing datasets. The data is collected using bar-
code scanners in supermarkets. Such databases consists of a large number of transaction
records which list all items bought by a customer on a single purchase. So the manager
could know if certain groups of items are consistently purchased together and use this data
for adjusting store layouts, cross-selling, promotions based on statistics.
Apriori Algorithm is also used in association rule mining for discovering frequent itemsets
in the transactions database. It was proposed by Agrawal & Srikant in 1993.
Exercise:
A customer does 4 transactions with you. In the first transaction, she buys 1 apple, 1 beer,
1 rice, and 1 chicken. In the second transaction, she buys 1 apple, 1 beer, 1 rice. In the
third transaction, she buys 1 apple, 1 beer only. In fourth transactions, she buys 1 apple
and 1 orange.
Support(Apple) = 4/4
So, Lift value is greater than 1 implies Rice is likely to be bought if Beer is bought.
The Dataset
Market Basket dataset consists of 15010 observations with Date, Time, Transaction and
Item feature or columns. The date variable or column ranges from 30/10/2016 to
09/04/2017. Time is a categorical variable that tells the time. Transaction is a quantitative
variable that helps in differentiation of transactions. Item is a categorical variable that links
with a product.
# Loading data
dataset=read.transactions('"C:/Users/admin/Documents/Market_Basket_Optimisation.csv'
, sep = ',', rm.duplicates = TRUE)
# Structure
str(dataset)
# Installing Packages
install.packages("arules")
install.packages("arulesViz")
# Loading package
library(arules)
library(arulesViz)
# Fitting model
# Training Apriori on the dataset
set.seed = 220 # Setting seed
associa_rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.2))
# Plot
itemFrequencyPlot(dataset, topN = 10)
Supervised learning is the types of machine learning in which machines are trained using
well "labeled" training data, and on basis of that data, machines predict the output. The
labeled data means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as a
student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
The working of Supervised learning can be easily understood by the below example and
diagram:
2. Classification
Classification is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features.
There are two main types of classification: binary classification and multi-class
classification.
Binary classification involves classifying instances into two classes, such as “spam” or “not
spam”, while multi-class classification involves classifying instances into more than two
classes.
The process of building a classification model typically involves the following steps:
i. Data preparation: This step involves cleaning and pre-processing the data, such as
removing missing values and transforming the data into a format that can be used
by the classification algorithm.
ii. Model selection: This step involves choosing an appropriate classification algorithm
based on the characteristics of the data and the desired outcome. Common
algorithms include decision trees, k-nearest neighbors, and support vector
machines.
iii. Model training: This step involves using the training data to train the classification
algorithm and build the model. The model is trained by adjusting its parameters to
minimize the difference between the predicted class labels and the actual class
labels.
iv. Model evaluation: This step involves evaluating the performance of the
classification model on a test dataset that is separate from the training data. This
can be done by calculating metrics such as accuracy, precision, recall, and F1-
score.
v. Model deployment: This step involves deploying the classification model in a
production environment, where it can be used to make predictions on new
instances.
Example:
Classification
■ Example: Credit scoring
■ Differentiating between low-risk and high-risk customers from their income and
savings
Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
Classification: Applications
■ Aka Pattern recognition
■ Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
■ Character recognition: Different handwriting styles.
■ Speech recognition: Temporal dependency.
¨ Use of a dictionary or the syntax of the language.
Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for
speech
■ Medical diagnosis: From symptoms to illnesses
■ Biometrics
Knowledge Extraction
Learning a rule from data also allows knowledge extraction.
The rule is extraction a simple model that explains the data, and looking at this model we
have an explanation about the process underlying the data.
For example, once we learn the discriminant separating low-risk and high-risk customers,
we have the knowledge of the properties of low-risk customers.
We can then use this information to target potential low-risk customers more efficiently, for
example, through advertising.
3. Regression analysis
It is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is
changing corresponding to an independent variable when other independent variables are
held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the
datapoints and the regression line is minimum." The distance between datapoints and
line tells whether a model has captured a strong relationship or not.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent variables.
Here we are discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Unsupervised Learning
It is a machine learning technique in which models are not supervised using training
dataset. Instead, models itself find the hidden patterns and insights from the given data. It
can be compared to learning which takes place in the human brain while learning new
things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it.
Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
In document clustering, the aim is to group similar documents. For example, news reports
can be subdivided as those related to politics, sports, fashion, arts, and so on. Commonly, a
document is represented as a bag of words—that is, we predefine a lexicon of N words, and
each document is an N-dimensional binary vector whose element i is 1 if word i appears in
the document; suffixes “–s” and “–ing” are removed to avoid duplicates and words such as
“of,” “and,” and so forth, which are not informative, are not used. Documents are then
grouped depending on the number of shared words. It is of course critical how the lexicon is
chosen.
Reinforcement Learning
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a
photo with our Facebook friends, then we automatically get a tagging suggestion with
name,and the technology behind this is machine learning's face detection and recognition
algorithm.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition."
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.
Let us say we want to learn the class, C, of a “family car.” We have a set of examples of
cars, and we have a group of people that we survey to whom we show these cars. The
people look at the cars and label them;
the positive examples cars that they believe are family cars are positive examples, and the
other negative examples cars are negative examples.
Class learning is finding a description that is shared by all the positive examples and none
of the negative examples.
Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect from a family car?
Output: Positive (+) and negative (–) examples
Input representation
Training set for the class of a “family car.” Each data point corresponds to one example car,
and the coordinates of the point indicate the price and engine power of that car. ‘+’ denotes
a positive example of the class (a family car), and ‘−’ denotes a negative example (not a
family car); it is another type of car
Let us denote price as the first input attribute x1 (e.g., in U.S. dollars) and engine power as
the second attribute x2 (e.g., engine volume in cubic centimeters). Thus we represent each
car using two numeric values:
x 1 if x is positive
x 1
x2
r
0 if x is negative
X {x t ,r t }tN1
Each car is represented by such an ordered pair (x,r) and the training set contains N such
examples.
where t indexes deferent examples in the set; it does not represent time or any such order.
Our training data can now be plotted in the two-dimensional (x1, x2) space where each
instance t is a data point at coordinates (𝑥1𝑡 , 𝑥2𝑡) and its type, namely, positive versus
negative, is given by r t .
Eqn. 2.4:
Example of a hypothesis class. The class of family car is a rectangle in the price-engine
power space.
Equation 2.4 fixes H, the hypothesis class from which we believe C is drawn, namely, the
set of rectangles. The learning algorithm then finds hypothesis the particular hypothesis, h
∈ H, specified by a particular quadruple of (𝑝1ℎ , 𝑝2ℎ , 𝑒1ℎ , 𝑒2ℎ ), to approximate C as closely as
possible.
Though the expert defines this hypothesis class, the values of the parameters are not
known; that is, though we choose H, we do not know which particular h ∈ H is equal, or
closest, to C. But once we restrict our attention to this hypothesis class, learning the class
reduces to the easier problem of finding the four parameters that define h. The aim is to
find h ∈ H that is as similar as possible to C. Let us say the hypothesis h makes a
prediction for an instance x such that
1 if h says x is positive
1hx r
N
h( x ) E (h | X ) t t
0 if h says x is negative t 1
In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x).
What we have is the training set X, which is a small subset empirical error of the set of all
possible x.
The empirical error is the proportion of training instances where predictions of h do not
match the required values given in X. The error of hypothesis h given the training set X is
E(h|x), where 1(a = b) is 1 if a = b and is 0 if a = b.
Note that if x1 and x2 are real-valued, there are infinitely many such h for which this is
satisfied, namely, for which the error, E, is 0, but given a future example somewhere close
to the boundary between positive and negative examples, different candidate hypotheses
may make different generalization predictions. This is the problem of generalization — that
is, how well our hypothesis will correctly classify future examples that are not part of the
training set.
One possibility is to find the most specific hypothesis, S, that is the hypothesis tightest
rectangle that includes all the positive examples and none of the negative examples (see
figure 2.4). This gives us one hypothesis, h = S, as our induced class.
Note that the actual class C may be larger than S but is most general never smaller.
The most general hypothesis, G, is the largest rectangle we hypothesis can draw that
includes all the positive examples and none of the negative examples (figure 2.4).
Any h ∈ H between S and G is a valid hypothesis with no error, said to be consistent with
the training set, and such h make version space up the version space.
Given another training set, S, G, version space, the parameters and thus the learned
hypothesis, h, can be different.
C is the actual class and h is our induced hypothesis. The point where C is 1 but h is 0 is a
false negative, and the point where C is 0 but h is 1 is a false positive. Other points—
namely, true positives and true negatives—are correctly classified.
Actually, depending on X and H, there may be several S i and Gj which respectively make up
the S-set and the G-set. Every member of the S-set is consistent with all the instances, and
there are no consistent hypotheses that are more specific. Similarly, every member of the G-
set is consistent with all the instances, and there are no consistent hypotheses that are
Given X, we can find S, or G, or any h from the version space and use it as our hypothesis,
h. It seems intuitive to choose h halfway between S margin and G; this is to increase the
margin, which is the distance between the boundary and the instances closest to it.
We choose the hypothesis with the largest margin, for best separation. The shaded
instances are those that define (or support) the margin; other instances can be removed
without affecting h.
In some applications, a wrong decision may be very costly and in such a case, we can say
that any instance that falls in between S and G is a doubt case of doubt, which we cannot
label with certainty due to lack of data. In such a case, the system rejects the instance and
defers the decision to a human expert.
A version space is a hierarchical representation of knowledge that enables you to keep track
of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.
Fundamental Assumptions
1. The data is correct; there are no erroneous instances.
2. A correct description is a conjunction of some of the attributes with values.
Let us say we have a dataset containing N points. These N points can be labeled in 2Nways
as positive and negative. Therefore, 2N different learning problems can be defined by N data
points. If for any of these problems, we can find a hypothesis h∈H that separates the
positive examples from the negative, then we say H shatters N points. That is, any learning
problem definable by N examples can be learned with no error by a hypothesis
drawn from H. The maximum number of points that can be shattered by H is called the
Vapnik-Chervonenkis (VC) dimension of H, is denoted as VC(H), and measures the capacity
of H.
In figure 2.6, we see that an axis-aligned rectangle can shatter four points in two
dimensions. Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two
dimensions, is four. In calculating the VC dimension, it is enough that we find four points
that can be shattered; it is not necessary that we be able to shatter any four points in two
dimensions.
VC dimension may seem pessimistic. It tells us that using a rectangle as our hypothesis
class, we can learn only datasets containing four points and not more.
(1 − x) ≤ exp[−x]
So if we choose N and δ such thatwe have
4 exp[−ϵN/4] ≤ δ
we can also write 4(1 − ϵ/4)N ≤ δ. Dividing both sides by 4, taking (natural) log and
rearranging terms, we have
N ≥ (4/ϵ) log(4/δ)
Therefore, provided that we take at least (4/ϵ) log(4/δ) independent examples from C and
use the tightest rectangle as our hypothesis h, with confidence probability at least 1 − δ, a
given point will be misclassified with error probability at most ϵ. We can have arbitrary large
confidence by decreasing δ and arbitrary small error by decreasing ϵ, and we see in
equation 2.7 that the number of examples is a slowly growing function of 1/ϵ and 1/δ,
linear and logarithmic, respectively.
There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modelled as a random
component and is included in “noise.”
As can be seen in figure 2.8, when there is noise, there is not a simple boundary between
the positive and negative instances and to separate them, one needs a complicated
hypothesis that corresponds to a hypothesis class with larger capacity. A rectangle can be
defined by four numbers, but to define a more complicated shape one needs a more
complex model with a much larger number of parameters. With a complex model,
one can make a perfect fit to the data and attain zero error; see the wiggly shape in figure
2.8. Another possibility is to keep the model simple and allow some error; see the rectangle
in figure 2.8.Using the simple rectangle (unless its training error is much bigger) makes
more sense because of the following.
classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one and exactly one of
them. The training set is now of the form
An example is given in figure 2.9 with instances from three classes: family car, sports car,
and luxury sedan. In machine learning for classification, we would like to learn the
boundary separating the instances of one class from the instances of all other classes. Thus
we view a K-class classification problem as K two-class problems. The training examples
belonging to Ci are the positive instances of hypothesis hi and the examples of all other
classes are the negative instances of hi . Thus in a K-class problem, we have K hypotheses
to learn such that
For a given x, ideally only one of hi(x), i = 1, . . . , K is 1 and we can choose a class. But
when no, or two or more, hi(x) is 1, we cannot choose reject a class, and this is the case of
doubt and the classifier rejects such cases. In our example of learning a family car, we used
only one hypothesis and only modelled the positive examples. Any negative example outside
is not a family car. Alternatively, sometimes we may prefer to build two hypotheses, one for
the positive and the other for the negative instances. This assumes a structure also for the
negative instances that can be covered by another hypothesis. Separating family cars from
RIDTs are said to have a sensitivity of 62.3%; this is just a clever way of saying that for a
person with flu, the test will be positive 62.3% of the time. For people who do not have the
flu, the test is more accurate since its specificity is 98.2% — only 1.8% of healthy people
will be flagged positive.
The positive likelihood ratio is said to be 34.5; let's see how it was computed:
This is to say, if the person is sick, odds are 35-to-1 that the test will be positive. And the
negative likelihood ratio is said to be 0.38:
This is to say, if the person is not sick, odds are 1-to-3 that the test will be positive.
In other words, these flu tests are pretty good when a person is actually sick, but not great
when the person is not sick.
True Positive and True Negative values mean the predicted value matches the actual
value.
A Type I Error happens when the model makes an incorrect prediction, as in, the model
predicted positive for an actual negative value.
A Type II Error happens when the model makes an incorrect prediction of an actual
positive value as negative.
Regression
Regression in machine learning is a type of supervised learning task that involves predicting
a continuous output variable based on one or more input variables, also known as features
or predictors. The goal of regression is to learn a function that maps the input variables to a
continuous output variable, which can be used to make predictions on new, unseen data.
Regression models can take on different forms, depending on the type of function used to
model the relationship between the input variables and the output variable. Some common
Linear regression is a simple and widely used regression technique that models the
relationship between the input variables and the output variable as a linear function. In
other words, the output variable is modeled as a weighted sum of the input variables, plus
an intercept term. The coefficients of the input variables are learned from the training data
using techniques such as ordinary least squares or gradient descent.
Polynomial regression is a type of regression that models the relationship between the input
variables and the output variable as a polynomial function. This can be useful when the
relationship between the variables is nonlinear, and a linear model is not sufficient to
capture the underlying pattern in the data.
Logistic regression is a type of regression that is used for classification tasks, where the
output variable is a categorical variable. Logistic regression models the relationship between
the input variables and the probability of the output variable belonging to a certain class,
using a logistic function.
Regression models can be evaluated using metrics such as mean squared error, mean
absolute error, R-squared, or root mean squared error, depending on the specific problem
and the characteristics of the data. Techniques such as cross-validation or regularization
can also be used to improve the performance of the model and prevent over fitting to the
training data.
For example, consider a regression model that predicts the price of a house based on its
size in square feet. If the model is trained on a dataset that contains houses with sizes
ranging from 500 to 2000 square feet, interpolation would involve estimating the price of a
house with a size of 1500 square feet, which is within the range of input values used to
train the model.
In contrast, extrapolation refers to the process of estimating a value of the output variable
for an input value that falls outside the range of the input values that were used to train the
model. Extrapolation can be more challenging and can lead to less reliable predictions, as it
involves making predictions based on assumptions about the behavior of the model outside
the range of the training data.
Model selection and Generalization
Model selection is the process of selecting the best model among a set of candidate models
for a given machine learning task. Model selection is an important step in the machine
There are many evaluation metrics that can be used in model selection, and the choice of
metric depends on the specific problem and the characteristics of the data. Here are some
of the most common evaluation metrics used in model selection:
1. Accuracy: Accuracy is the proportion of correctly classified instances, or the
number of true positives and true negatives divided by the total number of
instances. Accuracy is commonly used for classification tasks.
2. Precision: Precision is the proportion of true positives among the instances
predicted as positive, or the number of true positives divided by the number of true
positives plus false positives. Precision is useful when the cost of false positives is
high.
3. Recall: Recall is the proportion of true positives among the instances that are
actually positive, or the number of true positives divided by the number of true
positives plus false negatives. Recall is useful when the cost of false negatives is
high.
4. F1 score: The F1 score is the harmonic mean of precision and recall, or 2 times the
product of precision and recall divided by the sum of precision and recall. The F1
score is a balanced measure that takes both precision and recall into account.
5. Mean squared error (MSE): The MSE is the average of the squared differences
between the predicted and actual values, or the sum of squared errors divided by
the number of instances. MSE is commonly used for regression tasks.
6. Root mean squared error (RMSE): The RMSE is the square root of the MSE, or the
square root of the sum of squared errors divided by the number of
In general, it is important to carefully choose the appropriate evaluation metric for the
specific problem and to use appropriate techniques and evaluation methods to ensure that
the model selection process produces reliable and accurate predictions.
Generalization
Generalization is the ability of a machine learning model to perform well on new, unseen
data, beyond the data used to train the model. Model selection is closely related to
generalization, as the goal of model selection is to choose the best model that can generalize
well to new data.
In machine learning, the ultimate goal is to develop a model that can accurately and
reliably predict the output variable for new, unseen input data. To achieve this goal, it is
important to choose a model that is not only accurate on the training data, but also
generalizes well to new data.
Model selection helps to achieve this goal by evaluating the performance of different models
on a validation set or through cross-validation, and selecting the model that performs the
best on this set. By choosing the best model based on its performance on a separate
validation set, we can reduce the risk of over fitting to the training data and increase the
likelihood that the model will generalize well to new data.
However, it is important to note that model selection is just one aspect of achieving good
generalization in machine learning. Other important factors include data preprocessing,
feature selection or extraction, hyper parameter tuning, and regularization. By carefully
considering all of these factors, we can develop models that not only perform well on the
training data, but also generalize well to new, unseen data.
Ill-posed problem
An ill-posed problem in model selection refers to a problem where the data and the model
are insufficiently constrained, making it difficult or impossible to determine a unique
solution or to reliably evaluate the performance of the model.
In model selection, an ill-posed problem can arise when there are too many candidate
models or when the data is noisy, incomplete, or ambiguous. In such cases, it can be
difficult to select the best model or to accurately evaluate the performance of the models, as
the models may be too complex or too flexible to capture the underlying patterns in the
data.
To address an ill-posed problem in model selection, it is important to carefully consider the
characteristics of the data and the models, and to use appropriate regularization
techniques to constrain the models and reduce their complexity. Regularization techniques
such as L1 regularization or L2 regularization can be used to reduce the number of
parameters in the model and prevent overfitting to the training data, improving the model's
ability to generalize to new data.
In general, it is important to carefully consider the characteristics of the data and the
models, and to use appropriate techniques and evaluation metrics to ensure that the model
selection process produces reliable and accurate predictions, even in the presence of an ill-
posed problem.
Inductive Bias
The inductive bias of a learning algorithm refers to the set of assumptions or biases that the
algorithm makes about the relationship between the input variables and the output variable
in the data. The inductive bias guides the learning process by constraining the space of
possible hypotheses that the algorithm can consider and by prioritizing certain hypotheses
over others.
The inductive bias of a learning algorithm can be explicit or implicit. Explicit biases are
built into the algorithm through the choice of the learning algorithm or through the
selection of specific hyper parameters.
For example, a decision tree algorithm has an explicit bias towards simple decision trees,
while a neural network algorithm has an explicit bias towards smooth, continuous
functions.
Implicit biases, on the other hand, are inherent in the structure of the data and are not
explicitly built into the algorithm. For example, an implicit bias may arise from the
distribution of the input variables or from the structure of the output variable.
The choice of inductive bias can have a significant impact on the performance of the
learning algorithm, as it determines the set of hypotheses that the algorithm considers and
the way in which the algorithm generalizes to new data. A good inductive bias should be
able to capture the underlying patterns in the data while avoiding over fitting to the training
data.
In general, the choice of inductive bias depends on the specific problem and the
characteristics of the data. It is important to carefully consider the trade-offs between
simplicity and expressiveness, and to use appropriate techniques and evaluation metrics to
ensure that the learning algorithm produces reliable and accurate predictions.
When selecting a model based on bias value, the goal is to find a model that has a balance
between bias and variance, which is the tendency of the model to vary significantly
depending on the specific training data used. A model with high variance may be too flexible
and may overfit the data, while a model with low variance may be too rigid and may
underfit the data.
Typically, a model with high bias will have a high error on both the training and
the validation sets, while a model with high variance will have a low error on the training
set but a high error on the validation set. Therefore, the goal is to choose a model that has a
low error on both the training and validation sets.
One useful technique for selecting a model based on bias value is to plot the learning curve,
which shows the error of the model as a function of the size of the training set. A model
with high bias will typically have a high error on both the training and validation sets, but
the error will converge to a high value as the size of the training set increases. A model with
high variance, on the other hand, will typically have a low error on the training set but a
high error on the validation set, and the gap between the two errors will not converge as the
size of the training set increases.
By analyzing the learning curve, it is possible to identify the point at which the error of the
model converges or plateaus and to choose the model that has the lowest error at this point.
This can help to select a model that has a good balance between bias and variance and that
is likely to generalize well to new, unseen data.
Triple Trade-off in ML
The triple trade-off in machine learning refers to the trade-offs between three important
factors that affect the performance of a machine learning model: bias, variance, and model
complexity.
Bias refers to the tendency of a model to consistently make incorrect assumptions about
the relationship between the input variables and the output variable. Models with high
bias are typically too simple and may underfit the data.
Variance refers to the tendency of a model to vary significantly depending on the specific
training data used. Models with high variance are typically too complex and may overfit the
data.
Model complexity refers to the number of parameters or the degree of flexibility of the
model. Models with high complexity are typically more flexible and may have a higher
capacity to capture complex patterns in the data, but may also be more prone to over
fitting.
The triple trade-off arises because increasing one factor may come at the expense of the
other two factors. For example, increasing the complexity of the model may reduce bias and
improve the ability of the model to capture complex patterns in the data, but may also
increase variance and make the model more prone to over fitting.
Similarly, reducing the complexity of the model may reduce variance and improve the ability
of the model to generalize to new data, but may also increase bias and make the model too
simple to capture the underlying patterns in the data.
Decision Tree
A decision tree is a flowchart-like structure in which – each internal node represents a "test"
on an attribute – each branch represents the outcome of the test – each leaf node
represents a class label (decision taken after computing all attributes). • The paths from
root to leaf represent classification rules.
Introduction:
Decision Tree Learning
• Decision tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree.
• Learned trees can also be re-represented as sets of if-then rules to improve human
readability.
• These learning methods are among the most popular of inductive inference algorithms
• have been successfully applied to a broad range of tasks from learning to diagnose
medical cases to learning to assess credit risk of loan applicants.
• widely used algorithms are ID3, ASSISTANT, and C4.5
• These decision tree learning methods search a completely expressive hypothesis space
and thus avoid the difficulties of restricted hypothesis spaces
. • Their inductive bias is a preference for small trees over large trees.
DECISION TREE REPRESENTATION
• classifies instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance.
• Each node in the tree specifies a test of some attribute of the instance
• branch descending from that node corresponds to one of the possible values for this
attribute.
decision tree learning is generally best suited to problems with the following characteristics:
1. Instances are represented by attribute-value pairs.
– Instances are described by a fixed set of attributes and their values.
– Attribute can take
• on a small number of disjoint possible values
• real-values
2.The target function has discrete output values.
– The decision tree generally assigns a boolean classification (e.g., yes or no) to each
example.
– Can have more than two possible output values
– Also real-valued outputs ( though the application of decision trees in this setting is less
common).
3.Disjunctive descriptions may be required.
– As noted above, decision trees naturally represent disjunctive expressions.
3. The training data may contain errors.
– Decision tree learning methods are robust to errors, both errors in classifications of the
training examples and errors in the attribute values that describe these examples.
4. The training data may contain missing attribute values
– Decision tree methods can be used even when some training examples have
unknown values.
– (e.g., if the Humidity of the day is known for only some of the training examples)
• classification problems:
– Problems in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as classification problems.
• Decision tree learning has therefore been applied to problems such as
– learning medical patients by their disease
– equipment malfunctions by their cause
– loan applicants by their likelihood of defaulting on payments
if the target attribute can take on c different values, then the entropy of S relative to this c-
wise classification is defined as
ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued
functions, relative to the available attributes.
ID3 maintains only a single current hypothesis as it searches through the space of
decision trees.
ID3 in its pure form performs no backtracking in its search.
ID3 uses all training examples at each step in the search to make statistically based
decisions regarding how to refine its current hypothesis.
RULE POST-PRUNING