0% found this document useful (0 votes)

18 views

Unit 1-1

Uploaded by

Aviral Gupta

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Unit 1-1

Uploaded by

Aviral Gupta

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

UNIT-I

Induction Algorithms. Rule Induction. Decision Trees. Bayesian Methods. The Basic
Naıve Bayes Classifier. Naive Bayes Induction for Numeric Attributes. Correction to the
Probability Estimation. Laplace Correction. No Match. Other Bayesian Methods. Other
Induction Methods. Neural Networks. Genetic Algorithms. Instance based Learning.
Support Vector Machines.
Induction Algorithms.
Rule Induction
What is a Rule Induction?
Rule induction is a machine-learning technique that involves the discovery of patterns or
rules in data. It aims to extract explicit if-then rules that can accurately predict or classify
instances based on their features or attributes. Rule induction involves data mining
process of deducing if-then rules from a data set. These symbolic decision rules explain
an inherent relationship between the attributes and class labels in the data set. Many real-
life experiences are based on intuitive rule induction. For example, we can proclaim a rule
that states “if it is 8 a.m. on a weekday, then highway traffic will be heavy” and “if it is 8
p.m. on a Sunday, then the traffic will be light.” These rules are not necessarily right all
the time. 8 a.m. weekday traffic may be light during a holiday season. But, in general,
these rules hold true and are deduced from real-life experience based on our everyday
observations. Rule induction provides a powerful classification approach that can be
easily understood by the general audience. Apart from its use in Predictive Analytics by
classification of unknown data, rule induction is also used to describe the patterns in the
data. The description is in the form of simple if-then rules that can be easily understood
by general users.

The easiest way to extract rules from a data set is from a decision tree that is developed
on the same data set. A decision tree splits data on every node and leads to the leaf
where the class is identified. If we trace back from the leaf to the root node, we can
combine all the split conditions to form a distinct rule.
Recently, there has been substantial attention devoted to the use of machine learning
techniques as tools for decision support. These methods have been applied to a wide
variety of problems in engineering because of their ability to discover patterns from data.
The integration of these methods with conventional decision support systems can provide
a means for significantly improving the quality of decision making. A decision support
system can employ machine learning techniques to derive knowledge directly from prior
decision examples and to refine this knowledge continually. Inductive learning is perhaps
the most widely used machine learning technique. Inductive learning algorithms are
simple and fast. Another advantage is that they generate models that are easy to
understand. Finally, inductive learning algorithms are more accurate compared with other
machine learning techniques. Inductive learning techniques can be divided into two main
categories, namely, decision tree induction and rule induction. RULES (RULe Extraction
System) is a family of inductive learning algorithms that follow the rule induction approach.
The process of rule induction typically involves the following steps:

Data Preparation: The input data is prepared by organizing it into a structured format,
such as a table or a matrix, where each row represents an instance or observation, and
each column represents a feature or attribute.

Rule Generation: The rule generation process involves finding patterns or associations
in the data that can be expressed as if-then rules. Various algorithms and methods can
be used for rule generation, such as decision tree algorithms (e.g., C4.5, CART),
association rule mining algorithms (e.g., Apriori), and logical reasoning approaches (e.g.,
inductive logic programming).

Rule Evaluation: Once the rules are generated, they need to be evaluated to determine
their quality and usefulness. Evaluation metrics can include accuracy, coverage, support,
confidence, lift, and other measures depending on the specific application and domain.

Rule Selection and Pruning: Depending on the complexity of the rule set and the
specific requirements, rule selection and pruning techniques can be applied to refine the
rule set. This process involves removing redundant, irrelevant, or overlapping rules to
improve interpretability and efficiency.

Rule Application: Once a set of high-quality rules is obtained, they can be applied to
new, unseen instances for prediction or classification. Each instance is evaluated against
the rules, and the applicable rule(s) with the highest confidence or support is used to
make predictions or decisions.

Rule induction has been widely used in various domains, such as data mining, machine
learning, expert systems, and decision support systems. It provides interpretable and
human-readable models, making it useful for generating understandable insights and
explanations from data.

While rule induction can be effective in capturing explicit patterns and associations in the
data, it may struggle with capturing complex or non-linear relationships. Additionally, rule
induction algorithms may face challenges when dealing with large and high-dimensional
datasets, as the search space of possible rules can become exponentially large. The
importance of rule induction lies in its ability to extract interpretable and actionable
knowledge from complex datasets. It provides a way to discover underlying patterns,
dependencies, or rules that humans can easily understand and utilize. Rule induction has
applications in various domains, including data mining, machine learning, expert systems,
decision support systems, and business intelligence.

Benefits of rule induction include:

Interpretability: Rule induction generates human-readable rules that are easily

understood and interpreted. This is particularly valuable in domains where transparency
and explainability are important, such as healthcare, finance, or legal systems.
Decision Support: The extracted rules can support decision-making by providing explicit
guidelines or recommendations based on the discovered patterns or relationships.
Predictive Power: Rule induction enables the development of predictive models to make
accurate predictions or classifications based on the identified rules.
Insights and Knowledge Discovery: Rule induction helps uncover hidden relationships,
dependencies, or trends within data, leading to insights and knowledge discovery that
can inform strategic decision-making or guide further investigations.
Automation: Once the rules are extracted, they can be applied to new data automatically,
allowing for efficient and scalable decision-making or prediction.
However, it is important to consider the limitations of rule induction. The generated rules
depend on the available data and may not capture complex or nonlinear relationships.
Rule induction algorithms also require careful parameter tuning and may be sensitive to
noisy or irrelevant data. Examples of rule induction algorithms include the C4.5 algorithm,
RIPPER, and Apriori algorithm for association rule mining. Rule induction is a valuable
technique for extracting interpretable rules and patterns from data, facilitating decision-
making, prediction, and knowledge discovery in various domains.

Decision Trees
Tree induction is a method used in machine learning to derive decision trees from data.
Decision trees are predictive models that use a set of binary rules to calculate a target
value. They are widely used for classification and regression tasks because they are
interpretable, easy to implement, and can handle both numerical and categorical data.
Tree induction algorithms work by recursively partitioning the dataset into subsets based
on the features that provide the best separation between classes or values.
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data
set into smaller subsets, and at the same time, the decision tree is steadily developed.
The final tree is a tree with the decision nodes and leaf nodes. A decision node has at
least two branches. The leaf nodes show a classification or decision. We can't accomplish
more split on leaf nodes-The uppermost decision node in a tree that relates to the best
predictor called the root node. Decision trees can deal with both categorical and numerical
data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures
the randomness or impurity in data sets.

Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return
the highest data gain.

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment
the set until the data belongs to the same class.

Why are decision trees useful?

It enables us to analyze the possible consequences of a decision thoroughly.
It provides us a framework to measure the values of outcomes and the probability of
accomplishing them.
It helps us to make the best decisions based on existing data and best speculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be
used to split an extensive collection of records into smaller sets of the class by
implementing a sequence of simple decision rules. A decision tree model comprises a set
of rules for portioning a huge heterogeneous population into smaller, more homogeneous,
or mutually exclusive classes. The attributes of the classes can be any variables from
nominal, ordinal, binary, and quantitative values, in contrast, the classes must be a
qualitative type, such as categorical or ordinal or binary. In brief, the given data of
attributes together with its class, a decision tree creates a set of rules that can be used to
identify the class. One rule is implemented after another, resulting in a hierarchy of
segments within a segment. The hierarchy is known as the tree, and each segment is
called a node. With each progressive division, the members from the subsequent sets
become more and more similar to each other. Hence, the algorithm used to build a
decision tree is referred to as recursive partitioning. The algorithm is known as CART
(Classification and Regression Trees). Consider the given example of a factory where

Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which
leads to $8 million profit, and the probability of a bad economy is 0.4 (40%), which leads
to $6 million profit. Not expanding factor with 0$ cost, the probability of a good economy
is 0.6(60%), which leads to $4 million profit, and the probability of a bad economy is 0.4,
which leads to $2 million profit. The management teams need to take a data-driven
decision to expand or not based on the given data.
Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M
Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.

Decision tree Algorithm:

The decision tree algorithm may appear long, but it is quite simply the basis algorithm
techniques is as follows:
The algorithm is based on three parameters: D, attribute_list, and Attribute
_selection_method.
Generally, we refer to D as a data partition. Initially, D is the entire set of training tuples
and their related class levels (input training data). The parameter attribute_list is a set of
attributes defining the tuples. Attribute_selection_method specifies a heuristic process for
choosing the attribute that "best" discriminates the given tuples according to class.
Attribute_selection_method process applies an attribute selection measure.

Algorithms for Tree Induction

Several algorithms have been developed for tree induction, each with its own approach
to feature selection and tree construction. Some of the most well-known algorithms
include:

ID3 (Iterative Dichotomiser 3): This algorithm uses entropy and information gain to build
a decision tree for classification tasks.
C4.5: An extension of ID3, C4.5 uses the gain ratio to address some of the limitations of
information gain and can handle both continuous and discrete features.
CART (Classification and Regression Trees): CART is a versatile algorithm that can be
used for both classification and regression. It uses Gini impurity for classification and
variance reduction for regression.

Advantages and Disadvantages of Tree Induction

Advantages:
Interpretability: Decision trees can be easily visualized and understood, even by those
with little knowledge of machine learning.
Handling mixed data: Trees can handle both numerical and categorical data without the
need for preprocessing.
Non-linearity: Trees can model non-linear relationships between features and the target
variable.
Disadvantages:
Overfitting: Trees can easily overfit the training data, especially if they are allowed to grow
deep without pruning.
Instability: Small changes in the data can lead to very different trees being generated.
Performance: While trees are simple and interpretable, they often do not have the
predictive accuracy of more complex models.

Applications of Tree Induction

Tree induction is used in various domains, including:
Medical Diagnosis: Decision trees can help in diagnosing diseases by analyzing patient
data and identifying key symptoms and test results.
Financial Analysis: In finance, trees can be used for credit scoring, fraud detection, and
risk assessment.
Customer Segmentation: Marketing teams use decision trees to segment customers
based on purchasing behavior and preferences.
Bayesian Methods
In recent years, there has been growing interest in probabilistic methods for induction.
Such techniques have a number of clear attractions: they accommodate the flexible
nature of many natural concepts; they have inherent resilience to noise; and they have a
solid grounding in the theory of probability. Moreover, experimental studies of probabilistic
methods have revealed behaviors that are often competitive with the best inductive
learning schemes. Supervised Bayesian methods have long been used within the field of
pattern recognition but only in the past few years have they received attention within the
machine learning community.

The Basic Naıve Bayes Classifier

The most straightforward and widely tested method for probabilistic induction is known as
the naive Bayesian classifier .This scheme represents each class with a single
probabilistic summary. In particular, each description has an associated class probability
or base rate, p(Ck), which specifies the prior probability that one will observe a member
of class Ck. Each description also has an associated set of conditional probabilities,
specifying a probability distribution for each attribute. In nominal domains, one typically
stores a discrete distribution for each attribute in a description. Each p( V j I Ck) term
specifies the probability of value Vj, given an instance of class ck. In numeric domains,
one must represent a continuous probability distribution for each attribute. This requires
that one assume some general form or model, with a common choice being the normal
distribution, which can be conveniently represented entirely in terms of its mean and
variance. To classify a new instance I, a naive Bayesian classifier applies Bayes' theorem
to determine the probability of each descriptjon given the instance ,

However, since I is a conjunction of j values, one can expand this expression to

where the denominator sums over all classes and where p(Λ Vj | CK;) is the probability of
the instance I given the class C;. After calculating these quantities for each description,
the algorithm assigns the instance to the class with the highest probability. In order to
make the above expression operational one must still specify how to compute the term
p(Λ Vj | Ck)- The naive Bayesian classifier assumes independence of attributes within
each class which lets it use the equality.
where the values p( Vj I Ck) represent the conditional probabilities stored with each class.
This approach greatly simplifies the computation of class probabilities for a given
observation. The Bayesian framework also lets one specify prior probabilities for both the
class and the conditional terms. In the absence of domain-specific knowledge, a common
scheme makes use of 'uninformed priors', wh1ch assign equal probabilities to each class
and to the values of each attribute. However, one must also specify how much weight to
give these priors relative to the training data. Learning in the naive Bayesian classifier is
an almost trivial matter. The simplest implementation increments a count each time it
encounters a new instance along with a separate count for a class each time it observes
an instance of that class. These counts let the classifier estimate p( Ck) for each class Ck.
For each nominal value, the algorithm updates a count for that class-value pair. Together
with the second count this lets the classifier estimate p( Vj I Ck) for each numeric attribute,
the method retains and revises two quantities, the sum and the sum of squares which let
it compute the mean and variance for a normal curve that it uses to find p( Vj I Ck). In
domains that can have missing attributes, it must include a fourth count for each class-
attribute pair.
In contrast to many induction methods, the naive Bayesian classifier does not carry out
an extensive search through a space of possible descriptions. The basic algorithm makes
no choices about how to partition the data, which direction to move in a weight space, or
the like, and the resulting probabilistic summary is completely determined by the training
data and the prior probabilities. Nor does the order of the training instances have any
effect on the output; the bas1c process produces the same description whether it
operates incrementally or nonincrementally. These features make the learning algorithm
both simple to understand and quite efficient. Bayesian classifiers would appear to have
advantages over many induction algorithms. For example, their collection of class and
conditional probabilities should make them inherently robust with respect to noise. Their
statistical basis should also let them scale well to domains that involve many irrelevant
attributes.

Naive Bayes Induction for Numeric Attributes

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each other.
One of the most simple and effective classification algorithms, the Naïve Bayes classifier
aids in the rapid development of machine learning models with rapid prediction
capabilities.

Naïve Bayes algorithm is used for classification problems. It is highly used in text
classification. In text classification tasks, data contains high dimension (as each word
represent one feature in the data). It is used in spam filtering, sentiment detection, rating
classification etc. The advantage of using naïve Bayes is its speed. It is fast and making
prediction is easy with high dimension of data. This model predicts the probability of an
instance belongs to a class with a given set of feature value. It is a probabilistic classifier.
It is because it assumes that one feature in the model is independent of existence of
another feature. In other words, each feature contributes to the predictions with no relation
between each other. In real world, this condition satisfies rarely. It uses Bayes theorem
in the algorithm for training and prediction

Why it is Called Naive Bayes?

The “Naive” part of the name indicates the simplifying assumption made by the Naïve
Bayes classifier. The classifier assumes that the features used to describe an observation
are conditionally independent, given the class label. The “Bayes” part of the name refers
to Reverend Thomas Bayes, an 18th-century statistician and theologian who formulated
Bayes’ theorem. Consider a fictional dataset that describes the weather conditions for
playing a game of golf. Given the weather conditions, each tuple classifies the conditions
as fit(“Yes”) or unfit(“No”) for playing golf.Here is a tabular representation of our dataset.

Play
Outlook Temperature Humidity Windy
Golf

0 Rainy Hot High FALSE No

1 Rainy Hot High TRUE No

2 Overcast Hot High FALSE Yes

3 Sunny Mild High FALSE Yes

4 Sunny Cool Normal FALSE Yes
5 Sunny Cool Normal TRUE No

6 Overcast Cool Normal TRUE Yes

7 Rainy Mild High FALSE No

8 Rainy Cool Normal FALSE Yes
9 Sunny Mild Normal FALSE Yes
10 Rainy Mild Normal TRUE Yes

11 Overcast Mild High TRUE Yes

12 Overcast Hot Normal FALSE Yes

13 Sunny Mild High TRUE No

The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable(prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: The features of the data are conditionally independent of each
other, given the class label.
Continuous features are normally distributed: If a feature is continuous, then it is assumed
to be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is assumed
to have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
No missing data: The data should not contain any missing values.
With relation to our dataset, this concept can be understood as:
We assume that no pair of features are dependent. For example, the temperature being
‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the
winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example, knowing
only temperature and humidity alone can’t predict the outcome accurately. None of the
attributes is irrelevant and assumed to be contributing equally to the outcome.
The assumptions made by Naive Bayes are not generally correct in real-world situations.
In-fact, the independence assumption is never correct but often works well in
practice.Now, before moving to the formula for Naive Bayes, it is important to know about
Bayes’ theorem.

Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:

Basically, we are trying to find probability of event A, given the event B is true. Event B
is also termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
P(B) is Marginal Probability: Probability of Evidence.
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true based
on the evidence.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

With relation to our dataset, this concept can be understood as:

We assume that no pair of features are dependent. For example, the temperature being
‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the
winds. Hence, the features are assumed to be independent. Secondly, each feature is
given the same weight (or importance). For example, knowing only temperature and
humidity alone can’t predict the outcome accurately. None of the attributes is irrelevant
and assumed to be contributing equally to the outcome. Now, its time to put a naive
assumption to the Bayes’ theorem, which is, independence among the features. So now,
we split evidence into the independent parts.
Now, if any two events A and B are independent, then,

Now, as the denominator remains constant for a given input, we can remove that term:

Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
probability. This can be expressed mathematically as:
So now, we are done with our pre-computations and the classifier is ready. Let us test it
on a new set of features (let us call it today):
Correction to the Probability Estimation
Naïve Bayes is a probabilistic classifier based on Bayes theorem and is used for
classification tasks. It works well enough in text classification problems such as spam
filtering and the classification of reviews as positive or negative. The algorithm seems
perfect at first, but the fundamental representation of Naïve Bayes can create some
problems in real-world scenarios. Let’s take an example of text classification where the
task is to classify whether the review Is positive or negative. We build a likelihood table
based on the training data. While querying a review, we use the Likelihood table values,
but what if a word in a review was not present in the training dataset?
Query review = w1 w2 w3 w’
We have four words in our query review, and let’s assume only w1, w2, and w3 are
present in training data. So, we will have a likelihood for those words. To calculate
whether the review is positive or negative, we compare P(positive|review) and
P(negative|review).
In the likelihood table, we have P(w1|positive), P(w2|Positive), P(w3|Positive), and
P(positive)
but where is P(w’|positive)?If the word is absent in the training dataset, then we don’t
have its likelihood. What should we do?
Approach1- Ignore the term P(w’|positive)
Ignoring means that we are assigning it a value of 1, which means the probability of w’
occurring in positive P(w’|positive) and negative review P(w’|negative) is 1. This approach
seems logically incorrect.

Approach 2- In a bag of words model, we count the occurrence of words. The occurrences
of word w’ in training are 0. According to that
P(w’|positive)=0 and P(w’|negative)=0, but this will make both P(positive|review) and
P(negative|review) equal to 0 since we multiply all the likelihoods. This is the problem of
zero probability. So, how to deal with this problem?

Laplace Smoothing
Laplace smoothing is a smoothing technique that handles the problem of zero probability
in Naïve Bayes. Using Laplace smoothing, we can represent P(w’|positive) as

Here,
alpha represents the smoothing parameter,
K represents the number of dimensions (features) in the data, and
N represents the number of reviews with y=positive
If we choose a value of alpha!=0 (not equal to 0), the probability will no longer be zero
even if a word is not present in the training dataset.
Interpretation of changing alpha
Let’s say the occurrence of word w is 3 with y=positive in training data. Assuming we have
2 features in our dataset, i.e., K=2 and N=100 (total number of positive reviews).
Case 1- when alpha=1
P(w’|positive) = 3/102
Case 2- when alpha = 100
P(w’|positive) = 103/300
Case 3- when alpha=1000
P(w’|positive) = 1003/2100
As alpha increases, the likelihood probability moves towards uniform distribution (0.5).
Most of the time, alpha = 1 is being used to remove the problem of zero probability.
Numerical stability
In the earliest days of programming, developers often encountered difficulties when it
came to storing decimal or floating-point values in computer memory. While they could
easily represent whole numbers, representing decimal values posed a challenge.

The reason behind this challenge is that computers use binary representation, using only
0s and 1s, to represent any number. Consequently, it becomes challenging to accurately
represent decimal values in binary form. For instance, when representing extremely small
numbers like 0.000001, precision can be lost, and the value may be treated as 0. Let’s
consider an example in the field of biology. Suppose you are measuring the radius of a
cell. In some cases, your measurements might be extremely small, such as 0.00000001.
Now, let’s say you want to compare this radius to another cell’s radius, which is
0.0000003. Due to the limitations of computer representation, the computer will treat both
values as zero, leading to the incorrect conclusion that both cells have equal radii. This
condition is referred to as underflow. Underflow refers to a situation in which values
smaller than the smallest representable value in a computer’s numeric system are
rounded down to zero.

Let’s explore underflow in the context of a simple example. Suppose you are trying to
predict whether a student will get a placement based on their CGPA (Cumulative Grade
Point Average) and IQ (Intelligence Quotient). Let’s say the student has a CGPA of 8.1
and an IQ of 81. To calculate the probability of placement, you need to evaluate the
following:
p(y|8.1, 81) = p(y) * p(8.1|y) * p(81|y)
p(n|8.1, 81) = p(n) * p(8.1|n) * p(81|n)
Since probabilities range from 0 to 1, when you multiply these probabilities together
(especially if you have multiple features), the result tends to move closer to zero. This
leads to the underflow problem, where the computed probability becomes extremely
small, approaching zero, and can cause inaccuracies in the prediction model.

To address the underflow problem, one solution is to work with logarithmic probabilities.
By taking the logarithm of the probabilities, you can avoid the issue of extremely small
values.
The logarithmic property as mentioned, log(A * B) = log(A) + log(B), is indeed a useful
property. It allows us to rewrite the expression.
log(p(y) * p(8.1|y) * p(81|y)) as the sum of logarithms:

log(p(y)) + log(p(8.1|y)) + log(p(81|y)).

In the context of implementing this solution, you can utilize the predict_log_proba(X)
function available in the scikit-learn library's Naive Bayes implementation. This function
computes the logarithm of the probabilities for each class given input features X. After
calculating the logarithmic probabilities, you can compare them and choose the class with
the highest log probability. For example, if you obtain a log probability of 53 for one class
and 25 for another, you would select the class with the higher log probability. By using
logarithmic probabilities, you can overcome the underflow problem and make more
accurate predictions.

Laplace additive smoothing

To understand why we need Laplace additive smoothing, let’s consider a sentiment
analysis task. We have two features: movie reviews and corresponding sentiments
(positive or negative).

Our goal is to predict the sentiment based on the movie review.

To convert this data into a binary bag-of-words table, we need to represent each word
with a binary value (0 or 1). The table would look like this:

In the binary bag-of-words table, each word corresponds to a column, and its presence
is denoted by a 1, while its absence is denoted by a 0. The sentiment is represented by
the “Sentiment” column. To create the binary bag-of-words representation table,
considering the additional query point “r4” with the words “w1, w1, w1,” we can convert it
to a table as follows:
In this table, “r4” represents the additional query point, where “w1” appears twice and
“w2” and “w3” are absent. The sentiment for Review 4 is yet to be predicted.
Now, let’s calculate the probabilities for positive and negative sentiments based on the
given data table:
p(+ve|r4) = p(+ve) * p(w1=1|+ve) * p(w2=0|+ve) * p(w3=0|+ve) = (2/3) * (1/1) * (1/1) * (0/1)
=0
p(-ve|r4) = p(-ve) * p(w1=1|-ve) * p(w2=0|-ve) * p(w3=0|-ve) = (1/3) * (1/2) * (0/2) * (1/2) =
1/12

As you can see, both probabilities become 0, which is an issue when certain features do
not exist in a particular class, resulting in zero probabilities. This is where Laplace additive
smoothing comes in. Laplace additive smoothing helps avoid zero probabilities by adding
a small constant (alpha) to the numerator and n * alpha to the denominator of each
probability estimate. By applying Laplace additive smoothing, the probabilities will never
be zero. The value of alpha is usually 1 (default), but you can choose a different value
based on your preference. The value of n depends on the type of Naive Bayes algorithm
you are using, which we can discuss further if needed.Let’s understand the bias-variance
tradeoff in the case of Naive Bayes. The question arises: why do we add alpha in the
numerator and n * alpha in the denominator? Why don’t we add a very small constant
value like 0.000001 instead?
The reason we add alpha in the numerator and n * alpha in the denominator is to have
flexibility in controlling the bias and variance of the model. By tuning the value of alpha,
we can adjust the bias and variance accordingly.

When a model has high bias, it means it has simplified assumptions or constraints that
may lead to underfitting, resulting in poor performance. In such cases, we can set a lower
value of alpha to reduce bias and allow the model to capture more complex patterns. On
the other hand, when a model has high variance, it means it is too sensitive to the training
data and may overfit, resulting in poor generalization to unseen data. To address high
variance, we can set a higher value of alpha to smoothen the probability estimates and
reduce the impact of individual features, thus reducing variance. Alpha serves as a
hyperparameter that allows us to strike a balance between bias and variance. By
choosing different values of alpha, we can fine-tune the model’s behavior and find the
optimal tradeoff between bias and variance for a specific problem.
There are two reasons why we use Laplace additive smoothing:
1. To ensure that probabilities will not become zero.
2. By tuning the value of alpha and n * alpha, we can reduce overfitting and strike a
balance between bias and variance trade-off.
The Matching Problem
This famous problem has been stated variously in terms of hats and people, letters and
envelopes, tea cups and saucers – indeed, any situation in which you might want to match
two kinds of items seems to have appeared somewhere as a setting for the matching
problem. In the letter-envelope setting there are n letters labeled 1 through n and
also n envelopes labeled 1 through n. The letters are permuted randomly into the
envelopes, one letter per envelope (a mishap usually blamed on an unfortunate
hypothetical secretary), so that all permutations are equally likely. The main questions
are about the number of letters that are placed into their matching envelopes.
"Real life" settings aside, the problem is about the number of fixed points of a random
permutation. A fixed point is an element whose position is unchanged by the shuffle.

Matches at Fixed Locations

No Matches

If letters falling in the right envelopes are good events, then the worst possible event
is every letter falling in a wrong envelope. That is the event that there are no
matches, and is called a derangement. Let's find the chance of a derangement.

The key is to notice that the complement is a union, and then use the inclusion -
exclusion formula.
Other Bayesian Methods
Other Induction Methods
Induction is pattern recognition -- an inference based on limited observational or
experimental data -- and pattern recognition is an addictively exhilarating acquired skill.

Of the two types of scientific inference, induction is far more pervasive and useful than
deduction (Chapter 4). Induction usually infers some pattern among a set of observations
and then attributes that pattern to an entire population. Almost all hypothesis formation is
based consciously or subconsciously on induction.

Induction is pervasive because people seek order insatiably, yet they lack the opportunity
of basing that search on observation of the entire population. Instead they make a few
observations and generalize.

Induction is not just a description of observations; it is always a leap beyond the data -- a
leap based on circumstantial evidence. The leap may be an inference that other
observations would exhibit the same phenomena already seen in the study sample, or it
may be some type of explanation or conceptual understanding of the observations; often
it is both. Because induction is always a leap beyond the data, it can never be proved. If
further observations are consistent with the induction, then they confirm, or lend
substantiating support to, the induction. But the possibility always remains that as-yet-
unexamined data might disprove the induction.

In symbols, we can think of confirmation of our inductive hypothesis A as: A⇒B, B, ∴A

(i.e., A implies B; B is observed and therefore A must also be true or present). Such
evidence may be inductively useful confirmation. The logic, however, is a deductive
fallacy (known as affirming the consequent), because there may always be other factors
that cause B. Although confirmation of an induction is incremental and inconclusive, the
hypothesis can be disproved by a single experiment, via the deductive technique of
modus tollens: A⇒B, -B, ∴-A (i.e., A implies B; B is not observed and therefore A must
not be true or present).

Scientific induction requires that we make two unprovable assumptions, or postulates:

• representative sampling. Only if our samples are representative, or similar in

behavior to the population as a whole, may we generalize from observations of these
samples to the likely behavior of the entire population. In contrast, if our samples
represent only a distinctive subset of the population, then our inductions cannot extend
beyond this subset. This postulate is crucial, it is usually achieved easily by the scientist,
and yet it is often violated with scientifically catastrophic results. As discussed more fully
in the previous chapter, randomization and objective sampling are the paths to obtaining
a representative sample; subjective sampling generates a biased sample.

• uniformity of nature. Strictly speaking, even if our sample is representative we

cannot be certain that the unsampled remainder of the population exhibits the same
behavior. However, we assume that nature is uniform, that the unsampled remainder is
similar in behavior to our samples, that today’s natural laws will still be valid tomorrow.
Without this assumption, all is chaos.
* * *

Types of Explanation

Induction is explanation, and explanation is identification of some type of order in the

universe. Explanation is an integral part of the goal of science: perceiving a connection
among events, deciphering the explanation for that connection, and using these
inductions for prediction of other events. Some scientists claim that science cannot
explain; it can only describe. That claim only pertains, however, to Aristotelian
explanation: answering the question “Why?” by identifying the purpose of a phenomenon.
More often, the scientific question is “How?” Here we use the inclusive concept of
explanation as any identification of order.

Individual events are complex, but explanation discerns their underlying simplicity of
relationships. In this section we will consider briefly two types of scientific explanation:
comparison (analogy and symmetry) and classification. In subsequent sections we will
examine, in much more detail, two more powerful types of explanation: correlation and
causality.

Explanation can deal with attributes or with variables. An attribute is binary: either present
or absent. Explanation of attributes often involves consideration of associations of the
attribute with certain phenomena or circumstances. A variable, in contrast, is not merely
present or absent; it is a characteristic whose changes can be quantitatively measured.
Explanations of a variable often involve description of a correlation between changes in
that variable and changes in another variable. If a subjective attribute, such as tall or
short, can be transformed into a variable, such as height, explanatory value increases.

The different kinds of explanation contrast in explanatory power and experimental ease.
Easiest to test is the null hypothesis that two variables are completely unrelated.
Statistical rejection of the null hypothesis can demonstrate the likelihood that a
classification or correlation has predictive value. Causality goes deeper, establishing the
origin of that predictive ability, but demonstration of causality can be very challenging.
Beyond causality, the underlying quantitative theoretical mechanism sometimes can be
discerned.

* * *

Comparison is the most common means of identifying order, whether by scientists or by

lay people. Often, comparison goes no farther than a consideration of the same
characteristic in two individuals. Scientific comparison, however, is usually meant as a
generalization of the behavior of variables or attributes. Two common types of
comparison are symmetry and analogy.

Symmetry is a regularity of shape or arrangement of parts within a whole -- for example,

a correspondence of part and counterpart. In many branches of science, recognition of
symmetry is a useful form of pattern recognition. To the physicist, symmetry is both a
predictive tool and a standard by which theories are judged.

Analogy is the description of observed behavior in one class of phenomena and the
inference that this description is somehow relevant to a different class of phenomena.
Analogy does not necessarily imply that the two classes obey the same laws or function
in exactly the same way. Analogy often is an apparent order or similarity that serves only
as a visualization aid. That purpose is sufficient justification, and the analogy may inspire
fruitful follow-up research. In other cases, analogy can reflect a more fundamental
physical link between behaviors of the two classes.

Classifications evolve to regain utility, when exceptions and anomalous examples are
found. Often these exceptions can be explained by a more restrictive and complex class
definition. Frequently, the smaller class exhibits greater commonality of other
characteristics than was observed within the larger class

Coincidence

Classifications, like all explanations, seek meaningful associations and correlations.

Sometimes, however, they are misled by coincidence.

“A large number of incorrect conclusions are drawn because the possibility of

chance occurrences is not fully considered. This usually arises through lack of
proper controls and insufficient repetitions. There is the story of the research worker
in nutrition who had published a rather surprising conclusion concerning rats. A
visitor asked him if he could see more of the evidence. The researcher replied,
‘Sure, there’s the rat.’”

Without attention to statistical evidence and confirmatory power, the scientist falls into the
most common pitfall of non-scientists: hasty generalization. One or a few chance
associations between two attributes or variables are mistakenly inferred to represent a
causal relationship. Hasty generalization is responsible for many popular superstitions,
but even scientists such as Aristotle were not immune to it. Hasty generalizations are
often inspired by coincidence, the unexpected and improbable association between two
or more events. After compiling and analyzing thousands of coincidences, Diaconis and
Mostelle [1989] found that coincidences could be grouped into three classes:

• cases where there was an unnoticed causal relationship, so the association actually was
not a coincidence;
• nonrepresentative samples, focusing on one association while ignoring or forgetting
examples of non-matches;
• actual chance events that are much more likely than one might expect.

An example of this third type is that any group of 23 people has a 50% chance of at least
two people having the same birthday.
Correlation

“Every scientific problem is a search for the relationship between variables.”

[Thurstone, 1925]

Begin with two variables, which we will call X and Y, for which we have several
measurements. By convention, X is called the independent variable and Y is the
dependent variable. Perhaps X causes Y, so that the value of Y is truly dependent on the
value of X. Such a condition would be convenient, but all we really require is the possibility
that a knowledge of the value of the independent variable X may give us some ability to
predict the value of Y.

Crossplots

Crossplots are the best way to look for a relationship between two variables. They involve
minimal assumptions: just that one’s measurements are reliable and paired (xi, yi). They
permit use of an extremely efficient and robust tool for pattern recognition: the eye. Such
pattern recognition and its associated brainstorming are a joy.

Crossplot interpretation, like any subjective pattern recognition, is subject to the

‘Rorschach effect’: the brain's bias toward ‘seeing’ patterns even in random data. The
primary defense against the Rorschach effect is to subject each apparent pattern to some
quantitative test, but this may be impractical. Another defense is to look at many patterns,
of both random and systematic origins, in order to improve one’s ability to distinguish
between the two.

Nonlinear Relationships

The biggest pitfall of linear regression and correlation coefficients is that so many
relationships between variables are nonlinear. As an extreme example, imagine applying
these techniques to the annual temperature variation of Anchorage (Figure 10b). For a
sinusoidal distribution such as this, the correlation coefficient would be virtually zero and
regression would yield the absurd conclusion that knowledge of what month it is (X) gives
no information about expected temperature (Y). In general, any departure from a linear
relationship degrades the correlation coefficient.

The first defense against nonlinear relationships is to transform one or both variables so
that the relation between them is linear. Taking the logarithm of one or both is by far the
most common transformation; taking reciprocals is another. Taking the logarithm of both
variables is equivalent to fitting the relationship Y=bXm rather than the usual Y=b+mX.
Our earlier plotting hint to try to obtain a linear relationship had two purposes. First, linear
regression and correlation coefficients assume linearity. Second, linear trends are
somewhat easier for the eye to discern.

A second approach is to use a nonparametric statistic called the rank correlation

coefficient. This technique does not require a linear correlation. It does require a
relationship in which increase in one variable is accompanied by increase or decrease in
the other variable.

Neural Networks

Neural networks extract identifying features from data, lacking pre-programmed

understanding. Network components include neurons, connections, weights, biases,
propagation functions, and a learning rule. Neurons receive inputs, governed by
thresholds and activation functions. Connections involve weights and biases regulating
information transfer. Learning, adjusting weights and biases, occurs in three stages:
input computation, output generation, and iterative refinement enhancing the network’s
proficiency in diverse tasks.
These include:
1. The neural network is simulated by a new environment.
2. Then the free parameters of the neural network are changed as a result of this
simulation.
3. The neural network then responds in a new way to the environment because of the
changes in its free parameters.

Importance of Neural Networks

The ability of neural networks to identify patterns, solve intricate puzzles, and adjust to
changing surroundings is essential. Their capacity to learn from data has far-reaching
effects, ranging from revolutionizing technology like natural language processing and
self-driving automobiles to automating decision-making processes and increasing
efficiency in numerous industries. The development of artificial intelligence is largely
dependent on neural networks, which also drive innovation and influence the direction
of technology.
How does Neural Networks work?
Let’s understand with an example of how a neural network works:
Consider a neural network for email classification. The input layer takes features like
email content, sender information, and subject. These inputs, multiplied by adjusted
weights, pass through hidden layers. The network, through training, learns to
recognize patterns indicating whether an email is spam or not. The output layer, with a
binary activation function, predicts whether the email is spam (1) or not (0). As the
network iteratively refines its weights through backpropagation, it becomes adept at
distinguishing between spam and legitimate emails, showcasing the practicality of
neural networks in real-world applications like email filtering.
Working of a Neural Network
Neural networks are complex systems that mimic some features of the functioning of
the human brain. It is composed of an input layer, one or more hidden layers, and an
output layer made up of layers of artificial neurons that are coupled. The two stages of
the basic process are called backpropagation and forward propagation.

Forward Propagation
 Input Layer: Each feature in the input layer is represented by a node on the
network, which receives input data.
 Weights and Connections: The weight of each neuronal connection indicates
how strong the connection is. Throughout training, these weights are changed.
 Hidden Layers: Each hidden layer neuron processes inputs by multiplying them
by weights, adding them up, and then passing them through an activation function.
By doing this, non-linearity is introduced, enabling the network to recognize
intricate patterns.
 Output: The final result is produced by repeating the process until the output layer
is reached.
Backpropagation
 Loss Calculation: The network’s output is evaluated against the real goal values,
and a loss function is used to compute the difference. For a regression problem,
the Mean Squared Error (MSE) is commonly used as the cost function.

Loss Function:
 Gradient Descent: Gradient descent is then used by the network to reduce the
loss. To lower the inaccuracy, weights are changed based on the derivative of the
loss with respect to each weight.
 Adjusting weights: The weights are adjusted at each connection by applying this
iterative process, or backpropagation, backward across the network.
 Training: During training with different data samples, the entire process of forward
propagation, loss calculation, and backpropagation is done iteratively, enabling the
network to adapt and learn patterns from the data.
 Actvation Functions: Model non-linearity is introduced by activation functions like
the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a
neuron is based on the whole weighted input.
Learning of a Neural Network
1. Learning with supervised learning
In supervised learning, the neural network is guided by a teacher who has access to
both input-output pairs. The network creates outputs based on inputs without taking
into account the surroundings. By comparing these outputs to the teacher-known
desired outputs, an error signal is generated. In order to reduce errors, the network’s
parameters are changed iteratively and stop when performance is at an acceptable
level.
2. Learning with Unsupervised learning
Equivalent output variables are absent in unsupervised learning. Its main goal is to
comprehend incoming data’s (X) underlying structure. No instructor is present to offer
advice. Modeling data patterns and relationships is the intended outcome instead.
Words like regression and classification are related to supervised learning, whereas
unsupervised learning is associated with clustering and association.
3. Learning with Reinforcement Learning
Through interaction with the environment and feedback in the form of rewards or
penalties, the network gains knowledge. Finding a policy or strategy that optimizes
cumulative rewards over time is the goal for the network. This kind is frequently
utilized in gaming and decision-making applications.
Types of Neural Networks
There are seven types of neural networks that can be used.
 Feedforward Neteworks: A feedforward neural network is a simple artificial neural
network architecture in which data moves from input to output in a single direction.
It has input, hidden, and output layers; feedback loops are absent. Its
straightforward architecture makes it appropriate for a number of applications, such
as regression and pattern recognition.
 Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with
three or more layers, including an input layer, one or more hidden layers, and an
output layer. It uses nonlinear activation functions.
 Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is
a specialized artificial neural network designed for image processing. It employs
convolutional layers to automatically learn hierarchical features from input images,
enabling effective image recognition and classification. CNNs have revolutionized
computer vision and are pivotal in tasks like object detection and image analysis.
 Recurrent Neural Network (RNN): An artificial neural network type intended for
sequential data processing is called a Recurrent Neural Network (RNN). It is
appropriate for applications where contextual dependencies are critical, such as
time series prediction and natural language processing, since it makes use of
feedback loops, which enable information to survive within the network.
 Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to
overcome the vanishing gradient problem in training RNNs. It uses memory cells
and gates to selectively read, write, and erase information.

Genetic Algorithm (GA)

Genetic Algorithm (GA) is a search-based optimization technique based on the
principles of Genetics and Natural Selection. It is frequently used to find optimal or
near-optimal solutions to difficult problems which otherwise would take a lifetime to
solve. It is frequently used to solve optimization problems, in research, and in machine
learning.

Introduction to Optimization

Optimization is the process of making something better. In any process, we have a

set of inputs and a set of outputs as shown in the following figure.

Optimization refers to finding the values of inputs in such a way that we get the “best”
output values. The definition of “best” varies from problem to problem, but in
mathematical terms, it refers to maximizing or minimizing one or more objective
functions, by varying the input parameters.

The set of all possible solutions or values which the inputs can take make up the search
space. In this search space, lies a point or a set of points which gives the optimal
solution. The aim of optimization is to find that point or set of points in the search space.

What are Genetic Algorithms?

Nature has always been a great source of inspiration to all mankind. Genetic Algorithms
(GAs) are search based algorithms based on the concepts of natural selection and
genetics. GAs are a subset of a much larger branch of computation known
as Evolutionary Computation.

GAs were developed by John Holland and his students and colleagues at the University
of Michigan, most notably David E. Goldberg and has since been tried on various
optimization problems with a high degree of success.

In GAs, we have a pool or a population of possible solutions to the given problem.

These solutions then undergo recombination and mutation (like in natural genetics),
producing new children, and the process is repeated over various generations. Each
individual (or candidate solution) is assigned a fitness value (based on its objective
function value) and the fitter individuals are given a higher chance to mate and yield
more “fitter” individuals. This is in line with the Darwinian Theory of “Survival of the
Fittest”.

In this way we keep “evolving” better individuals or solutions over generations, till we
reach a stopping criterion.

Genetic Algorithms are sufficiently randomized in nature, but they perform much better
than random local search (in which we just try various random solutions, keeping track
of the best so far), as they exploit historical information as well.

Advantages of GAs

GAs have various advantages which have made them immensely popular. These
include −

 Does not require any derivative information (which may not be available for many
real-world problems).
 Is faster and more efficient as compared to the traditional methods.
 Has very good parallel capabilities.
 Optimizes both continuous and discrete functions and also multi-objective
problems.
 Provides a list of “good” solutions and not just a single solution.
 Always gets an answer to the problem, which gets better over the time.
 Useful when the search space is very large and there are a large number of
parameters involved.

Limitations of GAs

Like any technique, GAs also suffer from a few limitations. These include −

 GAs are not suited for all problems, especially problems which are simple and for
which derivative information is available.
 Fitness value is calculated repeatedly which might be computationally expensive
for some problems.
 Being stochastic, there are no guarantees on the optimality or the quality of the
solution.
 If not implemented properly, the GA may not converge to the optimal solution.
GA – Motivation

Genetic Algorithms have the ability to deliver a “good-enough” solution “fast-enough”.

This makes genetic algorithms attractive for use in solving optimization problems. The
reasons why GAs are needed are as follows −

Solving Difficult Problems

In computer science, there is a large set of problems, which are NP-Hard. What this
essentially means is that, even the most powerful computing systems take a very long
time (even years!) to solve that problem. In such a scenario, GAs prove to be an
efficient tool to provide usable near-optimal solutions in a short amount of time.

Failure of Gradient Based Methods

Traditional calculus based methods work by starting at a random point and by moving in
the direction of the gradient, till we reach the top of the hill. This technique is efficient
and works very well for single-peaked objective functions like the cost function in linear
regression. But, in most real-world situations, we have a very complex problem called
as landscapes, which are made of many peaks and many valleys, which causes such
methods to fail, as they suffer from an inherent tendency of getting stuck at the local
optima as shown in the following figure.

Getting a Good Solution Fast

Some difficult problems like the Travelling Salesperson Problem (TSP), have real-world
applications like path finding and VLSI Design. Now imagine that you are using your
GPS Navigation system, and it takes a few minutes (or even a few hours) to compute
the “optimal” path from the source to destination. Delay in such real world applications is
not acceptable and therefore a “good-enough” solution, which is delivered “fast” is what
is required.

This section introduces the basic terminology required to understand GAs. Also, a
generic structure of GAs is presented in both pseudo-code and graphical forms. The
reader is advised to properly understand all the concepts introduced in this section and
keep them in mind when reading other sections of this tutorial as well.

Basic Terminology

Before beginning a discussion on Genetic Algorithms, it is essential to be familiar with

some basic terminology which will be used throughout this tutorial.

 Population − It is a subset of all the possible (encoded) solutions to the given

problem. The population for a GA is analogous to the population for human beings
except that instead of human beings, we have Candidate Solutions representing
human beings.
 Chromosomes − A chromosome is one such solution to the given problem.
 Gene − A gene is one element position of a chromosome.
 Allele − It is the value a gene takes for a particular chromosome.

 Genotype − Genotype is the population in the computation space. In the

computation space, the solutions are represented in a way which can be easily
understood and manipulated using a computing system.
 Phenotype − Phenotype is the population in the actual real world solution space
in which solutions are represented in a way they are represented in real world
situations.
 Decoding and Encoding − For simple problems, the phenotype and
genotype spaces are the same. However, in most of the cases, the phenotype
and genotype spaces are different. Decoding is a process of transforming a
solution from the genotype to the phenotype space, while encoding is a process of
transforming from the phenotype to genotype space. Decoding should be fast as
it is carried out repeatedly in a GA during the fitness value calculation.
For example, consider the 0/1 Knapsack Problem. The Phenotype space consists
of solutions which just contain the item numbers of the items to be picked.
However, in the genotype space it can be represented as a binary string of length
n (where n is the number of items). A 0 at position x represents that xth item is
picked while a 1 represents the reverse. This is a case where genotype and
phenotype spaces are different.

 Fitness Function − A fitness function simply defined is a function which takes the
solution as input and produces the suitability of the solution as the output. In some
cases, the fitness function and the objective function may be the same, while in
others it might be different based on the problem.
 Genetic Operators − These alter the genetic composition of the offspring. These
include crossover, mutation, selection, etc.

Basic Structure

The basic structure of a GA is as follows −

We start with an initial population (which may be generated at random or seeded by

other heuristics), select parents from this population for mating. Apply crossover and
mutation operators on the parents to generate new off-springs. And finally these off-
springs replace the existing individuals in the population and the process repeats. In this
way genetic algorithms actually try to mimic the human evolution to some extent.

Each of the following steps are covered as a separate chapter later in this tutorial.

Instance based Learning

The Machine Learning systems which are categorized as instance-based
learning are the systems that learn the training examples by heart and then
generalizes to new instances based on some similarity measure. It is called instance-
based because it builds the hypotheses from the training instances. It is also known
as memory-based learning or lazy-learning (because they delay processing until a
new instance must be classified). The time complexity of this algorithm depends upon
the size of training data. Each time whenever a new query is encountered, its
previously stores data is examined. And assign to a target function value for the new
instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of
training instances. For example, If we were to create a spam filter with an instance-
based learning algorithm, instead of just flagging emails that are already marked as
spam emails, our spam filter would be programmed to also flag emails that are very
similar to them. This requires a measure of resemblance between two emails. A
similarity measure between two emails could be the same sender or the repetitive use
of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made
to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves
starting the identification of a local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression. Though we say regression problems as well it’s
best suited for classification. The main objective of the SVM algorithm is to find the
optimal hyperplane in an N-dimensional space that can separate the data points in
different classes in the feature space. The hyperplane tries that the margin between
the closest points of different classes should be as maximum as possible. The
dimension of the hyperplane depends upon the number of features. If the number of
input features is two, then the hyperplane is just a line. If the number of input features
is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine
when the number of features exceeds three.
Let’s consider two independent variables x1, x2, and one dependent variable which is
either a blue circle or a red circle.

Linearly Separable Data points

From the figure above it’s very clear that there are multiple lines (our hyperplane here
is a line because we are considering only two input features x1, x2) that segregate our
data points or do a classification between red and blue circles. So how do we choose
the best line or in general the best hyperplane that segregates our data points?

How does SVM work?

One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.

Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on
each side is maximized. If such a hyperplane exists it is known as the maximum-
margin hyperplane/hard margin. So from the above figure, we choose L2. Let’s
consider a scenario like shown below

Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify
the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the
best hyperplane that maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done
with previous data sets along with that it adds a penalty each time a point crosses the
margin. So the margins in these types of cases are called soft margins. When there
is a soft margin to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)).
Hinge loss is a commonly used penalty. If no violations no hinge loss.If violations
hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red
balls are separable by a straight line/linear line). What to do if data are not linearly
separable?

Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable
using a kernel. We call a point xi on the line and we create a new variable yi as a
function of distance from origin o.so if we plot this we get something like as shown
below

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A
non-linear function that creates a new variable is referred to as a kernel.

Support Vector Machine Terminology

1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data
points of different classes in a feature space. In the case of linear classifications, it
will be a linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane,
which makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The
main objective of the support vector machine algorithm is to maximize the
margin. The wider margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the
original input data points into high-dimensional feature spaces, so, that the
hyperplane can be easily found out even if the data points are not linearly
separable in the original input space. Some of the common kernel functions are
linear, polynomial, radial basis function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is
a hyperplane that properly separates the data points of different categories without
any misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM
permits a soft margin technique. Each data point has a slack variable introduced by
the soft-margin SVM formulation, which softens the strict margin requirement and
permits certain misclassifications or violations. It discovers a compromise between
increasing the margin and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the
regularisation parameter C in SVM. The penalty for going over the margin or
misclassifying data items is decided by it. A stricter penalty is imposed with a
greater value of C, which results in a smaller margin and perhaps fewer
misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently
formed by combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating
the Lagrange multipliers related to the support vectors can be used to solve SVM.
The dual formulation enables the use of kernel tricks and more effective
computing.

Mathematical intuition of Support Vector Machine

Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines (SVM) can
be divided into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a single straight line (in 2D) or a
hyperplane (in higher dimensions) can entirely divide the data points into their
respective classes. A hyperplane that maximizes the margin between the classes
is the decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original
input data is transformed by these kernel functions into a higher-dimensional
feature space, where the data points can be linearly separated. A linear SVM is
used to locate a nonlinear decision boundary in this modified space.

Popular kernel functions in SVM
The SVM kernel is a function that takes low-dimensional input space and transforms it
into higher-dimensional space, ie it converts nonseparable problems to separable
problems. It is mostly useful in non-linear separation problems. Simply put the kernel,
does some extremely complex data transformations and then finds out the process to
separate the data based on the labels or outputs defined.

Advantages of SVM
 Effective in high-dimensional cases.
 Its memory is efficient as it uses a subset of training points in the decision function
called support vectors.
 Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.

The Basics of Data Analytics
86% (7)
The Basics of Data Analytics
17 pages
5 What Is Data-WPS Office
No ratings yet
5 What Is Data-WPS Office
19 pages
Data Analytics - Unit 5
No ratings yet
Data Analytics - Unit 5
56 pages
Notes
No ratings yet
Notes
57 pages
Data miningng
No ratings yet
Data miningng
8 pages
Presentation 1
No ratings yet
Presentation 1
30 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Data Mining Simran
No ratings yet
Data Mining Simran
128 pages
Data Science Full
No ratings yet
Data Science Full
32 pages
Data Science Full
No ratings yet
Data Science Full
31 pages
Data Mining
No ratings yet
Data Mining
3 pages
70 Days of Data Science
No ratings yet
70 Days of Data Science
11 pages
CC Unit - 4 Imp Questions
No ratings yet
CC Unit - 4 Imp Questions
4 pages
Data Mining
No ratings yet
Data Mining
22 pages
Unit 5
No ratings yet
Unit 5
9 pages
Samta Rule Based System Notes.docx
No ratings yet
Samta Rule Based System Notes.docx
7 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
data analytics-1
No ratings yet
data analytics-1
21 pages
DMW - Unit 1
No ratings yet
DMW - Unit 1
21 pages
Big Data Analytics Algorithm, Tools in Systematic Review
No ratings yet
Big Data Analytics Algorithm, Tools in Systematic Review
7 pages
abc
No ratings yet
abc
10 pages
1st Semester MSC 2024
No ratings yet
1st Semester MSC 2024
16 pages
Dwdmsem 6 QB
No ratings yet
Dwdmsem 6 QB
13 pages
Exercises 5
No ratings yet
Exercises 5
5 pages
Unit 4 AI LASK
No ratings yet
Unit 4 AI LASK
7 pages
Lecture Note 5
No ratings yet
Lecture Note 5
7 pages
Unit 1
No ratings yet
Unit 1
52 pages
Unit-7 Expert Systems
No ratings yet
Unit-7 Expert Systems
17 pages
Decision Tree R
No ratings yet
Decision Tree R
5 pages
Knowledge_Class
No ratings yet
Knowledge_Class
9 pages
Q.1. What Is Data Mining?
No ratings yet
Q.1. What Is Data Mining?
15 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
Activity 1 PDF
No ratings yet
Activity 1 PDF
3 pages
Data Mining: Prof Jyotiranjan Hota
No ratings yet
Data Mining: Prof Jyotiranjan Hota
17 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
Mcd r fe ynny
No ratings yet
Mcd r fe ynny
23 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Advanced Mining Techniques
No ratings yet
Advanced Mining Techniques
8 pages
Viva Data Mining Lab
No ratings yet
Viva Data Mining Lab
11 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
Data Science
100% (1)
Data Science
7 pages
Data Mining AND Warehousing: Abstract
No ratings yet
Data Mining AND Warehousing: Abstract
12 pages
MZU-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 3
No ratings yet
MZU-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 3
39 pages
Operations Research Applications in The Field of Information and Communication Technologies
No ratings yet
Operations Research Applications in The Field of Information and Communication Technologies
10 pages
Module 4
No ratings yet
Module 4
54 pages
DM_UNIT-1_FUNDAMENTALS OF DATA MINING (1)
No ratings yet
DM_UNIT-1_FUNDAMENTALS OF DATA MINING (1)
43 pages
wibd
No ratings yet
wibd
10 pages
Discuss The Role of Data Mining Techniques and Data Visualization in e Commerce Data Mining
No ratings yet
Discuss The Role of Data Mining Techniques and Data Visualization in e Commerce Data Mining
13 pages
Pattern_Recognition_and_Computer_Vision_NOTES
No ratings yet
Pattern_Recognition_and_Computer_Vision_NOTES
27 pages
ML & AI-Introduction To Data-Science Tools
No ratings yet
ML & AI-Introduction To Data-Science Tools
7 pages
What Is Data?
No ratings yet
What Is Data?
8 pages
past ppr(1)
No ratings yet
past ppr(1)
31 pages
Aids 2 Mse
No ratings yet
Aids 2 Mse
27 pages
Crack_Data_Science_Interview_�_1731300339
No ratings yet
Crack_Data_Science_Interview_�_1731300339
132 pages
Advanced Data Analytics Assignment
No ratings yet
Advanced Data Analytics Assignment
6 pages
1.1 Data and Information Mining
No ratings yet
1.1 Data and Information Mining
24 pages
VO_MCA_S4_Data Mining Unit 3
No ratings yet
VO_MCA_S4_Data Mining Unit 3
14 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Representing Knowledge Using Rules
No ratings yet
Representing Knowledge Using Rules
30 pages
Slender Concrete Columns Sway Frame Moment Magnification ACI318 14 W PDF
No ratings yet
Slender Concrete Columns Sway Frame Moment Magnification ACI318 14 W PDF
36 pages
Chapter Iii: Sampling and Sampling Distribution
No ratings yet
Chapter Iii: Sampling and Sampling Distribution
4 pages
Debabrata Podder, Santanu Chatterjee - Introduction To Structural Analysis-CRC Press (2021)
100% (4)
Debabrata Podder, Santanu Chatterjee - Introduction To Structural Analysis-CRC Press (2021)
513 pages
Ae2201 Mechanics of Machine
No ratings yet
Ae2201 Mechanics of Machine
37 pages
A New Current Mirror Layout Technique For Improved Matching Characteristics
No ratings yet
A New Current Mirror Layout Technique For Improved Matching Characteristics
4 pages
Note 6 - Sound in Enclosed Space PDF
No ratings yet
Note 6 - Sound in Enclosed Space PDF
32 pages
Confirmatory Factor Analysis of The Performance Management Audit Questionnaire
No ratings yet
Confirmatory Factor Analysis of The Performance Management Audit Questionnaire
9 pages
Seismic Processing 3
100% (1)
Seismic Processing 3
169 pages
SPM 2019
No ratings yet
SPM 2019
6 pages
John Nash Thesis PDF
100% (2)
John Nash Thesis PDF
5 pages
BMG Gao2006
No ratings yet
BMG Gao2006
18 pages
Calculating IV Fluid Volume and Drip Rate
No ratings yet
Calculating IV Fluid Volume and Drip Rate
3 pages
Topic02. Descriptive Stats
No ratings yet
Topic02. Descriptive Stats
16 pages
371CPE Lectures Part2
No ratings yet
371CPE Lectures Part2
127 pages
COMP 312 Chapter 1
No ratings yet
COMP 312 Chapter 1
13 pages
Statistics For Managers Using Microsoft Excel: 6 Global Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 6 Global Edition
44 pages
Rock dynamics 1st Edition Aydan - The full ebook version is available, download now to explore
No ratings yet
Rock dynamics 1st Edition Aydan - The full ebook version is available, download now to explore
61 pages
Physics 11 Kinematics Review Booklet Solutions PDF
No ratings yet
Physics 11 Kinematics Review Booklet Solutions PDF
5 pages
Advanced Synthesis Cookbook
No ratings yet
Advanced Synthesis Cookbook
127 pages
l3 Final Ross David Colontonio
No ratings yet
l3 Final Ross David Colontonio
7 pages
Lectures On Lectures On: Computer Graphics Computer Graphics
No ratings yet
Lectures On Lectures On: Computer Graphics Computer Graphics
28 pages
File 1
No ratings yet
File 1
15 pages
Design Manual For Pitched Slope Protection
No ratings yet
Design Manual For Pitched Slope Protection
298 pages
18MAB204T Chi Square
No ratings yet
18MAB204T Chi Square
41 pages
6-State Space Analysis
No ratings yet
6-State Space Analysis
8 pages
EEE 4484 Lab 01 190041204
No ratings yet
EEE 4484 Lab 01 190041204
25 pages
2.2 Logic and Reasoning
No ratings yet
2.2 Logic and Reasoning
38 pages
Vibration Chapter04
No ratings yet
Vibration Chapter04
75 pages
G Math Chapter-1-G2-24
No ratings yet
G Math Chapter-1-G2-24
26 pages

Unit 1-1

Uploaded by

Unit 1-1

Uploaded by

UNIT-I

Benefits of rule induction include:

Interpretability: Rule induction generates human-readable rules that are easily

Why are decision trees useful?

Decision tree Algorithm:

Algorithms for Tree Induction

Advantages and Disadvantages of Tree Induction

Applications of Tree Induction

The Basic Naıve Bayes Classifier

However, since I is a conjunction of j values, one can expand this expression to

Naive Bayes Induction for Numeric Attributes

Why it is Called Naive Bayes?

0 Rainy Hot High FALSE No

2 Overcast Hot High FALSE Yes

3 Sunny Mild High FALSE Yes

6 Overcast Cool Normal TRUE Yes

7 Rainy Mild High FALSE No

11 Overcast Mild High TRUE Yes

12 Overcast Hot Normal FALSE Yes

13 Sunny Mild High TRUE No

With relation to our dataset, this concept can be understood as:

log(p(y)) + log(p(8.1|y)) + log(p(81|y)).

Laplace additive smoothing

Our goal is to predict the sentiment based on the movie review.

Matches at Fixed Locations

In symbols, we can think of confirmation of our inductive hypothesis A as: A⇒B, B, ∴A

Scientific induction requires that we make two unprovable assumptions, or postulates:

• representative sampling. Only if our samples are representative, or similar in

• uniformity of nature. Strictly speaking, even if our sample is representative we

Induction is explanation, and explanation is identification of some type of order in the

Comparison is the most common means of identifying order, whether by scientists or by

Symmetry is a regularity of shape or arrangement of parts within a whole -- for example,

Classifications, like all explanations, seek meaningful associations and correlations.

“A large number of incorrect conclusions are drawn because the possibility of

“Every scientific problem is a search for the relationship between variables.”

Crossplot interpretation, like any subjective pattern recognition, is subject to the

A second approach is to use a nonparametric statistic called the rank correlation

Neural networks extract identifying features from data, lacking pre-programmed

Importance of Neural Networks

Genetic Algorithm (GA)

Optimization is the process of making something better. In any process, we have a

What are Genetic Algorithms?

In GAs, we have a pool or a population of possible solutions to the given problem.

Genetic Algorithms have the ability to deliver a “good-enough” solution “fast-enough”.

Solving Difficult Problems

Failure of Gradient Based Methods

Getting a Good Solution Fast

Before beginning a discussion on Genetic Algorithms, it is essential to be familiar with

 Population − It is a subset of all the possible (encoded) solutions to the given

 Genotype − Genotype is the population in the computation space. In the

The basic structure of a GA is as follows −

We start with an initial population (which may be generated at random or seeded by

Instance based Learning

Linearly Separable Data points

How does SVM work?

Multiple hyperplanes separate the data from two classes

Selecting hyperplane for data with outlier

Hyperplane which is the most optimized one

Original 1D dataset for classification

Mapping 1D data to 2D to become able to separate the two classes

Support Vector Machine Terminology

Mathematical intuition of Support Vector Machine

You might also like