Unit 1-1
Unit 1-1
Induction Algorithms. Rule Induction. Decision Trees. Bayesian Methods. The Basic
Naıve Bayes Classifier. Naive Bayes Induction for Numeric Attributes. Correction to the
Probability Estimation. Laplace Correction. No Match. Other Bayesian Methods. Other
Induction Methods. Neural Networks. Genetic Algorithms. Instance based Learning.
Support Vector Machines.
Induction Algorithms.
Rule Induction
What is a Rule Induction?
Rule induction is a machine-learning technique that involves the discovery of patterns or
rules in data. It aims to extract explicit if-then rules that can accurately predict or classify
instances based on their features or attributes. Rule induction involves data mining
process of deducing if-then rules from a data set. These symbolic decision rules explain
an inherent relationship between the attributes and class labels in the data set. Many real-
life experiences are based on intuitive rule induction. For example, we can proclaim a rule
that states “if it is 8 a.m. on a weekday, then highway traffic will be heavy” and “if it is 8
p.m. on a Sunday, then the traffic will be light.” These rules are not necessarily right all
the time. 8 a.m. weekday traffic may be light during a holiday season. But, in general,
these rules hold true and are deduced from real-life experience based on our everyday
observations. Rule induction provides a powerful classification approach that can be
easily understood by the general audience. Apart from its use in Predictive Analytics by
classification of unknown data, rule induction is also used to describe the patterns in the
data. The description is in the form of simple if-then rules that can be easily understood
by general users.
The easiest way to extract rules from a data set is from a decision tree that is developed
on the same data set. A decision tree splits data on every node and leads to the leaf
where the class is identified. If we trace back from the leaf to the root node, we can
combine all the split conditions to form a distinct rule.
Recently, there has been substantial attention devoted to the use of machine learning
techniques as tools for decision support. These methods have been applied to a wide
variety of problems in engineering because of their ability to discover patterns from data.
The integration of these methods with conventional decision support systems can provide
a means for significantly improving the quality of decision making. A decision support
system can employ machine learning techniques to derive knowledge directly from prior
decision examples and to refine this knowledge continually. Inductive learning is perhaps
the most widely used machine learning technique. Inductive learning algorithms are
simple and fast. Another advantage is that they generate models that are easy to
understand. Finally, inductive learning algorithms are more accurate compared with other
machine learning techniques. Inductive learning techniques can be divided into two main
categories, namely, decision tree induction and rule induction. RULES (RULe Extraction
System) is a family of inductive learning algorithms that follow the rule induction approach.
The process of rule induction typically involves the following steps:
Data Preparation: The input data is prepared by organizing it into a structured format,
such as a table or a matrix, where each row represents an instance or observation, and
each column represents a feature or attribute.
Rule Generation: The rule generation process involves finding patterns or associations
in the data that can be expressed as if-then rules. Various algorithms and methods can
be used for rule generation, such as decision tree algorithms (e.g., C4.5, CART),
association rule mining algorithms (e.g., Apriori), and logical reasoning approaches (e.g.,
inductive logic programming).
Rule Evaluation: Once the rules are generated, they need to be evaluated to determine
their quality and usefulness. Evaluation metrics can include accuracy, coverage, support,
confidence, lift, and other measures depending on the specific application and domain.
Rule Selection and Pruning: Depending on the complexity of the rule set and the
specific requirements, rule selection and pruning techniques can be applied to refine the
rule set. This process involves removing redundant, irrelevant, or overlapping rules to
improve interpretability and efficiency.
Rule Application: Once a set of high-quality rules is obtained, they can be applied to
new, unseen instances for prediction or classification. Each instance is evaluated against
the rules, and the applicable rule(s) with the highest confidence or support is used to
make predictions or decisions.
Rule induction has been widely used in various domains, such as data mining, machine
learning, expert systems, and decision support systems. It provides interpretable and
human-readable models, making it useful for generating understandable insights and
explanations from data.
While rule induction can be effective in capturing explicit patterns and associations in the
data, it may struggle with capturing complex or non-linear relationships. Additionally, rule
induction algorithms may face challenges when dealing with large and high-dimensional
datasets, as the search space of possible rules can become exponentially large. The
importance of rule induction lies in its ability to extract interpretable and actionable
knowledge from complex datasets. It provides a way to discover underlying patterns,
dependencies, or rules that humans can easily understand and utilize. Rule induction has
applications in various domains, including data mining, machine learning, expert systems,
decision support systems, and business intelligence.
Decision Trees
Tree induction is a method used in machine learning to derive decision trees from data.
Decision trees are predictive models that use a set of binary rules to calculate a target
value. They are widely used for classification and regression tasks because they are
interpretable, easy to implement, and can handle both numerical and categorical data.
Tree induction algorithms work by recursively partitioning the dataset into subsets based
on the features that provide the best separation between classes or values.
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data
set into smaller subsets, and at the same time, the decision tree is steadily developed.
The final tree is a tree with the decision nodes and leaf nodes. A decision node has at
least two branches. The leaf nodes show a classification or decision. We can't accomplish
more split on leaf nodes-The uppermost decision node in a tree that relates to the best
predictor called the root node. Decision trees can deal with both categorical and numerical
data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures
the randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return
the highest data gain.
In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment
the set until the data belongs to the same class.
In other words, we can say that a decision tree is a hierarchical tree structure that can be
used to split an extensive collection of records into smaller sets of the class by
implementing a sequence of simple decision rules. A decision tree model comprises a set
of rules for portioning a huge heterogeneous population into smaller, more homogeneous,
or mutually exclusive classes. The attributes of the classes can be any variables from
nominal, ordinal, binary, and quantitative values, in contrast, the classes must be a
qualitative type, such as categorical or ordinal or binary. In brief, the given data of
attributes together with its class, a decision tree creates a set of rules that can be used to
identify the class. One rule is implemented after another, resulting in a hierarchy of
segments within a segment. The hierarchy is known as the tree, and each segment is
called a node. With each progressive division, the members from the subsequent sets
become more and more similar to each other. Hence, the algorithm used to build a
decision tree is referred to as recursive partitioning. The algorithm is known as CART
(Classification and Regression Trees). Consider the given example of a factory where
Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which
leads to $8 million profit, and the probability of a bad economy is 0.4 (40%), which leads
to $6 million profit. Not expanding factor with 0$ cost, the probability of a good economy
is 0.6(60%), which leads to $4 million profit, and the probability of a bad economy is 0.4,
which leads to $2 million profit. The management teams need to take a data-driven
decision to expand or not based on the given data.
Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M
Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.
ID3 (Iterative Dichotomiser 3): This algorithm uses entropy and information gain to build
a decision tree for classification tasks.
C4.5: An extension of ID3, C4.5 uses the gain ratio to address some of the limitations of
information gain and can handle both continuous and discrete features.
CART (Classification and Regression Trees): CART is a versatile algorithm that can be
used for both classification and regression. It uses Gini impurity for classification and
variance reduction for regression.
where the denominator sums over all classes and where p(Λ Vj | CK;) is the probability of
the instance I given the class C;. After calculating these quantities for each description,
the algorithm assigns the instance to the class with the highest probability. In order to
make the above expression operational one must still specify how to compute the term
p(Λ Vj | Ck)- The naive Bayesian classifier assumes independence of attributes within
each class which lets it use the equality.
where the values p( Vj I Ck) represent the conditional probabilities stored with each class.
This approach greatly simplifies the computation of class probabilities for a given
observation. The Bayesian framework also lets one specify prior probabilities for both the
class and the conditional terms. In the absence of domain-specific knowledge, a common
scheme makes use of 'uninformed priors', wh1ch assign equal probabilities to each class
and to the values of each attribute. However, one must also specify how much weight to
give these priors relative to the training data. Learning in the naive Bayesian classifier is
an almost trivial matter. The simplest implementation increments a count each time it
encounters a new instance along with a separate count for a class each time it observes
an instance of that class. These counts let the classifier estimate p( Ck) for each class Ck.
For each nominal value, the algorithm updates a count for that class-value pair. Together
with the second count this lets the classifier estimate p( Vj I Ck) for each numeric attribute,
the method retains and revises two quantities, the sum and the sum of squares which let
it compute the mean and variance for a normal curve that it uses to find p( Vj I Ck). In
domains that can have missing attributes, it must include a fourth count for each class-
attribute pair.
In contrast to many induction methods, the naive Bayesian classifier does not carry out
an extensive search through a space of possible descriptions. The basic algorithm makes
no choices about how to partition the data, which direction to move in a weight space, or
the like, and the resulting probabilistic summary is completely determined by the training
data and the prior probabilities. Nor does the order of the training instances have any
effect on the output; the bas1c process produces the same description whether it
operates incrementally or nonincrementally. These features make the learning algorithm
both simple to understand and quite efficient. Bayesian classifiers would appear to have
advantages over many induction algorithms. For example, their collection of class and
conditional probabilities should make them inherently robust with respect to noise. Their
statistical basis should also let them scale well to domains that involve many irrelevant
attributes.
Naïve Bayes algorithm is used for classification problems. It is highly used in text
classification. In text classification tasks, data contains high dimension (as each word
represent one feature in the data). It is used in spam filtering, sentiment detection, rating
classification etc. The advantage of using naïve Bayes is its speed. It is fast and making
prediction is easy with high dimension of data. This model predicts the probability of an
instance belongs to a class with a given set of feature value. It is a probabilistic classifier.
It is because it assumes that one feature in the model is independent of existence of
another feature. In other words, each feature contributes to the predictions with no relation
between each other. In real world, this condition satisfies rarely. It uses Bayes theorem
in the algorithm for training and prediction
Play
Outlook Temperature Humidity Windy
Golf
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable(prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: The features of the data are conditionally independent of each
other, given the class label.
Continuous features are normally distributed: If a feature is continuous, then it is assumed
to be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is assumed
to have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
No missing data: The data should not contain any missing values.
With relation to our dataset, this concept can be understood as:
We assume that no pair of features are dependent. For example, the temperature being
‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the
winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example, knowing
only temperature and humidity alone can’t predict the outcome accurately. None of the
attributes is irrelevant and assumed to be contributing equally to the outcome.
The assumptions made by Naive Bayes are not generally correct in real-world situations.
In-fact, the independence assumption is never correct but often works well in
practice.Now, before moving to the formula for Naive Bayes, it is important to know about
Bayes’ theorem.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:
Basically, we are trying to find probability of event A, given the event B is true. Event B
is also termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
P(B) is Marginal Probability: Probability of Evidence.
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true based
on the evidence.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
Now, as the denominator remains constant for a given input, we can remove that term:
Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
probability. This can be expressed mathematically as:
So now, we are done with our pre-computations and the classifier is ready. Let us test it
on a new set of features (let us call it today):
Correction to the Probability Estimation
Naïve Bayes is a probabilistic classifier based on Bayes theorem and is used for
classification tasks. It works well enough in text classification problems such as spam
filtering and the classification of reviews as positive or negative. The algorithm seems
perfect at first, but the fundamental representation of Naïve Bayes can create some
problems in real-world scenarios. Let’s take an example of text classification where the
task is to classify whether the review Is positive or negative. We build a likelihood table
based on the training data. While querying a review, we use the Likelihood table values,
but what if a word in a review was not present in the training dataset?
Query review = w1 w2 w3 w’
We have four words in our query review, and let’s assume only w1, w2, and w3 are
present in training data. So, we will have a likelihood for those words. To calculate
whether the review is positive or negative, we compare P(positive|review) and
P(negative|review).
In the likelihood table, we have P(w1|positive), P(w2|Positive), P(w3|Positive), and
P(positive)
but where is P(w’|positive)?If the word is absent in the training dataset, then we don’t
have its likelihood. What should we do?
Approach1- Ignore the term P(w’|positive)
Ignoring means that we are assigning it a value of 1, which means the probability of w’
occurring in positive P(w’|positive) and negative review P(w’|negative) is 1. This approach
seems logically incorrect.
Approach 2- In a bag of words model, we count the occurrence of words. The occurrences
of word w’ in training are 0. According to that
P(w’|positive)=0 and P(w’|negative)=0, but this will make both P(positive|review) and
P(negative|review) equal to 0 since we multiply all the likelihoods. This is the problem of
zero probability. So, how to deal with this problem?
Laplace Smoothing
Laplace smoothing is a smoothing technique that handles the problem of zero probability
in Naïve Bayes. Using Laplace smoothing, we can represent P(w’|positive) as
Here,
alpha represents the smoothing parameter,
K represents the number of dimensions (features) in the data, and
N represents the number of reviews with y=positive
If we choose a value of alpha!=0 (not equal to 0), the probability will no longer be zero
even if a word is not present in the training dataset.
Interpretation of changing alpha
Let’s say the occurrence of word w is 3 with y=positive in training data. Assuming we have
2 features in our dataset, i.e., K=2 and N=100 (total number of positive reviews).
Case 1- when alpha=1
P(w’|positive) = 3/102
Case 2- when alpha = 100
P(w’|positive) = 103/300
Case 3- when alpha=1000
P(w’|positive) = 1003/2100
As alpha increases, the likelihood probability moves towards uniform distribution (0.5).
Most of the time, alpha = 1 is being used to remove the problem of zero probability.
Numerical stability
In the earliest days of programming, developers often encountered difficulties when it
came to storing decimal or floating-point values in computer memory. While they could
easily represent whole numbers, representing decimal values posed a challenge.
The reason behind this challenge is that computers use binary representation, using only
0s and 1s, to represent any number. Consequently, it becomes challenging to accurately
represent decimal values in binary form. For instance, when representing extremely small
numbers like 0.000001, precision can be lost, and the value may be treated as 0. Let’s
consider an example in the field of biology. Suppose you are measuring the radius of a
cell. In some cases, your measurements might be extremely small, such as 0.00000001.
Now, let’s say you want to compare this radius to another cell’s radius, which is
0.0000003. Due to the limitations of computer representation, the computer will treat both
values as zero, leading to the incorrect conclusion that both cells have equal radii. This
condition is referred to as underflow. Underflow refers to a situation in which values
smaller than the smallest representable value in a computer’s numeric system are
rounded down to zero.
Let’s explore underflow in the context of a simple example. Suppose you are trying to
predict whether a student will get a placement based on their CGPA (Cumulative Grade
Point Average) and IQ (Intelligence Quotient). Let’s say the student has a CGPA of 8.1
and an IQ of 81. To calculate the probability of placement, you need to evaluate the
following:
p(y|8.1, 81) = p(y) * p(8.1|y) * p(81|y)
p(n|8.1, 81) = p(n) * p(8.1|n) * p(81|n)
Since probabilities range from 0 to 1, when you multiply these probabilities together
(especially if you have multiple features), the result tends to move closer to zero. This
leads to the underflow problem, where the computed probability becomes extremely
small, approaching zero, and can cause inaccuracies in the prediction model.
To address the underflow problem, one solution is to work with logarithmic probabilities.
By taking the logarithm of the probabilities, you can avoid the issue of extremely small
values.
The logarithmic property as mentioned, log(A * B) = log(A) + log(B), is indeed a useful
property. It allows us to rewrite the expression.
log(p(y) * p(8.1|y) * p(81|y)) as the sum of logarithms:
In the context of implementing this solution, you can utilize the predict_log_proba(X)
function available in the scikit-learn library's Naive Bayes implementation. This function
computes the logarithm of the probabilities for each class given input features X. After
calculating the logarithmic probabilities, you can compare them and choose the class with
the highest log probability. For example, if you obtain a log probability of 53 for one class
and 25 for another, you would select the class with the higher log probability. By using
logarithmic probabilities, you can overcome the underflow problem and make more
accurate predictions.
To convert this data into a binary bag-of-words table, we need to represent each word
with a binary value (0 or 1). The table would look like this:
In the binary bag-of-words table, each word corresponds to a column, and its presence
is denoted by a 1, while its absence is denoted by a 0. The sentiment is represented by
the “Sentiment” column. To create the binary bag-of-words representation table,
considering the additional query point “r4” with the words “w1, w1, w1,” we can convert it
to a table as follows:
In this table, “r4” represents the additional query point, where “w1” appears twice and
“w2” and “w3” are absent. The sentiment for Review 4 is yet to be predicted.
Now, let’s calculate the probabilities for positive and negative sentiments based on the
given data table:
p(+ve|r4) = p(+ve) * p(w1=1|+ve) * p(w2=0|+ve) * p(w3=0|+ve) = (2/3) * (1/1) * (1/1) * (0/1)
=0
p(-ve|r4) = p(-ve) * p(w1=1|-ve) * p(w2=0|-ve) * p(w3=0|-ve) = (1/3) * (1/2) * (0/2) * (1/2) =
1/12
As you can see, both probabilities become 0, which is an issue when certain features do
not exist in a particular class, resulting in zero probabilities. This is where Laplace additive
smoothing comes in. Laplace additive smoothing helps avoid zero probabilities by adding
a small constant (alpha) to the numerator and n * alpha to the denominator of each
probability estimate. By applying Laplace additive smoothing, the probabilities will never
be zero. The value of alpha is usually 1 (default), but you can choose a different value
based on your preference. The value of n depends on the type of Naive Bayes algorithm
you are using, which we can discuss further if needed.Let’s understand the bias-variance
tradeoff in the case of Naive Bayes. The question arises: why do we add alpha in the
numerator and n * alpha in the denominator? Why don’t we add a very small constant
value like 0.000001 instead?
The reason we add alpha in the numerator and n * alpha in the denominator is to have
flexibility in controlling the bias and variance of the model. By tuning the value of alpha,
we can adjust the bias and variance accordingly.
When a model has high bias, it means it has simplified assumptions or constraints that
may lead to underfitting, resulting in poor performance. In such cases, we can set a lower
value of alpha to reduce bias and allow the model to capture more complex patterns. On
the other hand, when a model has high variance, it means it is too sensitive to the training
data and may overfit, resulting in poor generalization to unseen data. To address high
variance, we can set a higher value of alpha to smoothen the probability estimates and
reduce the impact of individual features, thus reducing variance. Alpha serves as a
hyperparameter that allows us to strike a balance between bias and variance. By
choosing different values of alpha, we can fine-tune the model’s behavior and find the
optimal tradeoff between bias and variance for a specific problem.
There are two reasons why we use Laplace additive smoothing:
1. To ensure that probabilities will not become zero.
2. By tuning the value of alpha and n * alpha, we can reduce overfitting and strike a
balance between bias and variance trade-off.
The Matching Problem
This famous problem has been stated variously in terms of hats and people, letters and
envelopes, tea cups and saucers – indeed, any situation in which you might want to match
two kinds of items seems to have appeared somewhere as a setting for the matching
problem. In the letter-envelope setting there are n letters labeled 1 through n and
also n envelopes labeled 1 through n. The letters are permuted randomly into the
envelopes, one letter per envelope (a mishap usually blamed on an unfortunate
hypothetical secretary), so that all permutations are equally likely. The main questions
are about the number of letters that are placed into their matching envelopes.
"Real life" settings aside, the problem is about the number of fixed points of a random
permutation. A fixed point is an element whose position is unchanged by the shuffle.
If letters falling in the right envelopes are good events, then the worst possible event
is every letter falling in a wrong envelope. That is the event that there are no
matches, and is called a derangement. Let's find the chance of a derangement.
The key is to notice that the complement is a union, and then use the inclusion -
exclusion formula.
Other Bayesian Methods
Other Induction Methods
Induction is pattern recognition -- an inference based on limited observational or
experimental data -- and pattern recognition is an addictively exhilarating acquired skill.
Of the two types of scientific inference, induction is far more pervasive and useful than
deduction (Chapter 4). Induction usually infers some pattern among a set of observations
and then attributes that pattern to an entire population. Almost all hypothesis formation is
based consciously or subconsciously on induction.
Induction is pervasive because people seek order insatiably, yet they lack the opportunity
of basing that search on observation of the entire population. Instead they make a few
observations and generalize.
Induction is not just a description of observations; it is always a leap beyond the data -- a
leap based on circumstantial evidence. The leap may be an inference that other
observations would exhibit the same phenomena already seen in the study sample, or it
may be some type of explanation or conceptual understanding of the observations; often
it is both. Because induction is always a leap beyond the data, it can never be proved. If
further observations are consistent with the induction, then they confirm, or lend
substantiating support to, the induction. But the possibility always remains that as-yet-
unexamined data might disprove the induction.
Types of Explanation
Individual events are complex, but explanation discerns their underlying simplicity of
relationships. In this section we will consider briefly two types of scientific explanation:
comparison (analogy and symmetry) and classification. In subsequent sections we will
examine, in much more detail, two more powerful types of explanation: correlation and
causality.
Explanation can deal with attributes or with variables. An attribute is binary: either present
or absent. Explanation of attributes often involves consideration of associations of the
attribute with certain phenomena or circumstances. A variable, in contrast, is not merely
present or absent; it is a characteristic whose changes can be quantitatively measured.
Explanations of a variable often involve description of a correlation between changes in
that variable and changes in another variable. If a subjective attribute, such as tall or
short, can be transformed into a variable, such as height, explanatory value increases.
The different kinds of explanation contrast in explanatory power and experimental ease.
Easiest to test is the null hypothesis that two variables are completely unrelated.
Statistical rejection of the null hypothesis can demonstrate the likelihood that a
classification or correlation has predictive value. Causality goes deeper, establishing the
origin of that predictive ability, but demonstration of causality can be very challenging.
Beyond causality, the underlying quantitative theoretical mechanism sometimes can be
discerned.
* * *
Analogy is the description of observed behavior in one class of phenomena and the
inference that this description is somehow relevant to a different class of phenomena.
Analogy does not necessarily imply that the two classes obey the same laws or function
in exactly the same way. Analogy often is an apparent order or similarity that serves only
as a visualization aid. That purpose is sufficient justification, and the analogy may inspire
fruitful follow-up research. In other cases, analogy can reflect a more fundamental
physical link between behaviors of the two classes.
Classifications evolve to regain utility, when exceptions and anomalous examples are
found. Often these exceptions can be explained by a more restrictive and complex class
definition. Frequently, the smaller class exhibits greater commonality of other
characteristics than was observed within the larger class
Coincidence
Without attention to statistical evidence and confirmatory power, the scientist falls into the
most common pitfall of non-scientists: hasty generalization. One or a few chance
associations between two attributes or variables are mistakenly inferred to represent a
causal relationship. Hasty generalization is responsible for many popular superstitions,
but even scientists such as Aristotle were not immune to it. Hasty generalizations are
often inspired by coincidence, the unexpected and improbable association between two
or more events. After compiling and analyzing thousands of coincidences, Diaconis and
Mostelle [1989] found that coincidences could be grouped into three classes:
• cases where there was an unnoticed causal relationship, so the association actually was
not a coincidence;
• nonrepresentative samples, focusing on one association while ignoring or forgetting
examples of non-matches;
• actual chance events that are much more likely than one might expect.
An example of this third type is that any group of 23 people has a 50% chance of at least
two people having the same birthday.
Correlation
Begin with two variables, which we will call X and Y, for which we have several
measurements. By convention, X is called the independent variable and Y is the
dependent variable. Perhaps X causes Y, so that the value of Y is truly dependent on the
value of X. Such a condition would be convenient, but all we really require is the possibility
that a knowledge of the value of the independent variable X may give us some ability to
predict the value of Y.
Crossplots
Crossplots are the best way to look for a relationship between two variables. They involve
minimal assumptions: just that one’s measurements are reliable and paired (xi, yi). They
permit use of an extremely efficient and robust tool for pattern recognition: the eye. Such
pattern recognition and its associated brainstorming are a joy.
Nonlinear Relationships
The biggest pitfall of linear regression and correlation coefficients is that so many
relationships between variables are nonlinear. As an extreme example, imagine applying
these techniques to the annual temperature variation of Anchorage (Figure 10b). For a
sinusoidal distribution such as this, the correlation coefficient would be virtually zero and
regression would yield the absurd conclusion that knowledge of what month it is (X) gives
no information about expected temperature (Y). In general, any departure from a linear
relationship degrades the correlation coefficient.
The first defense against nonlinear relationships is to transform one or both variables so
that the relation between them is linear. Taking the logarithm of one or both is by far the
most common transformation; taking reciprocals is another. Taking the logarithm of both
variables is equivalent to fitting the relationship Y=bXm rather than the usual Y=b+mX.
Our earlier plotting hint to try to obtain a linear relationship had two purposes. First, linear
regression and correlation coefficients assume linearity. Second, linear trends are
somewhat easier for the eye to discern.
Neural Networks
Forward Propagation
Input Layer: Each feature in the input layer is represented by a node on the
network, which receives input data.
Weights and Connections: The weight of each neuronal connection indicates
how strong the connection is. Throughout training, these weights are changed.
Hidden Layers: Each hidden layer neuron processes inputs by multiplying them
by weights, adding them up, and then passing them through an activation function.
By doing this, non-linearity is introduced, enabling the network to recognize
intricate patterns.
Output: The final result is produced by repeating the process until the output layer
is reached.
Backpropagation
Loss Calculation: The network’s output is evaluated against the real goal values,
and a loss function is used to compute the difference. For a regression problem,
the Mean Squared Error (MSE) is commonly used as the cost function.
Loss Function:
Gradient Descent: Gradient descent is then used by the network to reduce the
loss. To lower the inaccuracy, weights are changed based on the derivative of the
loss with respect to each weight.
Adjusting weights: The weights are adjusted at each connection by applying this
iterative process, or backpropagation, backward across the network.
Training: During training with different data samples, the entire process of forward
propagation, loss calculation, and backpropagation is done iteratively, enabling the
network to adapt and learn patterns from the data.
Actvation Functions: Model non-linearity is introduced by activation functions like
the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a
neuron is based on the whole weighted input.
Learning of a Neural Network
1. Learning with supervised learning
In supervised learning, the neural network is guided by a teacher who has access to
both input-output pairs. The network creates outputs based on inputs without taking
into account the surroundings. By comparing these outputs to the teacher-known
desired outputs, an error signal is generated. In order to reduce errors, the network’s
parameters are changed iteratively and stop when performance is at an acceptable
level.
2. Learning with Unsupervised learning
Equivalent output variables are absent in unsupervised learning. Its main goal is to
comprehend incoming data’s (X) underlying structure. No instructor is present to offer
advice. Modeling data patterns and relationships is the intended outcome instead.
Words like regression and classification are related to supervised learning, whereas
unsupervised learning is associated with clustering and association.
3. Learning with Reinforcement Learning
Through interaction with the environment and feedback in the form of rewards or
penalties, the network gains knowledge. Finding a policy or strategy that optimizes
cumulative rewards over time is the goal for the network. This kind is frequently
utilized in gaming and decision-making applications.
Types of Neural Networks
There are seven types of neural networks that can be used.
Feedforward Neteworks: A feedforward neural network is a simple artificial neural
network architecture in which data moves from input to output in a single direction.
It has input, hidden, and output layers; feedback loops are absent. Its
straightforward architecture makes it appropriate for a number of applications, such
as regression and pattern recognition.
Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with
three or more layers, including an input layer, one or more hidden layers, and an
output layer. It uses nonlinear activation functions.
Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is
a specialized artificial neural network designed for image processing. It employs
convolutional layers to automatically learn hierarchical features from input images,
enabling effective image recognition and classification. CNNs have revolutionized
computer vision and are pivotal in tasks like object detection and image analysis.
Recurrent Neural Network (RNN): An artificial neural network type intended for
sequential data processing is called a Recurrent Neural Network (RNN). It is
appropriate for applications where contextual dependencies are critical, such as
time series prediction and natural language processing, since it makes use of
feedback loops, which enable information to survive within the network.
Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to
overcome the vanishing gradient problem in training RNNs. It uses memory cells
and gates to selectively read, write, and erase information.
Introduction to Optimization
Optimization refers to finding the values of inputs in such a way that we get the “best”
output values. The definition of “best” varies from problem to problem, but in
mathematical terms, it refers to maximizing or minimizing one or more objective
functions, by varying the input parameters.
The set of all possible solutions or values which the inputs can take make up the search
space. In this search space, lies a point or a set of points which gives the optimal
solution. The aim of optimization is to find that point or set of points in the search space.
Nature has always been a great source of inspiration to all mankind. Genetic Algorithms
(GAs) are search based algorithms based on the concepts of natural selection and
genetics. GAs are a subset of a much larger branch of computation known
as Evolutionary Computation.
GAs were developed by John Holland and his students and colleagues at the University
of Michigan, most notably David E. Goldberg and has since been tried on various
optimization problems with a high degree of success.
In this way we keep “evolving” better individuals or solutions over generations, till we
reach a stopping criterion.
Genetic Algorithms are sufficiently randomized in nature, but they perform much better
than random local search (in which we just try various random solutions, keeping track
of the best so far), as they exploit historical information as well.
Advantages of GAs
GAs have various advantages which have made them immensely popular. These
include −
Does not require any derivative information (which may not be available for many
real-world problems).
Is faster and more efficient as compared to the traditional methods.
Has very good parallel capabilities.
Optimizes both continuous and discrete functions and also multi-objective
problems.
Provides a list of “good” solutions and not just a single solution.
Always gets an answer to the problem, which gets better over the time.
Useful when the search space is very large and there are a large number of
parameters involved.
Limitations of GAs
Like any technique, GAs also suffer from a few limitations. These include −
GAs are not suited for all problems, especially problems which are simple and for
which derivative information is available.
Fitness value is calculated repeatedly which might be computationally expensive
for some problems.
Being stochastic, there are no guarantees on the optimality or the quality of the
solution.
If not implemented properly, the GA may not converge to the optimal solution.
GA – Motivation
Traditional calculus based methods work by starting at a random point and by moving in
the direction of the gradient, till we reach the top of the hill. This technique is efficient
and works very well for single-peaked objective functions like the cost function in linear
regression. But, in most real-world situations, we have a very complex problem called
as landscapes, which are made of many peaks and many valleys, which causes such
methods to fail, as they suffer from an inherent tendency of getting stuck at the local
optima as shown in the following figure.
Some difficult problems like the Travelling Salesperson Problem (TSP), have real-world
applications like path finding and VLSI Design. Now imagine that you are using your
GPS Navigation system, and it takes a few minutes (or even a few hours) to compute
the “optimal” path from the source to destination. Delay in such real world applications is
not acceptable and therefore a “good-enough” solution, which is delivered “fast” is what
is required.
This section introduces the basic terminology required to understand GAs. Also, a
generic structure of GAs is presented in both pseudo-code and graphical forms. The
reader is advised to properly understand all the concepts introduced in this section and
keep them in mind when reading other sections of this tutorial as well.
Basic Terminology
Fitness Function − A fitness function simply defined is a function which takes the
solution as input and produces the suitability of the solution as the output. In some
cases, the fitness function and the objective function may be the same, while in
others it might be different based on the problem.
Genetic Operators − These alter the genetic composition of the offspring. These
include crossover, mutation, selection, etc.
Basic Structure
Each of the following steps are covered as a separate chapter later in this tutorial.
From the figure above it’s very clear that there are multiple lines (our hyperplane here
is a line because we are considering only two input features x1, x2) that segregate our
data points or do a classification between red and blue circles. So how do we choose
the best line or in general the best hyperplane that segregates our data points?
So we choose the hyperplane whose distance from it to the nearest data point on
each side is maximized. If such a hyperplane exists it is known as the maximum-
margin hyperplane/hard margin. So from the above figure, we choose L2. Let’s
consider a scenario like shown below
Here we have one blue ball in the boundary of the red ball. So how does SVM classify
the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the
best hyperplane that maximizes the margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done
with previous data sets along with that it adds a penalty each time a point crosses the
margin. So the margins in these types of cases are called soft margins. When there
is a soft margin to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)).
Hinge loss is a commonly used penalty. If no violations no hinge loss.If violations
hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red
balls are separable by a straight line/linear line). What to do if data are not linearly
separable?
Say, our data is shown in the figure above. SVM solves this by creating a new variable
using a kernel. We call a point xi on the line and we create a new variable yi as a
function of distance from origin o.so if we plot this we get something like as shown
below
In this case, the new variable y is created as a function of distance from the origin. A
non-linear function that creates a new variable is referred to as a kernel.
Advantages of SVM
Effective in high-dimensional cases.
Its memory is efficient as it uses a subset of training points in the decision function
called support vectors.
Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.