ML Notes
ML Notes
MODULE I:
Definitions of Machine Learning
Machine Learning is the science (and art) of programming computers so they
can learn from data.
Machine Learning is the field of study that gives computers the ability to learn
without being explicitly programmed.
A computer program is said to learn from experience E with respect to some task
T and some performance measure P, if its performance on T, as measured by P,
improves with experience E.
Complex problems for which there is no good solution at all using a traditional
approach: the best Machine Learning techniques can find a solution.
Machine Learning shines is for problems that either are too complex for traditional
approaches or have no known algorithm. For example, consider speech recognition:
say you want to start simple and write a program capable of distinguishing the words
“one” and “two.” You might notice that the word “two” starts with a high-pitch sound
(“T”), so you could hardcode an algorithm that measures high-pitch sound intensity
and use that to distinguish ones and twos. Obviously this technique will not scale to
thousands of words spoken by millions of very different people in noisy environments
and in dozens of languages. The best solution (at least today) is to write an algorithm
that learns by itself, given many example recordings for each word.
Supervised/Unsupervised Learning
Machine Learning systems can be classified according to the amount and type of
supervision they get during training. There are four major categories: supervised
learning, unsupervised learning, semisupervised learning, and Reinforcement Learning
In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called labels. A typical supervised learning task is classification. The spam
filter is a good example of this: it is trained with many example emails along with
their class (spam or ham), and it must learn how to classify new emails.
Another typical task is to predict a target numeric value, such as the price of a car, given
a set of features (mileage, age, brand, etc.) called predictors. This sort of task
is called regression.
Eg:
Unsupervised learning
In unsupervised learning, the training data is unlabeled. The system tries to learn without
a supervisor. For example, say you have a lot of data about your blog’s visitors. You may
want to run a clustering algorithm to try to detect groups of similar visitors. At no point
do you tell the algorithm which group a visitor belongs to: it finds those connections
without your help. For example, it might notice that 40% of your visitors are males who
love comic books and generally read your blog in the evening, while 20% are young sci-fi
lovers who visit during the weekends, and so on. If you use a hierarchical
clustering algorithm, it may also subdivide each group into smaller groups. This may help
you target your posts for each group.
Eg:
(ii)Dimensionality Reduction
❖ The number of input features, variables, or columns present in a given dataset is
known as dimensionality,
❖ and the process to reduce these features is called dimensionality reduction
(Feature extraction).
❖ A dataset contains a huge number of input features in various cases, which makes
the predictive modeling task more complicated.
❖ Because it is very difficult to visualize or make predictions for the training dataset
with a high number of features, for such cases, dimensionality reduction
techniques are required to use.
Eg:
Semi-supervised learning
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled
data and a little bit of labeled data. This is called semisupervised learning. Some photo-
hosting services, such as Google Photos, are good examples of this. Once you upload all
your family photos to the service, it automatically recognizes that the same person A
shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7.
This is the unsupervised part of the algorithm (clustering). Perform one label per
person and it is able to name everyone in every photo, which is useful for searching photo.
Most semisupervised learning algorithms are combinations of unsupervised and
supervised algorithms.
Eg:
For example, a self-supervised learning model might be trained to predict the location
of an object in an image given the surrounding pixels to classify a video as depicting a
particular action.
Reinforcement Learning
Reinforcement Learning system uses an agent in this context which can observe the
environment, select and perform actions, and get rewards in return (or penalties in the
form of negative rewards). It must then learn by itself what is the best strategy, called
a policy, to get the most reward over time. A policy defines what action the agent should
choose when it is in a given situation.
Eg:
Drawbacks:
Handling large amounts of data: Batch learning requires loading the entire dataset into
memory for training. This becomes a challenge when dealing with large datasets that
exceed the available memory capacity.
Hardware limitations: Batch learning can be computationally expensive, especially when
dealing with complex models or large datasets. Training a model on a single machine may
take a significant amount of time and may require high-performance hardware, such as
GPUs or specialized processing units.
Availability constraints: In some scenarios, obtaining the entire dataset required for batch
learning may not be feasible or practical.
online learning
In the online learning, data is fed to the model in small batches, sequentially. These batches
are called mini batches. After, each batch of training, your model gets better. since these
batches are small chunks of data. so you can perform this training on server (in
production) That’s why it is called online learning means your model is getting trained
when your model is on server.
Online learning is great for systems that receive data as a continuous flow (e.g., stock
prices) and need to adapt to change rapidly or autonomously. It is also a good option if
you have limited computing resources: once an online learning system has learned about
new data instances, it does not need them anymore, so you can discard them.
Using online learning to handle huge datasets
One important parameter of online learning systems is how fast they should adapt to
changing data: this is called the learning rate. If you set a high learning rate, then your
system will rapidly adapt to new data, but it will also tend to quickly forget the old data.
Conversely, if you set a low learning rate, the system will have more inertia; that is, it will
learn more slowly, but it will also be less sensitive to noise in the new data or to sequences
of nonrepresentative data points. A big challenge with online learning is that if bad data
is fed to the system, the system’s performance will gradually decline.
One more way to categorize Machine Learning systems is by how they generalize. There
are two main approaches to generalization: instance-based learning and model-based
learning.
Model-Based Learning
Model-based learning involves creating a mathematical model that can predict outcomes
based on input data. The model is trained on a large dataset and then used to make
predictions on new data. The model can be thought of as a set of rules that the machine
uses to make predictions. In model-based learning, the training data is used to create a
model that can be generalized to new data. The model is typically created using statistical
algorithms such as linear regression, logistic regression, decision trees, and neural
networks. These algorithms use the training data to create a mathematical model that can
be used to predict outcomes.
Eg:
Instance-Based Learning
Instance-based learning involves using the entire dataset to make predictions. The
machine learns by storing all instances of data and then using these instances to make
predictions on new data. The machine compares the new data to the instances it has seen
before and uses the closest match to make a prediction. In instance-based learning, no
model is created. Instead, the machine stores all of the training data and uses this data to
make predictions based on new data. Instance-based learning is often used in pattern
recognition, clustering, and anomaly detection.
Eg:
all the funk music videos on YouTube. In reality, the search results are likely to be biased
toward popular artists (and if you live in Brazil you will get a lot of “funk carioca” videos,
which sound nothing like James Brown)
Poor-Quality Data
If the training data is full of errors, outliers, and noise (e.g., due to poor-quality
measurements), it will make it harder for the system to detect the underlying patterns,
so your system is less likely to perform well. It is often well worth the effort to spend time
cleaning up your training data. The truth is, most data scientists spend a significant part
of their time doing just that.
Irrelevant Features
As the saying goes: garbage in, garbage out. Your system will only be capable of learning
if the training data contains enough relevant features and not too many irrelevant ones.
A critical part of the success of a Machine Learning project is coming up with a good set
of features to train on. This process, called feature engineering, involves:
• Feature selection: selecting the most useful features to train on among
existing features.
• Feature extraction: combining existing features to produce a more useful
one (as we saw earlier, dimensionality reduction algorithms can help).
• Creating new features by gathering new data.
overfitting:
it occurs when your model is too simple to learn the underlying structure of the data. For
example, a linear model of life satisfaction is prone to underfit; reality is just more
complex than the model, so its predictions are bound to be inaccurate, even on the
training examples.
The main options to fix this problem are:
• Selecting a more powerful model, with more parameters
• Feeding better features to the learning algorithm (feature engineering)
• Reducing the constraints on the model (e.g., reducing the regularization
hyperparameter)
Underfitting
A statistical model or a machine learning algorithm is said to have underfitting when a
model is too simple to capture data complexities. It represents the inability of the model
to learn the training data effectively result in poor performance both on the training
and testing data. In simple terms, an underfit model’s are inaccurate, especially when
applied to new, unseen examples. It mainly happens when we uses very simple model
with overly simplified assumptions. To address underfitting problem of the model, we
need to use more complex models, with enhanced feature representation, and less
regularization.
Learning
Learning is the action or process of obtaining information or ability through studying,
practicing, being instructed, or experiencing something. Learning techniques can be split
into five categories:
1. Rote Learning (Memorizing): Memorizing things without understanding the
underlying principles or rationale.
2. Instructions (Passive Learning): Learning from a teacher or expert.
3. Analogy (Experience): We may learn new things by applying what we’ve learned
in the past.
4. Inductive Learning (Experience): Formulating a generalized notion based on prior
experience.
5. Deductive Learning: Getting new information from old information.
Concept Learning
In this sense, there are three things that an algorithm that enables concept learning must
have:
1. Details about the training (Past experiences to train our models)
2. Target Conception (Hypothesis to identify data objects)
3. Data objects themselves (For testing the models)
In supervised learning techniques, the main aim is to determine the possible hypothesis
out of hypothesis space that best maps input to the corresponding or correct outputs.
There are some common methods given to find out the possible hypothesis from the
Hypothesis space, where hypothesis space is represented by uppercase-h (H) and
hypothesis by lowercase-h (h).
Training experience
During the design of the checker's learning system, the type of training experience
available for a learning system will have a significant effect on the success or failure of the
learning.
1. Direct or Indirect training experience — In the case of direct training
experience, an individual board states and correct move for each board state
are given. In case of indirect training experience, the move sequences for a
game and the final result (win, loss or draw) are given for a number of games.
2. Supervised — The training experience will be labeled, which means, all the
board states will be labeled with the correct move. So the learning takes place
in the presence of a supervisor.
Unsupervised — The training experience will be unlabeled, which means, all
the board states will not have the moves. So the learner generates random
games and plays against itself with no supervision involvement.
Semi-supervised — Learner generates game states and asks the supervisor for
help in finding the correct move if the board state is confusing.
3. Is the training experience good —
Performance is best when training examples and test examples are from the
same/a similar distribution.
To illustrate, the hypothesis that a person enjoys his favorite sport only on cold days with
high humidity (independent of the values of the other attributes) is represented by the
expression
(?, Cold, High, ?, ?, ?)
Instance Space
Consider, for example, the instances X and hypotheses H in the EnjoySport learning task.
Given that the attribute Sky has three possible values, and that AirTemp, Humidity, Wind,
Water, and Forecast each have two possible values, the instance space X contains
exactly 3 . 2 . 2 . 2 . 2 . 2 = 96 distinct instances.
Example:
Let’s assume there are two features F1 and F2 with F1 has A and B as possibilities and F2
as X and Y as possibilities.
F1 – > A, B
F2 – > X, Y
Instance Space: (A, X), (A, Y), (B, X), (B, Y) – 4 Examples
Hypothesis Space: (A, X), (A, Y), (A, ø), (A, ?), (B, X), (B, Y), (B, ø), (B, ?), (ø, X), (ø, Y), (ø, ø),
(ø, ?), (?, X), (?, Y), (?, ø), (?, ?) – 16
Hypothesis Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y (?, ?) – 10
Hypothesis Space
Similarly there are 5 . 4 . 4 . 4 . 4 . 4 = 5120 syntactically distinct hypotheses within H.
Notice, however, that every hypothesis containing one or more “ø” symbols represents
the empty set of instances; that is, it classifies every instance as negative.
Therefore, the number of semantically distinct hypotheses is only 1 + (4 . 3 . 3 . 3 . 3 . 3) =
973.
FIND-S algorithm
Find-S algorithm, is a machine learning algorithm that seeks to find a maximally specific
hypothesis based on labeled training data. It starts with the most specific hypothesis and
generalizes it by incorporating positive examples. It ignores negative examples during
the learning process. The algorithm's objective is to discover a hypothesis that accurately
represents the target concept by progressively expanding the hypothesis space until it
covers all positive instances.
During the iterative process, the algorithm may introduce "don't care" symbols or
placeholders (often denoted as "?") in the hypothesis for attributes that vary among
positive examples. This allows the algorithm to generalize the concept by accommodating
varying attribute values. The algorithm discovers patterns in the training data and
provides a reliable representation of the concept being learned.
Example:
Step1: Initialization
H=[<0,0,0,0,0>]
Step 2: Consider first sample, compare the samp le value and hypothesis values one by
one and make changes:
H=[<Sunny,Warm,Normal,Strong,Warm,Same>]
Step 4: Skip third sample as it is negative and then consider fourth sample
H=[<Sunny,Warm,?,Strong,?,?>]
The version space, denoted VS_H,D with respect to hypothesis space H and training
examples D, is the subset of hypotheses from H consistent with the training examples in D
n the list of hypothesis, there are two extremes representing general (h1 and h2) and
specific (h6) hypothesis. Lets define these 2 extremes as general boundary G and specific
boundary S.
Definition — G
The general boundary G, with respect to hypothesis space H and training data D, is the set
of maximally general members of H consistent with D.
Definition — S
The specific boundary S, with respect to hypothesis space H and training data D, is the set
of minimally general (i.e., maximally specific) members of H consistent with D.
Example:
Consider the dataset:
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Initial Values:
G0=[<?,?,?,?,?,?>, <?,?,?,?,?,?>, <?,?,?,?,?,?>, <?,?,?,?,?,?>, <?,?,?,?,?,?>, <?,?,?,?,?,?>]
S0=[<0,0,0,0,0,0>]
Inductive Bias
Every machine learning algorithm with any ability to generalize beyond the training data
that it sees has, by definition, some type of inductive bias. That is, there is
some fundamental assumption or set of assumptions that the learner makes about the
target function that enables it to generalize beyond the training data. he candidate
elimination algorithm converge towards the true target concept provided it is given
accurate training examples and provided its initial hypothesis space contains the target
concept.
• What if the target concept is not contained in the hypothesis space?
• Can we avoid this difficulty by using a hypothesis space that includes every
possible hypothesis ?
• How does the size of this hypothesis space influence the ability of the
algorithm to generalize to unobserved instances ?
• How does the size of the hypothesis space influence the number of training
examples that must be observed ?
The following three learning algorithms are listed from weakest to strongest bias.
1.Rote-learning : storing each observed training example in memory. If the instance is
found in memory, the store classification is returned.
Inductive bias : nothing — Weakest bias
2.Candidate-Elimination algorithm : new instances are classified only in the case where all
members of the current version space agree in the classification.
Inductive bias : Target concept can be represented in its hypothesis space
3. Find-S : find the most specific hypothesis consistent with the training examples. It then
uses this hypothesis to classify all subsequent instances.
Inductive bias : Target concept can be represented in its hypothesis space + All instances
are negative instances unless the opposite is entailed by its other knowledge — Strongest
bias
MODULE-II
Working with real data
Many of the Machine Learning Crash Course Programming Exercises use the California
housing data set, which contains data drawn from the 1990 U.S. Census. The following
table provides descriptions, data ranges, and data types for each feature in the data set.
Column title Description
longitude A measure of how far west a house is; a more negative value is farther
west
latitude A measure of how far north a house is; a higher value is farther north
housingMedianAge Median age of a house within a block; a lower number is a newer
building
totalRooms Total number of rooms within a block
totalBedrooms Total number of bedrooms within a block
population Total number of people residing within a block
households Total number of households, a group of people residing within a home
unit, for a block
medianIncome Median income for households within a block of houses (measured in
tens of thousands of US Dollars)
medianHouseValue Median house value for households within a block (measured in US
Dollars)
Ocean proximity The distance from the house to ocean expressed as different categories
Google colab is used to operate on this data set and perform machine learning
preprocessing operations and machine learning techniques.
Dataset splitting
Two standard techniques for splitting data set is random shuffling and stratified
sampling. A simple random sample is used to represent the entire data population and
randomly selects individuals from the population without any other consideration.
A stratified random sample, on the other hand, first divides the population into smaller
groups, or strata, based on shared characteristics. Therefore, a stratified sampling
strategy will ensure that members from each subgroup are included in the data analysis.
article, we will discuss what is Exploratory Data Analysis (EDA) and the steps to
perform EDA. Key aspects of EDA include:
• Distribution of Data: Examining the distribution of data points to
understand their range, central tendencies (mean, median), and dispersion
(variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots,
scatter plots, and bar charts to visualize relationships within the data and
distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from other data
points. Outliers can influence statistical analyses and might indicate data
entry errors or unique cases.
• Correlation Analysis: Checking the relationships between variables to
understand how they might affect each other. This includes computing
correlation coefficients and creating correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing
data points, whether by imputation or removal, depending on their impact
and the amount of missing data.
• Summary Statistics: Calculating key statistics that provide insight into data
trends and nuances
housing1.plot(kind="scatter",x="longitude",y="latitude",grid=True,alpha=0.2)
plt.show()
#Studying correlation
Correlation is a key statistical concept that researchers employ to analyze connections
within their data. It helps us to Understand the Relationship Between Variables. Knowing
the correlation helps uncover important relationships between elements we are
investigating. It provides insight into how changes in one variable may correlate with or
Feature Engineering
Feature engineering is the process of transforming raw data into features that are
suitable for machine learning models. In other words, it is the process of selecting,
extracting, and transforming the most relevant features from the available data to build
more accurate and efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features
used to train them. Feature engineering involves a set of techniques that enable us to
create new features by combining or transforming the existing ones. These techniques
help to highlight the most important patterns and relationships in the data, which in
turn helps the machine learning model to learn from the data more effectively.
Output:
median_house_value 1.000000
median_income 0.688380
rooms_per_house 0.143663
total_rooms 0.137455
housing_median_age 0.102175
households 0.071426
total_bedrooms 0.054635
population -0.020153
people_per_house -0.038224
longitude -0.050859
latitude -0.139584
bedrooms_ratio -0.256397
Hence new attributes are much more correlated with target attribute than the older
features.
New approach
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(strategy="median")
housing_num=housing1.select_dtypes(include=[np.number])
imputer.fit(housing_num)
X=imputer.transform(housing_num)
housing_tr=pd.DataFrame(X,
columns=housing_num.columns,index=housing_num.index)
housing_tr.info()
Numerical data, as its name suggests, involves features that are only composed of
numbers, such as integers or floating-point values. Categorical data are variables that
contain label values rather than numeric values. The number of possible values is often
limited to a fixed set. Categorical variables are often called nominal.
Some examples include:
• A “pet” variable with the values: “dog” and “cat“.
• A “color” variable with the values: “red“, “green“, and “blue“.
• A “place” variable with the values: “first“, “second“, and “third“.
A numerical variable can be converted to an ordinal variable by dividing the range of the
numerical variable into bins and assigning values to each bin. For example, a numerical
variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an
ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.
• Nominal Variable (Categorical). Variable comprises a finite set of discrete values
with no relationship between values.
• Ordinal Variable. Variable comprises a finite set of discrete values with a ranked
ordering between values.
Some algorithms can work with categorical data directly. For example, a decision tree can
be learned directly from categorical data with no data transform required (this depends
on the specific implementation). Many machine learning algorithms cannot operate on
label data directly. They require all input variables and output variables to be numeric.
Ordinal Encoding:
In ordinal encoding, each unique category value is assigned an integer value. This is called
an ordinal encoding or an integer encoding and is easily reversible. Often, integer values
starting at zero are used.
One-Hot Encoding
For categorical variables where no ordinal relationship exists, the integer encoding may
not be enough, at best, or misleading to the model at worst. Forcing an ordinal
relationship via an ordinal encoding and allowing the model to assume a natural ordering
between categories may result in poor performance or unexpected results (predictions
halfway between categories).
In this case, a one-hot encoding can be applied to the ordinal representation.
This is where the integer encoded variable is removed and one new binary variable is
added for each unique integer value in the variable.
MinMax Scaler
The MinMax scaler is one of the simplest scalers to understand. It just scales all the data
between 0 and 1. The formula for calculating the scaled value is-
x_scaled = (x – x_min)/(x_max – x_min)
Thus, a point to note is that it does so for every feature separately. Though (0, 1) is the
default range, we can define our range of max and min values as well.
Standard Scaler
Just like the MinMax Scaler, the Standard Scaler is another popular scaler that is very easy
to understand and implement.
For each feature, the Standard Scaler scales the values such that the mean is 0 and the
standard deviation is 1(or the variance).
x_scaled = x – mean/std_dev
However, Standard Scaler assumes that the distribution of the variable is normal
Custom Transformer
Consider this situation – Suppose you have your own Python function to transform the
data. Sklearn also provides the ability to apply this transform to our dataset using what
is called a FunctionTransformer. Let us take a simple example. I have a feature
transformation techniques that involves taking (log to the base 2) of the values. In
NumPy, there is a function called log2 which does that for us. Thus, we can now apply the
FunctionTransformer:
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log2, validate = True)
df_scaled[col_names] = transformer.transform(features.values)
df_scaled
Here is the output with log-base 2 applied on Age and Income:
Transformation Pipelines
A machine learning pipeline is used to help automate machine learning workflows. They
operate by enabling a sequence of data to be transformed and correlated together in a
model that can be tested and evaluated to achieve an outcome, whether positive or
negative.
Machine learning (ML) pipelines consist of several steps to train a model.
Machine learning pipelines are iterative as every step is repeated to continuously
improve the accuracy of the model and achieve a successful algorithm. To build better
machine learning models, and get the most value from them, accessible, scalable and
durable storage solutions are imperative, paving the way for on-premises object storage.
Need of ML pipelines:
1. The main objective of having a proper pipeline for any ML model is to exercise
control over it. A well-organised pipeline makes the implementation more flexible.
2. The term ML model refers to the model that is created by the training process.
3. The learning algorithm finds patterns in the training data that map the input data
attributes to the target (the answer to be predicted), and it outputs an ML model
that captures these patterns.
4. A model can have many dependencies and to store all the components to make
sure all features available both offline and online for deployment, all the
information is stored in a central repository.
5. A pipeline consists of a sequence of components which are a compilation of
computations. Data is sent through these components and is manipulated with the
help of computation.
Eg Python code for creating pipelines in ML- Categorical and Numerical data
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
na=["longitude","latitude","housing_median_age","total_rooms","total_bedrooms","popu
lation","households","median_income"]
ca=["ocean_proximity"]
catp=make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder()
)
prep=ColumnTransformer([
("num",nump,na),
("cat",catp,ca)]
)
hp=prep.fit_transform(housing)
Non-Linear Model:
from sklearn.metrics import mean_squared_error
lrmse=mean_squared_error(housing_labels,hpred,squared=False)
print(lrmse)
from sklearn.tree import DecisionTreeRegressor
treg=make_pipeline(prep,DecisionTreeRegressor())
treg.fit(housing,housing_labels)
hpred=treg.predict(housing)
from sklearn.metrics import mean_squared_error
lrmse=mean_squared_error(housing_labels,hpred,squared=False)
print(lrmse)
# Create a model
model = SomeModel()
# Create a model
model = SomeModel()
param_distributions)
random_search.fit(X, y)
MODULE II:
Classification
The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a program
learns from the given dataset or observations and then classifies new observation into a
number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with
the corresponding output.
MNIST dataset
The MNIST database (Modified National Institute of Standards and Technology database)
is a large collection of handwritten digits. It has a training set of 60,000 examples, and a
test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits
written by employees of the United States Census Bureau) and Special Database 1 (digits
written by high school students) which contain monochrome images of handwritten
digits. The digits have been size-normalized and centered in a fixed-size image. The
original black and white (bilevel) images from NIST were size normalized to fit in a 20x20
pixel box while preserving their aspect ratio. The resulting images contain grey levels as
a result of the anti-aliasing technique used by the normalization algorithm. the images
were centered in a 28x28 image by computing the center of mass of the pixels, and
translating the image so as to position this point at the center of the 28x28 field.
mnist=fetch_openml("mnist_784",as_frame=False )
#as_frame=False as we need to process image as numpy arrays
show_digit(x[1])
Crossvalidation
Cross validation is a technique used in machine learning to evaluate the performance
of a model on unseen data. It involves dividing the available data into multiple folds or
subsets, using one of these folds as a validation set, and training the model on the
remaining folds. This process is repeated multiple times, each time using a different fold
as the validation set. Finally, the results from each validation step are averaged to
produce a more robust estimate of the model’s performance. The main purpose of cross
validation is to prevent overfitting, which occurs when a model is trained too well on
the training data and performs poorly on new, unseen data. By evaluating the model on
multiple validation sets, cross validation provides a more realistic estimate of the
model’s generalization performance, i.e., its ability to perform well on new, unseen data.
Confusion Matrix
A confusion matrix is a matrix that summarizes the performance of a machine learning
model on a set of test data. It is a means of displaying the number of accurate and
inaccurate instances based on the model’s predictions. It is often used to measure the
performance of classification models, which aim to predict a categorical label for each
input instance.
The matrix displays the number of instances produced by the model on the test data.
• True positives (TP): occur when the model accurately predicts a positive
data point.
• True negatives (TN): occur when the model accurately predicts a negative
data point.
• False positives (FP): occur when the model predicts a positive data point
incorrectly.
• False negatives (FN): occur when the model mispredicts a negative data
point.
Precision
Precision is defined as the ratio of correctly classified positive samples (True Positive) to
a total number of classified positive samples (either correctly or incorrectly).
Precision = True Positive/True Positive + False Positive
Precision = TP/TP+FP
precision helps us to visualize the reliability of the machine learning model in classifying
the model as positive
Recall
The recall is calculated as the ratio between the numbers of Positive samples correctly
classified as Positive to the total number of Positive samples. The recall measures the
model's ability to detect positive samples. The higher the recall, the more positive
samples detected.
Recall = True Positive/True Positive + False Negative
Recall = TP/TP+FN
Unlike Precision, Recall is independent of the number of negative sample classifications.
Further, if the model classifies all positive samples as positive, then Recall will be 1.
F1-score
Precision and recall offer a trade-off, i.e., one metric comes at the cost of another. More
precision involves a harsher critic (classifier) that doubts even the actual positive samples from
the dataset, thus reducing the recall score. On the other hand, more recall entails a lax critic that
allows any sample that resembles a positive class to pass, which makes border-case negative
samples classified as “positive,” thus reducing the precision. Ideally, we want to maximize both
precision and recall metrics to obtain the perfect classifier.
The F1 score combines precision and recall using their harmonic mean, and maximizing the F1
score implies simultaneously maximizing both precision and recall. Thus, the F1 score has
become the choice of researchers for evaluating their models in conjunction with accuracy.
The F1 score is calculated as the harmonic mean of the precision and recall scores, as shown
below. It ranges from 0-100%, and a higher F1 score denotes a better quality classifier.
thousands of customers registering on your website every week, and the call center
cannot reach all of them. But they can easily reach a couple of hundred.
Every customer that buys the product will make an effort well worth it. In this scenario,
the cost of false positives is low (just a quick call that does not result in a purchase), but
the value of true positives is high (immediate revenue).
In this case, you'd likely optimize for recall. You want to make sure you reach all potential
buyers. Your only limit is the number of people your call center can contact weekly. In
this case, you can set a lower decision threshold. Your model might have low
precision, but this is not a big deal as long as you reach your business goals and make a
certain number of sales.
Precision-recall curve
One approach is the precision-recall curve. It shows the value pairs between precision
and recall at different thresholds.
#Code for precision-recall curve
yscores=cross_val_predict(sg,xtrain,ytrain5,cv=3,
method='decision_function')
from sklearn.metrics import precision_recall_curve
p,r,t=precision_recall_curve(ytrain5,yscores)
plt.plot(t,p[:-1],label="T vs P")
plt.plot(t,r[:-1],label="T vs R")
plt.vlines(3000,0,1.0,"k","dotted",label="threshold line")
plt.show()
Output:
Multiclass classification
Binary classification are those tasks where examples are assigned exactly one of two
classes. Multi-class classification is those tasks where examples are assigned exactly one
of more than two classes.
• Binary Classification: Classification tasks with two classes.
• Multi-class Classification: Classification tasks with more than two classes.
Some algorithms are designed for binary classification problems. Examples include:
• Logistic Regression
• Perceptron
The formula for calculating the number of binary datasets, and in turn, models, is as
follows:
• (NumClasses * (NumClasses – 1)) / 2
Classically, this approach is suggested for support vector machines (SVM) and related
kernel-based algorithms. This is believed because the performance of kernel methods
does not scale in proportion to the size of the training dataset and using subsets of the
training data may counter this effect.
#Python code for One-Vs-One approach
from sklearn.svm import SVC
sc=SVC()
sc.fit(xtrain[:2000],ytrain[:2000])
sc.predict([x[0]])
sds=sc.decision_function([x[0]])
sds.round(2)
classid=sds.argmax()
classid
Error Analysis
Error analysis is the process to isolate, observe and diagnose erroneous ML
predictions thereby helping understand pockets of high and low performance of the
model. When it is said that “the model accuracy is 90%” it might not be uniform across
subgroups of data and there might be some input conditions which the model fails
more.
#Python code to display number of correct and wrong digit classifications
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import ConfusionMatrixDisplay
ypred=cross_val_predict(sc,xtrain,ytrain,cv=3)
ConfusionMatrixDisplay.from_predictions(ytrain,ypred)
Output:
Multilabel classification:
It is used when there are two or more classes and the data we want to classify may
belong to none of the classes or all of them at the same time, e.g. to classify which traffic
signs are contained on an image. In multi-label classification, the training set is composed
of instances each associated with a set of labels, and the task is to predict the label sets of
unseen instances through analyzing training instances with known label sets.
Difference between multi-class classification & multi-label classification is that in multi-
class problems the classes are mutually exclusive, whereas for multi-label problems each
label represents a different classification task, but the tasks are somehow related.
For example, multi-class classification makes the assumption that each sample is assigned
to one and only one label: a fruit can be either an apple or a pear but not both at the same
time. Whereas, an instance of multi-label classification can be that a text might be about
any of religion, politics, finance or education at the same time or none of these.
Multioutput Algorithms
Multioutput algorithms are a type of machine learning approach designed for problems
where the output consists of multiple variables, and each variable can belong to a
different class or have a different range of values. In other words, multioutput problems
involve predicting multiple dependent variables simultaneously.
Two main types of Multioutput Problems:
• Multioutput Classification: In multioutput classification, each instance is
associated with a set of labels and the goal is to predict these labels
simultaneously.
• Multioutput Regression: In multioutput regression, the task is to predict
multiple continuous variables simultaneously.
MODULE-III
Linear Regression
Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between the dependent variable and one or more independent
features by fitting a linear equation to observed data. Linear regression is not merely a
predictive tool; it forms the basis for various advanced models. Techniques like
regularization and support vector machines draw inspiration from linear regression,
expanding its utility. Additionally, linear regression is a cornerstone in assumption
testing, enabling researchers to validate key assumptions about the data.
Normal Equation
ypred
On the other hand, if the learning rate is too high, you might jump across the valley and
end up on the other side, possibly even higher up than you were before. This might make
the algorithm diverge, with larger and larger values, failing to find a good solution
On the left, the learning rate is too low: the algorithm will eventually reach the solu- tion,
but it will take a long time. In the middle, the learning rate looks pretty good: in just a few
iterations, it has already converged to the solution. On the right, the learn- ing rate is too
high: the algorithm diverges.
eta=learning_schedule(epoch*m+iteration)
theta=theta-eta*grad
theta
All algorithms end up near the minimum, but Batch GD’s path actually stops at the
minimum, while both Stochastic GD and Mini-batch GD continue to walk around.
However, don’t forget that Batch GD takes a lot of time to take each step, and Stochastic
GD and Mini-batch GD would also reach the minimum if you used a good learning
schedule.
Polynomial Regression
Polynomial regression is a type of regression analysis used in statistics and machine
learning when the relationship between the independent variable (input) and the
dependent variable (output) is not linear. While simple linear regression models the
relationship as a straight line, polynomial regression allows for more flexibility by
fitting a polynomial equation to the data. When the relationship between the variables
is better represented by a curve rather than a straight line, polynomial regression can
capture the non-linear patterns in the data.
#Python code for polynomial Regression
m=100
import numpy as np
x=6*np.random.randn(m,1)-3
y=0.5*x**2+x+2+np.random.randn(m,1)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
pf=PolynomialFeatures(degree=2,include_bias=False)
xpoly=pf.fit_transform(x)
lr=LinearRegression()
lr.fit(xpoly,y)
lr.intercept_,lr.coef_
Learning Curves
Learning curves are plots used to show a model's performance as the training set size
increases. Another way it can be used is to show the model's performance over a defined
period of time. We typically used them to diagnose algorithms that learn incrementally
from data. It works by evaluating a model on the training and validation datasets, then
plotting the measured performance.
Finding the right degree of a polynomial is a challenge and learning curves help in
resolving it. Learning curves are the plots of training and validation error as a function of
the training iteration.
#Python code to illustrate the use of Learning curves with Linear Regression
from sklearn.model_selection import learning_curve
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
tsize,tscore,vscore=learning_curve(LinearRegression(),x,y,
train_sizes=np.linspace(0.01,1,40),
cv=5,scoring="neg_root_mean_squared_error")
trainerr=-tscore.mean(axis=1)
validerr=-vscore.mean(axis=1)
import matplotlib.pyplot as plt
plt.plot(tsize,trainerr)
plt.plot(tsize,validerr)
plt.legend(["train","valid"])
plt.show()
#Python code to illustrate the use of Learning curves with Polynomial Regression
from sklearn.pipeline import make_pipeline
pr=make_pipeline(PolynomialFeatures(degree=2,include_bias=False),
LinearRegression())
tsize,tscore,vscore=learning_curve(pr,x,y,
train_sizes=np.linspace(0.01,1,40),
cv=5,scoring="neg_root_mean_squared_error")
trainerr=-tscore.mean(axis=1)
validerr=-vscore.mean(axis=1)
import matplotlib.pyplot as plt
plt.plot(tsize,trainerr)
plt.plot(tsize,validerr)
plt.legend(["train","valid"])
plt.show()
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount
of bias is introduced so that we can get better long-term predictions.
Lasso Regression
o Lasso regression is another regularization technique to reduce the complexity of
the model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
Early stopping can be best used to prevent overfitting of the model, and saving
resources. It would give best results if taken care of few things like – parameter tuning,
preventing the model from overfitting, and ensuring that the model learns enough from
the data.
Logistic Regression
Logistic regression is a supervised machine learning algorithm used for classification
tasks where the goal is to predict the probability that an instance belongs to a given
class or not. Logistic regression is used for binary classification where we use sigmoid
function, that takes input as independent variables and produces a probability value
between 0 and 1. Logistic regression predicts the output of a categorical dependent
variable. Therefore, the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1. In Logistic regression, instead
of fitting a regression line, we fit an “S” shaped logistic function, which predicts two
maximum values (0 or 1).
Estimating probabilities in Logistic Regression:
Decision Boundaries
The fundamental application of logistic regression is to determine a decision boundary for
a binary classification problem. Although the baseline is to identify a binary decision
boundary, the approach can be very well applied for scenarios with multiple classification
classes or multi-class classification.
Softmax regression
Softmax regression (or multinomial logistic regression) is a generalization of logistic
regression to the case where we want to handle multiple classes in the target column.
Softmax score for class k
Softmax function
The argmax operator returns the value of a variable that maximizes a function.
Cross entropy is frequently used to measure how well a set of estimated class
probabilities match the target classes
Cross entropy cost function
MODULE IV:
Decision Trees
A decision tree is a flowchart-like structure used to make decisions or predictions. It consists
of nodes representing decisions or tests on attributes, branches representing the outcome of
these decisions, and leaf nodes representing final outcomes or predictions. Each internal
node corresponds to a test on an attribute, each branch corresponds to the result of the test,
and each leaf node corresponds to a class label or a continuous value. Decision trees are
versatile ML algorithms used for classification, regression and multi-output classification. Also
suitable for any complex datasets.
Gini Impurity
The equation for computing Gini Impurity is:
𝒏
𝑮𝒊 = 𝟏 − ∑ 𝒑𝒊,𝒌 𝟐
𝒌=𝟏
Where n-number of classes,
A node’s gini attribute measures its Gini impurity. A node is pure if all training instances is
pure and Gini impurity is 0.
Eg:
For the root node,
50 50 50
Gini impurity=1-[(150)2 + (150)2 + (150)2 ]
=1-[1/9+1/9+1/9]=1-1/3=2/3==0.6667
The thick vertical line represents the decision boundary of the root node (depth 0): petal
length = 2.45 cm. Since the left area is pure (only Iris-Setosa), it cannot be split any further.
However, the right area is impure, so the depth-1 right node splits it at petal width = 1.75 cm
(represented by the dashed line). Since max_depth was set to 2, the Decision Tree stops right
there. However, if you set max_depth to 3, then the two depth-2 nodes would each add
another decision boundary (represented by the dotted lines
Black box machine learning models (Black Boxes), on the other hand, rank higher on
innovation and accuracy, but lower on transparency and interpretability. Black Boxes produce
output based on your input data set, but do not – and cannot – clarify how they came to those
conclusions. So, while a user can observe the input variable and the output variable,
everything in between related to the calculation and the process is not available. Even if it
were, humans would not be able to understand it.
Black Boxes tend to model extremely complex scenarios with deep and non-linear
interactions between the data. Some examples include:
• Deep-learning models
• Boosting models
• Random forest models
CART algorithm Splits training dataset into two subsets using single feature k and threshold
tk. It searches for pair (t,tk) that produces purest subsets weighted by their size. CART cost
function for classification:
Entropy
• The word “entropy,” is hails from physics, and refers to an indicator of the disorder.
The expected volume of “information,” “surprise,” or “uncertainty” associated with a
The few other hyperparameters that would restrict the structure of the decision tree are:
1. min_samples_split – Minimum number of samples a node must possess before
splitting.
2. min_samples_leaf – Minimum number of samples a leaf node must possess.
3. min_weight_fraction_leaf – Minimum fraction of the sum total of weights required to
be at a leaf node.
4. max_leaf_nodes – Maximum number of leaf nodes a decision tree can have.
5. max_features – Maximum number of features that are taken into the account for
splitting each node.
Output:
Ensemble learning:
Ensemble learning is a machine learning technique that enhances accuracy and resilience in
forecasting by merging predictions from multiple models. It aims to mitigate errors or biases
that may exist in individual models by leveraging the collective intelligence of the ensemble.
The underlying concept behind ensemble learning is to combine the outputs of diverse
models to create a more precise prediction. By considering multiple perspectives and utilizing
the strengths of different models, ensemble learning improves the overall performance of the
learning system. This approach not only enhances accuracy but also provides resilience
against uncertainties in the data. By effectively merging predictions from multiple models,
ensemble learning has proven to be a powerful tool in various domains, offering more robust
and reliable forecasts.
Eg:
Voting Classifiers
A voting classifier is a machine learning model that gains experience by training on a collection
of several models and forecasts an output (class) based on the class with the highest
likelihood of becoming the output. To forecast the output class based on the largest majority
of votes, it averages the results of each classifier provided into the voting classifier. The
concept is to build a single model that learns from various models and predicts output based
on their aggregate majority of votes for each output class, rather than building separate
specialized models and determining the accuracy for each of them.
There are primarily two different types of voting classifiers:
• Hard Voting: In hard voting, the predicted output class is a class with the highest
majority of votes. For example, let’s say classifiers predicted the output classes as (Cat,
Dog, Dog). As the classifiers predicted class “dog” a maximum number of times, we
will proceed with Dog as our final prediction.
• Soft Voting: In this, the average probabilities of the classes determine which one will
be the final prediction. For example, let’s say the probabilities of the class being a
“dog” is (0.30, 0.47, 0.53) and a “cat” is (0.20, 0.32, 0.40). So, the average for a class
dog is 0.4333, and the cat is 0.3067, from this, we can confirm our final prediction to
be a dog as it has the highest average probability.
Eg:
vcl.voting="soft"
vcl.named_estimators["svc"].probability=True
vcl.fit(xtrain,ytrain)
vcl.score(xtest,ytest)
Eg:
Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on,
so bagging ends up with a slightly higher bias than pasting. But the extra diversity also means
that the predictors end up being less correlated, so the ensemble’s variance is reduced.
Overall, bagging often results in better models.
Out-of-Bag Evaluation
With bagging, some instances may be sampled several times for any given predictor, while
others may not be sampled at all. By default, BaggingClassifier samples m training instances
with replacement (bootstrap=True), where m is the size of the training set. This means only
about 63% of the training instances are sampled on average for each predictor. The remaining
37% of the training instances that are not sampled are called out-of-bag (oob) instances.
Since a predictor never sees these instances during training, it can be evaluated on these
instances, without the need for a separate validation set. We can evaluate the ensemble itself
by averaging out the out-of-bag evaluations for each predictor.
In Scikit-Learn, we can set oob_score=True when creating a BaggingClassifier to request an
automatic oob evaluation. The resulting evaluation score is available through
the oob_score_ variable.
#Python code for OOB Evaluation
bag_clf = BaggingClassifier( base_estimator=DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True )
bag_clf.fit(X, y) print(bag_clf.oob_score_)
• High Bias, Low Variance: A model with high bias and low variance is said to be
underfitting.
• High Variance, Low Bias: A model with high variance and low bias is said to be
overfitting.
• High-Bias, High-Variance: A model has both high bias and high variance, which means
that the model is not able to capture the underlying patterns in the data (high bias)
and is also too sensitive to changes in the training data (high variance). As a result, the
model will produce inconsistent and inaccurate predictions on average.
• Low Bias, Low Variance: A model that has low bias and low variance means that the
model is able to capture the underlying patterns in the data (low bias) and is not too
sensitive to changes in the training data (low variance). This is the ideal scenario for a
machine learning model, as it is able to generalize well to new, unseen data and
produce consistent and accurate predictions. But in practice, it’s not possible.
Random Forests
Random Forest algorithm is a powerful tree learning technique in Machine Learning. It works
by creating a number of Decision Trees during the training phase. Each tree is constructed
using a random subset of the data set to measure a random subset of features in each
partition. This randomness introduces variability among individual trees, reducing the risk
of overfitting and improving overall prediction performance.
In prediction, the algorithm aggregates the results of all trees, either by voting (for
classification tasks) or by averaging (for regression tasks) This collaborative decision-making
process, supported by multiple trees with their insights, provides an example stable and
precise results. Random forests are widely used for classification and regression functions,
which are known for their ability to handle complex data, reduce overfitting, and provide
reliable forecasts in different environments.
Random forest is an ensemble of Decision trees, trained with bagging method. The value of
max_samples is set to 1.0. There is no need of pipelines in building classifiers/regressors. For
splitting node, uses best feature among random subset of features.
data based on some mathematical criteria (typically the Gini Index). This random sample of
features leads to the creation of multiple de-correlated decision trees.
Feature Importance
Features in machine learning, also known as variables or attributes, are individual measurable
properties or characteristics of the phenomena being observed. They serve as the input to
the model, and their quality and quantity can greatly influence the accuracy and efficiency of
the model. Several techniques can be employed to calculate feature importance in Random
Forests, each offering unique insights:
• Built-in Feature Importance: This method utilizes the model’s internal calculations to
measure feature importance, such as Gini importance and mean decrease in
accuracy. Essentially, this method measures how much the impurity (or randomness)
within a node of a decision tree decreases when a specific feature is used to split the
data.
• Permutation feature importance: Permutation importance assesses the significance
of each feature independently. By evaluating the impact of individual feature
permutations on predictions, it calculates importance.
• SHAP (SHapley Additive exPlanations) Values: SHAP values delve deeper by
explaining the contribution of each feature to individual predictions. This method
offers a comprehensive understanding of feature importance across various data
points.
#Python code to compute Feature Importance
from sklearn.datasets import load_iris
iris=load_iris(as_frame=True)
rcl=RandomForestClassifier(n_estimators=500)
rcl.fit(iris.data,iris.target)
for score,name in zip(rcl.feature_importances_,iris.data.columns):
print(name,round(score,2))
Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from
the number of weak classifiers. It is done by building a model by using weak models in series.
Firstly, a model is built from the training data. Then the second model is built which tries to
correct the errors present in the first model. This procedure is continued and models are
added until either the complete training data set is predicted correctly or the maximum
number of models are added.
Advantages of Boosting
• Improved Accuracy – Boosting can improve the accuracy of the model by combining
several weak models’ accuracies and averaging them for regression or voting over
them for classification to increase the accuracy of the final model.
• Robustness to Overfitting – Boosting can reduce the risk of overfitting by reweighting
the inputs that are classified wrongly.
• Better handling of imbalanced data – Boosting can handle the imbalance data by
focusing more on the data points that are misclassified
Boosting methods
Adaboost – AdaBoost is a boosting algorithm that also works on the principle of the stagewise
addition method where multiple weak learners are used for getting strong learners. The value
of the alpha parameter, in this case, will be indirectly proportional to the error of the weak
learner, Unlike Gradient Boosting in XGBoost, the alpha parameter calculated is related to the
errors of the weak learner, here the value of the alpha parameter will be indirectly
proportional to the error of the weak learner.
Gradient Boosting – It is a boosting technique that builds a final model from the sum of
several weak learning algorithms that were trained on the same dataset. It operates on the
idea of stagewise addition. The first weak learner in the gradient boosting algorithm will not
be trained on the dataset; instead, it will simply return the mean of the relevant column. The
residual for the first weak learner algorithm’s output will then be calculated and used as the
output column or target column for the next weak learning algorithm that will be trained. The
second weak learner will be trained using the same methodology, and the residuals will be
computed and utilized as an output column once more for the third weak learner, and so on
until we achieve zero residuals. The dataset for gradient boosting must be in the form of
numerical or categorical data, and the loss function used to generate the residuals must be
differential at all times.
Stacking
Stacking is a way to ensemble multiple classifications or regression model. There are many
ways to ensemble models, the widely known models are Bagging or Boosting. Bagging allows
multiple similar models with high variance are averaged to decrease variance. Boosting builds
multiple incremental models to decrease the bias, while keeping variance small.
Stacking (sometimes called Stacked Generalization) is a different paradigm. The point
of stacking is to explore a space of different models for the same problem. The idea is that
you can attack a learning problem with different types of models which are capable to learn
some part of the problem, but not the whole space of the problem. So, you can build multiple
different learners and you use them to build an intermediate prediction, one prediction for
each learned model. Then you add a new model which learns from the intermediate
predictions the same target.
This final model is said to be stacked on the top of the others, hence the name. Thus,
you might improve your overall performance, and often you end up with a model which is
better than any individual intermediate model. Notice however, that it does not give you any
guarantee, as is often the case with any machine learning technique.
Eg:
To train the blender, a common approach is to use a hold-out set. First, the training set is split
in two subsets. The first subset is used to train the predictors in the first layer
The blender is trained on this new training set, so it learns to predict the target value given
the first layer’s predictions.
It is actually possible to train several different blenders this way (e.g., one using Lin‐ ear
Regression, another using Random Forest Regression, and so on): we get a whole layer of
blenders. The trick is to split the training set into three subsets: the first one is used to train
the first layer, the second one is used to create the training set used to train the second layer
(using predictions made by the predictors of the first layer), and the third one is used to create
the training set to train the third layer (using pre‐ dictions made by the predictors of the
second layer)
MODULE V:
Bayesian Learning
Bayesian Machine Learning (BML) encompasses a suite of techniques and algorithms that
leverage Bayesian principles to model uncertainty in data. These methods are not just
theoretical constructs; they are practical tools that have transformed the way machines learn
from data.
Where,
P(h) is prior probability of hypothesis h
P(D) is prior probability of training data D
P(h|D) is posterior probability of h given D
P(D|h) is posterior probability of D given h
MAP working:
Second step comes from the application of Bayes theorem and third step is the consequence
of independence of P(D) on h.
Problem:
Solution:
As per Bayes theorem :
2 4
P(B|A).P(A) ∗
P(A|B)= = 437 =2/3
𝑃(𝐵)
7
2 3
P(A|B).P(B) ∗
3 7
P(B|A)= = 4 =2/4
𝑃(𝐴)
7
Hence Bayes theorem is verified as correct for this example
Problem:
Given,
we now observe a new patient for whom the lab test returns a positive result. Should we
diagnose the patient as having cancer or not?
Solution:
P(cancer|⨁)= P(⨁|cancer) ∗ P(cancer)
=0.98*0.008=0.0078
P(¬cancer|⨁)= P(⨁|¬cancer) ∗ P(¬cancer)
=0.03*0.992=0.0298
Normalize the probabilities so that sum of probabilities=1
0.0078
P(cancer|⨁)=0.0078+0.0298=0.21
0.0298
P(¬cancer|⨁)=0.0078+0.0298=0.79
Hence, it should be diagonized as cancer is negative despite of a positive lab report.
Brute for MAP learning algorithm consumes significant computational resources as all the
hypothesis in the hypothesis set is to be checked.
Assumptions of Brute-Force MAP:
1. The training data D is noise free
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any other.
Derivations:
Derivation of hML:
Thus, Any learning algorithm that minimizes the squared error between the output
hypothesis predictions and the training data will output a maximum likelihood hypothesis
Substitution:
Deriving MDL:
Problem:
Apply Bayes optimal classifier and find what is the classification result
Solution:
Computing P(V)
P(⊕)=[P(⊕ |ℎ1).P(h1|D)+ P(⊕ |ℎ2).P(h2|D)+ P(⊕ |ℎ3).P(h3|D)]
=[0.4+0+0]=0.4
Gibbs Algorithm:
1. Choose a hypothesis h from H at random, according to posterior probability
distribution over H
2. Use h to predict the classification of the next instance
3. Less computationally complex
4. misclassification error for the Gibbs algorithm is at most twice the expected error of
the Bayes optimal classifier
Problem:
Given the dataset:
Need of m-estimate:
Conditional probabilities can be estimated directly as relativ e frequencies:
However,
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and
it consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Example:
Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm.
Here we would like to compute the probability of Burglary Alarm. Calculate the
probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and David and Sophia both called the Harry.
Solution:
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
o Expectation step (E - step): It involves the estimation (guess) of all missing values
in the dataset so that after completing this step, there should not be any missing
value.
o Maximization step (M - step): This step involves the use of estimated data in the
E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the dataset
to estimate the missing data of the latent variables and then use that data to update the
values of the parameters in the M-step.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and convergence Step. These steps are explained
as follows:
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent variables
through observed data in datasets. The EM algorithm or latent variable model has a broad
range of real-life applications in machine learning. These are as follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
Additional Problems:
1. Apply candidate elimination for following dataset: Candidate Elimination
Algorithm):
Solution:
S0: (0, 0, 0, 0, 0) Most Specific Boundary
G0: (?, ?, ?, ?, ?) Most Generic Boundary
The first example is negative, the hypothesis at the specific boundary is consistent, hence
we retain it, and the hypothesis at the generic boundary is inconsistent hence we write
all consistent hypotheses by removing one “?” at a time.
S1: (0, 0, 0, 0, 0)
G1: (Many,?,?,?, ?) (?, Big,?,?,?) (?,Medium,?,?,?) (?,?,?,Exp,?) (?,?,?,?,One) (?,?,?,?,Few)
The second example is positive, the hypothesis at the specific boundary is inconsistent,
hence we extend the specific boundary, and the consistent hypothesis at the generic
boundary is retained and inconsistent hypotheses are removed from the generic
boundary.
S2: (Many, Big, No, Exp, Many)
G2: (Many,?,?,?, ?) (?, Big,?,?,?) (?,?,?,Exp,?) (?,?,?,?,Many)
The third example is positive, the hypothesis at the specific boundary is inconsistent,
hence we extend the specific boundary, and the consistent hypothesis at the generic
boundary is retained and inconsistent hypotheses are removed from the generic
boundary.
S3: (Many, ?, No, Exp, ?)
G3: (Many,?,?,?,?) (?,?,?,exp,?)
The fourth example is positive, the hypothesis at the specific boundary is inconsistent,
hence we extend the specific boundary, and the consistent hypothesis at the generic
boundary is retained and inconsistent hypotheses are removed from the generic
boundary.
S4: (Many, ?, No, ?, ?)
G4: (Many,?,?,?,?)
Learned Version Space by Candidate Elimination Algorithm for given data set is:
(Many, ?, No, ?, ?) (Many, ?, ?, ?, ?)
Solution:
Step 1: Positive Example: (Japan, Honda, Blue, 1980, Economy)
G = { (?, ?, ?, ?, ?) }
S = { (Japan, Honda, Blue, 1980, Economy) }
Step 2: Negative Example: (Japan, Toyota, Green, 1970, Sports)
{ (?, Honda, ?, ?, ?),
(?, ?, Blue, ?, ?),
G=
(?, ?, ?, 1980, ?),
(?, ?, ?, ?, Economy) }
3. In Orange County, 51% of the adults are males. (It doesn't take too much
advanced mathematics to deduce that the other 49% are females.) One adult is
randomly selected for a survey involving credit card usage.
a) Find the prior probability that the selected person is a male.
b) It is later learned that the selected survey subject was smoking a cigar. Also,
9.5% of males smoke cigars, whereas 1.7% of females smoke cigars (based on data
from the Substance Abuse and Mental Health Services Administration). Use this
additional information to find the probability that the selected subject is a male.
Solution:
Using the above data, we have to identify the species of an entity with the following
attributes.
X={Color=Green, Legs=2, Height=Tall, Smelly=No}
To predict the class label for the above attribute set, we will first calculate the probability
of the species being M or H in total.
P(Species=M)=4/8=0.5
P(Species=H)=4/8=0.5
Next, we will calculate the conditional probability of each attribute value for each class
label.
P(Color=White/Species=M)=2/4=0.5
P(Color=White/Species=H)=¾=0.75
P(Color=Green/Species=M)=2/4=0.5
P(Color=Green/Species=H)=¼=0.25
P(Legs=2/Species=M)=1/4=0.25
P(Legs=2/Species=H)=4/4=1
P(Legs=3/Species=M)=3/4=0.75
P(Legs=3/Species=H)=0/4=0
P(Height=Tall/Species=M)=3/4=0.75
P(Height=Tall/Species=H)=2/4=0.5
P(Height=Short/Species=M)=1/4=0.25
P(Height=Short/Species=H)=2/4=0.5
P(Smelly=Yes/Species=M)=3/4=0.75
P(Smelly=Yes/Species=H)=1/4=0.25
P(Smelly=No/Species=M)=1/4=0.25
P(Smelly=No/Species=H)=3/4=0.75
We can tabulate the above calculations in the tables for better visualization.
The conditional probability table for the Color attribute is as follows.
Color M H
White 0.5 0.75
Green 0.5 0.25
Conditional Probabilities for Color Attribute
The conditional probability table for the Legs attribute is as follows.
Legs M H
2 0.25 1
3 0.75 0
Conditional Probabilities for Legs Attribute
5. Build a decision tree using ID3 algorithm for the given training data in the table
(Buy Computer data), and predict the class of the following new example: age<=30,
income=medium, student=yes, credit-rating=fair
age income student Credit rating Buys computer
Solution:
The information gain is this mutual information minus the entropy:
The mutual information of the two classes,
Entropy(S)= E(9,5)= -9/14 log2(9/14) – 5/14 log2(5/14)=0.94
Now Consider the Age attribute
For Age, we have three values age<=30 (2 yes and 3 no), age31..40 (4 yes and 0 no), and
age>40 (3 yes and 2 no)
Entropy(age) = 5/14 (-2/5 log2(2/5)-3/5log2(3/5)) + 4/14 (0) + 5/14 (-3/5log2(3/5)-
2/5log2(2/5))
= 5/14(0.9709) + 0 + 5/14(0.9709) = 0.6935
Gain(age) = 0.94 – 0.6935 = 0.2465
Next, consider Income Attribute
For Income, we have three values incomehigh (2 yes and 2 no), incomemedium (4 yes and 2
no), and incomelow (3 yes 1 no)
Entropy(income) = 4/14(-2/4log2(2/4)-2/4log2(2/4)) + 6/14 (-4/6log2(4/6)-
2/6log2(2/6)) + 4/14 (-3/4log2(3/4)-1/4log2(1/4))
= 4/14 (1) + 6/14 (0.918) + 4/14 (0.811)
= 0.285714 + 0.393428 + 0.231714 = 0.9108
Gain(income) = 0.94 – 0.9108 = 0.0292
Next, consider Student Attribute
For Student, we have two values studentyes (6 yes and 1 no) and studentno (3 yes 4 no)
Entropy(student) = 7/14(-6/7log2(6/7)-1/7log2(1/7)) + 7/14(-3/7log2(3/7)-
4/7log2(4/7)
= 7/14(0.5916) + 7/14(0.9852)
= 0.2958 + 0.4926 = 0.7884
Gain (student) = 0.94 – 0.7884 = 0.1516
Finally, consider Credit_Rating Attribute
For Credit_Rating we have two values credit_ratingfair (6 yes and 2 no) and
credit_ratingexcellent (3 yes 3 no)
Entropy(credit_rating) = 8/14(-6/8log2(6/8)-2/8log2(2/8)) + 6/14(-3/6log2(3/6)-
3/6log2(3/6))
= 8/14(0.8112) + 6/14(1)
= 0.4635 + 0.4285 = 0.8920
Gain(credit_rating) = 0.94 – 0.8920 = 0.479
Since Age has the highest Information Gain we start splitting the dataset using the age
attribute.
Left sub-branch
For branch age<=30 we still have attributes income, student, and credit_rating. Which
one should be used to split the partition?
The mutual information is E(Sage<=30)= E(2,3)= -2/5 log2(2/5) – 3/5 log2(3/5)=0.97
For Income, we have three values incomehigh (0 yes and 2 no), incomemedium (1 yes and 1
no) and incomelow (1 yes and 0 no)
Entropy(income) = 2/5(0) + 2/5 (-1/2log2(1/2)-1/2log2(1/2)) + 1/5 (0) = 2/5 (1) = 0.4
Gain(income) = 0.97 – 0.4 = 0.57
For Student, we have two values studentyes (2 yes and 0 no) and studentno (0 yes 3 no)
Right sub-branch
Model Paper:
(a) What is Machine Leaning? Explain the applications of Machine Learning 04M
Q.1 (b) Discuss the any four main challenges of machine learning 08M
(c) Consider the “Japanese Economy Car” concept and instance given in Table 1., 08M
Illustrate the hypothesis using Candidate Elimination Learning algorithm.
Origin Manufacturer Color Decade Type Example
Type
Japan Honda Blue 1980 Economy Positive
Explain Find-S algorithm ad show its working by taking the enjoy sport
(a) concept andtraining instances given in Table 2. 10M
Enjo
Q.2 Exampl Sky AirTem Humidit Wind Wate Foreca y
e p y r st Spor
t
1 Sunn Warm Normal Stron War Same Yes
y g m
2 Sunn Warm High Stron War Same Yes
y g m
3 Rain Cold High Stron War Change No
y g m
4 Sunn Warm High Stron Cool Change Yes
y g
(b) Discuss the features of an unbiased Learner. 06M
(c) State the following problems with respect to Tasks, Performance, and 04M
Experience: i)A Checkers learning problem ii) A Robot driving learning
problem.
(a) In context to prepare the data for Machine Learning algorithms, Write a note 10M
Q.3 on (i) Data Cleaning (ii) Handling text and categorical attributes iii)Feature
scaling
(b) With the code snippets show how Grid Search and Randomized Search helps in 10M
Fine- Tuning a model.
(b) In Regularized Linear Models illustrate the three different methods to 10M
constrain theweights.
(a) With respect to Nonlinear SVM Classification, explain Polynomial Kernel 10M
Q.6 Gaussian
and RBF Kernel along with code snippet.
(b) Show that how SVMs make predictions using Quadratic Programming and 10 M
KernelizedSVM.
Q.7 (a) With an example dataset examine how Decision Trees are used in making 10M
predictions.
(b) Explain The CART Training Algorithm. 06M
(c) Identify the features of Regression and Instability w.r.t decision trees. 04M
(a) Write Bayes theorem. Identify the relationship between Bayes theorem and 10M
theproblem of concept learning?
Q.9 (b) Show that how Maximum Likelihood Hypothesis is helpful for predicting 10M
probabilities.