ML-UNIT-1
ML-UNIT-1
R20-MACHINE LEARNING
Unit-1:
Unit I: Introduction- Artificial Intelligence, Machine Learning, Deep learning, Types of Machine Learning
Systems, Main Challenges of Machine Learning.
Statistical Learning: Introduction, Supervised and Unsupervised Learning, Training and Test Loss, Tradeoffs
in Statistical Learning, Estimating Risk Statistics, Sampling distribution of an estimator, Empirical Risk
Minimization.
“Artificial intelligence is the capability of a computer system to mimic human functions such as learning
and problem-solving. Through AI, a computer system uses maths and logic to simulate the reasoning that
people use to learn from new information and make decisions.”
“Artificial intelligence, commonly referred to as AI, is the process of imparting data, information, and
human intelligence to machines. The main goal of Artificial Intelligence is to develop self-reliant machines
that can think and act like humans. These machines can mimic human behavior and perform tasks by
learning and problem-solving. Most of the AI systems simulate natural intelligence to solve complex
problems.”
ML Unit - 1
Page - 3
Reactive Machines - These are systems that only react. These systems don’t form memories, and they don’t
use any past experiences for making new decisions.
Limited Memory - These systems reference the past, and information is added over a period of time. The
referenced information is short-lived.
Theory of Mind - This covers systems that are able to understand human emotions and how they affect
decision making. They are trained to adjust their behavior accordingly.
Self-awareness - These systems are designed and created to be aware of themselves. They understand their
own internal states, predict other people’s feelings, and act appropriately.
ML Unit - 1
Page - 4
Examples
ML Unit - 1
Page - 6
Since the problem is not trivial, your program will likely become a long list of com‐ plex rules—pretty hard
to maintain.
In contrast, a spam filter based on Machine Learning techniques automatically learns which words and
phrases are good predictors of spam by detecting unusually fre‐ quent patterns of words in the spam
examples compared to the ham examples (Below Figure). The program is much shorter, easier to maintain,
and most likely more accurate.
ML Unit - 1
Page - 7
In contrast, a spam filter based on Machine Learning techniques automatically notices that “For U” has
become unusually frequent in spam flagged by users, and it starts flagging them without your intervention
Supervised learning:
In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels
Figure : A labeled training set for supervised learning (e.g., spam classification)
A typical supervised learning task is classification. The spam filter is a good example of this: it is trained
with many example emails along with their class (spam or ham), and it must learn how to classify new
emails.
B. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of
ML Unit - 1
Page - 9
input data without labeled responses. In unsupervised learning algorithms, classification or categorization is not
included in the observations. Example: Consider the following data regarding patients entering a clinic. The data
consists of the gender and age of the patients.
gender age
M 48
M 67
F 53
M 49
F 34
M 21
As a kind of learning, it resembles the methods humans use to figure out that certain objects or events are from the
same class, such as by observing the degree of similarity between objects. Some recommendation systems that you
find on the web in the form of marketing automation are based on this type of learning.
C.Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards.
A learner is not told what actions to take as in most forms of machine learning but instead must discover
which actions yield the most reward by trying them. For example — Consider teaching a dog a new trick: we
cannot tell him what to do, what not to do, but we can reward/punish it if it does the right/wrong thing.
When watching the video, notice how the program is initially clumsy and unskilled but steadily improves
with training until it becomes a champion.
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a
particular situation. It is employed by various software and machines to find the best possible behavior or path it
should take in a specific situation. Reinforcement learning differs from supervised learning in a way that in
supervised learning the training data has the answer key with it so the model is trained with the correct answer
itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to
perform the given task. In the absence of a training dataset, it is bound to learn from its experience.
Example: The problem is as follows: We have an agent and a reward, with many hurdles in between. The agent is
supposed to find the best possible path to reach the reward. The following problem explains the problem more
easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that is the diamond
and avoid the hurdles that are fired. The robot learns by trying all the possible paths and then choosing the path
ML Unit - 1
Page - 10
which gives him the reward with the least hurdles. Each right step will give the robot a reward and each wrong
step will subtract the reward of the robot. The total reward will be calculated when it reaches the final reward that
is the diamond.
D. Semi-supervised learning:
Where an incomplete training signal is given: a training set with some (often many) of the target outputs
missing. There is a special case of this principle known as Transduction where the entire set of problem instances
is known at learning time, except that part of the targets are missing. Semi-supervised learning is an approach to
machine learning that combines small labeled data with a large amount of unlabeled data during training. Semi-
supervised learning falls between unsupervised learning and supervised learning.
Example:
Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your
family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11,
while another person B shows up in photos 2, 5, and 7. This is the unsupervised part of the algorithm
(clustering). Now all the system needs is for you to tell it who these people are. Just one label per person,4 and
it is able to name everyone in every photo, which is useful for searching photos.
DNN and ANN :- Deep Learning is a subset of Machine Learning that is based on artificial neural
networks (ANNs) with multiple layers, also known as deep neural networks (DNNs). These neural networks are
inspired by the structure and function of the human brain, and they are designed to learn from large amounts of
data in an unsupervised or semi-supervised manner.
Deep Learning models are able to automatically learn features from the data, which makes them well-
suited for tasks such as image recognition, speech recognition, and natural language processing.
RNN: - The most widely used architectures in deep learning are feedforward neural networks,
convolutional neural networks (CNNs), and recurrent neural networks (RNNs).
ML Unit - 1
Page - 11
FNN:- Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of
information through the network. FNNs have been widely used for tasks such as image classification, speech
recognition, and natural language processing.
CNN:- Convolutional Neural Networks (CNNs) are a special type of FNNs designed specifically for
image and video recognition tasks. CNNs are able to automatically learn features from the images, which makes
them well-suited for tasks such as image classification, object detection, and image segmentation.
The major difference between deep learning vs machine learning is the way data is presented to the
machine. Machine learning algorithms usually require structured data, whereas deep learning networks work on
multiple layers of artificial neural networks.
The network has an input layer that accepts inputs from the data. The hidden layer is used to find any hidden
features from the data. The output layer then provides the expected output.
Recurrent Neural Networks (RNNs) are a type of neural networks that are able to process sequential data,
such as time series and natural language. RNNs are able to maintain an internal state that captures information
about the previous inputs, which makes them well-suited for tasks such as speech recognition, natural language
processing, and language translation.
In human brain approximately 100 billion neurons all together this is a picture of an individual neuron and
each neuron is connected through thousand of their neighbours. The question here is how do we recreate these
neurons in a computer. So, we create an artificial structure called an artificial neural net where we have nodes or
neurons. We have some neurons for input value and some for output value and in between, there may be lots of
neurons interconnected in the hidden layer.
ML Unit - 1
Page - 12
Here is an example of a neural network that uses large sets of unlabeled data of eye
retinas.
The network model is trained on this data to find out whether or not a person has diabetic retinopathy.
Natural Language Processing, and many more. Some of the most common applications include:
Image and Video Recognition: Deep learning models are used to automatically classify images and videos,
detect objects, and identify faces. Applications include image and video search engines, self-driving cars, and
surveillance systems.
Deep learning models are used to transcribe and translate speech in real-time, which is used in voiceSpeech
Recognition:-controlled devices, such as virtual assistants, and accessibility technology for people with hearing
impairments.
Natural Language Processing: Deep learning models are used to understand, generate and translate human
languages. Applications include machine translation, text summarization, and sentiment analysis.
Robotics: Deep learning models are used to control robots and drones, and to improve their ability to perceive and
interact with the environment.
Healthcare: Deep learning models are used in medical imaging to detect diseases, in drug discovery to identify
new treatments, and in genomics to understand the underlying causes of diseases.
Finance: Deep learning models are used to detect fraud, predict stock prices, and analyze financial data.
Gaming: Deep learning models are used to create more realistic characters and environments, and to improve the
gameplay experience.
Recommender Systems: Deep learning models are used to make personalized recommendations to users, such as
product recommendations, movie recommendations, and news recommendations.
Social Media: Deep learning models are used to identify fake news, to flag harmful content and to filter out spam.
Autonomous systems: Deep learning models are used in self-driving cars, drones, and other autonomous systems
to make decisions based on sensor data.
ML Unit - 1
Page - 14
In short, since your main task is to select a learning algorithm and train it on some data, the two things that can go
wrong are “bad algorithm” and “bad data.” Let’s start with examples of bad data.
ML Unit - 1
Page - 15
ML Unit - 1
Page - 16
4. Irrelevant Features
As the saying goes: garbage in, garbage out. Your system will only be capable of learning if the training data
contains enough relevant features and not too many irrelevant ones. A critical part of the success of a Machine
Learning project is coming up with a good set of features to train on. This process, called feature engineering,
involves:
Feature selection: selecting the most useful features to train on among existing features.
Feature extraction: combining existing features to produce a more useful one (as we saw earlier,
dimensionality reduction algorithms can help).
Creating new features by gathering new data.
5.Overfitting the Training Data
Say you are visiting a foreign country and the taxi driver rips you off. You might be tempted to say that all taxi
drivers in that country are thieves. Overgeneralizing is something that we humans do all too often, and
unfortunately machines can fall into the same trap if we are not careful. In Machine Learning this is called
overfitting: it means that the model performs well on the training data, but it does not generalize well.
Figure 1-22 shows an example of a high-degree polynomial life satisfaction model that strongly overfits the
training data. Even though it performs much better on the training data than the simple linear model, would you
really trust its predictions?
may be feeling a little lost, so let’s step back and look at the big picture:
• Machine Learning is about making machines get better at some task by learning from data, instead of having to
explicitly code rules.
• There are many different types of ML systems: supervised or not, batch or online, instance-based or model-
based, and so on.
• In a ML project you gather data in a training set, and you feed the training set to a learning algorithm. If the
algorithm is model-based it tunes some parameters to fit the model to the training set (i.e., to make good
predictions on the training set itself), and then hopefully it will be able to make good predictions on new cases as
well. If the algorithm is instance-based, it just learns the examples by heart and generalizes to new instances by
comparing them to the learned instances using a similarity measure.
• The system will not perform well if your training set is too small, or if the data is not representative, noisy, or
polluted with irrelevant features (garbage in, garbage out). Lastly, your model needs to be neither too simple (in
which case it will underfit) nor too complex (in which case it will overfit).
ML Unit - 1
Page - 18
Feature and response: Given an input or feature vector x, one of the main goals of machine learning is
to predict an output or response variable y.
For example,
x could be a digitized signature and y a binary variable that indicates whether the signature is genuine or
false.
x represents the weight and smoking habits of an expecting mother and y the birth weight of the baby.
Prediction function: which takes as an input x and outputs a guess g(x) for y (denoted by ^𝑦, for example)
loss function: We can measure the accuracy of a prediction by with respect to a given response y by
lossfunction using some loss function Loss(y, ^𝑦 ). n a regression setting the usual choice is the squared error
loss
`12(y− ^𝑦)2 .
In the case of classification, the zero–one (also written 0–1) loss function Loss(y, ^𝑦) = 1{y , ^𝑦} is often
used, which incurs a loss of 1 whenever the predicted class by is not equal to the class y.
we will encounter various other useful loss functions, such as the cross-entropy and hinge loss functions.
Error is often used as a measure of distance between a “true” object y and some approximation ^𝑦, thereof. If
y is real-valued, the absolute error |y − ^𝑦 | and the squared error (y− ^𝑦 ,)2 are both well-established error
concepts, as are the norm ||y− ^𝑦 || and squared norm ||y− ^𝑦 || 2 for vectors. The squared error (y− ^𝑦 ) 2 is just
one example.
ML Unit - 1
Page - 19
Supervised Learning: One tries to learn the functional relationship between the feature vector x and
response y in the presence of a teacher who provides n examples. It is common to speak of “explaining” or
predicting y on the basis of explanatory x, where x is a vector of explanatory variables.
An example of supervised learning is email spam detection.
Unsupervised learning: learning makes no distinction between response and explanatory variables, and the
objective is simply to learn the structure of the unknown distribution of the data. In other words, we need to
learn f(x). In this case the guess g(x) is an approximation of f(x) and the risk is of the form
which we call the training loss. The training loss is thus an unbiased estimator of the risk (the expected loss)
for a prediction function g, based on the training data.
To approximate the optimal prediction function g∗ (the minimizer of the risk we first select a suitable
collection of approximating functions G and then take our learner to be the function in G that minimizes the
training loss; that is
The prediction accuracy of new pairs of data is measured by the generalization risk of the learner. For a
fixed training set τ it is defined as
ML Unit - 1
Page - 20
Figure: The generalization risk for a fixed training set is the weighted-average loss over all possible pairs (x,
y).
Figure: The expected generalization risk is the weighted-average loss over all possible pairs (x, y) and over
all training sets.
For any outcome τ of the training data, we can estimate the generalization risk without bias by taking the
sample average
ML Unit - 1
Page - 21
We will consider two such decompositions: the approximation–estimation tradeoff and the bias–variance
tradeoff.
We can decompose the generalization risk into the following three components:
ML Unit - 1
Page - 22
Thus, when using a squared-error loss, the generalization risk for a linear class can be decomposed as:
Note that in this decomposition the statistical error is the only term that depends on the training set.
The errors in a machine learning model can be broken down into 2 parts:
1. Reducible Error
2. Irreducible Error
Irreducible errors are errors that cannot be reduced even if you use any other machine learning model.
Reducible errors, on the other hand, is further broken down into square of bias and variance. Due to this
bias-variance, it causes the machine learning model to either overfit or underfit the given data.
Bias is the inability of a machine learning model to capture the true relationship between the data
variables. It is caused by the erroneous assumptions that are inherent to the learning algorithm. For example,
in linear regression, the relationship between the X and the Y variable is assumed to be linear, when in
reality the relationship may not be perfectly linear.
In general,
High Bias indicates more assumptions in the learning algorithm about the relationships between the
variables.
Less Bias indicates fewer assumptions in the learning algorithm.
ML Unit - 1
Page - 23
Generally, nonlinear machine learning algorithms like decision trees have a high variance. It is even higher
if the branches are not pruned during training.
Let’s summarize:
If a model uses a simple machine learning algorithm like in the case of a linear model in the above
code, the model will have high bias and low variance (underfitting the data).
If a model follows a complex machine learning model, then it will have high variance and low bias(
overfitting the data).
ML Unit - 1
Page - 24
You need to find a good balance between the bias and variance of the model we have used. This
tradeoff in complexity is what is referred to as bias and variance tradeoff. An optimal balance of bias
and variance should never overfit or underfit the model.
This tradeoff applies to all forms of supervised learning: classification, regression, and structured
output learning.
How to fix bias and variance problems?
Fixing High Bias
Adding more input features will help improve the data to fit better.
Add more polynomial features to improve the complexity of the model.
Decrease the regularization term to have a balance between bias and variance.
Fixing High Variance
Reduce the input features, use only features with more feature importance to reduce overfitting the
data.
Getting more training data will help in this case, because the high variance model will not be
working for an independent dataset if you have very data.
1. In-Sample Risk : Due to the phenomenon of overfitting, the training loss of the learner ,is not
agood estimate of the generalization risk of the learner..
2. Cross-Validation:
The idea is to make multiple identical copies of the data set, and to partition each copy into different
training and test sets, as illustrated in Below Figure. Here, there are four copies of the data set (consisting of
response and explanatory variables). Each copy is divided into a test set (colored blue) and training set
(colored pink). For each of these sets, we estimate the model parameters using only training data and then
predict the responses for the test set. The average loss between the predicted and observed responses is then
a measure for the predictive power of the model.
ML Unit - 1
Page - 25
Figure: An illustration of four-fold cross-validation, representing four copies of the same data set. The data
in each copy is partitioned into a training set (pink) and a test set (blue). The darker columns represent the
response variable and the lighter ones the explanatory variables.
The sampling distribution of estimator depends on the sample size. The effect of change of the sample size
has to be determined. An estimate has a single numerical value and hence they are called point estimates.
There are various estimators like sample mean, sample standard deviation, proportion, variance, range etc.
Sampling distribution of the mean: It is the population mean from which the samples are drawn. For all the
sample sizes, it is likely to be normal if the population distribution is normal. The population mean is equal
31
Page - 26
to the mean of the sampling distribution of the mean. Sampling distribution of mean has the standard
deviation, which is as follows:
Where , is the standard deviation of the sampling mean , is the population standard deviation and n is
the sample size.
As the size of the sample increases, the spread of the sampling distribution of the mean decreases. But the
mean of the distribution remains the same and it is not affected by the sample size.
The sampling distribution of the standard deviation is the standard error of the standard deviation. It is
defined as:
Here, is the sampling distribution of the standard deviation. It is positively skewed for
small n but it approximately becomes normal for sample sizes greater than 30.
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family
of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we
cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the
true distribution of data that the algorithm will work on, but we can instead measure its performance on a
known set of training data (the "empirical" risk).
In general, the risk R(h) cannot be computed because the distribution P(x,y) is unknown to the learning
algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation,
called empirical risk, by averaging the loss function on the training set; more formally, computing the
expectation with respect to the empirical measure:
The empirical risk minimization principle states that the learning algorithm should choose a
hypothesis ℎ^ which minimizes the empirical risk:
Thus the learning algorithm defined by the ERM principle consists in solving the
above optimization problem.
31