0% found this document useful (0 votes)
69 views

Machine Learning

This document provides an introduction to machine learning, including definitions, key concepts, and applications. It defines machine learning as a type of artificial intelligence that allows computers to learn without being explicitly programmed. The document outlines the main components of a machine learning problem, including the task, data, and measure of improvement. It also discusses several common domains and applications of machine learning, such as learning associations, self-driving cars, stock market trading, and medical diagnosis. Finally, it covers key terminology like labels, features, examples, and models, and differentiates between regression and classification problems.

Uploaded by

Harsh Gokhru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Machine Learning

This document provides an introduction to machine learning, including definitions, key concepts, and applications. It defines machine learning as a type of artificial intelligence that allows computers to learn without being explicitly programmed. The document outlines the main components of a machine learning problem, including the task, data, and measure of improvement. It also discusses several common domains and applications of machine learning, such as learning associations, self-driving cars, stock market trading, and medical diagnosis. Finally, it covers key terminology like labels, features, examples, and models, and differentiates between regression and classification problems.

Uploaded by

Harsh Gokhru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Module 1

Introduction to Machine Learning


What is Machine Learning?
Machine Learning can be defined as a branch of Artificial intelligence that has the ability to
learn on its own without having all information with respect to a domain in the program itself.
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability
to learn without being explicitly programmed.Machine learning focuses on the development of
computer programs that can teach themselves to grow and change when exposed to new data.
The process of machine learning is similar to that of data mining. Both systems search through
data to look for patterns. However, instead of extracting data for human comprehension as is
the case in data mining applications, machine learning uses that data to improve the program’s
own understanding. Machine learning programs detect patterns in data and adjust program
action accordingly.
Components of a learning problem
● Task : The behavior or task that’s being improved . For example - Classification, acting in
an environment.
● Data : The experiences that are being used to improve performance in the task.
● Measure of improvement : For example - Increasing accuracy in prediction, new skills
that were not present initially, improved speed.
Domains and Applications of Machine learning
1. Learning Associations : is a method for discovering interesting relations between
variables in large databases. It is used to identify regularities among large scale
databases.The concept of learning association is mainly used in market analysis which is
finding associations between product bought by customers : If people who buy x
typically also buy y and if there is a customer who buys x and does not buy y, he or she is
a potential y customer.Once such customer is identified , they can be targeted for cross
selling.
2. Self-driving cars
a. One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company, is working on self-driving cars. It is using
unsupervised learning methods to train the car models to detect people and
objects while driving.
3. Stock Market trading
a. Machine learning is widely used in stock market trading. In the stock market,
there is always a risk of ups and downs in shares, so for this machine learning's
long short term memory neural network is used for the prediction of stock
market trends.
4. Medical Diagnosis
a. In medical science, machine learning is used for disease diagnoses. With this,
medical technology is growing very fast and able to build 3D models that can
predict the exact position of lesions in the brain.
b. It helps in finding brain tumors and other brain-related diseases easily.
5. Classification : Classification is the problem of identifying to which set of categories a
new observation belongs on the basis of a training set of data containing observations
whose category membership is known. Example : Email Spam and Malware Filtering.
6. Pattern recognition : is a branch of machine learning that focuses on the recognition of
similarities and regularities in data.
a. Optical character recognition: is the recognition of character codes from their
images. In this case, there are multiple classes as many as there are characters
that are to be recognized.
b. Face recognition: The input is an image and the classes are people to be
recognized.The learning program should learn to associate the face images to
identify.
c. Speech recognition: The input is acoustic and the classes are words that can be
uttered. The association to be learned is from an acoustic signal to a word of
some languages.
7. Density Estimation: There is a structure to the input space such that certain patterns
occur more often than others and based on the frequency of occurrence it has to be
analysed as to what happens and what does not. One type of density estimation is
clustering.
Key terminology
Labels
A label is the thing we're predicting—the y variable in simple linear regression. The label could
be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio
clip, or just about anything.
Features
A feature is an input variable—the x variable in simple linear regression. A simple machine
learning project might use a single feature, while a more sophisticated machine learning project
could use millions of features, specified as:
x1,x2,...xN
In the spam detector example, the features could include the following:
● words in the email text
● sender's address
● time of day the email was sent
● the email contains the phrase "one weird trick."
Examples
An example is a particular instance of data, x. (We put x in boldface to indicate that it is a
vector.) We break examples into two categories:
● labeled examples
● unlabeled examples
A labeled example includes both feature(s) and the label. That is:
labeled examples: {features, label}: (x, y)
An unlabeled example contains features but not the label. That is:
unlabeled examples: {features, ?}: (x, ?)
Models
A model defines the relationship between features and label. For example, a spam detection
model might associate certain features strongly with "spam". Let's highlight two phases of a
model's life:
● Training means creating or learning the model. That is, you show the model labeled
examples and enable the model to gradually learn the relationships between features
and labels.
● Inference means applying the trained model to unlabeled examples. That is, you use the
trained model to make useful predictions (y'). For example, during inference, you can
predict medianHouseValue for new unlabeled examples.
Regression vs. classification
A regression model predicts continuous values. For example, regression models make
predictions that answer questions like the following:
● What is the value of a house in California?
● What is the probability that a user will click on this ad?
A classification model predicts discrete values. For example, classification models make
predictions that answer questions like the following:
● Is a given email message spam or not spam?
● Is this an image of a dog, a cat, or a hamster?
How to choose the right algorithm
The type and kind of data we have plays a key role in deciding which algorithm to use. Some
algorithms can work with smaller sample sets while others require tons and tons of samples.
Certain algorithms work with certain types of data. E.g. Naïve Bayes works well with categorical
input but is not at all sensitive to missing data.
● Know your data
○ Look at Summary statistics and visualizations
○ Percentiles can help identify the range for most of the data
○ Averages and medians can describe central tendency
○ Correlations can indicate strong relationships
● Visualize the data
○ Box plots can identify outliers
○ Density plots and histograms show the spread of data
○ Scatter plots can describe bivariate relationships
● Clean your data
● Deal with missing value. Missing data affects some models more than others.
● Choose what to do with outliers
○ Outliers can be very common in multidimensional data.
○ Outliers can be the result of bad data collection, or they can be legitimate
extreme values.
● Augment your data
○ Feature engineering is the process of going from raw data to data that is ready
for modeling.
○ Different models may have different feature engineering requirements.
● Categorize the problem
○ The next step is to categorize the problem. This is a two-step process.
■ Categorize by input:
● If you have labelled data, it’s a supervised learning problem.
● If you have unlabelled data and want to find structure, it’s an
unsupervised learning problem.
● If you want to optimize an objective function by interacting with
an environment, it’s a reinforcement learning problem.
■ Categorize by output.
● If the output of your model is a number, it’s a regression problem.
● If the output of your model is a class, it’s a classification problem.
● If the output of your model is a set of input groups, it’s a
clustering problem.
● Do you want to detect an anomaly? That’s anomaly detection
● Important criteria affecting choice of algorithm is model complexity. Generally speaking,
a model is more complex as:
○ It relies on more features to learn and predict (e.g. using two features vs ten
features to predict a target)
○ It has more computational overhead (e.g. a single decision tree vs. a random
forest of 100 trees).
Steps in developing a ML application
1. Collect data. One could collect the samples by scraping a website and extracting data or
information from an RSS feed or an API.One could even have a device to collect wind speed
measurements or blood glucose levels or anything you can measure.One can even use
public available data to save time and effort.
2. Prepare the input data. Once one has required data, need to make sure it’s in a usable
format. Might also need to do some algorithm-specific formatting here. The benefit of
having this standard format is that it can mix and match algorithms and data sources. Some
algorithms need features in a special format, some algorithms can deal with target variables
and features as strings, and some need them to be integers.
3. Analyse the input data. This is looking at the data from the previous task. This could be
as simple as looking at the data that has been parsed in a text editor to make sure steps 1
and 2 are actually working and don’t have a bunch of empty values. One can also look at the
data to see if one can recognize any patterns or if there’s anything obvious, such as a few
data points that are vastly different from the rest of the set.Plotting data in one or more
dimensions can also help. But most of the time, there will be more than three features so
therefore need some advanced methods.
4. Train the algorithm. This is where machine learning takes place. This step and the next
step are where the “core” algorithms lie, depending on the algorithm ,feed the algorithm
good clean data from the first two steps and extract knowledge or information. This
knowledge is often stored in a format that’s readily usable by a machine for the next two
steps.
5. Test the algorithm. This is where the information learned in the previous step is put to
use. When evaluating an algorithm, one needs to test it to see how well it does. In the case
of supervised learning,some known values can be used to evaluate the algorithm. In
unsupervised learning, need to use some other metrics to evaluate the success.In case, not
satisfied with the result , can go back to step 4, make changes and try testing again or need
to go back to step 1 if there is some problem with data.
6. Use it. Here one makes a real program to do some task, and once again can see if all the
previous steps worked as expected. If encountered some new data and have to revisit steps
1 to 5.
Types of Machine learning
● Supervised learning
○ It is a type of learning in which the system or model is given a training set
consisting of feature-outcome pairs from which it learns a pattern or a mapping
of feature-outcome which it uses in future predictions.
○ Known Techniques : Linear Regression, SVM, Logistic regression.
○ Application : Marketing, forecast sales or risk evaluation.
○ Examples : Classification and Regression
○ Disadvantages : Overfitting supervised algorithms easily possible and
Computation time is more and also unwanted data can reduce efficiency.
○ Advantages : Supervised learning allows you to collect data or produce a data
output from the previous experience. Helps you to optimize performance criteria
using experience. Supervised machine learning helps you to solve various types
of real-world computation problems.
● Unsupervised learning
○ It is a type of learning in which the outcome is not present in the data. Instead,
only the features are present and the model automatically finds patterns in the
data according to which it segregates it.
○ Known techniques : K-means, Apriori
○ Application : Anomaly detection system and recommendation system.
○ Examples : Association and clustering in order to discover underlying patterns
○ Disadvantages : Less accuracy of the result is because the input data is not known
and not labeled in advance. Also there is no way possible to get precise
information regarding data sorting and the output as the data in this learning is
labeled and not known.
○ Advantages : Unsupervised learning solves the problem by learning the data and
classifying it without any labels. The labels can be added after the data has been
classified which is much easier. It is very helpful in finding patterns in data, which
are not possible to find using normal methods.
● Semi-supervised
○ It is when a small amount of data is labelled and large amounts of it is not
labelled. This is useful while learning the hypothesis line between classes and
then using unlabelled data to refine it.
○ Known Techniques : Graph based methods and low density separation
techniques.
○ Application : Speech Analysis
○ Advantages : It is simple and easy to understand ,reduces the amount of
annotated data used. It is a stable algorithm with high efficiency.
○ Disadvantages : Iteration results are not stable. It is not applicable to
network-level data. It has low accuracy.
● Reinforcement learning
○ It is a type of learning where the model or system learns through feedback. The
goal is to perform actions to achieve an objective according to a programmed
policy. It is penalized if it performs against policy, otherwise rewarded.
○ Known Techniques : Q-learning and SARSA
○ Application : Robotics for industrial automation, Aircraft control.
○ Examples : Exploration or exploitation in order to learn series of actions
○ Disadvantages : Reinforcement learning is not preferable to use for solving simple
problems. Too much reinforcement learning can lead to an overload of states
which can diminish the results.
○ Advantages : Reinforcement learning doesn’t require large labeled datasets. It’s a
massive advantage because as the amount of data in the world grows it becomes
more and more costly to label it for all required applications.
Difference between Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.
Supervised learning model takes direct feedback Unsupervised learning model does not take
to check if it is predicting correct output or not. any feedback.
Supervised learning model predicts the output. Unsupervised learning models find the
hidden patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it is the hidden patterns and useful insights from
given new data. the unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be categorized in Unsupervised Learning can be classified in
Classification and Regression problems. Clustering and Associations problems.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
Supervised learning models produce accurate Unsupervised learning models may give less
results. accurate results as compared to supervised
learning.
Supervised learning is not close to true Artificial Unsupervised learning is more close to true
intelligence as in this, we first train the model for Artificial Intelligence as it learns similarly as a
each data, and then only it can predict the correct child learns daily routine things from his
output. experiences.
It includes various algorithms such as Linear It includes various algorithms such as
Regression, Logistic Regression, Support Vector Clustering, KNN, and Apriori algorithms.
Machine, Multi-class Classification, Decision tree,
Bayesian Logic, etc.
Difference between Reinforcement learning and Supervised Learning

Module 2
Decision Trees
Decision Tree algorithm belongs to the family of supervised learning algorithms and practical
methods for inductive inference .Decision tree learning is a method for approximating discrete-
valued target functions, in which the learned function is represented by a decision tree. Learned
trees can also be represented as sets of if-then rules. Decision trees are a hierarchical data
structure implementing the divide and conquer strategy. It is also an efficient method that can
be used for both classification and regression.It is a hierarchical model for supervised learning
where the local region is identified in a sequence of recursive splits in a smaller number of
steps.
Key Components of Decision trees:
1. Root Node: It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.

2. Splitting: It is a process of dividing a node into two or more sub-nodes.

3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.

4. Leaf / Terminal Node: Nodes that do not split are called Leaf or Terminal nodes. A
leaf node defines a localized region in input space. Each leaf node has an output label
which is a class node in classification and a numeric value in regression.

5. Pruning: When we remove sub-nodes of a decision node, this process is called


pruning. You can say the opposite process of splitting.

6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.

7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent
node of sub-nodes whereas sub-nodes are the child of a parent node.

8. Boundaries of region are defined by discriminants that are coded in internal nodes.

Advantages of Decision Tree :

● Compared to other algorithms, decision trees require less effort for data preparation
during pre-processing.
● A decision tree does not require normalization of data.
● A decision tree does not require scaling of data as well.
● Missing values in the data also do NOT affect the process of building a decision tree to
any considerable extent.
● A Decision tree model is very intuitive and easy to explain to technical teams as well as
stakeholders.
Disadvantages of Decision Tree

● A small change in the data can cause a large change in the structure of the decision tree
causing instability.
● For a Decision tree sometimes calculation can go far more complex compared to other
algorithms.
● Decision trees often involve higher time to train the model.
● Decision tree training is relatively expensive as the complexity and time taken are more.
● The Decision Tree algorithm is inadequate for applying regression and predicting
continuous values.
Types of problems Decision trees learning is good for :

Decision tree representation


Impurity Measures in Decision Tree

Entropy : Entropy characterizes the impurity of an arbitrary collection of examples.


Entropy - Information Theory

Entropy - Non-boolean Target Classification

Information Gain
Construction Decision Trees

ID3
Advantages and disadvantages of ID3
C4.5

Classification and Regression Tree (CART)


Three Steps in CART
● Tree building
● Pruning
● Optimal tree selection
○ If the attribute is categorical, then a classification tree is used
○ If it is continuous, regression trees are used
Steps of Tree Building
1. For each non-terminal node
a. For a variable
i. At all its split points, splits samples into two binary nodes
ii. Select the best split in the variable in terms of the reduction in impurity
(gini index)
b. Rank all of the best splits and select the variable that achieves the highest purity
at root
c. Assign classes to the nodes according to a rule that minimizes misclassification
costs
d. Grow a very large tree Tmax until all terminal nodes are either small or pure or
contain identical measurement vectors
2. Prune and choose final tree using the cross validation

Advantages and disadvantages of CART

Issues in Decision tree learning


Brief explanation of disadvantages of Decision Tree

● The most significant disadvantage of Decision Trees is that they are prone to overfitting.
Decision Trees overfit because you can end up with a leaf node for every single target
value in your training data.

● Decision Trees are also locally optimized, or greedy, which just means that they don’t
think ahead when deciding how to split at any given node. Rather, splits are made to
minimize or maximize the chosen splitting (selection) criterion— gini or entropy for
classification, MSE or MAE for regression.

● Because of the greedy nature of splitting, imbalanced classes also pose a major issue for
Decision Trees when dealing with classification. At each split, the tree is deciding how to
best split up classes into the next two nodes. So when one class has very low
representation (the minority class), many of those observations can get lost in the
majority class nodes, and then prediction of the minority class will be even less likely
than it should, if any nodes predict it at all.

Ensemble methods

Ensemble methods combine several decision trees to produce better predictive performance
than utilizing a single decision tree. The main principle behind the ensemble model is that a
group of weak learners come together to form a strong learner.

Techniques are :

Bagging

Bagging is used when the goal is to reduce the variance of a decision tree classifier. The
objective is to create several subsets of data from training samples chosen randomly with
replacement. Each collection of subset data is used to train their decision trees. As a result, we
get an ensemble of different models. Average of all the predictions from different trees are used
which is more robust than a single decision tree classifier.

Bagging Steps:

● Suppose there are N observations and M features in the training data set. A sample from
the training data set is taken randomly with replacement.
● A subset of M features are selected randomly and whichever feature gives the best split
is used to split the node iteratively.
● The tree has grown to the largest.
● Above steps are repeated n times and prediction is given based on the aggregation of
predictions from n number of trees.
Advantages:

● Reduces overfitting of the model.


● Handles higher dimensionality data very well.
● Maintains accuracy for missing data.
Disadvantages:

● Since final prediction is based on the mean predictions from subset trees, it won’t give
precise values for the classification and regression model.
Boosting

● Unlike bagging, which is a parallel ensemble technique, boosting works sequentially.


● It aims to convert weak learners to strong learners by sequentially improving the
previous classification, thus minimizing the bias error as we move forward.
● Boosting begins similar to bagging, by randomly choosing datasets from the training
dataset. It creates a classification model using these features and tests the model on the
existing ‘training dataset’
● Some of the data points from the training dataset are correctly classified.
● Now, for building the next random dataset, the instances or data points which were
wrongly classified in the previous dataset will be given higher priority which simply
means that these instances or data points will have a higher likelihood of being selected
in the next dataset.
● This way, boosting sequentially builds N random datasets using the data gained from the
previously chosen instances.
Advantages
● It is one of the most successful techniques in solving the two-class classification
problems.
● It is good at handling the missing data.
Disadvantages
● Boosting is hard to implement in real-time due to the increased complexity of the
algorithm.
● The high flexibility of these techniques results in multiple numbers parameters that
directly affect the behaviour of the model.
Types of Boosting

Gradient Boosting

● Gradient Boosting uses the gradient descent method to reduce the loss function of the
entire operation. Gradient descent is a first-order optimization algorithm that finds the
local minimum of a function (differentiable function). Gradient boosting sequentially
trains multiple models, and it can fit novel models to get a better estimate of the
response.
● Once a loss function is defined for a particular model, gradient boosting is used to
minimize the value of this function, thus minimizing the error while constructing another
tree, by modifying the weights associated with the data points.
● GradientBoostingRegressor and GradientBoostingClassifier can be used to implement
this method in Python by using the library sklearn.ensemble.
AdaBoost

● The Adaptive Boosting technique works iteratively to improve the classification


happening at a certain stage. It uses a decision stump.
● A decision stump is basically a one-level decision tree taking decisions based on a single
feature.
● Different decision stumps can also be combined to create a better decision stump which
will ensure that the data fits without any errors.
● We can use the AdaBoostRegressor and AdaBoostClassifier from the library
sklearn.ensemble to implement the AdaBoost algorithm.
XGBoost(eXtreme Gradient Boosting)

XGBoost is an implementation of gradient boosted decision trees designed for speed and
performance. Gradient boosting machines are generally very slow in implementation because of
sequential model training. Hence, they are not very scalable. Thus, XGBoost is focused on
computational speed and model performance. XGBoost provides:

● Parallelization of tree construction using all of your CPU cores during training.
● Distributed Computing for training very large models using a cluster of machines.
● Out-of-Core Computing for very large datasets that don’t fit into memory.
● Cache Optimization of data structures and algorithms to make the best use of hardware.
Loss function

Loss functions play an important role in any statistical model .They define an objective which
the performance of the model is evaluated against and the parameters learned by the model
are determined by minimizing a chosen loss function. Loss functions define what a good
prediction is and isn't.

Difference between Loss Function and Cost Function


The terms cost and loss functions almost all refer to the same meaning. But, loss function
mainly applies for a single training set as compared to the cost function which deals with a
penalty for a number of training sets or the complete batch. Loss function is also sometimes
called an error function.

In short, we can say that the loss function is a part of the cost function. The cost function is
calculated as an average of loss functions. The loss function is a value which is calculated at
every instance.

So, for a single training cycle loss is calculated numerous times, but the cost function is only
calculated once.

Regression Loss Function

1. Squared Error Loss


Squared Error loss for each training example, also known as L2 Loss, is the square of the
difference between the actual and the predicted values.

The corresponding cost function is the Mean of these Squared Errors (MSE). The MSE loss
function penalizes the model for making large errors by squaring them. This property makes the
MSE cost function less robust to outliers.Therefore it should be used if our data is prone to
many outliers.
2. Absolute Error Loss
Absolute Error for each training example is the distance between the predicted and the actual
values, irrespective of the sign. Absolute Error is also known as the L1 loss:

The cost is the Mean of these Absolute Errors (MAE).


The MAE cost is more robust to outliers as compared to MSE. However, handling the absolute or
modulus operator in mathematical equations is not easy which is the disadvantage of MAE.
3. Huber Loss
The Huber loss combines the best properties of MSE and MAE. It is quadratic for smaller errors
and is linear otherwise (and similarly for its gradient). It is identified by its delta parameter:

Huber loss is more robust to outliers than MSE. It is used in Robust Regression, M-estimation
and Additive Modelling. A variant of Huber Loss is also used in classification.
Binary Classification Loss Functions
Binary Classification refers to assigning an object into one of two classes. This classification is
based on a rule applied to the input feature vector. For example, classifying an email as spam or
not spam based on, say its subject line, is binary classification.
Let’s take breast cancer as a dataset, we want to classify a tumor as ‘Malignant’ or ‘Benign’
based on features like average radius, area, perimeter, etc. For simplification, we will use only
two input features (X_1 and X_2) namely ‘worst area’ and ‘mean symmetry’ for classification.
The target value Y can be 0 (Malignant) or 1 (Benign).
Here is a scatter plot for our data:

1. Binary Cross Entropy Loss


The entropy is to indicate disorder or uncertainty. It is measured for a random variable X with
probability distribution p(X):
The negative sign is used to make the overall quantity positive.
A greater value of entropy for a probability distribution indicates a greater uncertainty in the
distribution. Likewise, a smaller value indicates a more certain distribution.This makes binary
cross-entropy suitable as a loss function – you want to minimize its value. We use binary
cross-entropy loss for classification models which output a probability p.
Probability that the element belongs to class 1 (or positive class) = p
Then, the probability that the element belongs to class 0 (or negative class) = 1 - p
Then, the cross-entropy loss for output label y (can take values 0 and 1) and predicted
probability p is defined as:

This is also called Log-Loss. To calculate the probability p, we can use the sigmoid function.
Here, z is a function of our input features:

The range of the sigmoid function is [0, 1] which makes it suitable for calculating probability.

2. Hinge Loss
Hinge loss is primarily used with Support Vector Machine (SVM) Classifiers with class labels -1
and 1. Hinge Loss not only penalizes the wrong predictions but also the right predictions that
are not confident.
Hinge loss for an input-output pair (x, y) is given as:

Hinge Loss simplifies the mathematics for SVM while maximizing the loss (as compared to
Log-Loss). It is used when we want to make real-time decisions with not a laser-sharp focus on
accuracy.
Multi-class Classification Loss Function
Emails are not just classified as spam or not spam (this isn’t the 90s anymore!). They are
classified into various other categories – Work, Home, Social, Promotions, etc. This is a
Multi-Class Classification use case.
1. Multi-Class Cross Entropy Loss
The multi-class cross-entropy loss is a generalization of the Binary Cross Entropy loss. The loss
for input vector X_i and the corresponding one-hot encoded target vector Y_i is:

We use the softmax function to find the probabilities p_ij:

Source: Wikipedia
Softmax is implemented through a neural network layer just before the output layer. The
Softmax layer must have the same number of nodes as the output layer.

Finally, our output is the class with the maximum probability for the given input.
Neural Network Learning as Optimization

● A deep learning neural network learns to map a set of inputs to a set of outputs from
training data.
● Cannot calculate the perfect weights for a neural network; there are too many
unknowns. Instead, the problem of learning is cast as a search or optimization problem
and an algorithm is used to navigate the space of possible sets of weights the model may
use in order to make good or good enough predictions.
● a neural network model is trained using the stochastic gradient descent optimization
algorithm and weights are updated using the backpropagation of error algorithm
● “Gradient” in gradient descent refers to an error gradient. The model with a given set of
weights is used to make predictions and the error for those predictions is calculated
● The gradient descent algorithm seeks to change the weights so that the next evaluation
reduces the error, meaning the optimization algorithm is navigating down the gradient
(or slope) of error.
● In the context of an optimization algorithm, the function used to evaluate a candidate
solution (i.e. a set of weights) is referred to as the objective function.
● We may seek to maximize or minimize the objective function, meaning that we are
searching for a candidate solution that has the highest or lowest score respectively.
● Typically, with neural networks, we seek to minimize the error. As such, the objective
function is often referred to as a cost function or a loss function and the value calculated
by the loss function is referred to as simply “loss.”
● The cost or loss function has an important job in that it must faithfully distill all aspects
of the model down into a single number in such a way that improvements in that
number are a sign of a better model.
● In calculating the error of the model during the optimization process, a loss function
must be chosen.
● This can be a challenging problem as the function must capture the properties of the
problem and be motivated by concerns that are important to the project and
stakeholders.
Model Selection
Model selection is a technique for selecting the best model after the individual models are
evaluated based on the required criteria.
Techniques :
● Resampling methods are simple techniques of rearranging data samples to inspect if the
model performs well on data samples that it has not been trained on.
● Random Splits are used to randomly sample a percentage of data into training, testing,
and preferably validation sets
● time-wise split :The training set can have data for the last three years and 10 months of
the present year. The last two months can be reserved for the testing or validation set.
● K-Fold Cross-Validation: The cross-validation technique works by randomly shuffling the
dataset and then splitting it into k groups. Then on iterating over each group, the group
needs to be considered as a test set while all other groups are clubbed together into the
training set.
● BootStrap: The first step is to select a sample size (which is usually equal to the size of
the original dataset). Thereafter, a sample data point must be randomly selected from
the original dataset and added to the bootstrap sample. After the addition, the sample
needs to be put back into the original sample. This process needs to be repeated for N
times, where N is the sample size.
Model Evaluation
For every classification model prediction, a matrix called the confusion matrix can be
constructed which demonstrates the number of test cases correctly and incorrectly classified.
● Accuracy is the simplest metric and can be defined as the number of test cases correctly
classified divided by the total number of test cases.

● Precision is the metric used to identify the correctness of classification.

● Recall tells us the number of positive cases correctly identified out of the total number
of positive cases.

● F1 score is the harmonic mean of Recall and Precision and therefore, balances out the
strengths of each.

● AUC-ROC: ROC curve is a plot of true positive rate (recall) against false positive rate (TN
/ (TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and
the higher the area, the better is the model performance.
● If the curve is somewhere near the 50% diagonal line, it suggests that the model
randomly predicts the output variable.
Bias and Variance
Bias occurs when a model is strictly ruled by assumptions – like the linear regression model
assumes that the relationship of the output variable with the independent variables is a straight
line. This leads to underfitting when the actual values are non-linearly related to the
independent variables.
Variance is high when a model focuses on the training set too much and learns the variations
very closely, compromising on generalization. This leads to overfitting.
An optimal model is one that has the lowest bias and variance and since these two attributes
are indirectly proportional, the only way to achieve this is through a tradeoff between the two.
Therefore, the model selection should be such that the bias and variance intersect.

Linear Regression with single variable


Gradient Descent
Linear regression with multiple variables with Gradient Descent
Difference between Linear regression and Logistic regression
Linear Regression Logistic Regression
Linear Regression is a supervised regression Logistic Regression is a supervised classification
model. model.
In Linear Regression, we predict the value by an In Logistic Regression, we predict the value by 1
integer number. or 0.
Here activation function is used to convert a
Here no activation function is used. linear regression equation to the logistic
regression equation
Here no threshold value is needed. Here a threshold value is added.
Here we calculate Root Mean Square Here we use precision to predict the next
Error(RMSE) to predict the next weight value. weight value.
Here the dependent variable consists of only
Here the dependent variable should be numeric two categories. Logistic regression estimates
and the response variable is continuous to the odds outcome of the dependent variable
value. given a set of quantitative or categorical
independent variables.
It is based on the least square estimation. It is based on maximum likelihood estimation.
Any change in the coefficient leads to a change
Here when we plot the training datasets, a in both the direction and the steepness of the
straight line can be drawn that touches logistic function. It means positive slopes result
maximum plots. in an S-shaped curve and negative slopes result
in a Z-shaped curve.
Linear regression is used to estimate the
Whereas logistic regression is used to calculate
dependent variable in case of a change in
the probability of an event. For example,
independent variables. For example, predict the
classify if tissue is benign or malignant.
price of houses.
Logistic regression assumes the binomial
Linear regression assumes the normal or distribution of the dependent variable.
gaussian distribution of the dependent variable.

Linear regression
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable. The linear regression model
provides a sloped straight line representing the relationship between the variables.
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
● Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
● Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Logistic Regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Logistic Regression is much similar to Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for
solving the classification problems.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
● Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
● Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
● Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Module 4
Dimensionality Reduction
It is the process of reducing the number of random variables under consideration via obtaining
a set of uncorrelated principle variables. Dimension reduction refers to the process of
converting a set of data having vast dimensions into data with lesser dimensions ensuring that it
conveys similar information concisely.
Methods of Dimensionality Reduction : Principal component Analysis (PCA) and Independent
Component Analysis (ICA)
Advantages of Dimensionality Reduction
Principal component analysis(PCA)
Independent Component Analysis (ICA)

Independent component analysis (ICA) is a method for finding underlying factors or


components from multivariate (multi-dimensional) statistical data. What distinguishes ICA from
other methods is that it looks for components that are both statistically independent and
nonGaussian.

Independent Component Analysis(ICA) estimation principles


● Principle 1: Minimization of mutual information

● Principle 2: Maximum Likelihood Estimation


Motivation-Cocktail Party Problem

Difference between PCA and ICA


Principal Component Analysis Independent Component Analysis
It reduces the dimensions to avoid the It decomposes the mixed signal into its
problem of overfitting. independent sources’ signals.
It deals with the Principal Components. It deals with the Independent Components.
It doesn’t focus on the issue of variance
It focuses on maximizing the variance.
among the data points.
It focuses on the mutual orthogonality It doesn’t focus on the mutual
property of the principal components. orthogonality of the components.
It doesn’t focus on the mutual It focuses on the mutual independence of
independence of the components. the components.
PCA removes correlations but not higher ICA removes correlation and higher order
order dependence. dependence.
ICA strives to generate components as
independent as possible through
PCA uses up to second order moments of minimizing both the second order and
the data to produce uncorrelated higher-order dependencies in the given
components. data.

Definition of ICA
(Note : Eq.3 -> x = As)

BSS- Blind Source Separation


● ICA is very closely related to the method called blind source separation (BSS) or blind
signal separation.
● A “source” here means an original signal, i.e. independent component, like the speaker
in a cocktail party problem.
● “Blind” means that we know very little, if anything, on the mixing matrix A, and make
little assumptions on the source signals.
● ICA is one method, perhaps the most widely used, for performing blind source
separation.
Ambiguities of ICA

What is independence?
Uncorrelated does not mean independent
Why are Gaussian variables forbidden?

Non-Gaussianity Estimation - Measurement of non-Gaussianity


The Central Limit Theorem
– Distribution of a sum of independent random variables tends toward a Gaussian distribution.
– Thus, a sum of two independent random variables usually has a distribution that is closer to
gaussian than any of the two original random variables.
Preprocessing for ICA
1. Centering

2. Whitening
Centering + Whitening = Sphering
● Centering and whitening combined is referred to as sphering, and is necessary to speed
up the ICA algorithm.
● Sphering removes the first and second-order statistics of the data; both the mean and
covariance are set to zero and the variance are equalized.

You might also like