ML Unit-2
ML Unit-2
Supervised learning is a type of Machine learning in which the machine needs external
supervision to learn. The supervised learning models are trained using the labeled dataset. Once
the training and processing are done, the model is tested by providing a sample test data to
check whether it predicts the correct output.
The goal of supervised learning is to map input data with the output data. Supervised learning
is based on supervision, and it is the same as when a student learns things in the teacher's
supervision. The example of supervised learning is spam filtering.
o Classification
o Regression
Distance-basedmodels
Like Linear models, distance-based models are based on the geometry of data. As the name
implies, distance-based models work on the concept of distance. In the context of Machine
learning, the concept of distance is not based on merely the physical distance between two
points.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class
of a particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Linear Regression
Linear regression is one of the most popular and simple machine learning algorithms that is
used for predictive analysis. Here, predictive analysis defines prediction of something, and
linear regression makes predictions for continuous numbers such as salary, age, etc.
It shows the linear relationship between the dependent and independent variables, and shows
how the dependent variable(y) changes according to the independent variable (x).
It tries to best fit a line between the dependent and independent variables, and this best fit line
is knowns as the regression line.
y= a0+ a*x+ b
x= independent variable
a0 = Intercept of line.
The below diagram shows the linear regression for prediction of weight according to
height: Read more..
Logistic Regression in Machine Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
AD
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
AD
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
The MNIST (Modified National Institute of Standards and Technology) database is a large
database of handwritten numbers or digits that are used for training various image processing
systems. The dataset also widely used for training and testing in the field of machine learning.
The set of images in the MNIST database are a combination of two of NIST's databases: Special
Database 1 and Special Database 3.
The MNIST dataset has 60,000 training images and 10,000 testing images.
The MNIST dataset can be online, and it is essentially a database of various handwritten digits.
The MNIST dataset has a large amount of data and is commonly used to demonstrate the real
power of deep neural networks. Our brain and eyes work together to recognize any numbered
image. Our mind is a potent tool, and it's capable of categorizing any image quickly. There are
so many shapes of a number, and our mind can easily recognize these shapes and determine
what number is it, but the same task is not simple for a computer to complete. There is only
one way to do this, which is the use of deep neural network which allows us to train a computer
to classify the handwritten digits effectively.
So, we have only dealt with data which contains simple data points on a Cartesian coordinate
system. From starting till now, we have distributed with binary class datasets. And when we
use multiclass datasets, we will use the Softmax activation function is quite useful for
classifying binary datasets. And it was quite effective in arranging values between 0 and 1. The
sigmoid function is not effective for multicausal datasets, and for this purpose, we use the
softmax activation function, which is capable of dealing with it.
The MNIST dataset is a multilevel dataset consisting of 10 classes in which we can classify
numbers from 0 to 9. The major difference between the datasets that we have used before and
the MNIST dataset is the method in which MNIST data is inputted in a neural network.
In the perceptual model and linear regression model, each of the data points was defined by a
simple x and y coordinate. This means that the input layer needs two nodes to input single data
points.
In the MNIST dataset, a single data point comes in the form of an image. These images included
in the MNIST dataset are typical of 28*28 pixels such as 28 pixels crossing the horizontal axis
and 28 pixels crossing the vertical axis. This means that a single image from the MNIST
database has a total of 784 pixels that must be analyzed. The input layer of our neural network
has 784 nodes to explain one of these images.
Here, we will see how to create a function that is a model for recognizing handwritten digits
by looking at each pixel in the image. Then using TensorFlow to train the model to predict the
image by making it look at thousands of examples which are already labeled. We will then
check the model's accuracy with a test dataset.
Now before we start, it is important to note that every data point has two parts: an image (x)
and a corresponding label (y) describing the actual image and each image is a 28x28 array, i.e.,
784 numbers. The label of the image is a number between 0 and 9 corresponding to the
TensorFlow MNIST image. To download and use MNIST dataset, use the
following commands:
1. from tensorflow.examples.tutorials.mnist import input_data
2. mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
Ranking
• About Ranking
• Ranking Methods
• Ranking Algorithms
• XGBoost
About Ranking
Ranking is useful for many applications in information retrieval such as e-commerce, social
networks, recommendation systems, and so on. For example, a user searches for an article or
an item to buy online. To build a recommendation system, it becomes important that similar
articles or items of relevance appear to the user such that the user clicks or purchases the
item. A simple regression model can predict the probability of a user to click an article or buy
an item. However, it is more practical to use ranking technique and be able to order or rank
the articles or items to maximize the chances of getting a click or purchase. The prioritization
of the articles or the items influence the decision of the users.
The ranking technique directly ranks items by training a model to predict the ranking of one
item over another item. In the training model, it is possible to have items, ranking one over
the other by having a "score" for each item. Higher ranked items have higher scores and
lower ranked items have lower scores. Using these scores, a model is built to predict which
item ranks higher than the other.
Ranking Methods
Oracle Machine Learning supports pairwise and listwise ranking methods through XGBoost.
For a training data set, in a number of sets, each set consists of objects and labels representing
their ranking. A ranking function is constructed by minimizing a certain loss function on the
training data. Using test data, the ranking function is applied to get a ranked list of objects.
Ranking is enabled for XGBoost using the regression function. OML4SQL supports pairwise
and listwise ranking methods through XGBoost.
Pairwise ranking: This approach regards a pair of objects as the learning instance. The pairs
and lists are defined by supplying the same case_id value. Given a pair of objects, this
approach gives an optimal ordering for that pair. Pairwise losses are defined by the order of
the two objects. In OML4SQL, the algorithm uses LambdaMART to perform pairwise
ranking with the goal of minimizing the average number of inversions in ranking.
Listwise ranking: This approach takes multiple lists of ranked objects as learning instance.
The items in a list must have the same case_id. The algorithm uses LambdaMART to perform
list-wise ranking.
See Also:
Related Topics
• XGBoost
• DBMS_DATA_MINING — Algorithm Settings: XGBoost
Ranking Algorithms
Related Topics
• XGBoost
Structured outputs
Structured prediction or structured (output) learning is an umbrella
term for supervised machine learning techniques that involves predicting structured objects,
rather than scalar discrete or real values.[1]
Similar to commonly used supervised learning techniques, structured prediction models are
typically trained by means of observed data in which the true prediction value is used to
adjust model parameters. Due to the complexity of the model and the interrelations of
predicted variables the process of prediction using a trained model and of training itself is
often computationally infeasible and approximate inference and learning methods are used.
Applications
For example, the problem of translating a natural language sentence into a syntactic
representation such as a parse tree can be seen as a structured prediction problem[2] in which
the structured output domain is the set of all possible parse trees. Structured prediction is also
used in a wide variety of application domains including bioinformatics, natural language
processing, speech recognition, and computer vision.
Example: sequence tagging
Sequence tagging is a class of problems prevalent in natural language processing, where input
data are often sequences (e.g. sentences of text). The sequence tagging problem appears in
several guises, e.g. part-of-speech tagging and named entity recognition. In POS tagging, for
example, each word in a sequence must receive a "tag" (class label) that expresses its "type"
of word:
This DT
is VBZ
a DT
tagged JJ
sentence NN
The main challenge of this problem is to resolve ambiguity: the word "sentence" can also
be a verb in English, and so can "tagged".
While this problem can be solved by simply performing classification of individual
tokens, that approach does not take into account the empirical fact that tags do not occur
independently; instead, each tag displays a strong conditional dependence on the tag of
the previous word. This fact can be exploited in a sequence model such as a hidden
Markov model or conditional random field[2] that predicts the entire tag sequence for a
sentence, rather than just individual tags, by means of the Viterbi algorithm.
Techniques
Probabilistic graphical models form a large class of structured prediction models. In
particular, Bayesian networks and random fields are popular. Other algorithms and
models for structured prediction include inductive logic programming, case-based
reasoning, structured SVMs, Markov logic networks, Probabilistic Soft Logic,
and constrained conditional models. Main techniques: