ML
ML
Chapter-1
Spam e-mail recognition was described in the Prologue. It constitutes a binary classification
task, which is easily the most common task in machine learning which figures heavily
throughout the book. One obvious variation is to consider classification problems with more
than two classes. For instance, we may want to distinguish different
kinds of ham e-mails, e.g., work-related e-mails and private messages. We could approach
this as a combination of two binary classification tasks: the first task is to distinguish between
spam and ham, and the second task is, among ham e-mails, to distinguish between work-
related and private ones.
b) Regression analysis
c) Cluster analysis
While some classification algorithms naturally permit the use of more than two classes,
others are by nature binary algorithms; these can, however, be turned into multinomial
classifiers by a variety of strategies.
Regression analysis
regression analysis is a set of statistical processes for estimating the relationships between a
dependent variable (often called the 'outcome variable') and one or more independent
variables (often called 'predictors', 'covariates', or 'features'). The most common form of
regression analysis is linear regression, in which a researcher finds the line (or a more
complex linear function) that most closely fits the data according to a specific mathematical
criterion.
1
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense) to each other
than to those in other groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics.
Models form the central concept in machine learning as they are what is being learned from
the data, in order to solve a given task. There is a considerable – not to say bewildering –
range of machine learning models to choose from.
1. Geometric models
2. Probabilistic models
3. Logical models
4. Grouping and grading
A geometric model is constructed directly in instance space, using geometric concepts such
as lines, planes and distances. One main advantage of geometric classifiers is that they are
easy to visualise, as long as we keep to two or three dimensions.
Logical models
Logic models are hypothesized descriptions of the chain of causes and effects leading to an
outcome of interest (e.g. prevalence of cardiovascular diseases, annual traffic collision, etc).
While they can be in a narrative form, logic model usually take form in a graphical depiction
of the "if-then" (causal) relationships between the various elements leading to the outcome.
However, the logic model is more than the graphical depiction: it is also the theories,
scientific evidences, assumptions and beliefs that support it and the various processes behind
it.
Grouping and grading
Grouping models do this by breaking up the instance space into groups or segments, the
number of which is determined at training time. One could say that grouping models have a
fixed and finite ‘resolution’ and cannot distinguish between individual instances beyond this
resolution.
2
Features: the workhorses of machine learning
Univariate model
Binary splits
3
Chapter-2
Coverage plot: data is displayed graphically in a coverage plot. The more sequence reads
you have in a region, the higher the plot is. More RNA sequence reads means more gene
expression.
Degrees of freedom: each of a number of independently variable factors affecting the range
of states in which a system may exist, in particular any of the directions in which independent
motion can occur.
4
Scoring and ranking
Variable Ranking is the process of ordering the features by the value of some scoring
function, which usually measures feature-relevance. Resulting set: The score S(fi) is
computed from the training data, measuring some criteria of feature fi. By convention a high
score is indicative for a valuable (relevant) feature.
Machine Learning Studio (classic) provides many different scoring modules. You select one
depending on the type of model you are using, or the type of scoring task you are performing:
Use this module if you want to cluster new data based on an existing K-Means
clustering model.
This module replaces the Assign to Clusters (deprecated) module, which has been
deprecated but is still available for use in existing experiments.
Use this module if you want to generate recommendations, find related items or users,
or predict ratings.
Use this module for all other regression and classification models, as well as some
anomaly detection models.
5
Assessing and visualising ranking performance
A probabilistic classifier assigns the probabilities to each class , where the probability of a
particular class corresponds to the probability of the image belonging to that class. This is
called probability estimation.
-----XXX-----
6
UNIT- II
Chapter-3
How to evaluate multi-class performance and how to build multi-class models out of binary
models.
Multi-class classification
Multi-class scores and probabilities
Multi-class classification: multiclass or multinomial classification is the problem of
classifying instances into one of three or more classes. (Classifying instances into one of two
classes is called binary classification.)
The existing multi-class classification techniques can be categorized into
(i) Transformation to binary
(ii) Extension from binary(Multi-class scores and probabilities)
(iii) Hierarchical classification.
1. Transformation to binary
This section discusses strategies for reducing the problem of multiclass classification to
multiple binary classification problems. It can be categorized into One vs Rest and One vs
One. The techniques developed based on reducing the multi-class problem into multiple
binary problems can also be called problem transformation techniques.
One-vs.-rest
One-vs.-rest (or one-vs.-all, OvA or OvR, one-against-all, OAA) strategy involves training a
single classifier per class, with the samples of that class as positive samples and all other
samples as negatives. This strategy requires the base classifiers to produce a real-valued
confidence score for its decision, rather than just a class label; discrete class labels alone can
lead to ambiguities, where multiple classes are predicted for a single sample.
In pseudocode, the training algorithm for an OvA learner constructed from a binary
classification learner L is as follows:
Inputs:
Output:
Procedure:
1
o Apply L to X, z to obtain fk
Making decisions means applying all classifiers to an unseen sample x and predicting the
label k for which the corresponding classifier reports the highest confidence score:
Although this strategy is popular, it is a heuristic that suffers from several problems.
Firstly, the scale of the confidence values may differ between the binary classifiers.
Second, even if the class distribution is balanced in the training set, the binary
classification learners see unbalanced distributions because typically the set of
negatives they see is much larger than the set of positives.
One-vs.-one
In the one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for a K-way
multiclass problem; each receives the samples of a pair of classes from the original training
set, and must learn to distinguish these two classes. At prediction time, a voting scheme is
applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the
highest number of "+1" predictions gets predicted by the combined classifier.
Like OvR, OvO suffers from ambiguities in that some regions of its input space may receive
the same number of votes.
Extension from binary This section discusses strategies of extending the existing binary
classifiers to solve multi-class classification problems. Several algorithms have been
developed based on neural networks, decision trees, k-nearest neighbors, naive Bayes,
support vector machines and Extreme Learning Machines to address multi-class classification
problems. These types of techniques can also be called algorithm adaptation techniques.
Neural networks
Multiclass perceptrons provide a natural extension to the multi-class problem. Instead of just
having one neuron in the output layer, with binary output, one could have N binary neurons
leading to multi-class classification. In practice, the last layer of a neural network is usually a
softmax function layer, which is the algebraic simplification of N logistic classifiers,
normalized per class by the sum of the N-1 other logistic classifiers.
Extreme Learning Machines (ELM) is a special case of single hidden layer feed-forward
neural networks (SLFNs) where in the input weights and the hidden node biases can be
chosen at random. Many variants and developments are made to the ELM for multiclass
classification.
k-nearest neighbours
2
training example is measured. The k smallest distances are identified, and the most
represented class by these k nearest neighbours is considered the output class label.
Naive Bayes
Naive Bayes is a successful classifier based upon the principle of maximum a posteriori
(MAP). This approach is naturally extensible to the case of having more than two classes, and
was shown to perform well in spite of the underlying simplifying assumption of conditional
independence.
Decision trees
Decision tree learning is a powerful classification technique. The tree tries to infer a split of
the training data based on the values of the available features to produce a good
generalization. The algorithm can naturally handle binary or multiclass classification
problems. The leaf nodes can refer to either of the K classes concerned.
Support vector machines are based upon the idea of maximizing the margin i.e. maximizing
the minimum distance from the separating hyperplane to the nearest example. The basic SVM
supports only binary classification, but extensions have been proposed to handle the
multiclass classification case as well. In these extensions, additional parameters and
constraints are added to the optimization problem to handle the separation of the different
classes.
3
3.Hierarchical classification
Regression
A function estimator, also called a regressor, is a mapping ˆ f :X →R. The regression learning
problem is to learn a function estimator from examples (xi , f (xi ))
Regression models are used to predict a continuous value. Predicting prices of a house given
the features of house like size, price etc is one of the common examples of Regression. It is a
supervised technique.
Types of Regression
4
1.Simple Linear Regression
This is one of the most common and interesting type of Regression technique. Here we
predict a target variable Y based on the input variable X. A linear relationship should exist
between target variable and predictor and so comes the name Linear Regression.
Consider predicting the salary of an employee based on his/her age. We can easily identify
that there seems to be a correlation between employee’s age and salary (more the age more is
the salary). The hypothesis of linear regression is
Y represents salary, X is employee’s age and a and b are the coefficients of equation. So in
order to predict Y (salary) given X (age), we need to know the values of a and b (the model’s
coefficients).
While training and building a regression model, it is these coefficients which are learned and
fitted to training data. The aim of training is to find a best fit line such that cost function is
minimized. The cost function helps in measuring the error. During training process we try to
minimize the error between actual and predicted values and thus minimizing cost function.
In the figure, the red points are the data points and the blue line is the predicted line for the
training data. To get the predicted value, these data points are projected on to the line.
To summarize, our aim is to find such values of coefficients which will minimize the cost
function. The most common cost function is Mean Squared Error (MSE) which is equal to
average squared difference between an observation’s actual and predicted values. The
coefficient values can be calculated using Gradient Descent approach which will be
discussed in detail in later articles. To give a brief understanding, in Gradient descent we start
with some random values of coefficients, compute gradient of cost function on these values,
update the coefficients and calculate the cost function again. This process is repeated until we
find a minimum value of cost function.
2.Polynomial Regression
5
It is still a linear model but the curve is now quadratic rather than a line. Scikit-Learn
provides PolynomialFeatures class to transform the features.
If we increase the degree to a very high value, the curve becomes overfitted as it learns the
noise in the data as well.
In SVR, we identify a hyperplane with maximum margin such that maximum number of data
points are within that margin. SVRs are almost similar to SVM classification algorithm. We
will discuss SVM algorithm in detail in my next article.
Instead of minimizing the error rate as in simple linear regression, we try to fit the error
within a certain threshold. Our objective in SVR is to basically consider the points that are
within the margin. Our best fit line is the hyperplane that has maximum number of
points.
6
4.Decision Tree Regression
Decision trees can be used for classification as well as regression. In decision trees, at each
level we need to identify the splitting attribute. In case of regression, the ID3 algorithm can
be used to identify the splitting node by reducing standard deviation (in classification
information gain is used).
A decision tree is built by partitioning the data into subsets containing instances with similar
values (homogenous). Standard deviation is used to calculate the homogeneity of a numerical
sample. If the numerical sample is completely homogeneous, its standard deviation is zero.
2. Split the dataset on different attributes and calculate standard deviation for each branch
(standard deviation for target and predictor). This value is subtracted from the standard
deviation before the split. The result is the standard deviation reduction.
3. The attribute with the largest standard deviation reduction is chosen as the splitting node.
4. The dataset is divided based on the values of the selected attribute. This process is run
recursively on the non-leaf branches, until all data is processed.
To avoid overfitting, Coefficient of Deviation (CV) is used which decides when to stop
branching. Finally the average of each branch is assigned to the related leaf node (in
regression mean is taken where as in classification mode of leaf nodes is taken).
Random forest is an ensemble approach where we take into account the predictions of several
decision regression trees.
Random Forest prevents overfitting (which is common in decision trees) by creating random
subsets of the features and building smaller trees using these subsets.
7
The above explanation is a brief overview of each regression type.
Unsupervised learning problems further grouped into clustering and association problems.
Clustering
Exclusive (partitioning)
In this clustering method, Data are grouped in such a way that one data can belong to one
cluster only.
Example: K-means
Agglomerative
In this clustering technique, every data is a cluster. The iterative unions between the two
nearest clusters reduce the number of clusters.
8
Overlapping
In this technique, fuzzy sets is used to cluster data. Each point may belong to two or more
clusters with separate degrees of membership.
Descriptive Learning : Using descriptive analysis you came up with the idea that, two
products A (Burger) and B (french fries) are brought together with very high frequency.
Now you want that if user buys A then machine should automatically give him a suggestion
to buy B. So by seeing past data and deducing what could be the possible factors influencing
this situation can be achieved using ML.
Predictive Learning : We want to increase our sales, using descriptive learning we came to
know about what could be the possible factors influencing sales. By tuning the parameters in
such a way so that sales should be maximized in the next quarter, and therefore predicting
what sales we could generate and hence making investments accordingly. This task can be
handled using ML also.
9
Chapter-4
Concept learning
Concept learning, also known as category learning. "The search for and listing of attributes
that can be used to distinguish exemplars from non exemplars of various categories". It is
Acquiring the definition of a general category from given sample positive and negative
training examples of the category.
Much of human learning involves acquiring general concepts from past experiences. For
example, humans identify different vehicles among all the vehicles based on specific sets of
features defined over a large set of features. This special set of features differentiates the
subset of cars in a set of vehicles. This set of features that differentiate cars can be called a
concept.
Similarly, machines can learn from concepts to identify whether an object belongs to a
specific category by processing past/training data to find a hypothesis that best fits the
training examples.
Target concept:
The set of items/objects over which the concept is defined is called the set of instances and
denoted by X. The concept or function to be learned is called the target concept and denoted
by c. It can be seen as a boolean valued function defined over X and can be represented as c:
X -> {0, 1}.
If we have a set of training examples with specific features of target concept C, the problem
faced by the learner is to estimate C that can be defined on training data.
H is used to denote the set of all possible hypotheses that the learner may consider regarding
the identity of the target concept. The goal of a learner is to find a hypothesis H that can
identify all the objects in X so that h(x) = c(x) for all x in X.
An algorithm that supports concept learning requires:
1. Training data (past experiences to train our models)
2. Target concept (hypothesis to identify data objects)
3. Actual data objects (for testing the models)
10
The hypothesis space
Each of the data objects represents a concept and hypotheses. Considering a hypothesis
<true, true, false, false> is more specific because it can cover only one sample. Generally,
we can add some notations into this hypothesis. We have the following notations:
The hypothesis ⵁ will reject all the data samples. The hypothesis <? , ? , ? , ? > will accept
all the data samples. The ? notation indicates that the values of this specific feature do not
affect the result.
General to Specific
Any instance classified by h1 will also be classified by h2. We can say that h2 is more
general than h1. Using this concept, we can find a general hypothesis that can be defined over
the entire dataset X.
To find a single hypothesis defined on X, we can use the concept of being more general than
partial ordering. One way to do this is start with the most specific hypothesis from H and
generalize this hypothesis each time it fails to classify and observe positive training data
object as positive.
1. The first step in the Find-S algorithm is to start with the most specific hypothesis,
which can be denoted by h <- <ⵁ, ⵁ, ⵁ, ⵁ>.
2. This step involves picking up next training sample and applying Step 3 on the sample.
3. The next step involves observing the data sample. If the sample is negative, the
hypothesis remains unchanged and we pick the next training sample by processing
Step 2 again. Otherwise, we process Step 4.
4. If the sample is positive and we find that our initial hypothesis is too specific because
it does not cover the current training sample, then we need to update our current
hypothesis. This can be done by the pairwise conjunction (logical and operation) of
the current hypothesis and training sample.
If the next training sample is <true, true, false, false> and the current hypothesis is
<ⵁ, ⵁ, ⵁ, ⵁ>, then we can directly replace our existing hypothesis with the new one.
11
If the next positive training sample is <true, true, false, true> and current hypothesis
is <true, true, false, false>, then we can perform a pairwise conjunctive. With the
current hypothesis and next training sample, we can find a new hypothesis by putting
? in the place where the result of conjunction is false:
<true, true, false, true> ⵁ <true, true, false, false> = <true, true, false, ?>
Now, we can replace our existing hypothesis with the new one: h <-<true, true, false,
?>
5. This step involves repetition of Step 2 until we have more training samples.
6. Once there are no training samples, the current hypothesis is the one we wanted to
find. We can use the final hypothesis to classify the real objects.
Input : conjunctions x, y.
Output : conjunction z.
1 z ←true;
2 for each feature f do
3 if f = vx is a conjunct in x and f = vy is a conjunct in y then
4 add f = Combine-ID(vx , vy ) to z; // Combine-ID: see text
5 end
6 end
7 return z
12
13
14
15
-----XXX-----
16