0% found this document useful (0 votes)
47 views

ML

Uploaded by

Anjali Saini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

ML

Uploaded by

Anjali Saini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT-1

Chapter-1

The ingredients of machine learning

1. Tasks: the problems that can be solved with machine learning


2. Models: the output of machine learning
3. Features: the workhorses of machine learning

Tasks: The problems that can be solved with machine learning

Spam e-mail recognition was described in the Prologue. It constitutes a binary classification
task, which is easily the most common task in machine learning which figures heavily
throughout the book. One obvious variation is to consider classification problems with more
than two classes. For instance, we may want to distinguish different
kinds of ham e-mails, e.g., work-related e-mails and private messages. We could approach
this as a combination of two binary classification tasks: the first task is to distinguish between
spam and ham, and the second task is, among ham e-mails, to distinguish between work-
related and private ones.

a) Multiclass or multinomial classification

b) Regression analysis

c) Cluster analysis

d) Association rule learning

In machine learning, multiclass or multinomial classification is the problem of classifying


instances into one of three or more classes. (Classifying instances into one of two classes is
called binary classification.)

While some classification algorithms naturally permit the use of more than two classes,
others are by nature binary algorithms; these can, however, be turned into multinomial
classifiers by a variety of strategies.

Multiclass classification should not be confused with multi-label classification, where


multiple labels are to be predicted for each instance.

Regression analysis

regression analysis is a set of statistical processes for estimating the relationships between a
dependent variable (often called the 'outcome variable') and one or more independent
variables (often called 'predictors', 'covariates', or 'features'). The most common form of
regression analysis is linear regression, in which a researcher finds the line (or a more
complex linear function) that most closely fits the data according to a specific mathematical
criterion.

1
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense) to each other
than to those in other groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics.

Association rule learning is a rule-based machine learning method for discovering


interesting relations between variables in large databases. It is intended to identify strong
rules discovered in databases using some measures of interestingness.

Models: The output of machine learning

Models form the central concept in machine learning as they are what is being learned from
the data, in order to solve a given task. There is a considerable – not to say bewildering –
range of machine learning models to choose from.

1. Geometric models
2. Probabilistic models
3. Logical models
4. Grouping and grading

A geometric model is constructed directly in instance space, using geometric concepts such
as lines, planes and distances. One main advantage of geometric classifiers is that they are
easy to visualise, as long as we keep to two or three dimensions.

probabilistic classifier is a classifier that is able to predict, given an observation of an


input, a probability distribution over a set of classes, rather than only outputting the most
likely class that the observation should belong to. Probabilistic classifiers provide
classification that can be useful in its own right or when combining classifiers into
ensembles.

Logical models
Logic models are hypothesized descriptions of the chain of causes and effects leading to an
outcome of interest (e.g. prevalence of cardiovascular diseases, annual traffic collision, etc).
While they can be in a narrative form, logic model usually take form in a graphical depiction
of the "if-then" (causal) relationships between the various elements leading to the outcome.
However, the logic model is more than the graphical depiction: it is also the theories,
scientific evidences, assumptions and beliefs that support it and the various processes behind
it.
Grouping and grading
Grouping models do this by breaking up the instance space into groups or segments, the
number of which is determined at training time. One could say that grouping models have a
fixed and finite ‘resolution’ and cannot distinguish between individual instances beyond this
resolution.

2
Features: the workhorses of machine learning

Univariate model

Binary splits

Univariate model : In mathematics, univariate refers to an expression, equation, function


or polynomial of only one variable. Objects of any of these types involving more than one
variable may be called multivariate. In some cases the distinction between the univariate and
multivariate cases is fundamental; for example, the fundamental theorem of algebra and
Euclid's algorithm for polynomials are fundamental properties of univariate polynomials that
cannot be generalized to multivariate polynomials.

Binary splitting is a technique for speeding up numerical evaluation of many types of


series with rational terms. In particular, it can be used to evaluate hyper geometric series at
rational points.

3
Chapter-2

Binary classification and related tasks

Classification systematic arrangement in groups or categories according to established


criteria. Classification is the most common task in machine learning. A classifier is a
mapping ˆ c :X →C , where C = {C1,C2, . . . ,Ck } is a finite and usually small set of class
labels. We will sometimes also use Ci to indicate the set of examples of that class. We use the
‘hat’ to indicate that ˆ c(x) is an estimate of the true but unknown function c(x).
Examples for a classifier take the form (x,c(x)), where x ∈ X is an instance and c(x) is
the true class of the instance. Learning a classifier involves constructing the function ˆ c such
that it matches c as closely as possible (and not just on the training set, but ideally on the
entire instance spaceX).
In the simplest case we have only two classes which are usually referred to as positive
and negative, ⊕and _, or +1 and −1. Two-class classification is often called binary
classification (or concept learning, if the positive class can be meaningfully called a
concept). Spam e-mail filtering is a good example of binary classification, in which spam is
conventionally taken as the positive class, and Hamas the negative class (clearly, positive
here doesn’t mean ‘good’!). Other examples of binary classification include medical
diagnosis (the positive class here is having a particular disease) and credit card fraud
detection.

Visualising classification performance

Coverage plot: data is displayed graphically in a coverage plot. The more sequence reads
you have in a region, the higher the plot is. More RNA sequence reads means more gene
expression.

Degrees of freedom: each of a number of independently variable factors affecting the range
of states in which a system may exist, in particular any of the directions in which independent
motion can occur.

4
Scoring and ranking

Variable Ranking is the process of ordering the features by the value of some scoring
function, which usually measures feature-relevance. Resulting set: The score S(fi) is
computed from the training data, measuring some criteria of feature fi. By convention a high
score is indicative for a valuable (relevant) feature.

List of scoring modules

Machine Learning Studio (classic) provides many different scoring modules. You select one
depending on the type of model you are using, or the type of scoring task you are performing:

 Apply Transformation: Applies a well-specified data transformation to a dataset.

Use this module to apply a saved process to a set of data.

 Assign Data to Clusters: Assigns data to clusters by using an existing trained


clustering model.

Use this module if you want to cluster new data based on an existing K-Means
clustering model.

This module replaces the Assign to Clusters (deprecated) module, which has been
deprecated but is still available for use in existing experiments.

 Score Matchbox Recommender: Scores predictions for a dataset by using the


Matchbox recommender.

Use this module if you want to generate recommendations, find related items or users,
or predict ratings.

 Score Model: Scores predictions for a trained classification or regression model.

Use this module for all other regression and classification models, as well as some
anomaly detection models.

5
Assessing and visualising ranking performance

Class probability estimation

A probabilistic classifier assigns the probabilities to each class , where the probability of a
particular class corresponds to the probability of the image belonging to that class. This is
called probability estimation.

Turning rankers into class probability estimators

 Concavity relates to the rate of change of a function's derivative. A function f is concave


up (or upwards) where the derivative f′ is increasing. This is equivalent to the derivative of f′
, which is f′′f, start superscript, prime, prime, end superscript, being positive.

-----XXX-----

6
UNIT- II

Chapter-3

Beyond binary classification

Handling more than two classes

How to evaluate multi-class performance and how to build multi-class models out of binary
models.
Multi-class classification
Multi-class scores and probabilities
Multi-class classification: multiclass or multinomial classification is the problem of
classifying instances into one of three or more classes. (Classifying instances into one of two
classes is called binary classification.)
The existing multi-class classification techniques can be categorized into
(i) Transformation to binary
(ii) Extension from binary(Multi-class scores and probabilities)
(iii) Hierarchical classification.
1. Transformation to binary
This section discusses strategies for reducing the problem of multiclass classification to
multiple binary classification problems. It can be categorized into One vs Rest and One vs
One. The techniques developed based on reducing the multi-class problem into multiple
binary problems can also be called problem transformation techniques.
One-vs.-rest

One-vs.-rest (or one-vs.-all, OvA or OvR, one-against-all, OAA) strategy involves training a
single classifier per class, with the samples of that class as positive samples and all other
samples as negatives. This strategy requires the base classifiers to produce a real-valued
confidence score for its decision, rather than just a class label; discrete class labels alone can
lead to ambiguities, where multiple classes are predicted for a single sample.

In pseudocode, the training algorithm for an OvA learner constructed from a binary
classification learner L is as follows:

Inputs:

 L, a learner (training algorithm for binary classifiers)


 samples X
 labels y where yi ∈ {1, … K} is the label for the sample Xi

Output:

 a list of classifiers fk for k ∈ {1, …, K}

Procedure:

 For each k in {1, …, K}


o Construct a new label vector z where zi = yi if yi = k and zi = 0
otherwise

1
o Apply L to X, z to obtain fk

Making decisions means applying all classifiers to an unseen sample x and predicting the
label k for which the corresponding classifier reports the highest confidence score:

Although this strategy is popular, it is a heuristic that suffers from several problems.
Firstly, the scale of the confidence values may differ between the binary classifiers.
Second, even if the class distribution is balanced in the training set, the binary
classification learners see unbalanced distributions because typically the set of
negatives they see is much larger than the set of positives.

One-vs.-one

In the one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for a K-way
multiclass problem; each receives the samples of a pair of classes from the original training
set, and must learn to distinguish these two classes. At prediction time, a voting scheme is
applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the
highest number of "+1" predictions gets predicted by the combined classifier.

Like OvR, OvO suffers from ambiguities in that some regions of its input space may receive
the same number of votes.

2. Multi-class scores and probabilities

Extension from binary This section discusses strategies of extending the existing binary
classifiers to solve multi-class classification problems. Several algorithms have been
developed based on neural networks, decision trees, k-nearest neighbors, naive Bayes,
support vector machines and Extreme Learning Machines to address multi-class classification
problems. These types of techniques can also be called algorithm adaptation techniques.

Neural networks

Multiclass perceptrons provide a natural extension to the multi-class problem. Instead of just
having one neuron in the output layer, with binary output, one could have N binary neurons
leading to multi-class classification. In practice, the last layer of a neural network is usually a
softmax function layer, which is the algebraic simplification of N logistic classifiers,
normalized per class by the sum of the N-1 other logistic classifiers.

Extreme learning machines

Extreme Learning Machines (ELM) is a special case of single hidden layer feed-forward
neural networks (SLFNs) where in the input weights and the hidden node biases can be
chosen at random. Many variants and developments are made to the ELM for multiclass
classification.

k-nearest neighbours

k-nearest neighbors kNN is considered among the oldest non-parametric classification


algorithms. To classify an unknown example, the distance from that example to every other

2
training example is measured. The k smallest distances are identified, and the most
represented class by these k nearest neighbours is considered the output class label.

Naive Bayes

Naive Bayes is a successful classifier based upon the principle of maximum a posteriori
(MAP). This approach is naturally extensible to the case of having more than two classes, and
was shown to perform well in spite of the underlying simplifying assumption of conditional
independence.

Decision trees

Decision tree learning is a powerful classification technique. The tree tries to infer a split of
the training data based on the values of the available features to produce a good
generalization. The algorithm can naturally handle binary or multiclass classification
problems. The leaf nodes can refer to either of the K classes concerned.

Support vector machines

Support vector machines are based upon the idea of maximizing the margin i.e. maximizing
the minimum distance from the separating hyperplane to the nearest example. The basic SVM
supports only binary classification, but extensions have been proposed to handle the
multiclass classification case as well. In these extensions, additional parameters and
constraints are added to the optimization problem to handle the separation of the different
classes.

3
3.Hierarchical classification

Hierarchical classification tackles the multi-class classification problem by dividing the


output space i.e. into a tree. Each parent node is divided into multiple child nodes and the
process is continued until each child node represents only one class. Several methods have
been proposed based on hierarchical classification.

Regression
A function estimator, also called a regressor, is a mapping ˆ f :X →R. The regression learning
problem is to learn a function estimator from examples (xi , f (xi ))
Regression models are used to predict a continuous value. Predicting prices of a house given
the features of house like size, price etc is one of the common examples of Regression. It is a
supervised technique.

Types of Regression

1. Simple Linear Regression


2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression

4
1.Simple Linear Regression

This is one of the most common and interesting type of Regression technique. Here we
predict a target variable Y based on the input variable X. A linear relationship should exist
between target variable and predictor and so comes the name Linear Regression.

Consider predicting the salary of an employee based on his/her age. We can easily identify
that there seems to be a correlation between employee’s age and salary (more the age more is
the salary). The hypothesis of linear regression is

Y represents salary, X is employee’s age and a and b are the coefficients of equation. So in
order to predict Y (salary) given X (age), we need to know the values of a and b (the model’s
coefficients).

While training and building a regression model, it is these coefficients which are learned and
fitted to training data. The aim of training is to find a best fit line such that cost function is
minimized. The cost function helps in measuring the error. During training process we try to
minimize the error between actual and predicted values and thus minimizing cost function.

In the figure, the red points are the data points and the blue line is the predicted line for the
training data. To get the predicted value, these data points are projected on to the line.

To summarize, our aim is to find such values of coefficients which will minimize the cost
function. The most common cost function is Mean Squared Error (MSE) which is equal to
average squared difference between an observation’s actual and predicted values. The
coefficient values can be calculated using Gradient Descent approach which will be
discussed in detail in later articles. To give a brief understanding, in Gradient descent we start
with some random values of coefficients, compute gradient of cost function on these values,
update the coefficients and calculate the cost function again. This process is repeated until we
find a minimum value of cost function.

2.Polynomial Regression

In polynomial regression, we transform the original features into polynomial features of a


given degree and then apply Linear Regression on it. Consider the above linear model Y =
a+bX is transformed to something like

5
It is still a linear model but the curve is now quadratic rather than a line. Scikit-Learn
provides PolynomialFeatures class to transform the features.

If we increase the degree to a very high value, the curve becomes overfitted as it learns the
noise in the data as well.

3.Support Vector Regression

In SVR, we identify a hyperplane with maximum margin such that maximum number of data
points are within that margin. SVRs are almost similar to SVM classification algorithm. We
will discuss SVM algorithm in detail in my next article.

Instead of minimizing the error rate as in simple linear regression, we try to fit the error
within a certain threshold. Our objective in SVR is to basically consider the points that are
within the margin. Our best fit line is the hyperplane that has maximum number of
points.

6
4.Decision Tree Regression

Decision trees can be used for classification as well as regression. In decision trees, at each
level we need to identify the splitting attribute. In case of regression, the ID3 algorithm can
be used to identify the splitting node by reducing standard deviation (in classification
information gain is used).

A decision tree is built by partitioning the data into subsets containing instances with similar
values (homogenous). Standard deviation is used to calculate the homogeneity of a numerical
sample. If the numerical sample is completely homogeneous, its standard deviation is zero.

The steps for finding splitting node is briefly described as below:

1. Calculate standard deviation of target variable using below formula.

2. Split the dataset on different attributes and calculate standard deviation for each branch
(standard deviation for target and predictor). This value is subtracted from the standard
deviation before the split. The result is the standard deviation reduction.

3. The attribute with the largest standard deviation reduction is chosen as the splitting node.

4. The dataset is divided based on the values of the selected attribute. This process is run
recursively on the non-leaf branches, until all data is processed.

To avoid overfitting, Coefficient of Deviation (CV) is used which decides when to stop
branching. Finally the average of each branch is assigned to the related leaf node (in
regression mean is taken where as in classification mode of leaf nodes is taken).

5.Random Forest Regression

Random forest is an ensemble approach where we take into account the predictions of several
decision regression trees.

1. Select K random points


2. Identify n where n is the number of decision tree regressors to be created. Repeat
step 1 and 2 to create several regression trees.
3. The average of each branch is assigned to leaf node in each decision tree.
4. To predict output for a variable, the average of all the predictions of all decision
trees are taken into consideration.

Random Forest prevents overfitting (which is common in decision trees) by creating random
subsets of the features and building smaller trees using these subsets.

7
The above explanation is a brief overview of each regression type.

Unsupervised and descriptive learning

 Unsupervised machine learning finds all kind of unknown patterns in data.


 Unsupervised methods help you to find features which can be useful for
categorization.
 It is taken place in real time, so all the input data to be analyzed and labeled in the
presence of learners.
 It is easier to get unlabeled data from a computer than labeled data, which needs
manual intervention.

Types of Unsupervised Learning

Unsupervised learning problems further grouped into clustering and association problems.
Clustering

Clustering is an important concept when it comes to unsupervised learning. It mainly deals


with finding a structure or pattern in a collection of uncategorized data. Clustering algorithms
will process your data and find natural clusters(groups) if they exist in the data. You can also
modify how many clusters your algorithms should identify. It allows you to adjust the
granularity of these groups.

There are different types of clustering you can utilize:

Exclusive (partitioning)

In this clustering method, Data are grouped in such a way that one data can belong to one
cluster only.

Example: K-means

Agglomerative

In this clustering technique, every data is a cluster. The iterative unions between the two
nearest clusters reduce the number of clusters.

Example: Hierarchical clustering

8
Overlapping

In this technique, fuzzy sets is used to cluster data. Each point may belong to two or more
clusters with separate degrees of membership.

Descriptive Learning : Using descriptive analysis you came up with the idea that, two
products A (Burger) and B (french fries) are brought together with very high frequency.
Now you want that if user buys A then machine should automatically give him a suggestion
to buy B. So by seeing past data and deducing what could be the possible factors influencing
this situation can be achieved using ML.

Predictive Learning : We want to increase our sales, using descriptive learning we came to
know about what could be the possible factors influencing sales. By tuning the parameters in
such a way so that sales should be maximized in the next quarter, and therefore predicting
what sales we could generate and hence making investments accordingly. This task can be
handled using ML also.

9
Chapter-4
Concept learning

Concept learning, also known as category learning. "The search for and listing of attributes
that can be used to distinguish exemplars from non exemplars of various categories". It is
Acquiring the definition of a general category from given sample positive and negative
training examples of the category.

Much of human learning involves acquiring general concepts from past experiences. For
example, humans identify different vehicles among all the vehicles based on specific sets of
features defined over a large set of features. This special set of features differentiates the
subset of cars in a set of vehicles. This set of features that differentiate cars can be called a
concept.
Similarly, machines can learn from concepts to identify whether an object belongs to a
specific category by processing past/training data to find a hypothesis that best fits the
training examples.
Target concept:

The set of items/objects over which the concept is defined is called the set of instances and
denoted by X. The concept or function to be learned is called the target concept and denoted
by c. It can be seen as a boolean valued function defined over X and can be represented as c:
X -> {0, 1}.
If we have a set of training examples with specific features of target concept C, the problem
faced by the learner is to estimate C that can be defined on training data.
H is used to denote the set of all possible hypotheses that the learner may consider regarding
the identity of the target concept. The goal of a learner is to find a hypothesis H that can
identify all the objects in X so that h(x) = c(x) for all x in X.
An algorithm that supports concept learning requires:
1. Training data (past experiences to train our models)
2. Target concept (hypothesis to identify data objects)
3. Actual data objects (for testing the models)

10
The hypothesis space

Each of the data objects represents a concept and hypotheses. Considering a hypothesis
<true, true, false, false> is more specific because it can cover only one sample. Generally,
we can add some notations into this hypothesis. We have the following notations:

1. ⵁ (represents a hypothesis that rejects all)


2. < ? , ? , ? , ? > (accepts all)
3. <true, false, ? , ? > (accepts some)

The hypothesis ⵁ will reject all the data samples. The hypothesis <? , ? , ? , ? > will accept
all the data samples. The ? notation indicates that the values of this specific feature do not
affect the result.

The total number of the possible hypothesis is (3 * 3 * 3 * 3) + 1 — 3 because one feature


can have either true, false, or ? and one hypothesis for rejects all (ⵁ).

General to Specific

Many machine learning algorithms rely on the concept of general-to-specific ordering of


hypothesis.

1. h1 = < true, true, ?, ? >


2. h2 = < true, ? , ? , ? >

Any instance classified by h1 will also be classified by h2. We can say that h2 is more
general than h1. Using this concept, we can find a general hypothesis that can be defined over
the entire dataset X.

To find a single hypothesis defined on X, we can use the concept of being more general than
partial ordering. One way to do this is start with the most specific hypothesis from H and
generalize this hypothesis each time it fails to classify and observe positive training data
object as positive.

1. The first step in the Find-S algorithm is to start with the most specific hypothesis,
which can be denoted by h <- <ⵁ, ⵁ, ⵁ, ⵁ>.
2. This step involves picking up next training sample and applying Step 3 on the sample.
3. The next step involves observing the data sample. If the sample is negative, the
hypothesis remains unchanged and we pick the next training sample by processing
Step 2 again. Otherwise, we process Step 4.
4. If the sample is positive and we find that our initial hypothesis is too specific because
it does not cover the current training sample, then we need to update our current
hypothesis. This can be done by the pairwise conjunction (logical and operation) of
the current hypothesis and training sample.

If the next training sample is <true, true, false, false> and the current hypothesis is
<ⵁ, ⵁ, ⵁ, ⵁ>, then we can directly replace our existing hypothesis with the new one.

11
If the next positive training sample is <true, true, false, true> and current hypothesis
is <true, true, false, false>, then we can perform a pairwise conjunctive. With the
current hypothesis and next training sample, we can find a new hypothesis by putting
? in the place where the result of conjunction is false:

<true, true, false, true> ⵁ <true, true, false, false> = <true, true, false, ?>

Now, we can replace our existing hypothesis with the new one: h <-<true, true, false,
?>

5. This step involves repetition of Step 2 until we have more training samples.
6. Once there are no training samples, the current hypothesis is the one we wanted to
find. We can use the final hypothesis to classify the real objects.

Paths through the hypothesis space


As we can clearly see in Figure 4.4, in this example we have not one but two most general
hypotheses. What we can also notice is that every concept between the least general one and
one of the most general ones is also a possible hypothesis, i.e., covers all the positives and
none of the negatives. Mathematically speaking we say that the set of Algorithm 4.3: LGG-
Conj-ID(x, y) – find least general conjunctive generalisation of two conjunctions, employing
internal disjunction.

Input : conjunctions x, y.
Output : conjunction z.
1 z ←true;
2 for each feature f do
3 if f = vx is a conjunct in x and f = vy is a conjunct in y then
4 add f = Combine-ID(vx , vy ) to z; // Combine-ID: see text
5 end
6 end
7 return z

12
13
14
15
-----XXX-----

16

You might also like