0% found this document useful (0 votes)

4 views4 pages

The Hundred Page Machine Learning 2019

Supervised learning involves gathering data in the form of input-output pairs, where inputs can be various types of data and outputs are typically labels or real numbers. The Support Vector Machine (SVM) algorithm is used to create a decision boundary that separates different classes by optimizing the margin between them, which helps in predicting labels for new data. The effectiveness of the model on unseen examples is attributed to the statistical likelihood that new examples will be similar to those in the training set, allowing the decision boundary to generalize well.

Uploaded by

keshav pareek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views4 pages

The Hundred Page Machine Learning 2019

Uploaded by

keshav pareek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

1.

3 How Supervised Learning Works

In this section, I briefly explain how supervised learning works so that you have the picture
of the whole process before we go into detail. I decided to use supervised learning as an
example because it’s the type of machine learning most frequently used in practice.
The supervised learning process starts with gathering the data. The data for supervised
learning is a collection of pairs (input, output). Input could be anything, for example, email
messages, pictures, or sensor measurements. Outputs are usually real numbers, or labels (e.g.
“spam”, “not_spam”, “cat”, “dog”, “mouse”, etc). In some cases, outputs are vectors (e.g.,
four coordinates of the rectangle around a person on the picture), sequences (e.g. [“adjective”,
“adjective”, “noun”] for the input “big beautiful car”), or have some other structure.
Let’s say the problem that you want to solve using supervised learning is spam detection.
You gather the data, for example, 10,000 email messages, each with a label either “spam” or
“not_spam” (you could add those labels manually or pay someone to do that for us). Now,
you have to convert each email message into a feature vector.
The data analyst decides, based on their experience, how to convert a real-world entity, such
as an email message, into a feature vector. One common way to convert a text into a feature
vector, called bag of words, is to take a dictionary of English words (let’s say it contains
20,000 alphabetically sorted words) and stipulate that in our feature vector:
• the first feature is equal to 1 if the email message contains the word “a”; otherwise,
this feature is 0;
• the second feature is equal to 1 if the email message contains the word “aaron”; otherwise,
this feature equals 0;
• ...
• the feature at position 20,000 is equal to 1 if the email message contains the word
“zulu”; otherwise, this feature is equal to 0.
You repeat the above procedure for every email message in our collection, which gives
us 10,000 feature vectors (each vector having the dimensionality of 20,000) and a label
(“spam”/“not_spam”).
Now you have a machine-readable input data, but the output labels are still in the form of
human-readable text. Some learning algorithms require transforming labels into numbers.
For example, some algorithms require numbers like 0 (to represent the label “not_spam”)
and 1 (to represent the label “spam”). The algorithm I use to illustrate supervised learning is
called Support Vector Machine (SVM). This algorithm requires that the positive label (in
our case it’s “spam”) has the numeric value of +1 (one), and the negative label (“not_spam”)
has the value of ≠1 (minus one).
At this point, you have a dataset and a learning algorithm, so you are ready to apply
the learning algorithm to the dataset to get the model.
SVM sees every feature vector as a point in a high-dimensional space (in our case, space

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

is 20,000-dimensional). The algorithm puts all feature vectors on an imaginary 20,000-
dimensional plot and draws an imaginary 20,000-dimensional line (a hyperplane) that separates
examples with positive labels from examples with negative labels. In machine learning, the
boundary separating the examples of different classes is called the decision boundary.
The equation of the hyperplane is given by two parameters, a real-valued vector w of the
same dimensionality as our input feature vector x, and a real number b like this:

wx ≠ b = 0,

where the expression wx means w(1) x(1) + w(2) x(2) + . . . + w(D) x(D) , and D is the number
of dimensions of the feature vector x.
(If some equations aren’t clear to you right now, in Chapter 2 we revisit the math and
statistical concepts necessary to understand them. For the moment, try to get an intuition of
what’s happening here. It all becomes more clear after you read the next chapter.)
Now, the predicted label for some input feature vector x is given like this:

y = sign(wx ≠ b),

where sign is a mathematical operator that takes any value as input and returns +1 if the
input is a positive number or ≠1 if the input is a negative number.
The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find
the optimal values wú and bú for parameters w and b. Once the learning algorithm identifies
these optimal values, the model f (x) is then defined as:

f (x) = sign(wú x ≠ bú )

Therefore, to predict whether an email message is spam or not spam using an SVM model,
you have to take a text of the message, convert it into a feature vector, then multiply this
vector by wú , subtract bú and take the sign of the result. This will give us the prediction (+1
means “spam”, ≠1 means “not_spam”).
Now, how does the machine find wú and bú ? It solves an optimization problem. Machines
are good at optimizing functions under constraints.
So what are the constraints we want to satisfy here? First of all, we want the model to predict
the labels of our 10,000 examples correctly. Remember that each example i = 1, . . . , 10000 is
given by a pair (xi , yi ), where xi is the feature vector of example i and yi is its label that
takes values either ≠1 or +1. So the constraints are naturally:
• wxi ≠ b Ø 1 if yi = +1, and
• wxi ≠ b Æ ≠1 if yi = ≠1

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

x(2)

1
2

b
w

—
x
w

b
—
x
w

1
—
b
—
x
w
x(1)
b
w

Figure 1: An example of an SVM model for two-dimensional feature vectors.

We would also prefer that the hyperplane separates positive examples from negative ones with
the largest margin. The margin is the distance between the closest examples of two classes,
as defined by the decision boundary. A large margin contributes to a better generalization,
that is how well the model will classify new examples in the future. ToÒachieve that, we need
qD
to minimize the Euclidean norm of w denoted by ÎwÎ and given by j=1 (w
(j) )2 .

So, the optimization problem that we want the machine to solve looks like this:
Minimize ÎwÎ subject to yi (wxi ≠ b) Ø 1 for i = 1, . . . , N . The expression yi (wxi ≠ b) Ø 1
is just a compact way to write the above two constraints.
The solution of this optimization problem, given by wú and bú , is called the statistical
model, or, simply, the model. The process of building the model is called training.
For two-dimensional feature vectors, the problem and the solution can be visualized as shown
in fig. 1. The blue and orange circles represent, respectively, positive and negative examples,
and the line given by wx ≠ b = 0 is the decision boundary.
Why, by minimizing the norm of w, do we find the highest margin between the two classes?
Geometrically, the equations wx ≠ b = 1 and wx ≠ b = ≠1 define two parallel hyperplanes,
2
as you see in fig. 1. The distance between these hyperplanes is given by ÎwÎ , so the smaller

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

the norm ÎwÎ, the larger the distance between these two hyperplanes.
That’s how Support Vector Machines work. This particular version of the algorithm builds
the so-called linear model. It’s called linear because the decision boundary is a straight line
(or a plane, or a hyperplane). SVM can also incorporate kernels that can make the decision
boundary arbitrarily non-linear. In some cases, it could be impossible to perfectly separate
the two groups of points because of noise in the data, errors of labeling, or outliers (examples
very different from a “typical” example in the dataset). Another version of SVM can also
incorporate a penalty hyperparameter for misclassification of training examples of specific
classes. We study the SVM algorithm in more detail in Chapter 3.
At this point, you should retain the following: any classification learning algorithm that
builds a model implicitly or explicitly creates a decision boundary. The decision boundary
can be straight, or curved, or it can have a complex form, or it can be a superposition of
some geometrical figures. The form of the decision boundary determines the accuracy of
the model (that is the ratio of examples whose labels are predicted correctly). The form of
the decision boundary, the way it is algorithmically or mathematically computed based on
the training data, differentiates one learning algorithm from another.
In practice, there are two other essential differentiators of learning algorithms to consider:
speed of model building and prediction processing time. In many practical cases, you would
prefer a learning algorithm that builds a less accurate model fast. Additionally, you might
prefer a less accurate model that is much quicker at making predictions.

1.4 Why the Model Works on New Data

Why is a machine-learned model capable of predicting correctly the labels of new, previously
unseen examples? To understand that, look at the plot in fig. 1. If two classes are separable
from one another by a decision boundary, then, obviously, examples that belong to each class
are located in two different subspaces which the decision boundary creates.
If the examples used for training were selected randomly, independently of one another, and
following the same procedure, then, statistically, it is more likely that the new negative
example will be located on the plot somewhere not too far from other negative examples.
The same concerns the new positive example: it will likely come from the surroundings of
other positive examples. In such a case, our decision boundary will still, with high probability,
separate well new positive and negative examples from one another. For other, less likely
situations, our model will make errors, but because such situations are less likely, the number
of errors will likely be smaller than the number of correct predictions.
Intuitively, the larger is the set of training examples, the more unlikely that the new examples
will be dissimilar to (and lie on the plot far from) the examples used for training. To minimize
the probability of making errors on new examples, the SVM algorithm, by looking for the
largest margin, explicitly tries to draw the decision boundary in such a way that it lies as far
as possible from examples of both classes.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

SVM Using Python
No ratings yet
SVM Using Python
24 pages
Timoshenko Beam Theory
No ratings yet
Timoshenko Beam Theory
6 pages
Lecture 02 Supervised Learning 27102022 124322am
No ratings yet
Lecture 02 Supervised Learning 27102022 124322am
29 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
9 pages
Unit 3 PPT
No ratings yet
Unit 3 PPT
20 pages
03 Classification
No ratings yet
03 Classification
66 pages
QUESTIONS
No ratings yet
QUESTIONS
20 pages
AP for NLP-LO2
No ratings yet
AP for NLP-LO2
38 pages
2
No ratings yet
2
15 pages
ML Unit 3 V1
No ratings yet
ML Unit 3 V1
25 pages
Fintech ML Using Azure
No ratings yet
Fintech ML Using Azure
51 pages
Deep Learn
No ratings yet
Deep Learn
48 pages
Evaluation of Different Classifier
No ratings yet
Evaluation of Different Classifier
4 pages
Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
No ratings yet
Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
44 pages
Lec 05
No ratings yet
Lec 05
54 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
Final ppt
No ratings yet
Final ppt
51 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
SVM Unit 2
No ratings yet
SVM Unit 2
12 pages
Mod09-ppt2-ML_in_Image_Classification
No ratings yet
Mod09-ppt2-ML_in_Image_Classification
30 pages
Session Svmclassification
No ratings yet
Session Svmclassification
28 pages
Machine Learning
No ratings yet
Machine Learning
40 pages
Introduction To Machine Learning: Mohsen Afsharchi
No ratings yet
Introduction To Machine Learning: Mohsen Afsharchi
72 pages
Data analysis ch1
No ratings yet
Data analysis ch1
13 pages
Svm
No ratings yet
Svm
52 pages
3.unit 3 ML Part-1 Q&A
No ratings yet
3.unit 3 ML Part-1 Q&A
39 pages
Svm
No ratings yet
Svm
52 pages
AI Chapter 3 Part 3
No ratings yet
AI Chapter 3 Part 3
49 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
SVM VS SVC
No ratings yet
SVM VS SVC
27 pages
support_vector_machines
No ratings yet
support_vector_machines
12 pages
Support Vector Machine
No ratings yet
Support Vector Machine
13 pages
Module 1 Notes
No ratings yet
Module 1 Notes
38 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
ML Final Print Upload
No ratings yet
ML Final Print Upload
10 pages
Machine Learning – I[1]
No ratings yet
Machine Learning – I[1]
126 pages
ML Unit 3 Part B Material
No ratings yet
ML Unit 3 Part B Material
15 pages
Tutorial On Support Vector Machine (SVM) : Abstract
No ratings yet
Tutorial On Support Vector Machine (SVM) : Abstract
13 pages
Week 5 Slides
No ratings yet
Week 5 Slides
25 pages
Module 1 Notes
No ratings yet
Module 1 Notes
56 pages
Introduction To Machinelearning
No ratings yet
Introduction To Machinelearning
75 pages
Practical # 11
No ratings yet
Practical # 11
10 pages
Machine Learning Models
0% (1)
Machine Learning Models
16 pages
SVM, Neural Network and Random Forest in R
No ratings yet
SVM, Neural Network and Random Forest in R
45 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Machine Learning Report
No ratings yet
Machine Learning Report
22 pages
ML Chapter 1
No ratings yet
ML Chapter 1
41 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
I. The Types of Machine Learning
No ratings yet
I. The Types of Machine Learning
8 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Unit-1 DL
No ratings yet
Unit-1 DL
29 pages
UNIT-II-Support Vector Machine Algorithm
No ratings yet
UNIT-II-Support Vector Machine Algorithm
13 pages
cs188-fa22-note19
No ratings yet
cs188-fa22-note19
8 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
14 Supervised Machine Learning
No ratings yet
14 Supervised Machine Learning
94 pages
ML 2
No ratings yet
ML 2
166 pages
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
No ratings yet
Fundamental Knowledge of Machine Learning: Abstract This Chapter Introduces The Basic Concepts and Methods of Machine
14 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Exercises of Vectors and Vectorial Spaces
From Everand
Exercises of Vectors and Vectorial Spaces
Simone Malacrida
No ratings yet
9ccc27f9-3dbb-4e1c-98b1-91ada20d937c
No ratings yet
9ccc27f9-3dbb-4e1c-98b1-91ada20d937c
27 pages
INVESTMENT COMMITTEE REPORT Date
No ratings yet
INVESTMENT COMMITTEE REPORT Date
4 pages
Let there be gamma
No ratings yet
Let there be gamma
3 pages
WP_421_2024
No ratings yet
WP_421_2024
32 pages
Presentation of Megginson Paper
No ratings yet
Presentation of Megginson Paper
14 pages
Project_FAC630 Behavioural Finance
No ratings yet
Project_FAC630 Behavioural Finance
2 pages
L13_14_15_16_17_Classification
No ratings yet
L13_14_15_16_17_Classification
123 pages
Game Changer First Edition November 2024
No ratings yet
Game Changer First Edition November 2024
25 pages
DFCCIL Annual Report 2024 Final 3M0F
No ratings yet
DFCCIL Annual Report 2024 Final 3M0F
203 pages
MerQube Introduction To Defined Outcome Indices
No ratings yet
MerQube Introduction To Defined Outcome Indices
9 pages
Relative Valuation
No ratings yet
Relative Valuation
48 pages
2012 Math Talent Quest Official Test PDF
No ratings yet
2012 Math Talent Quest Official Test PDF
5 pages
Addition of Vectors: Combining Vector Components
No ratings yet
Addition of Vectors: Combining Vector Components
10 pages
Session 3 SG - Pedrano - Gamification in Mathematics
No ratings yet
Session 3 SG - Pedrano - Gamification in Mathematics
4 pages
June 2022 QP Cp2
No ratings yet
June 2022 QP Cp2
32 pages
All Titles On This PDF Are Clickable: JEE MAINS 2025/ 2026
No ratings yet
All Titles On This PDF Are Clickable: JEE MAINS 2025/ 2026
10 pages
Maths Activity 12 All PDF
50% (4)
Maths Activity 12 All PDF
77 pages
Mark Scheme: Q Scheme Marks Aos Pearson Progression Step and Progress Descriptor 1 B1
No ratings yet
Mark Scheme: Q Scheme Marks Aos Pearson Progression Step and Progress Descriptor 1 B1
10 pages
DBM20023-Topic 1-Indices and Logarithms
No ratings yet
DBM20023-Topic 1-Indices and Logarithms
29 pages
-calculus-Chapter_4_3_Improper_integral
No ratings yet
-calculus-Chapter_4_3_Improper_integral
28 pages
A Gentle Introduction To The Finite Element Method: Francisco-Javier Sayas 2008
No ratings yet
A Gentle Introduction To The Finite Element Method: Francisco-Javier Sayas 2008
104 pages
SIUE Exit Exam
No ratings yet
SIUE Exit Exam
9 pages
Decimal To Binary (By Me)
100% (1)
Decimal To Binary (By Me)
4 pages
Instant download (Ebook) Matrix Algebra: Theory, Computations and Applications in Statistics 3rd Edition by James E. Gentle ISBN 9783031421433, 3031421434 pdf all chapter
100% (14)
Instant download (Ebook) Matrix Algebra: Theory, Computations and Applications in Statistics 3rd Edition by James E. Gentle ISBN 9783031421433, 3031421434 pdf all chapter
55 pages
Mathematics Webinar 3 - Circle Geometry (Part 1)
100% (1)
Mathematics Webinar 3 - Circle Geometry (Part 1)
72 pages
Uploads862986296064ch03 SM Et2e PDF
No ratings yet
Uploads862986296064ch03 SM Et2e PDF
180 pages
2 4 Hardness
No ratings yet
2 4 Hardness
14 pages
CENG240-2021 Week2 The World of PL and Data Representation
No ratings yet
CENG240-2021 Week2 The World of PL and Data Representation
72 pages
Hardy Unit Weebly
No ratings yet
Hardy Unit Weebly
31 pages
November 2020 Question Paper 33
No ratings yet
November 2020 Question Paper 33
9 pages
Bicomplex Holomorphic Functions The Algebra Geometry And Analysis Of Bicomplex Numbers 1st Edition M Elena Lunaelizarrars download
No ratings yet
Bicomplex Holomorphic Functions The Algebra Geometry And Analysis Of Bicomplex Numbers 1st Edition M Elena Lunaelizarrars download
80 pages
Sec 3 E Math East Spring Sec SA1 2018i
No ratings yet
Sec 3 E Math East Spring Sec SA1 2018i
40 pages
The Confuent Hypergeometric Function and Whittaker Functions
No ratings yet
The Confuent Hypergeometric Function and Whittaker Functions
13 pages
DLP For Demo
No ratings yet
DLP For Demo
3 pages
Differences Between The ASME and ISO Tolerancing Standards
No ratings yet
Differences Between The ASME and ISO Tolerancing Standards
5 pages
Green's Reciprocity Theorem
No ratings yet
Green's Reciprocity Theorem
6 pages
Ch4 - Multilayer Perceptron
No ratings yet
Ch4 - Multilayer Perceptron
26 pages
Class Xii Second Preboard Paper
No ratings yet
Class Xii Second Preboard Paper
6 pages
Lesson 4 - Converting Time Measurements
100% (3)
Lesson 4 - Converting Time Measurements
3 pages

The Hundred Page Machine Learning 2019

Uploaded by

The Hundred Page Machine Learning 2019

Uploaded by

1.

3 How Supervised Learning Works

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

Figure 1: An example of an SVM model for two-dimensional feature vectors.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

1.4 Why the Model Works on New Data

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

You might also like