0% found this document useful (0 votes)

72 views41 pages

Fitting A Model To Data

The document discusses fitting parameterized models to data for classification tasks. It describes linear discriminant functions that classify instances based on whether they fall above or below a linear decision boundary. The goal is to choose model parameters (weights) that optimize an objective function. Support vector machines are also discussed as a type of linear discriminant that aims to maximize the margin of separation between classes.

Uploaded by

Gökhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views41 pages

Fitting A Model To Data

Uploaded by

Gökhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Fitting a Model to Data

Parameterized Model
• A very common case is where the structure of the
model is a parameterized mathematical function or
equation of a set of numeric attributes.
• The attributes used in the model could be chosen
• based on domain knowledge regarding which attributes
ought to be informative in predicting the target variable,
• or they could be chosen based on other data mining
techniques,
• such as the attribute selection procedures introduced
previously.

2
Decision Boundaries

3
Instance Space

4
Linear Classifier

5
Linear Discriminant Functions
• Our goal is going to be to fit our model to the data.
• Recall y = mx + b, where m is the slope of the line and b is
the y intercept (the y value when x = 0). The line in previous
figure can be expressed in this form (with Balance in
thousands) as:
Age = ( - 1.5) × Balance + 60
• We would classify an instance x as a + if it is above the line,
and as a • if it is below the line.
+ if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 > 0
𝑐𝑙𝑎𝑠𝑠 𝑥 =
● if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 ≤ 0
• This is called a linear discriminant because it discriminates
between the classes, and the function of the decision
boundary is a linear combination—a weighted sum—of the
attributes.

6
Linear Discriminant Functions
• In the two dimensions of our example, the linear combination corresponds to a
line.
• In three dimensions, the decision boundary is a plane, and in higher dimensions
it is a hyperplane.
• Thus, this linear model is a different sort of multivariate supervised
segmentation.
• A linear discriminant function is a numeric classification model.
• For example, consider our feature vector x, with the individual component features being
xi. A linear model then can be written as follows:

• The concrete example from the above example can be written in this form

• To use this model as a linear discriminant, for a given instance represented by a

feature vector x, we check whether f(x) is positive or negative.

7
Linear Discriminant Functions

• We now have a parameterized model: the weights

of the linear function (wi) are the parameters.
• The data mining is going to “fit” this parameterized
model to a particular dataset—meaning specifically,
to find a good set of weights on the features.
• Unfortunately, it’s not trivial to choose the “best”
line to separate the classes. See next slide …

8
Choosing the “best” line

9
Optimizing an Objective Function
• What should be our goal or objective in choosing
the parameters?
• This would allow us to answer the question: what
weights should we choose?
• Our general procedure will be to define an
objective function that represents our goal, and can
be calculated for a particular set of weights and a
particular set of data.
• We will then find the optimal value for the weights
by maximizing or minimizing the objective function.
10
Objective Functions
• “Best” line depends on the objective (loss) function
• Objective function should represent our goal
• A loss function determines how much penalty should be
assigned to an instance based on the error in the model’s
predicted value
• Examples of objective (or loss) functions:
• 𝜆 𝑦; 𝑥 = 𝑦 − 𝑓(𝑥)
2
• 𝜆 𝑦; 𝑥 = 𝑦 − 𝑓 𝑥 [convenient mathematically – linear regression]
• 𝜆 𝑦; 𝑥 = 𝐼 𝑦 ≠ 𝑓(𝑥)
• Linear regression, logistic regression, and support vector
machines are all very similar instances of our basic
fundamental technique:
• The key difference is that each uses a different objective function

11
Logistic regression is a misnomer
• The distinction between classification and regression is
whether the value for the target variable is categorical
or numeric
• For logistic regression, the model produces a numeric
estimate
• However, the values of the target variable in the data
are categorical
• Logistic regression is estimating the probability of class
membership (a numeric quantity) over a categorical
class
• Logistic regression is a class probability estimation
model and not a regression model

12
Example 1 – Iris Dataset
• This is an old and fairly simple
dataset representing various types of
iris, a genus of flowering plant.
• The original dataset includes three
species of irises represented with
four attributes, and the data mining
problem is to classify each instance
as belonging to one of the three
species based on the attributes.
• We’ll use just two species of irises,
Iris Setosa and Iris Versicolor.
• The dataset describes a collection of
flowers of these two species, each
described with two measurements:
the Petal width and the Sepal width.

13
Example 1 – Iris Dataset
• Each instance is one
flower and corresponds
to one dot on the
graph.
• The filled dots are of
the species Iris Setosa
and the circles are
instances of the species
Iris Versicolor.

14
Ranking Instances and Probability Class
Estimation

• In many applications, we don’t simply want a yes or no

prediction of whether an instance belongs to the class, but
we want some notion of which examples are more or less
likely to belong to the class
• Which consumers are most likely to respond to this offer?
• Which customers are most likely to leave when their contracts
expire?
• Ranking
• Tree induction
• Linear discriminant functions (e.g., linear regressions, logistic
regressions, SVMs)
• Ranking is free
• Class Probability Estimation
• Tree induction
• Logistic regression

15
The many faces of classification:
Classification / Probability Estimation
/ Ranking Increasing difficulty

Classification Ranking Probability

• Ranking:
• Business context determines the number of actions (“how
far down the list”)
• Probability:
• You can always rank / classify if you have probabilities!

16
Ranking: Examples
• Search engines
• Whether a document is relevant to a topic / query
• In machine translation
• for ranking a set of hypothesized translations
• In computational biology
• for ranking candidate 3-D structures in protein structure
prediction problem
• In recommender systems
• for identifying a ranked list of related news articles to
recommend to a user after he or she has read a current
news article.

17
Class Probability Estimation:
Examples
• MegaTelCo
• Ranking vs. Class Probability Estimation
• Identify accounts or transactions as likely to have
been defrauded
• The director of the fraud control operation may want
the analysts to focus not simply on the cases most likely
to be fraud, but on accounts where the expected
monetary loss is higher
• We need to estimate the actual probability of fraud

18
Support Vector Machines
• In short, support vector machines are linear
discriminants.
• As with linear discriminants generally, SVMs classify
instances based on a linear function of the features,
like:

• You may also hear of nonlinear support vector

machines.
• Oversimplifying slightly, a nonlinear SVM uses different
features (that are functions of the original features), so that
the linear discriminant with the new features is a nonlinear
discriminant with the original features.

21
Support Vector Machines
• The crucial question :
• What is the objective function that is used to fit an SVM
to data?
• Skip the mathematical details for an intuitive
understanding.
• There are two main ideas.
• Instead of thinking about separating with a line,
first fit the fattest bar between the classes.
• See figure in the next slide.

22
Support Vector Machines
• First idea: a wider bar is
better.
• Once the widest bar is
found, the linear
discriminant will be the
center line through the
bar (the solid middle line
in Figure 4-8).
• The distance between the
dashed parallel lines is
called the margin around
the linear discriminant,
and thus the objective is
to maximize the margin.

23
Support Vector Machines
• The second idea:
• How do they handle
points falling on the
wrong side of the
discrimination boundary?
• Typically, a single line
cannot perfectly separate
the data into classes.
• This is true of most data
from complex real-world
applications.

24
Support Vector Machines
• In the objective function that measures how well a
particular model fits the training points, we will simply
penalize a training point for being on the wrong side of the
decision boundary.
• In the case where the data indeed are linearly separable, we
incur no penalty and simply maximize the margin.
• If the data are not linearly separable, the best fit is some
balance between a fat margin and a low total error penalty.
• The penalty for a misclassified point is proportional to the
distance from the decision boundary, so if possible the SVM
will make only “small” errors. Technically, this error function
is known as hinge loss.
• See next slide.

25
Loss Functions

26
Loss Functions
• The term “loss” is used for error penalty, across data
science.
• A loss function determines how much penalty should be
assigned to an instance based on the error in the model’s
predicted value—in our present context, based on its
distance from the separation boundary.
• Several loss functions are commonly used.
• Support vector machines use hinge loss.
• Hinge loss incurs no penalty for an example that is not on the wrong
side of the margin.
• The hinge loss only becomes positive when an example is on the
wrong side of the boundary and beyond the margin.
• Loss then increases linearly with the example’s distance from the
margin, thereby penalizing points more the farther they are from
the separating boundary.

27
Loss Functions
• Zero-one loss, as its name implies, assigns a loss of zero for a correct
decision and one for an incorrect decision.
• For contrast, consider a different sort of loss function.: Squared error
• Squared error specifies a loss proportional to the square of the distance from
the boundary.
• Squared error loss usually is used for numeric value prediction (regression),
rather than classification.
• The squaring of the error has the effect of greatly penalizing predictions that
are grossly wrong.
• For classification, this would apply large penalties to points far over on
the “wrong side” of the separating boundary.
• Unfortunately, using squared error for classification also penalizes
points far on the correct side of the decision boundary.
• For most business problems, choosing squared-error loss for
classification or class-probability estimation thus would violate our
principle of thinking carefully about whether the loss function is aligned
with the business goal.

28
Linear Regression
• We need to decide on the objective function we will use to
optimize the model’s fit to the data.
• An intuitive notion of the fit of the model is:
• how far away are the estimated values from the true values on the
training data?
• In other words, how big is the error of the fitted model?
• Presumably we’d like to minimize this error.
• For a particular training dataset, we could compute this
error for each individual data point and sum up the results.
• Then the model that fits the data best would be the model
with the minimum sum of errors on the training data.
• And that is exactly what regression procedures do.

29
Linear Regression
• The method that is most natural is to simply subtract
one from the other (and take the absolute value).
• This is called absolute error, and we could then
minimize the sum of absolute errors or equivalently the
mean of the absolute errors across the training data.
• This makes a lot of sense, but it is not what standard
linear regression procedures do.
• Standard linear regression procedures instead minimize the
sum or mean of the squares of these errors.
• Analysts often claim to prefer squared error because it
strongly penalizes very large errors.

30
Linear Regression
• Importantly, any choice for the objective function has
both advantages and drawbacks.
• For least squares regression a serious drawback is that
it is very sensitive to the data:
• erroneous or otherwise outlying data points can severely
skew the resultant linear function.
• An important thing to remember is
• Once we see linear regression simply as an instance of fitting
a (linear) model to data,
• we see that we have to choose the objective function to optimize,
and
• we should do so with the ultimate business application in mind.

31
Class Probability Estimation and
Logistic Regression
• For many applications we would like to estimate the
probability that a new instance belongs to the class of
interest.
• In many cases, we would like to use the estimated
probability in a decision-making context that includes
other factors such as costs and benefits.
• Fortunately, within this same framework for fitting
linear models to data, by choosing a different objective
function we can produce a model designed to give
accurate estimates of class probability.
• The most common procedure by which we do this is
called logistic regression.
32
Logistic Regression
• An instance being further from the separating
boundary intuitively ought to lead to a higher
probability of being in one class or the other, and
the output of the linear function, f(x), gives the
distance from the separating boundary.
• However, this also shows the problem: f(x) ranges
from –∞ to ∞, and a probability should range from
zero to one.
• To limit in [0,1], we can use the logistic function:

33
Logistic regression (“sigmoid”)
curve

34
Application of Logistic Regression
• The Wisconsin Breast Cancer Dataset
• This is another popular dataset from the the machine
learning dataset repository at the University of California
at Irvine.
• Each example describes characteristics of a cell
nuclei image, which has been labeled as either
benign or malignant (cancerous), based on an
expert’s diagnosis of the cells.

35
Wisconsin Breast Cancer Dataset

36
Wisconsin Breast Cancer dataset

• From each of these basic characteristics, three values

were computed: the mean (_mean), standard error (_SE),
and “worst” or largest
37
Wisconsin Breast Cancer dataset

38
Wisconsin Breast Cancer dataset
• The performance of this model is quite good—it makes
only six mistakes on the entire dataset, yielding an
accuracy of about 98.9% (the percentage of the
instances that the model classifies correctly).
• For comparison, a classification tree was learned from
the same dataset (using Weka’s J48 implementation).
• The tree has 25 nodes altogether, with 13 leaf nodes.
Recall that this means that the tree model partitions
the instances into 13 segments.
• The classification tree’s accuracy is 99.1%, slightly
higher than that of logistic regression.

39
Non-linear Functions
• Linear functions can actually represent nonlinear
models, if we include more complex features in the
functions

40
Non-linear Functions
• Using “higher order” features is just a “trick”
• Common techniques based on fitting the parameters of
complex, nonlinear functions:
• Non-linear support vector machines and neural networks
• Nonlinear support vector machine with a “polynomial
kernel” consider “higher-order” combinations of the
original features
• Squared features, products of features, etc.
• Think of a neural network as a “stack” of models
• On the bottom of the stack are the original features
• Each layer in the stack applies a simple model to the outputs of the
previous layer
• Might fit data too well (..to be covered)

41
Simple Neural Network

42
Linear Models versus Tree
Induction
• What is more comprehensible to the stakeholders?
• Rules or a numeric function?
• How “smooth” is the underlying phenomenon being modeled?
• Trees need a lot of data to approximate curved boundaries
• How “non-linear” is the underlying phenomenon being
modeled?
• If very, much “data engineering” needed to apply linear models
• How much data do you have?!
• There is a key tradeoff between the complexity that can be modeled
and the amount of training data available
• What are the characteristics of the data: missing values, types
of variables, relationships between them, how many are
irrelevant, etc.
• Trees fairly robust to these complications

The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Mix Design of Concrete and Properties of Green Concrete
100% (9)
Mix Design of Concrete and Properties of Green Concrete
32 pages
SVM Using Python
No ratings yet
SVM Using Python
24 pages
Eb018460 01
50% (2)
Eb018460 01
10 pages
HIRARC - Changing A Flat Tire
No ratings yet
HIRARC - Changing A Flat Tire
8 pages
Memorandum of Agreement
100% (5)
Memorandum of Agreement
2 pages
Module 3.1
No ratings yet
Module 3.1
25 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
Supervised Alg
No ratings yet
Supervised Alg
27 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
SVM Notes
No ratings yet
SVM Notes
40 pages
Ds 2
No ratings yet
Ds 2
27 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
Lecture Slides-Week11
No ratings yet
Lecture Slides-Week11
32 pages
Lecture Slides Week11
No ratings yet
Lecture Slides Week11
33 pages
SVM Class
No ratings yet
SVM Class
33 pages
ml41
No ratings yet
ml41
49 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
SVM
No ratings yet
SVM
40 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
Unit II 2.2 ML Kernel Machines SVM
No ratings yet
Unit II 2.2 ML Kernel Machines SVM
50 pages
cs221-lecture11
No ratings yet
cs221-lecture11
71 pages
3.unit 3 ML Part-2 Q&A
No ratings yet
3.unit 3 ML Part-2 Q&A
23 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
43 pages
CH 5 SVM
No ratings yet
CH 5 SVM
25 pages
08 Classification
No ratings yet
08 Classification
46 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
CMPE 442 Introduction To Machine Learning: Support Vector Machines
No ratings yet
CMPE 442 Introduction To Machine Learning: Support Vector Machines
64 pages
2425S-CSEC520-07-SVM
No ratings yet
2425S-CSEC520-07-SVM
50 pages
SVM
No ratings yet
SVM
57 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
10_SVM (1)
No ratings yet
10_SVM (1)
77 pages
Linear Regression & SVM
No ratings yet
Linear Regression & SVM
33 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
ML SVM
No ratings yet
ML SVM
34 pages
Pattern Recognition Linear Classifier by Zaheer Ahmad
0% (1)
Pattern Recognition Linear Classifier by Zaheer Ahmad
37 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Module 6-Svm Ppt
No ratings yet
Module 6-Svm Ppt
47 pages
Deep Learn
No ratings yet
Deep Learn
48 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
UNIT - 2-1
No ratings yet
UNIT - 2-1
7 pages
KCA 034 - Unit 2
No ratings yet
KCA 034 - Unit 2
97 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Chapter 8
No ratings yet
Chapter 8
103 pages
10 SVM
No ratings yet
10 SVM
23 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM Reference
No ratings yet
SVM Reference
8 pages
20MEMECH Part 3 - Classification
No ratings yet
20MEMECH Part 3 - Classification
49 pages
Mod09-ppt2-ML_in_Image_Classification
No ratings yet
Mod09-ppt2-ML_in_Image_Classification
30 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
Lec 05
No ratings yet
Lec 05
54 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Svm
No ratings yet
Svm
52 pages
ML Unit 2
No ratings yet
ML Unit 2
53 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Banoqabil Notes 1
No ratings yet
Banoqabil Notes 1
14 pages
DIGEST Sps. Ramos v. Obispo
100% (2)
DIGEST Sps. Ramos v. Obispo
3 pages
Work From Home Arrangement - IT Asset - INDIA - New
No ratings yet
Work From Home Arrangement - IT Asset - INDIA - New
2 pages
20-Bit, Low-Power Digital-to-Analog Converter: Features Description
No ratings yet
20-Bit, Low-Power Digital-to-Analog Converter: Features Description
27 pages
Namma Kalvi 10th Maths Question Papers em 217391-Min
No ratings yet
Namma Kalvi 10th Maths Question Papers em 217391-Min
94 pages
CBLC9 Module Datasheet - Tuya IoT Development Platform - Tuya IoT Development Platform-1
No ratings yet
CBLC9 Module Datasheet - Tuya IoT Development Platform - Tuya IoT Development Platform-1
28 pages
1 Modeling of Shear Walls For Non Linear and Pushover Analysis of Tall Buildings
No ratings yet
1 Modeling of Shear Walls For Non Linear and Pushover Analysis of Tall Buildings
12 pages
Doyle, W. R. (2020) - Throwing Out The "Recession Playbook" For Higher Education
No ratings yet
Doyle, W. R. (2020) - Throwing Out The "Recession Playbook" For Higher Education
10 pages
Effect of Environmental Management System On Haulage Company
No ratings yet
Effect of Environmental Management System On Haulage Company
9 pages
Results User Manual: Pathfinder 2019
No ratings yet
Results User Manual: Pathfinder 2019
90 pages
Listening Text Tsunami
100% (1)
Listening Text Tsunami
4 pages
Aircraft Propulsion Lecture - 2
No ratings yet
Aircraft Propulsion Lecture - 2
16 pages
A320 Cockpit
100% (1)
A320 Cockpit
4 pages
Electronic Invoicing Mexico
No ratings yet
Electronic Invoicing Mexico
10 pages
Database Systems Concepts and Architecture: Dheeba. J/SCOPE
No ratings yet
Database Systems Concepts and Architecture: Dheeba. J/SCOPE
27 pages
Regional Diagnostic Test For MAPEH 9 60 Items FINAL
No ratings yet
Regional Diagnostic Test For MAPEH 9 60 Items FINAL
12 pages
28200302111103032626
No ratings yet
28200302111103032626
9 pages
NAKULA,+VOL 1+no 5+September+2023+Hal+66-93
No ratings yet
NAKULA,+VOL 1+no 5+September+2023+Hal+66-93
28 pages
Management Principles Question Paper October 2023
No ratings yet
Management Principles Question Paper October 2023
8 pages
Carbon Compounds SQ Ans
No ratings yet
Carbon Compounds SQ Ans
51 pages
Form V: PART-I (General Information)
No ratings yet
Form V: PART-I (General Information)
5 pages
Guide Specifications For Plant Precast Concrete Products
No ratings yet
Guide Specifications For Plant Precast Concrete Products
18 pages
Fusion Cross-Sections - 24 - 11 - 2020
No ratings yet
Fusion Cross-Sections - 24 - 11 - 2020
25 pages
TEST-MAKING-TEMPLATE-REVISED-AND-FINAL Felix 2
No ratings yet
TEST-MAKING-TEMPLATE-REVISED-AND-FINAL Felix 2
7 pages
Complex Procurement in Oracle EBS R12: Bruce Kozlowski Director Solution Architecture
No ratings yet
Complex Procurement in Oracle EBS R12: Bruce Kozlowski Director Solution Architecture
30 pages
Summer Placement Report 22 24
No ratings yet
Summer Placement Report 22 24
8 pages

Fitting A Model To Data

Uploaded by

Fitting A Model To Data

Uploaded by

Fitting a Model to Data

• To use this model as a linear discriminant, for a given instance represented by a

• We now have a parameterized model: the weights

• In many applications, we don’t simply want a yes or no

Classification Ranking Probability

• You may also hear of nonlinear support vector

• From each of these basic characteristics, three values

You might also like