Notes-1

The document provides an overview of key concepts in machine learning, including the differences between regression and classification, the bias-variance trade-off, and the distinction between supervised and unsupervised learning. It also discusses model performance measurement through loss functions and introduces the k-Nearest Neighbors (kNN) method, highlighting its flexibility and considerations for implementation. Additionally, it touches on extensions of kNN, such as kernel regression, which allows for weighted predictions based on proximity.

Uploaded by

girosi4121

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Notes-1

Uploaded by

girosi4121

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Machine Learning and Big Data Analytics Section 1

Emily Mower
September 2018

Note that the material in these notes draws on the excellent and more thorough treatment
of these topics in Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor
Hastie and Robert Tibshirani.

1 Important Machine Learning Concepts

1.1 Regression vs Classification
Prediction problems can be defined based on the characteristics of the outcome variable we want
to predict.

Regression problems are those where the outcome is quantitative

Classification problems are those where the outcome is qualitative / categorical

Sometimes the same methods can be used for regression and classification problems, but many
methods are useful for only one of the two problem types.

1.2 Bias-Variance Trade-off

The variance of a statistical learning method is the amount by which the prediction function
would change if it was estimated on a different training set. A model that overfits has high
variance, whereas a model that underfits has low variance.
To remember the difference between low variance and high variance models, I find it helpful to
think of examples. Suppose your model was “use the mean of the training data as the predicted
value for all new data points.” The mean shouldn’t change much across training sets, so this has
low variance. On the other hand, a model that picked up super complex patterns is likely to be
picking up noise in addition to signal. The noise will vary by training set, so such a method would
have high variance.

The bias of a statistical learning method is the error produced by representing a real world
problem by a statistical learning method. Very flexible models (which are prone to overfitting) can
capture complex patterns and so tend to have low bias. Very simple models (which are prone to
underfitting) are limited in their ability to pick up patterns and so may have high bias.
The book uses the example of representing a non-linear function by a linear one to show that
no matter how much data you have, a linear model will not do a great prediction job when the
process generating the data is non-linear. Bias also applies to methods that might not fit your
traditional concept of a statistical function. In the K-Nearest Neighbors section, we will discuss
bias in that setting.

Often, we will talk about the bias-variance trade-off . In an ideal world, we would find a
model that has low variance and low bias, because that would yield a good and consistent model.
In practice, you usually have to allow bias to increase in order to decrease variance and vice versa.
However, there are many models that will decrease one (bias or variance) significantly while only
increasing the other a little.

1
1.3 Supervised v. Unsupervised Learning
Supervised learning refers to problems where there is a known outcome. In these problems, you
can train a model to take features and predict the known outcome.
Unsupervised learning refers to problems where you are interested in uncovering patterns and
do not have a target outcome in mind.

An example of supervised learning would be using students’ high school grades, class enroll-
ments, and demographic variables to predict whether or not they attend college.
An example of unsupervised learning would be using the same grades, enrollment, and demo-
graphic features to identify “types” of high school students. That is, students who look similar
according to these features. Perhaps you are interested in this because you want to make classes
that contain a mix of different types of students. Often, unsupervised learning is useful for creating
features for supervised learning problems, but sometimes uncovering patterns is the final objective.

1.4 Measuring Model Performance

There are different functions you can use to measure model performance, and which function you
choose depends on your data and your objective. These functions are called “loss functions,” which
is a somewhat intuitive name when you think about the fact that your machine learning algorithm
is trying to minimize this function and thus minimize your loss.
To understand how and why loss functions depend on your data and objectives, examples can
be helpful.
Consider first that you are trying to predict the future college majors of this year’s incoming
freshmen (a classification problem). In this case, your prediction will either be right (you predict
the major they end up choosing) or it will be wrong. Therefore, you might use accuracy (% correct)
to measure model performance.
What if, though, you cared more about being wrong for some majors than others? For example,
imagine that all biology majors are going to need personalized lab equipment in their junior year
and that the lab equipment is really expensive if ordered last minute but a lot cheaper if ordered a
year or more in advance? Then, you might want to give more weight to people who end up being
biology majors so that your model does better for predicting biology majors than other majors.
Now consider that you are trying to predict home prices (a regression problem). You might
measure your performance using mean-squared error (MSE), which is found by taking the difference
between the predicted sale price for each home and the true sale price (the error), squaring it for
each home, and then taking the mean of these squared errors. However, home prices are skewed
(e.g. some homes are extremely expensive compared to most homes on the market). This means
that a 5% error on a $3 million home is a lot bigger than a 5% error on a $100,000 home. When
you square the errors (as you do when calculating MSE), the difference becomes enormous.
But since both errors are 5%, maybe you want to penalize them the same. One option is to
use Mean Percentage Error (MPE), but this has the weird effect that if you over-predict one home
by 5% and under-predict the other by 5%, your MPE is zero. Therefore, a popular option is to
use the Mean Absolute Percentage Error (MAPE), which is the mean of the absolute values of the
percentage errors and thus would be 5% in this example.
For many prediction problems in the policy sphere, we may not only care about accuracy of
prediction but also about fairness or other objectives. The loss function is a place where we can
explicitly tell the model to optimize for these concerns in addition to predictive performance.

2 k-Nearest Neighbors
2.1 Concept
The idea underlying k-Nearest Neighbors (kNN) is that we expect observations with similar features
to have similar outcomes. kNN makes no other assumptions about functional form, so it is quite
flexible.

2
2.2 Method
kNN can be used for either regression or classification, though it works slightly differently depending
on what setting we are in. In the classification setting, the prediction is a majority vote of the
observation’s k-nearest neighbors. In the regression setting, the prediction is the average outcome
of the observation’s k-nearest neighbors.
For kNN, bias will be lower when the relationship between features and the outcome is smooth
across the feature space. When the relationship is rough, bias will increase quickly as further away
neighbors are included in the prediction.
The only choice we have to make when implementing kNN is the value of k (e.g. how many
neighbors should we use in our prediction?). A good way to find k is through cross-validation,
something we will cover a little later, but which broadly involves training the algorithm on one set
of data and seeing how well it does on a different set.

2.3 Implementation and Considerations

A concern with kNN is whether you have good coverage of your feature space. Imagine that all of
your training points were in one region of the feature space, but some of your test points are far
away from this region. You will still use the k nearest neighbors to predict the outcome for these
far-away test points, but it might not work as well as if the points were close together. Therefore,
when implementing kNN, it’s good to think about how similar the features in your test set will be
to the features in your training set. If they differ systematically, that is a concern (as it would be
for other ML methods as well).
Another important consideration is whether there is an imbalance in the frequency of one
outcome compared to another. For example, suppose we are trying to classify points as “true”
or “false” and most points are “true.” Even if the “false” outcomes are clustered together in the
feature space, if we use a large enough value of k, we will predict “true” for these observations
simply because there are many more “true” observations than “false” observations. Therefore, we
would do better to use a small value for k in this setting.
Another consideration is whether proximity in each variable is equally important or if proximity
in one variable is more important than proximity in another variable. kNN will normalize variables
so that they are all on the same scale (same mean and variance) and then treat distance in
all normalized variables the same. If you want to up-weight proximity for some variables and
down-weight it for others, you can change the way each variable is normalized to accomplish this.
Alternatively, you can include only those variables you think are important. When you have this
type of uncertainty, there are more principled ways of selecting variables that will be discussed
later in the course.

2.4 Extensions
You might think that neighbors that are really close should be weighted more than neighbors that
are a bit further away. Many people agree, so there are methods to allow you to weight different
observations differently. You might also think that you shouldn’t use just the k nearest neighbors,
but all the neighbors within a certain distance. Or maybe you think there’s information available
in all observations, but there’s more information in closer neighbors. All of these adjustments fall
under the umbrella of kernel regression. In fact, kNN is a special case of kernel regression.
Broadly defined, kernel regression methods are a class of methods that generate predictions by
taking weighted averages of observations. Because these methods (kNN included) do not specify a
functional form, they are called “non-parametric regression” methods.

Globalization - A Basic Text
100% (1)
Globalization - A Basic Text
3 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Customer Service Request Form 24may2021
No ratings yet
Customer Service Request Form 24may2021
2 pages
Stock Trading The Comprehensive Guide
100% (2)
Stock Trading The Comprehensive Guide
63 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Machine Learning Interview Questions.
50% (2)
Machine Learning Interview Questions.
43 pages
Top 100 Machine Learning Questions With Answers For Interview PDF
100% (3)
Top 100 Machine Learning Questions With Answers For Interview PDF
48 pages
Dental Anomalies MCQ
83% (6)
Dental Anomalies MCQ
6 pages
Kenya Water Design - Manual - 2005 PDF
No ratings yet
Kenya Water Design - Manual - 2005 PDF
500 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
Ensemble Methods
No ratings yet
Ensemble Methods
12 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
40 Interview Questions On Machine Learning From Analytics Vidhya
No ratings yet
40 Interview Questions On Machine Learning From Analytics Vidhya
14 pages
Machine Learning Interview Questions PDF
No ratings yet
Machine Learning Interview Questions PDF
14 pages
Comparison of Classification Algorithms
No ratings yet
Comparison of Classification Algorithms
11 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
Ensemble Method
No ratings yet
Ensemble Method
12 pages
Basic Interview Q's On ML PDF
100% (2)
Basic Interview Q's On ML PDF
243 pages
Bias and Variance
No ratings yet
Bias and Variance
7 pages
Question Set-1
No ratings yet
Question Set-1
10 pages
module 3 modified
No ratings yet
module 3 modified
48 pages
Data Science Unit 5
No ratings yet
Data Science Unit 5
11 pages
40 Interview Questions On Machine Learning - AnalyticsVidhya
100% (1)
40 Interview Questions On Machine Learning - AnalyticsVidhya
21 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Interview questions companie
No ratings yet
Interview questions companie
72 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
machine learning-unit 3
No ratings yet
machine learning-unit 3
18 pages
Machine Learning Volume I 280820241047
No ratings yet
Machine Learning Volume I 280820241047
4 pages
SubjectiveQuestions
No ratings yet
SubjectiveQuestions
4 pages
Divorce Prediction System: Devansh Kapoor 179202050
No ratings yet
Divorce Prediction System: Devansh Kapoor 179202050
12 pages
Merge +1
No ratings yet
Merge +1
107 pages
Data Analysis (27 Questions) : 1. (Given A Dataset) Analyze This Dataset and Tell Me What You Can Learn From It
No ratings yet
Data Analysis (27 Questions) : 1. (Given A Dataset) Analyze This Dataset and Tell Me What You Can Learn From It
28 pages
Solved With ChatGPT
No ratings yet
Solved With ChatGPT
3 pages
Ôn Thi KTDL
No ratings yet
Ôn Thi KTDL
18 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
Project Lit Final1
No ratings yet
Project Lit Final1
15 pages
Data Science Interview Questions -1
No ratings yet
Data Science Interview Questions -1
55 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Week 12 Chats
No ratings yet
Week 12 Chats
4 pages
40_Machine_Learning_Interview_Questions
No ratings yet
40_Machine_Learning_Interview_Questions
55 pages
Machine Learning Interview Question
No ratings yet
Machine Learning Interview Question
72 pages
Bias and Variance in Machine Learning
100% (1)
Bias and Variance in Machine Learning
7 pages
Notes - Machine Learning
No ratings yet
Notes - Machine Learning
9 pages
Interview Questions
100% (1)
Interview Questions
67 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
23 pages
DOC-20250117-WA0014._20250117_193235_0000
No ratings yet
DOC-20250117-WA0014._20250117_193235_0000
22 pages
Bias and Variance
No ratings yet
Bias and Variance
36 pages
15 Data Analyst Questions
No ratings yet
15 Data Analyst Questions
9 pages
Data Science Interview Guide
No ratings yet
Data Science Interview Guide
23 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Thesis Multiple Linear Regression
100% (2)
Thesis Multiple Linear Regression
5 pages
UNIT3
No ratings yet
UNIT3
37 pages
Metrices of The Model
No ratings yet
Metrices of The Model
9 pages
ML_DS_interview_quetions
No ratings yet
ML_DS_interview_quetions
17 pages
Final Paper Guide For PS, Spring : e Source File For This Document Is Not Yet Available at
No ratings yet
Final Paper Guide For PS, Spring : e Source File For This Document Is Not Yet Available at
13 pages
I Am Sharing 'Interview' With You
100% (3)
I Am Sharing 'Interview' With You
65 pages
Linear Models Bias
No ratings yet
Linear Models Bias
17 pages
inductive_bias
No ratings yet
inductive_bias
3 pages
Data Analytics
No ratings yet
Data Analytics
32 pages
Activity-Based Costing: Question IM 10.1 Intermediate
No ratings yet
Activity-Based Costing: Question IM 10.1 Intermediate
6 pages
7 The Ocean Spirit Mami Wata Takes Many Faces
No ratings yet
7 The Ocean Spirit Mami Wata Takes Many Faces
3 pages
Acetaminophen Ibuprofen Dosage Chart
No ratings yet
Acetaminophen Ibuprofen Dosage Chart
1 page
Ho - Diagnostics Examples 2 in SPSS
No ratings yet
Ho - Diagnostics Examples 2 in SPSS
4 pages
Comparison
No ratings yet
Comparison
7 pages
Block J Distribution by MMC
No ratings yet
Block J Distribution by MMC
9 pages
Magdalene College - Academic Visitors Policy
No ratings yet
Magdalene College - Academic Visitors Policy
2 pages
CLG612H Operation and Maintenance Manual
No ratings yet
CLG612H Operation and Maintenance Manual
136 pages
Field Study On Undrained Shear Strength of Soft Soil Around Micropiles - Revised 04052017
No ratings yet
Field Study On Undrained Shear Strength of Soft Soil Around Micropiles - Revised 04052017
6 pages
Jensen
No ratings yet
Jensen
19 pages
Integrative Programming and Technologies 2
No ratings yet
Integrative Programming and Technologies 2
3 pages
Santoprene 101-73
No ratings yet
Santoprene 101-73
4 pages
Gunn Oscillator
No ratings yet
Gunn Oscillator
12 pages
Week 7 Eapp Cot
No ratings yet
Week 7 Eapp Cot
6 pages
HANDOUT 21st Q1 W3
No ratings yet
HANDOUT 21st Q1 W3
12 pages
Division Memorandum No. 532, s.2022
No ratings yet
Division Memorandum No. 532, s.2022
12 pages
Past Tense Irregular Verbs Lesson Plan
No ratings yet
Past Tense Irregular Verbs Lesson Plan
7 pages
Field Work PPT Komal Puja
No ratings yet
Field Work PPT Komal Puja
11 pages
RS 240L - Final Exam Study Guide
No ratings yet
RS 240L - Final Exam Study Guide
3 pages
Root Cause Analysis: Christine Lyman, Sr. Manager Jan 2015
No ratings yet
Root Cause Analysis: Christine Lyman, Sr. Manager Jan 2015
9 pages
Mini Project: Hyderabad Karnataka Education Society's
No ratings yet
Mini Project: Hyderabad Karnataka Education Society's
4 pages
Anaphylaxis Wikipedia
No ratings yet
Anaphylaxis Wikipedia
6 pages
Request For EoI Data 4 Development Fellowships 1
No ratings yet
Request For EoI Data 4 Development Fellowships 1
6 pages
Tripleband Combiner
No ratings yet
Tripleband Combiner
1 page
English For It Students - Feb 2023
No ratings yet
English For It Students - Feb 2023
6 pages

Notes-1

Uploaded by

Notes-1

Uploaded by

Machine Learning and Big Data Analytics Section 1

1 Important Machine Learning Concepts

Regression problems are those where the outcome is quantitative

1.2 Bias-Variance Trade-off

1.4 Measuring Model Performance

2.3 Implementation and Considerations

You might also like