unit 4 regression
unit 4 regression
Regression technique
Types of Learning
In general, machine learning algorithms can be classified into
three types.
• Supervised learning
• Unsupervised learning
• Reinforcement learning
Supervised learning
A training set of examples with the correct responses (targets) is
provided and, based on this training set, the algorithm
generalises to respond correctly to all possible inputs. This is
also called learning from exemplars. Supervised learning is the
machine learning task of learning a function that maps an input
to an output based on example input-output pairs.
Example supervised learning
Consider the following data regarding patients entering a clinic.
The data consists of the gender and age of the patients and each
patient is labeled as “healthy” or “sick”.
Unsupervised learning
Unsupervised learning is a type of machine learning algorithm
used to draw inferences from datasets consisting of input data
without labeled responses. In unsupervised learning algorithms,
a classification or categorization is not included in the
observations. There are no output values and so there is no
estimation of functions.
Consider the following data regarding patients entering a clinic.
The data consists of the gender and age of the patients.
Reinforcement learning
This is somewhere between supervised and unsupervised
learning.
Reinforcement learning is the problem of getting an agent to act
in the world so as to maximize its rewards.
The algorithm gets told when the answer is wrong, but does not
get told how to correct it. It has to explore and try out different
possibilities until it works out how to get the answer right.
Reinforcement learning is sometime called learning with a critic
because of this monitor that scores the answer, but does not
suggest improvements.
Evaluating Models
To train and evaluate models, data are often divided into three
sets: the training set, the test set, and the evaluation set
Training Set
is used to build the initial model
may need to “enrich the data” to get enough of the special cases
Test Set
is used to adjust the initial model
models can be tweaked to be less idiosyncrasies to the training data and can be
adapted for a more general model
idea is to prevent “over-training” (i.e., finding patterns where none exist).
Evaluation Set
is used to evaluate the model performance
7
Test and Evaluation Sets
Reading too much into the training set (overfitting)
common problem with most data mining algorithms
resulting model works well on the training set but performs poorly on unseen
data
test set can be used to “tweak” the initial model, and to remove unnecessary
inputs or features
8
Cross Validation
Cross validation is a heuristic that works as follows
randomly divide the data into n folds, each with approximately the same
number of records
create n models using the same algorithms and training parameters; each model
is trained with n-1 folds of the data and tested on the remaining fold
can be used to find the best algorithm and its optimal training parameter
Steps in Cross Validation
1. Divide the available data into a training set and an evaluation set
2. Split the training data into n folds
3. Select an algorithm and training parameters
4. Train and test n models using the n train-test splits
5. Repeat step 2 to 4 using different algorithms / parameters and compare
model accuracies
6. Select the best model
7. Use all the training data to train the model
8. Assess the final model using the evaluation set
9
Example – 5 Fold Cross Validation
10
Linear Regression
Linear regression: involves a response variable y and a single predictor
variable x y = w0 + w1 x
w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
| D|
( x x )( y y)
w
i i
1
i 1
| D| w y w x
0 1
i
( x
i 1
x ) 2
εi Slope = β1
Predicted Value
Random Error for this
of y for xi
x value
Intercept = β0
xi x
Estimated Regression Model
Independent
ŷ i w 0 w1x variable
e 2
(y ŷ) 2
(y (w 0 w1x)) 2
w1
( x x )( y y )
w0 y w1 x
(x x) 2
General Form of Linear Functions
cov[x, y ]
y x x,y means of training x, y
var[ x]
yˆ t xt for test sample xt
mtcars dataset
18
plot
19
Regression model
20
21
Training and test set
22
Regression model
23
Multivariate regression
24
Multivariate regression
25
26