0% found this document useful (0 votes)
3 views

Regression Analysis

The document discusses various cost functions used in machine learning, particularly in regression analysis, including Mean Squared Error (MSE) and Cross-Entropy Loss. It explains linear regression, polynomial regression, and the importance of regularization techniques like LASSO to prevent overfitting. Additionally, it covers the use of learning curves for model evaluation and selection based on performance metrics.

Uploaded by

nabinkoirala53
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Regression Analysis

The document discusses various cost functions used in machine learning, particularly in regression analysis, including Mean Squared Error (MSE) and Cross-Entropy Loss. It explains linear regression, polynomial regression, and the importance of regularization techniques like LASSO to prevent overfitting. Additionally, it covers the use of learning curves for model evaluation and selection based on performance metrics.

Uploaded by

nabinkoirala53
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

function Cost

Cost " "


Cost
2.
labels
class
Measures Log "
Function outliers.
functions
Mean
J(e) Loss Regression
Analysis
(Binary for Absolute
how=-) for
1
Binary Regression
well (=1 m
Cross-Entropy) Error
the Classification
Dilog(9)+ (MAE):
predicted
1
Providesa
m
(1-y)log(1
probablities

robust

math -)) measure

the
actual against

function Cost function Cost

" " "


Cost Cost function. The Defintion:
machine A
"Cost 1. " It"
procesS. This outputs
functions
Hinge Cross-Entropy: functions
heavily
Mean goal
Function Categorical quantifies
involves
of learning
Loss: Squared a
for for machine mathematical
and )
Encourages for adjusting model'
the s
J(O) =
Support Multi-class Regression the
Error error
learningactual
the predictions
Vector (MSE): or
a Classification outputs function
model difference
max(0,
y)
1- margin m
algorith
Machines
of k Penalizes parameters (ymatch that
separation
between
classes m )
between
Problem is of the
to the measures
large duringminimize data. actual
the 1/22/2025
errors the data. how
predicted
training this
more cost well
a
A linear statistical technique that models the
adependent variable Yand one or more relationship between " Concise representation of equation (1) in
independent variables X. Linear vectorized form:
Linear
Two types:
9= ha() =0x
Regression Regression where
"Simple Linear Regression: One independent variable. ha is the hypothesis function, using the model
" Multiple Linear Regression: Multiple
independent variables. parameters 0
" is the model'sparameter vector,
and the feature weights 0, to , . containing
" Definition: Computes weighted average of the bias term 0n
predictions
input features to make
" x0s the instance's feature vector, containing xo to xn,
with xo = 1.
ù=0% +,* + 0,x, + .. +nn (1) " 9 is the dot product of the vectors and x which is egual to
ù is the predicted value
n is the number of features; x; is the th feature value " In machine learning, vectors are represented as column vectors, i.e.
2D arrays with a single column.
0, is the jth model parameter; , is the bias term; 1, 2, ., O, If0 and x are column vectors then the prediction is given as:
are the feature weights
" For simple linear regression (n = 1) and equation (1) becomes: 0"is transpose of 0and ""xis the matrix multiplication.
ý= 0, + 4*1

" The goal is to find the best-fit line (or hyperplane in higher The Normal Equation:
Linear dimensions) that is, the value of 0, that minimizes the error Linear
Regression between predicted and actual values.
ê= (xX)-xy
Regression "@ is the value of 0 that minimizes the cost function
"The cost function for Linear Regression is the MSE:
" yisthe vector of targetvalues containing y) toy(m)
[=1
# Generate Linear like data to test equation (1)
where hg is the linear regression hypothesis on a training set X.
" The Normal Equation: import numpy as np
" The value of that minimizes MSE can be obtained using a
closed formsolution. np.random.seed(1)
" Closed form solution: A mathematical equation that directly m= 100 #number of instances
gives results X= 2*np.random.rand(m, 1) # column vector
y= 4 +3*X+ np, Random.randn(m, 1) # column vector
Linear
Regression 14
Linear Computing ô
Regression from
sklearn.preprocessing import
X_b= add_dummy_feature(X) #addadd_dummy_feature
x, to each instance
10
theta_best = np.linalg.inv(X_b.T @X_b) @x_b.T @y
y8 "The actual function used
togenerate the data is y =4 +3x, +
Gaussian noise.
" Computed value of
>» theta_best
array([[ 4.21509616],
0.5 l0 2.0 [2.770113391]1)
" Actual values: 0 = 4and 01 = 3
01 = 2.77.
instead of 00 = 4.215 and

Linear " Making predictions using


>>> X_new = np.array([0], (2]) Linear
Regression Regression Predictions
>>> X_new_b= add_dummy_feature( X_new] # add x0=1 to each
instance
10
>>> Y_predict = X_new_b@theta_best
>>y_predict
array([[4.21509616],
[9.75532293|])
0.5 L0 2.0
1/22/2025
Linear >>> from
sklearn.linear_model
>>> lin_reg =
import LinearRegression
Regression Linear_regression() Polynomial A method of fitting a linear
model to a non-linear data.
>>> lin_reg.fit(X,y) Powers of each feature are added as
>>> lin_reg.intercept, lin_alg.coef_
Regression dataset.
new features to the

(array([4.21509], aray([[2.77011]]) Example: If x is a feature of the dataset, then , x,... Are


added to the dataset
>>> lin_reg-predict(X_ne w) 10

Example:
array([[4.21509], [9.755322]]) 8

Computational complexity np.random.seed(1)


M=100

Limitations of Linear Regression X=6*np.random.rand(m, 1) -3


Y=0.5*x**2+X + 2 +
np.random.rand(m, 1)

Polynomial >>> from sklearn.preprocessing import Polynomial Features # Fit a LinearRegression model to this extended
Regression >» poly_features = PolynomialFeatures(degree=2, include_bias=False) Polynomial training data
>>> lin_reg = LinearRegression()
>>> X poly = poly_features.fit _transform(X) Regression
>>> lin_reg.fit(X_poly, y)
>>> X[o]
>>> lin_reg.intercept_, lin_reg.coef
array([-0.75275929]) (array([ 1.78134581]), array(tt 0.93366893, 0.56456263]))
>>> X_poly[o] 16

array([-0.75275929, 0.56664654]) Predlctions


" The PolynomialFeatures class: Transforms the training data by
creating the square (second degree polynomial) of each feature in
the training set as a new feature
" X_poly[o] contains the original feature Xand the new squared
feature

2
" PolynomialFeatures adds all combinations of features up to the The
Polynomial given degree.
10,
300
Regression "Example: For two features, say a and b, PolynomialFeatures with
degree=3 would add the features a, a³, b², b, and the
learning_ 1

combinations ab, a²b, and ab. curve()


" When there are multiple features, Polynomial Regression can find function
relationship between multiple features, whereas a Linear
Regression model cannot learn such relationships. of Scikit
" Consider a linear regression model and polynomial regression Learn
models of degree 2(quadratic polynomial) and a 300 degree model
fitted to the above data. -1 2

"300-degree polynomial overfits the data while linear model


underfits the data.
Flgure Learning Curves
" Quadratic model best fits the data.

" How do we decide on a best performing model?


" How do we decide on a best performing model?
2. Using Learning curves: A plot of the mode's performance
1. Using Cross-Validation to evaluate genera lization on training error and validation error as a function of the
Learning capability of a model Learning training iterations
" An overfitted model is tOo complex while an underfitted
Curves model is a simple model Curves " The model errors on train and validation data are
evaluated at regular intervals and then plotted.
for " The dataset is divided into k-folds for "Models are trained incrementally or several times with
" The model is trained on k-1 folds and tested on one
Machine fold such that each fold is used as validation set exactly
Machine gradually larger subsets of the training data.

Learning once.

" The metric(s) of interest (e.g., accuracy, precision, recal,


Learning " (Incremental Learning: Ability of a model to learn and update
Models F1-score, mean squared error, etc.) are computed for
Models its knowledge continuously as new data becomes available,
each fold. without reguiring access to entire dataset or re-training from
" The metric values across all k-folds for each model are
the model from scratch.)
averaged.
" The model with the best metric is selected.
" The learning_curve) function of
The the model using cross-validation Scikit-Learn: Trains and evaluates # Learning curve of a Linear Regression Model
learning " Defaut: Model is trained on growing
subsets of the training data. The
from sklearn.model_selection import learning_curve
train _sízes, train_scores, valid_scores =learning_curve
curve() " If model supports incremental
learning then, the argument
'exploit incremental_learning is set to True. learning_ LinearRegression(), X, y, trainsizes=np.linspace(0.01,1.0,40),
function
of Scikit
" Return value:
a) Size of training sets at which the model was evaluated
curve() cv=5, scoring="neg_root_mean_squared_error")
train_errors = train_scores.mean(axis=1)
b) The training and validation scores for each cross-validation function
Learn fold. of Scikit
valid_errors = valid_scores.mean(axis=1)

Learn plt.plot(train_sizes, train_errors, "t, linewidth=2,label=train")


plt.plot(train_sizes, valid_errors, "b-,linewidth=3,label=valid")
[..]
plt.show()

Figure: Learning Curve " The model is underfitting:


" Training Error: Model fits few points correctly.
The 3.0 The " With increase in new instances, the model performs
train poorly because of the noise and the fact that the model
learning_ 2.54 va
learning_ is not linear.

curve() 2.0 curve() "It then reaches a plateau


" Validation Error:
function Y1.5
function " Initially quite large because model istrained on very
of Scikit of Scikit few training samples.
1.0 " Reduces gradually as the model is trained on increasing
Learn Learn training dataset.
0.5
" How to improve under-fitted model?
0.0 " Use a complex modelor better features
20 30 40 50 60 70 80
Training set size " Learning Curve of a 10th degree polynomial
3.0
2. Regularlzatlon: A technique used to
prevent overfitting by
discouraging overly complex models that
Learniing
Curve of a
2.5 -

2.0
traln
val Regularized
Linear
fit the training data
too well but fail to generalize to unseen data.
" Apenalty term is added to the loss function being
optimized.
" This penalty term imposes
1 t
15 Models constraints on the model's
parameters, encouraging simpler models and reducing
risk of overfitting. the
degre e 1.0 LASSO Types of Regularization:
polynomial 0.5 - Regression 1. L1 Regularization - Least Absolute Shrinkage and Selection
Operator Regression (Lasso): A regularized version of Linear
0.04m Regression
The model is overfitted
10 20 30 40 50 60 70 80 Adds the sum of the absolute values of the model
How to improve an overfitted Training set size parameters (llwlli) to the cost function.
model?
1. Input more training data until the
validation error reaches the training error. J(e) = MSE(0) + 2ail

" Characteristic: Tends to eliminate the weights of th eleast 4.0 4.0


important features by setting them to 0.
Regularized " Automatically performs feature selection and outputs a Regularized
3.5

3.0
a=0
a=0.1
3.5 a=0
a=le-07
sparse model.
Linear Linear 2.5
3.04
a=l
>>> from sklearn.linear_model import Lasso 2.5
Models >>> lasso_reg = Lasso(alpha=0.1) Models- 2.0

LASSO >>> lasso_reg.fit(X, y) LASSO


1.5 1.5
10
1.04,
Regression >> lasso_ref.predict([1.511)
Regression 0.54 0.5

0.0
array([1.53788]) 0.0 0.5 1.0 1.5 2,.0 2.5 3.0 0.0 o.5 1.0 1.5 2.0 5 3.0

X1 X1

Figure. Alinearmodel (left) and apolynomial model (righe), both


using various levels of Laso neguarization
Encourages sparsity in the
to exactly zero, model by driving some
Regularized " effectively performing parameters
feature selection.
Linear What is
"Sparsity"? How does sparsity lead to better model
performance?
Models
" Many of
are zero.
the
model's parameters (weights or Regularized Feature Selection: If the model
automatically
parameters to zero, it effectively "selects"sets some
only the
Ifa parameter is zero, it
coefficients) Linear important features.
LASSO (input means the Models- This simplifies the model and
makes it
selection. ignoredcorresponding
variable) is feature interpret because it focuses on only the mosteasier to

Regression "
performing feature effectively by the model, thus LASSO input variables. relevant
How Does
Regularization
" The penalty Encourage Sparsity? Regression
Reducing Overfitting: By ignoring unnecessary features,
the model becomes less complex, which helps prevent
makes the
values, resulting in some model prefer smaller
parameter values parameter
overfitting.
" Example: If a feature isn't very usefulbecoming zero.
predictions, L1 regularization will penalize itsforassociated
making
parameter so much that the
feature by setting its weight tomodel
zero.
decides to "drop" that

Example: Consider following features to predict house price:


" Also called Tikhonov regularization
" X,: Number of bedrooms
Regularized " Xz: Size of the house Regularized A regularized version of Linear
term equal to a Regression: a regularization
Linear 0 is added to the cost function.
Xg: Roof color Linear J(0) MSE(0) + .
= aln
Models " Xa: Distance to school 2i=l o
m
LASSO Acomplex model might initially use all these
features. But
Models: " The penalty is proportional to the square of the
the model's coefficients. magnitude of
after applying L1 regularization: Ridge " If w denotes the vector of feature weights (1 to
Regression " The model might decide that roof color (x )
doesn't
matter for predicting house prices and set its weight to Regression
the regularization term is equal 6), then
represents the &, norm of the weighttoa(|wll), where l| w
vector.
zero. " The regularization term keeps the model weights as low as
" Now the model effectively ignores x a and possible when fitting the model to the data.
on x , X , andx 4
focuses only " The hyperparameter a controls regularization to the model.
L1 regularization automatically performs feature selection. " Ifa= 0, then Ridge Regression is just Linear
Regression.
" Ifa is very large, then all weights end up very close to
zero
and the result is a flat line going through the data's mean.
Closed form solution for 1/22/2025
Regularized 8= (XTXRidge Regression:
+
Linear where A is an
the top left cell(n+1) aA)-^Xy
(n+1) identity
to matrix except with a Oin
40.

Models: >>> from corresponding the bias term. Regularized s a=0


4.0

sklearn.Ridge(alpha=0.
linear_model1,import Ridge 3.5 a=0
a=10

Ridge >>> ridge_reg= Linear 3.0


a=100 3.0
a= le - 05
a=1
>>> ridge_reg.fit(X,y) solver="cholesky")
2.5

Models: Y2.0
2.5 -

Regression ridge_reg.predict([[1.5]) Ridge 1.5

1.0
2.0

L5

array([[1.5532]]) Regression 0.5 0.5

0.0 0.0
0.0 0.5 l.0 1.s 20 2.5 3.0o.0 0.5 1.0 15 2.0 2.5 3.0
X1 X

Figure -17.A Iinear model (left) and a polynomial model


(right), both with various levels of Ridge regularization

Also called Logit Regression


" What is the need for Logistic Regression?
Astatistical method used for binary classification problems,
where the goal is to predict one of two possible outcomes Linear regression is not suitable for classification tasks
because:
(e.g., yes/no, 0/1, success/failure).
1. It predicts continuous values, which don't align well with
Despite its name, logistic regression is a classification
Logistic algorithm, not a regression algorithm. Logistic
the categorical nature of the outputs in classification.
2.. Predictions are unbounded, whereas probabilities (used
" It estimates the probability that an instance belongs to a
Regression particular class (e.g., what is the probability that this email Regression in classification) are restricted to the range [0, 1]. Logistic
regression overcomes these issues by using the logistic
isspam?). (or sigmoid) function to model probabilities.
" If the estimated probability s greater than 50%, then the
model predicts that the instance belongs to that class
(called the positive class, labeled "1"), and otherwise it
predicts that it does not (i.e., it belongs to the negative
class, labeled "0").
" This makes it a bínary classifier.
How does Logistic 1/22/2025
Logistic 1, Regression work?
Regression Computes
term),
a
similarweighted
to a Linearsum of the input
Outputs the logistic of theRegression modelfeatures (plus a bias
"
2.
result:
Logistic Logistic Regression Model Prediction
Logistic Model estimates the
probability ' h¡(x) that an
The ß= he() =o*e) Regression Instance x belongs to the positive class and then makes

that logistic,
predictions as follows:
outputs denoted
a o(-), is a
sigmoid function (i.e., if p<0.s
number between 0 and 1. S-shaped) l1 if p2 0.5
The Logistic Function: o(t)= 1 Logistic Regression Model Training and computing Cost Function:
1+exp(-t) " Objective of the training: Estimate the parameter vector so
that the model estimates high probabilities for positive instancas
1.00
(y=1) and low probabilities for negative instances (y=0).
0.75 o(t) =
0.50
" The cost function for a single training instance to obtain above is
given as:
($) if y=1
0.25

0.00
c(0) =log(1-) ify= 0
-10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 20.0
t

Cost function over the entire training set:


1
J(e) =->
mL y0log(p0) +(1-y0)log(1 - p°)] Training a logistic Regression Model:
i=1 The objective of training: btain parameter vector in
" Whentapproaches 0, -log(t) grows very large such a way that the model estimates high probabilities for
positive instances (y = 1) and low probabilities for
Logistic The cost will be huge if the model estimates a probability
close to 0 for a positive instance Logistic negative instances (y = 0).
Regression " The cost will be large if the model estimates a probability Regression
close to 1 for a negative instance
-log(t) is close to when t is close to 1:
Cost willbe close to Oif the estimated probability is close
to 0fora negative instance
Cost will be close to 1 for a positive instance
. 1/22/202s
Multinomial
extends logisticRegression,
regression foralsomulticlass
called Softmax
softmax function to predict problems.Regression,
. It uses the
Mulitnomial class and minimizes " Thesoftmax score s(x) for cless k is
computed as:
Logistic When to use cross-entropy loss. probabilities for each
Multinomial
" Regression?
Regression Multinomial
more than tworegression is used for datasets that involve Mulitnomjal
whether a flowercategorical
is one of outcomes, such as predicting Logistiç
Versicolor, or Virginica). three species (e.g., Iris Setosa,
" How does multinomial
regression work?
Regression
Given an instance x, the
computes a score Softmax Regression model first
the probability ofsz() for each class k,
each class by applyingthen estimates
function (also called the normalized the softmax
SCores. exponential) to the

What Mmm

Regularjzed
Logistic Linea
Regression Models

You might also like