0% found this document useful (0 votes)
6 views

Week 9 Lecture - Revision Test-dual-translated

The document outlines key concepts in regression analysis, including cost functions, regularization, and types of logistic regression. It also explains Principal Component Analysis (PCA) for dimensionality reduction and details the Random Forest classification algorithm. Additionally, it highlights the importance of model complexity and generalization in predictive modeling.

Uploaded by

599146824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week 9 Lecture - Revision Test-dual-translated

The document outlines key concepts in regression analysis, including cost functions, regularization, and types of logistic regression. It also explains Principal Component Analysis (PCA) for dimensionality reduction and details the Random Forest classification algorithm. Additionally, it highlights the importance of model complexity and generalization in predictive modeling.

Uploaded by

599146824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Week 9

KQC7015 Test
Date : 22 December 2024
Time : 1.00 – 2.00 afternoon (1 hour)
Venue : Block Y, Department OF Electrical Engineering
What is C

Like most statistical models, regression


seeks to minimize a cost function.

So let’s first start by thinking about what a


cost function is.

A cost function tries to measure how wrong


you are.

So if my prediction was right then there


should be no cost, if I am just a tiny bit
wrong there should be a small cost.
Lambda (λ) controls the
trade-off between allowing
the model to increase it's
Parameter C = 1/λ.
complexity as much as it
wants with trying to keep it
simple.
Tuning
parameters for For example, if λ is very low
regression or 0, the model will have
enough power to increase
If we increase the value of
λ, the model will tend to
it's complexity (overfit) by underfit, as the model will
assigning big values to the become too simple.
weights for each parameter.

Parameter C will work the


For big values of C, we
other way around. For small
reduce the power of
values of C, we increase the
regularization which implies
regularization strength
that the model is allowed to
which will create simple
increase it's complexity, and
models which underfit the
therefore, overfit the data.
data.
Regularization helps us tune and control our model complexity,
ensuring that our models are better at making (correct) classifications
— or more simply, the ability to generalize.

If we don’t apply regularization, our classifiers can easily become too


complex and overfit to our training data, in which case we lose the
ability to generalize to our testing data (and data points outside the
testing set as well).
 Similarly, without applying regularization we also run the risk
of underfitting.

 In this case, our model performs poorly on the training our — our classifier
is not able to model the relationship between the input data and the
output class labels.
Types of Logistic Regression
1. Binary • The categorical response has only two 2
Logistic possible outcomes.
• Example: Spam or Not
Regression

2. Multinomial • Three or more categories without ordering.


• Example: Predicting which food is preferred
Logistic more (very healthy, non-healthy, moderate
Regression healthy)

3. Ordinal
• Three or more categories with ordering.
Logistic • Example: Movie rating from 1 to 5
Regression
EXAMPLES :

Spam Detection : Predicting if an email is Spam or not


Credit Card Fraud : Predicting if a given credit card transaction is fraud or
not
Health : Predicting if a given mass of tissue is benign or malignant
Marketing : Predicting if a given user will buy an insurance product or not
Banking : Predicting if a customer will default on a loan.
Principle Component Analysis (PCA)

 A dimensionality-reduction method by transforming a large set of variables into


a smaller one that still contains most of the information in the large set.
 Reducing the number of variables → at the expense of accuracy,
 to trade a little accuracy for simplicity.
 smaller data sets are easier to explore and visualize
 easier to analyze data and faster for machine learning algorithms

11
HOW DO YOU DO A PCA?

1. Standardize the range of continuous initial variables


2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix
to identify the principal components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes

12
1. Standardization
❖ Standardize the range of the continuous initial variables → contributes equally to the
analysis.
❖ Why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive
regarding the variances of the initial variables.
❖ If there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges
❖ (For example, a variable that ranges between 0 and 100 will dominate over a variable that
ranges between 0 and 1), which will lead to biased results.
❖ Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.

𝑉𝐴𝐿𝑈𝐸 −𝑀𝐸𝐴𝑁
❖ Z=
𝑆𝑇𝐴𝑁𝐷𝐴𝑅𝐷 𝐷𝐸𝑉𝐼𝐴𝑇𝐼𝑂𝑁

❖ Once the standardization is done, all the variables will be transformed to the same scale.
13
2. COVARIANCE MATRIX COMPUTATION

➢ To understand how the variables of the input data set are varying from the
mean with respect to each other
➢ sometimes, variables are highly correlated ( contain redundant
information).

➢ To identify these correlations → compute the covariance matrix.


➢ Covariance matrix is a p × p symmetric matrix
(where p is the number of dimensions)

➢ For example, for a 3-dimensional data set with 3 variables x, y, and z, the
covariance matrix is a 3×3 matrix of this from:

14
Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the
main diagonal (Top left to bottom right) we actually have the variances of each initial
variable.
And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the
covariance matrix are symmetric with respect to the main diagonal, which means that
the upper and the lower triangular portions are equal.

What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?

It’s actually the sign of the covariance that matters :


•if positive then : the two variables increase or decrease together (correlated)
•if negative then : One increases when the other decreases (Inversely correlated)

15
3. COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE
COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL
COMPONENTS

 Eigenvectors and eigenvalues → compute from the covariance


matrix → principal components of the data.

 Principal components
▪ new variables - linear combinations or mixtures of the initial
variables. are uncorrelated and most of the information within the
initial variables is squeezed or compressed into the first
components.
▪ So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible information in
the first component, then maximum remaining information in the
second and so on.
16
Reduce dimensionality without losing much information,
by discarding the components with low information
and considering the remaining components as your
new variables.

Principal components are less interpretable and don’t


have any real meaning since they are constructed as
linear combinations of the initial variables.

PCs represent the directions of the data that explain


a maximal amount of variance - the lines that capture
most information of the data.

The larger the variance carried by a line, the larger


the dispersion of the data points along it,
and the larger the dispersion along a line, the more
the information it has. 17
 The first principal component accounts for
the largest possible variance in the data set.

 For example, let’s assume that the scatter plot of


our data set is as shown, can we guess the first
principal component ?
 Yes, it’s approximately the line that matches the
purple marks because it’s the line in which the
projection of the points (red dots) is the most
spread out.

 Or mathematically speaking, it’s the line that


maximizes the variance (the average of the
squared distances from the projected points (red
dots) to the origin).

18
 2nd PC → with the condition that it is uncorrelated with (i.e.,
perpendicular to) the 1st PC and that it accounts for the next highest
variance.

 This continues until a total of p principal components have been


calculated, equal to the original number of variables.

 Every eigenvector has an eigenvalue. And their number is equal to the


number of dimensions of the data.

 For example, for a 3-dimensional data set, there are 3 variables,


therefore there are 3 eigenvectors with 3 corresponding eigenvalues.

19
 The eigenvectors of the Covariance matrix are
actually the directions of the axes where there is the most
variance(most information) and that we call PC.

 And eigenvalues are simply the coefficients attached to


eigenvectors, which give the amount of variance carried in each PC.

 By ranking your eigenvectors in order of their eigenvalues, highest


to lowest, you get the PCs in order of significance.

20
4: FEATURE VECTOR
i. Choose whether to keep all these components or discard those of lesser
significance (of low eigenvalues),

ii. Form with the remaining ones a matrix of vectors that we call Feature vector.

iii. Feature vector is simply a matrix that has as columns the eigenvectors of the
components

iv. This makes it the first step towards dimensionality reduction, because if we
choose to keep only p eigenvectors (components) out of n, the final data set
will have only p dimensions.
21
22
5. RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS
AXES

❖ In the previous steps, you just select the PCs and form the feature vector,
but the input data set remains always in terms of the original axes.

❖ In this step, which is the last one, the aim is to use the feature vector
formed using the eigenvectors of the covariance matrix, to reorient the
data from the original axes to the ones represented by the PC.

❖ This can be done by multiplying the transpose of the original data set by
the transpose of the feature vector.

 FinalDataSet = FeatureVector𝑇 * 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙𝐷𝑎𝑡𝑎𝑆𝑒𝑡𝑇


23
2-D convolution
Convolution in higher dimensions
Pooling and invariance
Max pooling provides invariance to small shifts of
the input tensor
This New
value value
fallen off added
after
after
shift
shift
Stride and tiling
Max Pooling Layer

It is a sample based
Max pooling layer helps reduce discretization process. It is
the spatial size of the convolved similar to the convolution layer
features and also helps reduce but instead of taking a dot
over-fitting by providing an product between the input and
abstracted representation of the kernel we take the max of
them. the region from the input
overlapped by the kernel.

Below is an example which


shows a maxpool layer’s
operation with a kernel having
size of 2 and stride of 1.
Max pooling step — 2
Random forest method

 The random forest is a classification algorithm consisting of many decisions trees.

 It uses feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose
prediction by committee is more accurate than that of any individual tree.
 Random forest is created by randomly splitting the data.

 Each decision tree is formed using feature selection indicators like information gain, gain
ratio of each feature.

 Each tree is dependent on an independent sample.

 Considering it to be a classification problem, then each tree computes votes and the highest
votes class is chosen.

 If its regression, the average of all the tree's outputs is declared as the result.
Assumptions for Random Forest

 There should be some actual values in the feature variable of


the dataset so that the classifier can predict accurate results
rather than a guessed result.

 The predictions from each tree must have very low


correlations.
Random Forest works in two-phases:
1) first is to create the random forest by combining N decision tree,
2) second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

 Step-1: Select random K data points from the training set.


 Step-2: Build the decision trees associated with the selected data points
(Subsets).
 Step-3: Choose the number N for decision trees that you want to build.
 Step-4: Repeat Step 1 & 2.

 Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
Four sectors where Random forest mostly used:

Land Marketing
Banking: Medicine:
Use: :

Banking With the


sector help of this
We can Marketing
mostly uses algorithm,
identify the trends can
this disease
areas of be
algorithm trends and
similar land identified
for the risks of the
use by this using this
identificatio disease can
algorithm. algorithm.
n of loan be
risk. identified.

You might also like