Slide 2
Slide 2
REGRESSION
In statistics, the logistic model is used to model the probability of a certain class or event existing
such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes
of events such as determining whether an image contains a cat, dog, lion, etc
binomial: Target variable can have only 2 possible types: “0” or “1” which may represent “win” vs
“loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
multinomial: Target variable can have 3 or more possible types which are not ordered(i.e. types
have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.
ordinal: It deals with target variables with ordered categories. For example, a
test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here,
each category can be given a score like 0, 1, 2, 3.
•Start with binary class problems
How do we develop a classification algorithm?
• Tumour size vs malignancy (0 or 1)
• We could use linear regression
• Then threshold the classifier output (i.e. anything over some value is yes, else
no)
• In our example below linear regression with thresholding seems to work
•We can see above this does a reasonable job of stratifying the data points into one of
two classes
• But what if we had a single Yes with a very small tumour
• This would lead to classifying all the existing yeses as nos
•Another issues with linear regression
• We know Y is 0 or 1
• Hypothesis can give values large than 1 or less than 0
•So, logistic regression generates a value where is always either 0 or 1
• Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
•What function is used to represent our hypothesis in classification
•We want our classifier to output values between 0 and 1
• When using linear regression we did hθ(x) = (θT x)
• For classification hypothesis representation we do hθ(x) = g((θT x))
• Where we define g(z)
• z is a real number
• This is the sigmoid function, or the logistic function
• If we combine these equations we can write out the hypothesis as
•How does the sigmoid function look like
When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that
y=1 on input x
• Example
• If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
• hθ(x) = 0.7
• Tells a patient they have a 70% chance of a tumor being
malignant
hθ(x) = P(y=1|x ; θ)
• What does this mean?
• Probability that y=1, given x, parameterized by θ
•Since this is a binary classification task we know y = 0 or 1
• So the following must be true
• P(y=1|x ; θ) + P(y=0|x ; θ) = 1
• P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
Decision boundary
•This gives a better sense of what the hypothesis function is computing
• One way of using the sigmoid function is;
• When the probability of y being 1 is greater than 0.5 then we can predict y =
1
• Else we predict y = 0
• When is it exactly that hθ(x) is greater than 0.5?
• Look at sigmoid function
• g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
• So if z is positive, g(z) is greater than 0.5
• z = (θT x)
• So when
• θT x >= 0
• Then hθ >= 0.5
•So what we've shown is that the hypothesis predicts y = 1 when θ T x >= 0
• The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
• Let's use this to better understand how the hypothesis makes its predictions
Decision boundary
•This gives a better sense of what the hypothesis function is computing
• One way of using the sigmoid function is;
• When the probability of y being 1 is greater than 0.5 then we can predict y =
1
• Else we predict y = 0
• When is it exactly that hθ(x) is greater than 0.5?
• Look at sigmoid function
• g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
• So if z is positive, g(z) is greater than 0.5
• z = (θT x)
• So when
• θT x >= 0
• Then hθ >= 0.5
•So what we've shown is that the hypothesis predicts y = 1 when θ T x >= 0
• The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
• Let's use this to better understand how the hypothesis makes its predictions
Consider,
hθ(x) = g(θ0 + θ1x1 + θ2x2)
•To get around this we need a different, convex Cost() function which means we can apply
gradient descent
The above two functions can be compressed into a single function i.e.
Gradient Descent
Now the question arises, how do we reduce the cost value. Well, this can be done by
using Gradient Descent. The main goal of Gradient descent is to minimize the cost
value. i.e. min J(θ).
Now to minimize our cost function we need to run the gradient descent function on
each parameter i.e.
Gradient descent has an analogy in which we have to imagine ourselves at the top
of a mountain valley and left stranded and blindfolded, our objective is to reach the
bottom of the hill. Feeling the slope of the terrain around you is what everyone
would do. Well, this action is analogous to calculating the gradient descent, and
taking a step is analogous to one iteration of the update to the parameters.
Multiclass classification problems
•Getting logistic regression for multiclass classification using one vs. all
•Multiclass - more than yes or no (1 or 0)
• Classification with multiple classes for assignment
•Given a dataset with three classes, how do we get a learning algorithm to work?
• Use one vs. all classification make binary classification work for multiclass classification
•One vs. all classification
• Split the training set into three separate binary classification problems
• i.e. create a new fake training set
• Triangle (1) vs crosses and squares (0) hθ1(x)
• P(y=1 | x1; θ)
• Crosses (1) vs triangle and square (0) hθ2(x)
• P(y=1 | x2; θ)
• Square (1) vs crosses and square (0) hθ3(x)
• P(y=1 | x3; θ)
In K-Nearest Neighbors, data points that are near each other are said to be
neighbors.
Similar cases with the same class labels are near each other.
Thus, the distance between two cases is a measure of their dissimilarity.
There are different ways to calculate the similarity or conversely,
the distance or dissimilarity of two data points.
For example, this can be done using Euclidean distance.
the K-Nearest Neighbors algorithm works as follows.
- calculate the distance from the new case hold out from each of the cases in the
dataset.
- search for the K-observations in the training data that are nearest to the
measurements of the unknown data point.
- predict the response of the unknown data point using the most popular response
value from the K-Nearest Neighbors.
There are two parts in this algorithm that might be a bit confusing.
A low value of K causes a highly complex model as well, which might result in overfitting of the
model.
It means the prediction process is not generalized enough to be used for out-of-sample cases.
Out-of-sample data is data that is outside of the data set used to train the model.
In other words, it cannot be trusted to be used for prediction of unknown samples. It's important to
remember that overfitting is bad, as we want a general model that works for any data, not just the
data used for training.
Now, on the opposite side of the spectrum, if we choose a very high value of K such as K equals
20,
then the model becomes overly generalized.
So, how can we find the best value for K?
The general solution is to reserve a part of your data for testing the accuracy of the model.
Once you've done so, choose K equals one and then use the training part for modeling and
calculate the accuracy of prediction using all samples in your test set.
Repeat this process increasing the K and see which K is best for your model.
For example, in our case,
1. No Training Period: KNN is called Lazy Learner (Instance based learning). It does not learn
anything in the training period. It does not derive any discriminative function from the training
data. In other words, there is no training period for it. It stores the training dataset and learns
from it only at the time of making real time predictions. This makes the KNN algorithm much
faster than other algorithms that require training e.g. SVM, Linear Regression etc.
2. Since the KNN algorithm requires no training before making predictions, new data can be
added seamlessly which will not impact the accuracy of the algorithm.
3. KNN is very easy to implement. There are only two parameters required to implement KNN i.e.
the value of K and the distance function (e.g. Euclidean or Manhattan etc.)
Disadvantages of KNN
1. Does not work well with large dataset: In large datasets, the cost of calculating the
distance between the new point and each existing points is huge which degrades the
performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm doesn't work well
with high dimensional data because with large number of dimensions, it becomes difficult
for the algorithm to calculate the distance in each dimension.
3.Sensitive to noisy data, missing values and outliers: KNN is sensitive to noise in
the dataset. We need to manually impute missing values and remove outliers.
SUPPORT VECTOR MACHINE(SVM)
A Support Vector Machine is a supervised algorithm that can classify cases by finding
a separator.
SVM works by first mapping data to a high dimensional feature space so that
data points can be categorized, even when the data are not linearly separable.
Then, a separator is estimated for the data. The data should be transformed in such a
way that a separator could be drawn as a hyperplane.
Therefore, the SVM algorithm outputs an optimal hyperplane that categorizes new examples.
DATA TRANFORMATION
For the sake of simplicity, imagine that our dataset is one-dimensional data.
This means we have only one feature x.
As you can see, it is not linearly separable.
Well, we can transfer it into a two-dimensional space. For example, you can increase the dimension of
data by mapping x into a new space using a function with outputs x and x squared.
ADVANTAGES
- Accurate in high dimension place
- Memory efficient
DISADVANTAGES
- Small datasets
- Prone to overfitting
APPLICATIONS
- Image Recognition
- Spam detection