Unit II Final
Unit II Final
1. Choose the value of k and the k initial guesses for the centroids. In this example,
k = 3, and the initial centroids are indicated by the points shaded in red, green, and blue
in the following figure.
Thus, (xc, yc) is the ordered pair of the arithmetic means of the
coordinates of the m points in the cluster. In this step, a centroid is
computed for each of the k clusters.
Compute the mean of each cluster
4. Repeat Steps 2 and 3 until the algorithm converges to an answer :
a. Assign each point to the closest centroid computed in Step 3.
b. Compute the centroid of newly defined clusters.
c. Repeat until the algorithm reaches the final answer.
In k-means, k clusters can be identified in a given
dataset, but what value of k should be selected? The
value of k can be chosen based on a reasonable guess
or some predefined requirement.
In k-means, k clusters can be identified in a given dataset.
The value of k can be chosen based on a reasonable guess or some predefined
requirement.
However, even then, it would be good to know how much better or worse
having k clusters versus k-1 or k+1 cluster would be in explaining the structure
of the data.
Next, a heuristic using the Within Sum of Squares (WSS) metric is examined to
determine a reasonably optimal value of k. Using the distance function, WSS is
defined as shown below.
In other words, WSS is the sum of the squares of the distances
between each data point and the closest centroid.
The term q(i) indicates the closest centroid that is associated
with the ith point. If the points are relatively close to their
respective centroids, the WSS is relatively small.
Thus, if k +1 clusters do not greatly reduce the value of WSS
from the case with only k clusters, there may be little benefit
to adding another cluster.
The heuristic using WSS can provide at least several
possible k values to consider.
When the number of attributes is relatively small, a
common approach to further refine the choice of k is
to plot the data to determine how distinct the
identified clusters are from each other.
In general, the following questions should be
considered.
In general, the following questions should be considered.
• Are the clusters well separated from each other?
• Do any of the clusters have only a few points?
• Do any of the centroids appear to be too close to each
other?
In the first case, ideally the plot would look like the one
shown in below figure, when n = 2.
Example of distinct clusters
The clusters are well defined, with considerable space between the
four identified clusters.
However, in other cases, such as in below figure, the clusters may
be close to each other, and the distinction may not be so obvious.
A good model should have a high accuracy score, but having a high
accuracy score alone does not guarantee the model is well established.
The true positive rate (TPR) shows what percent of positive instances the
classifier correctly identified. It's also illustrated in the following Equation.
A well-performed model should have a high TPR that is ideally 1
and a low FPR and FNR that are ideally 0. In some cases, a model
with a TPR of 0.95 and an FPR of 0.3 is more acceptable than a
model with a TPR of 0.9 and an FPR of 0.1 even if the second
model is more accurate overall.
Precision is the percentage of instances marked positive that
really are positive, as shown in the following Equation.
ROC curve is a common tool to evaluate classifiers.
The abbreviation stands for Receiver Operating
Characteristic, a term used in signal detection to characterize
the trade-off between hit rate and false alarm rate over a
noisy channel.
A ROC curve evaluates the performance of a classifier based
on the TP and FP, regardless of other factors such as class
distribution and error costs.
Related to the ROC curve is the area under the curve (AUC).
The AUC is calculated by measuring the area under the ROC
curve.
Higher AUC scores mean the classifier performs better.
The score can range from 0.5 (for the diagonal line
TPR=FPR) to 1.0 (with ROC passing through the top-left
corner).
Besides the above two classifiers, several other methods are
commonly used for classification, including
Bagging,
Boosting,
Random forest, and
Support Vector Machines (SVM)
Bagging (or bootstrap aggregating) uses the bootstrap technique that
repeatedly samples with replacement from a dataset according to a uniform
probability distribution.
"With replacement" means that when a sample is selected for a training or
testing set, the sample is still kept in the dataset and may be selected again.
Because the sampling is with replacement, some samples may appear
several times in a training or testing set, whereas others may be absent.
A model or base classifier is trained separately on each bootstrap sample,
and a test sample is assigned to the class that received the highest number
of votes.
Boosting (or AdaBoost) uses votes for classification to combine the output
of individual models.
In addition, it combines models of the same type. However, boosting is an
iterative procedure where a new model is influenced by the performances
of those models built previously.
Furthermore, boosting assigns a weight to each training sample that
reflects its importance, and the weight may adaptively change at the end of
each boosting round.
Bagging and boosting have been shown to have better performances than a
decision tree.
Random forest is a class of ensemble methods using decision tree
classifiers.
It is a combination of tree predictors such that each tree depends on
the values of a random vector sampled independently and with the
same distribution for all trees in the forest.
A special case of random forest uses bagging on decision trees,
where samples are randomly chosen with replacement from the
original training set.
SVM is another common classification method that combines linear
models with instance-based learning techniques.
Support vector machines select a small number of critical boundary
instances called support vectors from each class and build a linear
decision function that separates them as widely as possible, SVM by
default can efficiently perform linear classifications and can be
configured to perform nonlinear classifications as well.
In general, regression analysis attempts to explain the influence that a set
of variables has on the outcome of another variable of interest.
Often, the outcome variable is called a dependent variable because the
outcome depends on the other variables.
These additional variables are sometimes called the input variables or the
independent variables.
Regression analysis is useful for answering the following kinds of
questions:
• What is a person's expected income?
• What is the probability that an applicant will default on a loan?
Linear regression is a useful tool for answering the first
question, and logistic regression is a popular method for
addressing the second.
Regression analysis is a useful explanatory tool that can
identify the input variables that have the greatest statistical
influence on the outcome.
For example, if it is found that the reading level of 10-year-
old students is an excellent predictor of the students' success
in high school and a factor in their attending college, then
additional importance on reading can be considered,
implemented, and evaluated to improve students' reading
levels at a younger age.
Used for Predictive analysis.
Linear regression is an analytical technique used to model the
relationship between several input variables and a continuous
outcome variable.
A key assumption is that the relationship between an input variable
and the outcome variable is linear.
Although this assumption may appear restrictive, it is often possible
to properly transform the input or outcome variables to achieve a
linear relationship between the modified input and outcome variables.
A linear regression model is a probabilistic one that accounts
for the randomness that can affect any particular outcome.
Based on known input values, a linear regression model
provides the expected value of the outcome variable based
on the values of the input variables, but some uncertainty
may remain in predicting any particular outcome.
Linear regression is often used in business, government,
and other scenarios. Some common practical
applications of linear regression in the real world
include the following:
Real estate
Demand forecasting
Medical
A simple linear regression analysis can be used to model
residential home prices as a function of the home's living
area.
Such a model helps set or evaluate the list price of a home on
the market.
The model could be further improved by including other
input variables such as number of bathrooms, number of
bedrooms, plot size, school district rankings, crime statistics,
and property taxes.
Businesses and governments can use linear regression models
to predict demand for goods and services.
For example, restaurant chains can appropriately prepare for the
predicted type and quantity of food that customers will
consume based upon the weather, the day of the week, whether
an item is offered as a special, the time of day, and the
reservation volume.
Similar models can be built to predict retail sales, emergency
room visits, and ambulance dispatches.
A linear regression model can be used to analyze the effect
of a proposed radiation treatment on reducing tumor sizes.
Input variables might include duration of a single radiation
treatment, frequency of radiation treatment, and patient
attributes such as age or weight.
As the name of this technique suggests, the linear
regression model assumes that there is a linear
relationship between the input variables and the
outcome variable.
This relationship can be expressed as shown in the
following Equation :
Where :
y is the outcome variable
xjare the input variables, for j=1,2,...,p-1
𝛽0 is the value of y when each xj equals zero
For a regression model with just one input variable, below figure illustrates
the normality assumption on the error terms and the effect on the outcome
variable, y, for a given value of x.
N-Fold Cross-Validation
A major assumption in linear regression modelling is that
the relationship between the input variables and the
outcome variable is linear.
The most fundamental way to evaluate such a relationship
is to plot the outcome variable against each input variable.
If the relationship between Age and Income is represented
as illustrated in the following Figure, a linear model would
not apply.
Figure : Income as a quadratic function of Age
In such a case, it is often useful to do any of the following :
• Transform the outcome variable.
• Transform the input variables.
• Add extra input variables or terms to the regression model.
Common transformations include taking square roots or the
logarithm of the variables.
Another option is to create a new input variable such as the age
squared and add it to the linear regression model.
As stated previously, it is assumed that the error terms in
the linear regression model are normally distributed with a
mean of zero and a constant variance.
If this assumption does not hold, the various inferences
that were made with the hypothesis tests, confidence
intervals, and prediction intervals are suspect.
The residual plots are useful for confirming that the
residuals were centered on zero and have a constant
variance.
However the normality assumption still has to be
validated.
To prevent overfitting a given dataset, a common practice
is to randomly split the entire dataset into a training set
and a testing set.
Once the model is developed on the training set, the model
is evaluated against the testing set.
When there is not enough data to create training and
testing sets, an N-fold cross-validation technique may be
helpful to compare one fitted model against another.
In N-fold cross-validation, the following occurs:
• The entire dataset is randomly split into N datasets of approximately
equal size.
• A model is trained against N - 1 of these datasets and tested against the
remaining dataset. A measure of the model error is obtained.
• This process is repeated a total of N times across the various
combinations of N datasets taken N - 1 at a time. Recall:
The term −0.16×Age means that for each year increase in Age, the value of 𝑦 will
decrease by 0.16.