0% found this document useful (0 votes)
0 views

Unit II Final

The document provides an overview of machine learning, focusing on supervised and unsupervised learning, with an emphasis on clustering techniques like k-means. It explains the k-means algorithm, its applications in various fields, and considerations for selecting the optimal number of clusters. Additionally, it touches on classification methods, particularly decision trees, and their use in predicting outcomes based on input variables.

Uploaded by

rajesh.a04082004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Unit II Final

The document provides an overview of machine learning, focusing on supervised and unsupervised learning, with an emphasis on clustering techniques like k-means. It explains the k-means algorithm, its applications in various fields, and considerations for selecting the optimal number of clusters. Additionally, it touches on classification methods, particularly decision trees, and their use in predicting outcomes based on input variables.

Uploaded by

rajesh.a04082004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

 Machine learning is a field of computer science that gives computers the ability

to learn without being explicitly programmed. Supervised learning and


unsupervised learning are two main types of machine learning.
 In supervised learning, the machine is trained on a set of labeled data, which
means that the input data is paired with the desired output. The machine then
learns to predict the output for new input data. Supervised learning is often used
for tasks such as classification, regression, and object detection.
 In unsupervised learning, the machine is trained on a set of unlabeled data, which
means that the input data is not paired with the desired output. The machine then
learns to find patterns and relationships in the data. Unsupervised learning is
often used for tasks such as clustering, dimensionality reduction, and anomaly
detection (a statistical technique that identifies unusual data points or events ).
 In general, clustering is the use of unsupervised
techniques for grouping similar objects.
 In machine learning, unsupervised refers to the problem
of finding hidden structure within un-labeled data.
 Clustering techniques are unsupervised in the sense that
the data scientist does not determine, in advance, the
labels to apply to the clusters.
 The structure of the data describes the objects of interest
and determines how best to group the objects.
 Clustering is a method often used for exploratory analysis
of the data.
 In clustering, there are no predictions made.
 Rather, clustering methods find the similarities between
objects according to the object attributes and group the
similar objects into clusters.
 Clustering techniques are utilized in marketing,
economics, and various branches of science.
 A popular clustering method is k-means.
 Given a collection of objects each with n measurable attributes,
kmeans is an analytical technique that, for a chosen value of k,
identifies k clusters of objects based on the objects' proximity(closer)
to the center of the k groups.
 The center is determined as the arithmetic average (mean) of each
cluster's n-dimensional vector of attributes.
 Below figure illustrates three dusters of objects with two attributes.
 Each object in the dataset is represented by a small dot color-coded to
the closest large dot, the mean of the cluster.
 Clustering is often used as a lead-in to classification.
 Once the clusters are identified, labels can be applied
to each cluster to classify each group based on its
characteristics.
 Some specific applications of k-means are :
◦ Image processing,
◦ Medical and
◦ Customer segmentation.
 Video is one example of the growing volumes of unstructured data
being collected. Within each frame of a video, k-means analysis
can be used to identify objects in the video. For each frame, the
task is to determine which pixels are most similar to each other.
 The attributes of each pixel can include brightness, color, and
location, the x and y coordinates in the frame.
 With security video images, for example, successive frames are
examined to identify any changes to the clusters.
 These newly identified clusters may indicate unauthorized access
to a facility.
 Patient attributes such as age, height, weight, systolic and
diastolic blood pressures, cholesterol level, and other attributes
can identify naturally occurring clusters.
 These dusters could be used to target individuals for specific
preventive measures or clinical trial participation.
 Clustering, in general, is useful in biology for the
classification of plants and animals as well as in the field of
human genetics.
 Marketing and sales groups use k-means to better identify customers
who have similar behaviors and spending patterns.
 For example, a wireless provider may look at the following
customer attributes: monthly bill, number of text messages, data
volume consumed, minutes used during various daily periods, and
years as a customer.
 The wireless company could then look at the naturally occurring
clusters and consider tactics to increase sales or reduce the customer
churn rate, the proportion of customers who end their relationship
with a particular company.
 To illustrate the method to find k clusters from a collection of
M objects with n attributes, the two-dimensional case (n = 2)
is examined.
 It is much easier to visualize the k-means method in two
dimensions.
 Because each object in this example has two attributes, it is
useful to consider each object corresponding to the point
(xi, yi), where x and y denote the two attributes and i = 1, 2 ...
M. For a given cluster of m points (m ≤M), the point that
corresponds to the cluster's mean is called a centroid.
 The k-means algorithm to find k dusters can be described in the following four steps.

1. Choose the value of k and the k initial guesses for the centroids. In this example,
k = 3, and the initial centroids are indicated by the points shaded in red, green, and blue
in the following figure.

Initial starting points for the centroids


2. Compute the distance from each data point (xi, yi) to each centroid.
Assign each point to the closest centroid. This association defines the
first k dusters.

In two dimensions, the distance, d, between any two points, (x1,


y1) and (x2, y2), in the Cartesian plane is typically expressed by using
the Euclidean distance measure provided in Equation.

In following figure, the points closest to a centroid are shaded the


corresponding color.
Points are assigned to the closest centroid
3. Compute the centroid, the center of mass, of each newly defined
cluster from Step 2.
In the following figure, the computed centroids in Step 3 are the
lightly shaded points of the corresponding color. In two dimensions,
the centroid (xc, yc) of the m points in a k-means duster is calculated
as follows in Equation.

Thus, (xc, yc) is the ordered pair of the arithmetic means of the
coordinates of the m points in the cluster. In this step, a centroid is
computed for each of the k clusters.
Compute the mean of each cluster
4. Repeat Steps 2 and 3 until the algorithm converges to an answer :
a. Assign each point to the closest centroid computed in Step 3.
b. Compute the centroid of newly defined clusters.
c. Repeat until the algorithm reaches the final answer.
 In k-means, k clusters can be identified in a given
dataset, but what value of k should be selected? The
value of k can be chosen based on a reasonable guess
or some predefined requirement.
 In k-means, k clusters can be identified in a given dataset.
 The value of k can be chosen based on a reasonable guess or some predefined
requirement.

 However, even then, it would be good to know how much better or worse
having k clusters versus k-1 or k+1 cluster would be in explaining the structure
of the data.

 Next, a heuristic using the Within Sum of Squares (WSS) metric is examined to
determine a reasonably optimal value of k. Using the distance function, WSS is
defined as shown below.
 In other words, WSS is the sum of the squares of the distances
between each data point and the closest centroid.
 The term q(i) indicates the closest centroid that is associated
with the ith point. If the points are relatively close to their
respective centroids, the WSS is relatively small.
 Thus, if k +1 clusters do not greatly reduce the value of WSS
from the case with only k clusters, there may be little benefit
to adding another cluster.
 The heuristic using WSS can provide at least several
possible k values to consider.
 When the number of attributes is relatively small, a
common approach to further refine the choice of k is
to plot the data to determine how distinct the
identified clusters are from each other.
 In general, the following questions should be
considered.
 In general, the following questions should be considered.
• Are the clusters well separated from each other?
• Do any of the clusters have only a few points?
• Do any of the centroids appear to be too close to each
other?
 In the first case, ideally the plot would look like the one
shown in below figure, when n = 2.
Example of distinct clusters
 The clusters are well defined, with considerable space between the
four identified clusters.
 However, in other cases, such as in below figure, the clusters may
be close to each other, and the distinction may not be so obvious.

Example of less obvious cluster


 K-means is a simple and straightforward method for
defining clusters.
 Once clusters and their associated centroids are
identified, it is easy to assign new objects (for
example, new customers) to a cluster based on the
object's distance from the closest centroid.
 Because the method is unsupervised, using k-means
helps to eliminate subjectivity from the analysis.
 Although k-means is considered an unsupervised method,
there are still several decisions that the practitioner must
make:
a. What object attributes should be included in the
analysis?
b. What unit of measure (for example, miles or
kilometers) should be used for each attribute?
c. Do the attributes need to be rescaled so that one
attribute does not have a disproportionate effect on the results?
d. What other considerations might apply?
 Regarding which object attributes (for example, age
and income) to use in the analysis, it is important to
understand what attributes will be known at the time
a new object will be assigned to a cluster.
 For example, information on existing customers'
satisfaction or purchase frequency may be available,
but such information may not be available for
potential customers.
 The Data Scientist may have a choice of a dozen or more attributes
to use in the clustering analysis. Whenever possible and based on
the data, it is best to reduce the number of attributes to the extent
possible.
 Too many attributes can minimize the impact of the most important
variables.
 Also, the use of several similar attributes can place too much
importance on one type of attribute. For example, if five attributes
related to personal wealth are included in a clustering analysis, the
wealth attributes dominate the analysis and possibly mask the
importance of other attributes, such as age.
 When dealing with the problem of too many
attributes, one useful approach is to identify any
highly correlated attributes and use only one or two
of the correlated attributes in the clustering analysis.
 Another option to reduce the number of attributes is
to combine several attributes into one measure.
 For example, instead of using two attribute variables,
one for Debt and one for Assets, a Debt to Asset ratio
could be used.
 From a computational perspective, the k-means algorithm is
somewhat indifferent to the units of measure for a given attribute
(for example, meters or centimeters for a patient's height).
 However, the algorithm will identify different clusters depending on
the choice of the units of measure.
 For example, suppose that k-means is used to cluster patients based
 on age in years and height in centimeters.
 For k=2, below figure illustrates the two clusters that would be
determined for a given dataset.
 But if the height was rescaled from centimeters to meters by
dividing by 100, the resulting dusters would be slightly
different, as illustrated in below Figure.

Cluster with height expressed in meters


 Attributes that are expressed in dollars are common in clustering analyses
and can differ in magnitude from the other attributes.
 For example, if personal income is expressed in dollars and age is
expressed in years, the income attribute, often exceeding 510,000, can
easily dominate the distance calculation with ages typically less than 100
years.
 Although some adjustments could be made by expressing the income in
thousands of dollars (for example, 10 for 510,000), a more straightforward
method is to divide each attribute by the attribute's standard deviation.
 The resulting attributes will each have a standard deviation equal to 1 and
will be without units.
 Returning to the age and height example, the standard deviations are 23.1 years and
36.4 cm, respectively. Dividing each attribute value by the appropriate standard
deviation and performing the k-means analysis yields the result shown in below
Figure.

Cluster with rescaled attributes


 In many statistical analyses, it is common to
transform typically skewed data, such as income,
with long tails by taking the logarithm of the data.
 Such transformation can also be applied in k-means,
but the Data Scientist needs to be aware of what
effect this transformation will have.
 The k-means algorithm is sensitive to the starting
positions of the initial centroid.
 Thus, it is important to rerun the k-means analysis
several times for a particular value of k to ensure the
cluster results provide the overall minimum WSS.
 This task is accomplished in R by using the nstart
option in the kmeans () function call.
 K-means clustering is applicable to objects that can be
described by attributes that are numerical with a
meaningful distance measure.
 Interval and ratio attribute types can certainly be used.
 However, k-means does not handle categorical variables
well. For example, suppose a clustering analysis is to be
conducted on new car sales.
 Among other attributes, such as the sale price, the color of
the car is considered important.
 Although one could assign numerical values to the
color, such as red = 1, yellow = 2, and green = 3, it is
not useful to consider that yellow is as close to red as
yellow is to green from a clustering perspective.
 In such cases, it may be necessary to use an alternative
clustering methodology.
 Classification is a supervised machine learning
method where the model tries to predict the correct
label of a given input data.
 In classification, the model is fully trained using the
training data, and then it is evaluated on test data
before being used to perform prediction on new
unseen data.
 A decision tree (also called prediction tree) uses a tree
structure to specify sequences of decisions and consequences.
 Given input X = {x1,x2,...xn}, the goal is to predict a
response or output variable Y.
 Each member of the set {x1,x2,...xn} is called an input
variable.
 The prediction can be achieved by constructing a decision tree
with test points and branches.
 At each test point, a decision is made to pick a specific branch
and traverse down the tree. Eventually, a final point is reached,
and a prediction can be made. Due to its flexibility and easy
visualization, decision trees are commonly deployed in data
mining applications for classification purposes.
 The input values of a decision tree can be categorical or
continuous.
 A decision tree employs a structure of test points (called
nodes) and branches, which represent the decision being made.
 A node without further branches is called a leaf node. The leaf
nodes return class labels and, in some implementations, they
return the probability scores.
 A decision tree can be converted into a set of decision rules. In
the following example rule, income and mortgage_amount are
input variables, and the response is the output variable default
with a probability score.
 IF income <50,000 AND mortgage_amount > 100K THEN
default = True WITH PROBABILITY 75% .
 Decision trees have two varieties: classification trees and
regression trees.
 Classification trees usually apply to output variables that are
categorical—often binary—in nature, such as yes or no,
purchase or not purchase, and so on.
 Regression trees, on the other hand, can apply to output
variables that are numeric or continuous, such as the predicted
price of a consumer good or the likelihood a subscription will
be purchased.
 The following Figure shows an example of using a decision
tree to predict whether customers will buy a product.
 The term branch refers to the outcome of a decision and is
visualized as a line connecting two nodes.
 If a decision is numerical, the "greater than" branch is usually
placed on the right, and the "less than" branch is placed on the left.
Depending on the nature of the variable, one of the branches may
need to include an "equal to" component.
 Internal nodes are the decision or test points. Each internal node
refers to an input variable or an attribute.
 The top internal node is called the root.
 The decision tree in the above Figure is a binary tree in that each
internal node has no more than two branches.
 The branching of a node is referred to as a split.
 The depth of a node is the minimum number of steps required to
reach the node from the root.
 In the above Figure, for example, nodes Income and Age have a
depth of one, and the four nodes on the bottom of the tree have a
depth of two.
 Leaf nodes are at the end of the last branches on the tree. They
represent class labels—the outcome of all the prior decisions.
 The path from the root to a leaf node contains a series of decisions
made at various internal nodes.
 The decision tree in the above Figure shows that females with
income less than or equal to $45,000 and males 40 years old or
younger are classified as people who would purchase the product.
 In traversing this tree, age does not matter for females, and
income does not matter for males.
 Decision trees are widely used in practice.
1. To classify animals, questions (like cold-blooded or warm-
blooded, mammal or not mammal) are answered to arrive at a
certain classification.
2. A checklist of symptoms during a doctor's evaluation of a
patient.
3. The artificial intelligence engine of a video game commonly
uses decision trees to control the autonomous actions of a
character in response to various scenarios.
4. Retailers can use decision trees to segment customers or

predict response rates to marketing and promotions.


5. Financial institutions can use decision trees to help
decide if a loan application should be approved or denied.
 In general, the objective of a decision tree algorithm is to
construct a tree T from a training set S. If all the records in S
belong to some class C (subscribed=yes, for example), or if S
is sufficiently pure, then that node is considered a leaf node
and assigned the label C.
 The purity of a node is defined as its probability of the
corresponding class.
 In contrast, if not all the records in S belong to class C or if S is not
sufficiently pure, the algorithm selects the next most informative
attribute A (duration, marital, and so on) and partitions S according
to A's values.
 The algorithm constructs subtrees T1 T2... for the subsets of S
recursively until one of the following criteria is met:
• All the leaf nodes in the tree satisfy the minimum purity threshold.
• The tree cannot be further split with the preset minimum purity
threshold.
 Any other stopping criterion is satisfied (such as the maximum
depth of the tree).
 The first step in constructing a decision tree is to choose the most
informative attribute. A common way to identify the most
informative attribute is to use entropy-based methods.
 The entropy methods select the most informative attribute based on
two basic measures:
 • Entropy, which measures the impurity of an attribute
 • Information gain, which measures the purity of an attribute
 As an example of a binary random variable, consider tossing a coin
with known, not necessarily fair, probabilities of coming up heads
or tails.
 The corresponding entropy graph is shown in Figure 7-5. Let x = 1
represent heads and x = 0 represent tails. The entropy of the
unknown result of the next toss is maximized when the coin is fair.
That is, when heads and tai ls have equal probability.
 P(x = 1) = P( x = 0) = 0.5, entropy
 Hx = - (0.5 x log2 0.5) = 1.
 On the other hand, if the coin is not fair, the probabilities of heads
and tails would not be equal and there would be less uncertainty.
 As an extreme case, when the probability of tossing a head is equal
to 0 or 1, the entropy is minimized to 0.
 Therefore, the entropy for a completely pure variable is 0 and is 1
for a set with equal occurrences for both the classes (head and tail,
or yes and no)
 Information gain compares the degree of purity of the parent node
before a split with the degree of purity of the child node after a split.
 At each split, an attribute with the greatest information gain is
considered the most informative attribute. Information gain
indicates the purity of an attribute.
 Multiple algorithms exist to implement decision trees, and the
methods of tree construction vary with different algorithms. Some
popular algorithms include ID3.
 ID3 (or Iterative Dichotomiser 3) is one of the first decision
tree algorithms, and it was developed by John Ross Quinlan.
 Let A be a set of categorical input variables, P be the output
variable (or the predicted class), and T be the training set. The
ID3 algorithm is shown here.
 Decision trees use greedy algorithms, in that they always
choose the option that seems the best available at that
moment.
 At each step, the algorithm selects which attribute to use for
splitting the remaining records.
 This characteristic increases the efficiency of decision trees.
 However, once a bad split is taken, it is propagated through the rest
of the tree. To address this problem, an ensemble technique (such as
random forest) may be used.
 In ensemble methods like Random Forests, multiple decision trees
are trained on different subsets of the data or different random
samples of features.
 Each tree then predicts a class label for an input, and the class label
that is predicted by the majority of the trees becomes the final
prediction.
 There are a few ways to evaluate a decision tree, First, evaluate
whether the splits of the tree make sense. Conduct sanity checks by
validating the decision rules with domain experts, and determine if
the decision rules are sound.
 If the dataset is small or has many features (high-dimensional), a
tree is more likely to overfit. In overfitting, the model fits the
training set well, but it performs poorly on the new samples in the
testing set.
 For decision tree learning, overfitting can be caused by either the
lack of training data or the biased data in the training set. Two
approaches can help avoid overfitting in decision tree learning.
 • Stop growing the tree early before it reaches the point where all
the training data is perfectly classified.
 • Grow the full tree, and then post-prune the tree with methods such
as reduced-error pruning and rule- based post pruning.
 Decision trees are computationally inexpensive, and it is easy to
classify the data. The outputs are easy to interpret as a fixed
sequence of simple tests. Decision trees are able to handle both
numerical and categorical attributes and are robust with redundant
or correlated variables.
 Decision trees can handle categorical attributes with many distinct
values, such as country codes for telephone numbers.
 Decision trees can also handle variables that have a nonlinear effect
on the outcome, so they work better than linear models (for
example, linear regression and logistic regression) for highly
nonlinear problems.
 The structure of a decision tree is sensitive to small variations in the
training data. Although the dataset is the same, constructing two
decision trees based on two different subsets may result in very
different trees.
 If a tree is too deep, overfitting may occur, because each split
reduces the training data for subsequent splits.
 Decision trees are not a good choice if the dataset contains many
irrelevant variables. This is different from the notion that they are
robust with redundant variables and correlated variables.
 If the dataset contains redundant variables, the resulting decision
tree ignores all because the algorithm cannot detect information gain
by including more redundant variables.
 On the other hand, if the dataset contains irrelevant variables and if
these variables are accidentally chosen as splits in the tree, the tree
may grow too large and may end up with less data at every split,
where overfitting is likely to occur.
 To address this problem, feature selection can be introduced in the
data pre-processing phase to eliminate the irrelevant variables.
 Decision trees are not well suited when most of the variables in the
training set are correlated, since overfitting is likely to occur.
 To overcome the issue of instability and potential overfitting of deep
trees, one can combine the decisions of several randomized shallow
decision trees—the basic idea of another classifier called random
forest or use ensemble methods to combine several weak learners
for better classification.
 For binary decisions, a decision tree works better if the training
dataset consists of records with an even probability of each result. In
other words, the root of the tree has a 50% chance of either
classification.
 This occurs by randomly selecting training records from each
possible classification in equal numbers.
 When using methods such as logistic regression on a
dataset with many variables, decision trees can help
determine which variables are the most useful to select
based on information gain.
 Then these variables can be selected for the logistic
regression. Decision trees can also be used to prune
redundant variables.
 Naive Bayes is a probabilistic classification method based on Bayes'
theorem. Bayes' theorem gives the relationship between the
probabilities of two events and their conditional probabilities.
 A naive Bayes classifier assumes that the presence or absence of a
particular feature of a class does not affect the presence or absence
of other features.
 For example, an object can be classified based on its attributes such
as shape, color, and weight.
 The input variables are generally categorical, but variations of the
algorithm can accept continuous variables, There are also ways to convert
continuous variables into categorical ones. This process is often referred
to as the discretization of continuous variables.
 For an attribute such as income, the attribute can be converted into
categorical values as shown below.
 Low Income: income < $10,000
 Working Class: $10,000 < income < $50,000
 Middle Class: $50,000 < income < $1,000,000
 Upper Class: income >$1,000,000
 The output typically includes a class label and its
corresponding probability score.
 Naive Bayes classifiers are easy to implement and can
execute efficiently.
 Spam filtering is a classic use case of naive Bayes text
classification. Bayesian spam filtering has become a popular
mechanism to distinguish spam e-mail from legitimate e-mail.
 Naive Bayes classifiers can also be used for fraud detection.
 In the domain of auto insurance, for example, based on a training
set with attributes such as driver's rating, vehicle age, vehicle
price, historical claims by the policy holder, police report status,
and claim genuineness, naive Bayes can provide probability-
based classification of whether a new claim is genuine.
 The conditional probability of event C occurring, given that event A
has already occurred, is denoted as P(C|A), which can be found
using the formula in the following Equation.
 Classifiers methods can be used to classify instances into distinct
groups according to the similar characteristics they share.
 Each of these classifiers faces the same issue: how to evaluate if
they perform well.
 A few tools have been designed to evaluate the performance of a
classifier.
 A confusion matrix is a specific table layout that allows
visualization of the performance of a classifier.
 The following table shows the confusion matrix for a two-class
classifier.
 True positives (TP) are the number of positive instances the
classifier correctly identified as positive.
 False positives (FP) are the number of instances in which the
classifier identified as positive but in reality are negative.
 True negatives (TN) are the number of negative instances the
classifier correctly identified as negative, False negatives (FN) are
the number of instances classified as negative but in reality are
positive.
 In a two-class classification, a preset threshold may be used to
separate positives from negatives. TP and TN are the correct
guesses.
 A good classifier should have large TP and TN and small (ideally
zero) numbers for FP and FN.
 The accuracy (or the overall success rate) is a metric defining the rate at
which a model has classified the records correctly. It is defined as the sum
of TP and TN divided by the total number of instances, as shown in the
following Equation.

 A good model should have a high accuracy score, but having a high
accuracy score alone does not guarantee the model is well established.
 The true positive rate (TPR) shows what percent of positive instances the
classifier correctly identified. It's also illustrated in the following Equation.
 A well-performed model should have a high TPR that is ideally 1
and a low FPR and FNR that are ideally 0. In some cases, a model
with a TPR of 0.95 and an FPR of 0.3 is more acceptable than a
model with a TPR of 0.9 and an FPR of 0.1 even if the second
model is more accurate overall.
 Precision is the percentage of instances marked positive that
really are positive, as shown in the following Equation.
 ROC curve is a common tool to evaluate classifiers.
 The abbreviation stands for Receiver Operating
Characteristic, a term used in signal detection to characterize
the trade-off between hit rate and false alarm rate over a
noisy channel.
 A ROC curve evaluates the performance of a classifier based
on the TP and FP, regardless of other factors such as class
distribution and error costs.
 Related to the ROC curve is the area under the curve (AUC).
 The AUC is calculated by measuring the area under the ROC
curve.
 Higher AUC scores mean the classifier performs better.
 The score can range from 0.5 (for the diagonal line
TPR=FPR) to 1.0 (with ROC passing through the top-left
corner).
 Besides the above two classifiers, several other methods are
commonly used for classification, including

 Bagging,
 Boosting,
 Random forest, and
 Support Vector Machines (SVM)
 Bagging (or bootstrap aggregating) uses the bootstrap technique that
repeatedly samples with replacement from a dataset according to a uniform
probability distribution.
 "With replacement" means that when a sample is selected for a training or
testing set, the sample is still kept in the dataset and may be selected again.
 Because the sampling is with replacement, some samples may appear
several times in a training or testing set, whereas others may be absent.
 A model or base classifier is trained separately on each bootstrap sample,
and a test sample is assigned to the class that received the highest number
of votes.
 Boosting (or AdaBoost) uses votes for classification to combine the output
of individual models.
 In addition, it combines models of the same type. However, boosting is an
iterative procedure where a new model is influenced by the performances
of those models built previously.
 Furthermore, boosting assigns a weight to each training sample that
reflects its importance, and the weight may adaptively change at the end of
each boosting round.
 Bagging and boosting have been shown to have better performances than a
decision tree.
 Random forest is a class of ensemble methods using decision tree
classifiers.
 It is a combination of tree predictors such that each tree depends on
the values of a random vector sampled independently and with the
same distribution for all trees in the forest.
 A special case of random forest uses bagging on decision trees,
where samples are randomly chosen with replacement from the
original training set.
 SVM is another common classification method that combines linear
models with instance-based learning techniques.
 Support vector machines select a small number of critical boundary
instances called support vectors from each class and build a linear
decision function that separates them as widely as possible, SVM by
default can efficiently perform linear classifications and can be
configured to perform nonlinear classifications as well.
 In general, regression analysis attempts to explain the influence that a set
of variables has on the outcome of another variable of interest.
 Often, the outcome variable is called a dependent variable because the
outcome depends on the other variables.
 These additional variables are sometimes called the input variables or the
independent variables.
 Regression analysis is useful for answering the following kinds of
questions:
• What is a person's expected income?
• What is the probability that an applicant will default on a loan?
 Linear regression is a useful tool for answering the first
question, and logistic regression is a popular method for
addressing the second.
 Regression analysis is a useful explanatory tool that can
identify the input variables that have the greatest statistical
influence on the outcome.
 For example, if it is found that the reading level of 10-year-
old students is an excellent predictor of the students' success
in high school and a factor in their attending college, then
additional importance on reading can be considered,
implemented, and evaluated to improve students' reading
levels at a younger age.
 Used for Predictive analysis.
 Linear regression is an analytical technique used to model the
relationship between several input variables and a continuous
outcome variable.
 A key assumption is that the relationship between an input variable
and the outcome variable is linear.
 Although this assumption may appear restrictive, it is often possible
to properly transform the input or outcome variables to achieve a
linear relationship between the modified input and outcome variables.
 A linear regression model is a probabilistic one that accounts
for the randomness that can affect any particular outcome.
 Based on known input values, a linear regression model
provides the expected value of the outcome variable based
on the values of the input variables, but some uncertainty
may remain in predicting any particular outcome.
 Linear regression is often used in business, government,
and other scenarios. Some common practical
applications of linear regression in the real world
include the following:
 Real estate
 Demand forecasting
 Medical
 A simple linear regression analysis can be used to model
residential home prices as a function of the home's living
area.
 Such a model helps set or evaluate the list price of a home on
the market.
 The model could be further improved by including other
input variables such as number of bathrooms, number of
bedrooms, plot size, school district rankings, crime statistics,
and property taxes.
 Businesses and governments can use linear regression models
to predict demand for goods and services.
 For example, restaurant chains can appropriately prepare for the
predicted type and quantity of food that customers will
consume based upon the weather, the day of the week, whether
an item is offered as a special, the time of day, and the
reservation volume.
 Similar models can be built to predict retail sales, emergency
room visits, and ambulance dispatches.
 A linear regression model can be used to analyze the effect
of a proposed radiation treatment on reducing tumor sizes.
 Input variables might include duration of a single radiation
treatment, frequency of radiation treatment, and patient
attributes such as age or weight.
 As the name of this technique suggests, the linear
regression model assumes that there is a linear
relationship between the input variables and the
outcome variable.
 This relationship can be expressed as shown in the
following Equation :
 Where :
y is the outcome variable
xjare the input variables, for j=1,2,...,p-1
𝛽0 is the value of y when each xj equals zero

𝛽𝑗 is the change in y based on a unit change in xjfor j=1,2,...,p-1

. is a random error term that represents the difference in the linear

model and a particular observed value for y.


 Suppose it is desired to build a linear regression model that estimates a
person's annual income as a function of two variables — age and
education—both expressed in years.
 In this case, income is the outcome variable, and the input variables are
age and education.
 However, it is also obvious that there is considerable variation in income
levels for a group of people with identical ages and years of education.
This variation is represented by in the model.
 So, in this example, the model would be expressed as shown in the
following Equation :
 In the linear model, the 𝛽2 p represent the unknown p parameters.
 The estimates for these unknown parameters are chosen so that, on
average, the model provides a reasonable estimate of a person's
income based on age and education.
 In other words, the fitted model should minimize the overall error
between the linear model and the actual observations.
 Ordinary Least Squares (OLS) is a common technique to estimate the
parameters.
 To illustrate how OLS works, suppose there is only one input variable, x, for
an outcome variable y. Furthermore, n observations of (x, y) are obtained
and plotted in below Figure.
 The goal is to find the line that best approximates the relationship
between the outcome variable and the input variables.
 With OLS, the objective is to find the line through these points that
minimizes the sum of the squares of the difference between each
point and the line in the vertical direction.
 In other words, find the values of 𝛽0 and 𝛽1 , such that the
summation shown in Equation is minimized.
 The n individual distances to be squared and then summed are illustrated in
below figure. The vertical lines represent the distance between each
observed y value and the line
 In the normal model description, there were no assumptions made
about the error term; no additional assumptions were necessary for
OLS to provide estimates of the model parameters.
 However, in most linear regression analyses, it is common to assume
that the error term is a normally distributed random variable with
mean equal to zero and constant variance.
 Thus, the linear regression model is expressed as shown in Equation.
 Where :
y is the outcome variable
xjare the input variables, for j=1,2,...,p-1
𝛽0 is the value of y when each xj equals zero

𝛽𝑗 is the change in y based on a unit change in xjfor j=1,2,...,p-1

.~N(0, σ2) and the σ2 (Constant variance) are independent of each


other.
 Thus, for a given (x1, x2,... xp-1), y is normally distributed with mean
and variance σ2.

 For a regression model with just one input variable, below figure illustrates
the normality assumption on the error terms and the effect on the outcome
variable, y, for a given value of x.

Normal distribution about y for a given value of x


 Following are Some tools and techniques that can be used
to validate a fitted linear regression model.

 Evaluating the Linearity Assumption


 Evaluating the Residuals
 Evaluating the Normality Assumption

 N-Fold Cross-Validation
 A major assumption in linear regression modelling is that
the relationship between the input variables and the
outcome variable is linear.
 The most fundamental way to evaluate such a relationship
is to plot the outcome variable against each input variable.
If the relationship between Age and Income is represented
as illustrated in the following Figure, a linear model would
not apply.
Figure : Income as a quadratic function of Age
 In such a case, it is often useful to do any of the following :
• Transform the outcome variable.
• Transform the input variables.
• Add extra input variables or terms to the regression model.
 Common transformations include taking square roots or the
logarithm of the variables.
 Another option is to create a new input variable such as the age
squared and add it to the linear regression model.
 As stated previously, it is assumed that the error terms in
the linear regression model are normally distributed with a
mean of zero and a constant variance.
 If this assumption does not hold, the various inferences
that were made with the hypothesis tests, confidence
intervals, and prediction intervals are suspect.
 The residual plots are useful for confirming that the
residuals were centered on zero and have a constant
variance.
 However the normality assumption still has to be
validated.
 To prevent overfitting a given dataset, a common practice
is to randomly split the entire dataset into a training set
and a testing set.
 Once the model is developed on the training set, the model
is evaluated against the testing set.
 When there is not enough data to create training and
testing sets, an N-fold cross-validation technique may be
helpful to compare one fitted model against another.
 In N-fold cross-validation, the following occurs:
• The entire dataset is randomly split into N datasets of approximately
equal size.
• A model is trained against N - 1 of these datasets and tested against the
remaining dataset. A measure of the model error is obtained.
• This process is repeated a total of N times across the various
combinations of N datasets taken N - 1 at a time. Recall:

• The observed N model errors are averaged over the N folds.


 The averaged error from one model is compared against
the averaged error from another model.
 This technique can also help determine whether adding
more variables to an existing model is beneficial or
possibly overfitting the data.
 In linear regression modelling, the outcome variable is a
continuous variable.
 When the outcome variable is categorical in nature, logistic
regression can be used to predict the likelihood of an outcome
based on the input variables.
 Although logistic regression can be applied to an outcome variable
that represents multiple values, but we will examine the case in
which the outcome variable represents two values such as
true/false, pass/fail, or yes/no.
 For example, a logistic regression model can be built to determine if a
person will or will not purchase a new automobile in the next 12
months.
 The training set could include input variables for a person's age,
income, and gender as well as the age of an existing automobile.
 The training set would also include the outcome variable on whether
the person purchased a new automobile over a 12-month period.
 The logistic regression model provides the likelihood or probability
of a person making a purchase in the next 12 months.
 The logistic regression model is applied to a variety of
situations in both the public and the private sector.
 Some common ways that the logistic regression model is
following :
 Medical
 Finance
 Marketing
 Engineering
 Medical : Develop a model to determine the likelihood of a patient's
successful response to a specific medical treatment or procedure.
 Input variables could include age, weight, blood pressure, and
cholesterol levels.
 Finance : Using a loan applicant's credit history and the details on
the loan, determine the probability that an applicant will default on
the loan.
 Based on the prediction, the loan can be approved or denied, or the
terms can be modified.
 Marketing :
◦ Determine a wireless customer's probability of switching

carriers (known as churning) based on age, number of


family members on the plan, months remaining on the
existing contract, and social network contacts.
◦ With such insight, target the high appropriate offers to
prevent churn.
 Engineering :
◦ Based on operating conditions and various diagnostic

measurements, determine the probability of a mechanical


part experiencing a malfunction or failure.
◦ With this, probability estimate, schedule the appropriate
preventive maintenance activity.
Figure : The Logistic Function
 A wireless telecommunications company wants to estimate
the probability that a customer will churn (switch to a
different company) in the next six months.
 With a reasonably accurate prediction of a person's
likelihood of churning, the sales and marketing groups can
attempt to retain the customer by offering various incentives.
 Data on 8,000 current and prior customers was obtained. The variables
collected for each customer follow:
• Age (years)
• Married(true/false)
• Duration as a customer (years)
• Churned_contacts(count) — Number of the customer's contacts
that have churned (count)
• churned (true/false) — Whether the customer churned
 After analyzing the data and fitting a logistic regression model, Age and
Churned_contacts were selected as the best predictor variables.
 The following Equation provides the estimated model parameters.
y = 3.50 - 0.16 * Age + 0.38 * Churned _ contacts
The value 3.50 represents the intercept (often denoted as 𝐵0) in a linear
regression model or equation. This means that, when both Age and Churned
contacts are zero, the value of 𝑦 will be 3.50.

𝐵0=3.50 is the starting point or baseline value of 𝑦.

The term −0.16×Age means that for each year increase in Age, the value of 𝑦 will
decrease by 0.16.

The term +0.38×Churned_contacts means that for each additional Churned


contact, the value of 𝑦 will increase by 0.38.
 If Age is 0 and Churned contacts is 0, then 𝑦 would equal
3.50 (which is the value of the intercept).
 If Age increases or Churned contacts change, the result for 𝑦
will adjust according to the values of these variables, with
the coefficients −0.16 and +0.38 modifying the baseline
value (3.50).
 Based on the fitted model, there is a 93% chance that a 20-year-old
customer who has had six contacts churn will also churn.
 So far, the log-likelihood ratio test discussion has focused on
comparing a fitted model to the default model of using only
the intercept.
 However, the log-likelihood ratio test can also compare one
fitted model to another.
 Logistic regression is often used as a classifier to assign class
labels to a person, item, or transaction based on the predicted
probability provided by the model.
 In the Churn example, a customer can be classified with the
label called Churn if the logistic model predicts a high
probability that the customer will churn.
 Otherwise, a Remain label is assigned to the customer.
 Commonly, 0.5 is used as the default probability threshold to
distinguish between any two class labels.
 However, any threshold value can be used depending on the
preference to avoid false positives (for example, to predict
Churn when actually the customer will Remain) or false
negatives (for example, to predict Remain when the customer
will actually Churn).
 It can be useful to visualize the observed responses against the
estimated probabilities provided by the logistic regression.
 The following figure provides overlaying histograms for the
customers who churned and for the customers who remained as
customers.
 With a proper fitting logistic model, the customers who
remained tend to have a low probability of churning.
 Conversely, the customers who churned have a high
probability of churning again.
 This histogram plot helps visualize the number of items to be
properly classified or mis-classified.
 In the Churn example, an ideal histogram plot would have
the remaining customers grouped at the left side of the plot,
the customers who churned at the right side of the plot, and
no overlap of these two groups.
 Linear regression is suitable when the input variables are continuous
or discrete, including categorical data types, but the outcome variable
is continuous.
 If the outcome variable is categorical, logistic regression is a better
choice.
 Furthermore, in linear regression, the assumption of normally
distributed error terms with a constant variance is important for many
of the statistical inferences that can be considered.
 Although a collection of input variables may be a good predictor for
the outcome variable, the analyst should not infer that the input
variables directly cause an outcome.
 Use caution when applying an already fitted model to data that
falls outside the dataset used to train the model.
 The linear relationship in a regression model may no longer hold
at values outside the training dataset.
 For example, if income was an input variable and the values of
income ranged from $35,000 to $90,000, applying the model to
incomes well outside those incomes could result in inaccurate
estimates and predictions.
 If several of the input variables are highly correlated to each other,
the condition is known as multicollinearity.
 Multicollinearity can often lead to coefficient estimates that are
relatively large in absolute magnitude and may be of inappropriate
direction (negative or positive sign).
 When possible, the majority of these correlated variables should
be removed from the model or replaced by a new variable that is a
function of the correlated variables.
 Polynomial Regression - used to represent a non-linear relationship
between dependent and independent variables
 Lasso Regression – As with ridge regression, the lasso (Least
Absolute Shrinkage and Selection Operator) technique penalizes
the absolute magnitude of the regression coefficient.
 Ridge Regression - When data exhibits multicollinearity, that is,
the ridge regression technique is applied when the independent
variables are highly correlated.
 Quantile Regression - a subset of the linear regression technique. It
is employed when the linear regression requirements are not met or
when the data contains outliers.
 Bayesian Linear Regression - used in machine learning that uses
Bayes’ theorem to calculate the regression coefficients’ values.
 Principal Components Regression – Multicollinear regression
data is often evaluated using the principle components regression
approach.
 Partial Least Squares Regression- a fast and efficient covariance-
based regression analysis technique.
 Elastic Net Regression - combines ridge and lasso regression
techniques that are particularly useful when dealing with strongly
correlated data.

You might also like