COSC 210 INTRODUCTION TO MACHINE LEARNING Module I-1
COSC 210 INTRODUCTION TO MACHINE LEARNING Module I-1
set of logical “if . . . then . . . else . . .” rules, or some such thing. It may be noted that this is not
an exhaustive list.
1.1.2 Definition of Learning
Definition
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with experience
E.
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given
classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• Training experience: A sequence of images and steering commands recorded
while observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself
Machine Learning program
A computer program which learns from experience is called a Machine Learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
1.2 Machines Learning Process
1.2.1 Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates the various
components and the steps involved in the learning process.
3
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
4. In medicine, learning programs are used for medical diagnosis.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing the
quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching for
relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
9. Machine Learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.
10. Machine Learning methods have been used to develop programmes for playing games such
as chess, backgammon and Go.
11. Machine Learning is used in text or document classification, e.g., spam detection
12. It is also used in Natural Language Processing, e.g., morphological analysis, part-of-speech
tagging, statistical parsing, named-entity recognition
The list goes on, but here are more applications of Machine Learning : Speech recognition- speech
synthesis, speaker verification, Optical character recognition (OCR), Computational biology
applications, e.g., protein function or structured prediction, Computer vision tasks, e.g., image
recognition, face detection, Fraud detection (credit card, telephone) and network intrusion,
Unassisted vehicle control (robots, navigation), Recommendation systems, search engines,
information extraction systems, etc.
1.4 Understanding Data
Since an important component of the Machine Learning process is data storage, we briefly
consider in this section the different types and forms of data that are encountered in the Machine
Learning process.
1.4.1 Unit of observation
Unit of observation is the smallest entity with measured properties of interest for a study.
5
Examples
• A person, an object or a thing
• A time point
• A geographic region
• A measurement
Sometimes, units of observation are combined to form units such as person-years.
1.4.2 Examples, Features and Labels
Datasets that store the units of observation and their properties can be imagined as collections
of data consisting of Examples and Features.
Examples
An “example” is an instance of the unit of observation for which properties have been recorded.
An “example” is also referred to as an “instance”, or “case” or “record.” (It may be noted that the
word “example” has been used here in a technical sense.) It typically represents a single
observation or unit of data used for training or testing a model.
Features
A “feature” is the set of attributes, often represented as a vector, associated to an example. It is
a recorded property or a characteristic of examples. It is also referred to as “attribute”, or
“variable”.
Label: Values or categories assigned to examples. In classification problems, examples are
assigned specific categories, for instance, the spam and non-spam categories in a binary
classification problem. In regression, items are assigned real-valued labels. Label is the output or
target variable that the model is trying to predict or classify. Labels are used in supervised
learning.
Examples for “examples”, “features” and “Label”
Case1: Cancer detection
Consider the problem of developing a model for detecting cancer. In this study we note the
following.
(a) The units of observation are the patients.
(b) The examples are members of a sample of cancer patients.
(c) The features can be: Gender, Age, Blood pressure, the findings of the pathology report after a
biopsy, etc.
6
1. Qualitative data
2. Quantitative data
1.5.1 Qualitative Data
Qualitative data also called categorical data provides information about the quality of an object
or information which cannot be measured. For example, if we consider the quality of
performance of students in terms of ‘Good’, ‘Average’, and ‘Poor’, it falls under the category of
qualitative data. Also, name or roll number of students are information that cannot be measured
using some scale of measurement. So they would fall under qualitative data.
Qualitative data can be further subdivided into two types as follows:
1. Nominal data
2. Ordinal data
1.5.1.1 Nominal Data
Nominal data is one which has no numeric value, but a named value. It is used for assigning
named values to attributes. Nominal values cannot be quantified. Examples of nominal data are:
1. Blood group: A, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Female.
4. Colour: Red, Green, Blue, etc.
1.5.1.2 Ordinal Data
Ordinal data, in addition to possessing the properties of nominal data, can also be naturally
ordered. This means ordinal data also assigns named values to attributes but unlike nominal data,
they can be arranged in a sequence of increasing or decreasing value so that we can say whether
a value is better than or greater than another value. Examples of ordinal data are:
1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.
1.5.2 Quantitative Data
Quantitative data also refer to as numeric data, relates to information about the quantity of an
object – hence it can be measured. For example, if we consider the attribute ‘score’, it can be
measured using a scale of measurement. Quantitative data is also termed as numeric data. There
are two types of quantitative data:
1. Interval data
9
2. Ratio data
1.5.2.1 Interval Data
Interval data is numeric data for which not only the order is known, but the exact difference
between values is also known. An ideal example of interval data is Celsius temperature. The
difference between each value remains the same in Celsius temperature. For example, the
difference between 12°C and 18°C degrees is measurable and is 6°C as in the case of difference
between 15.5°C and 21.5°C. Other examples include date, time, etc.
Interval data do not have something called a ‘true zero’ value. For example, there is nothing called
‘0 temperature’ or ‘no temperature’. Hence, only addition and subtraction applies for interval
data. The ratio cannot be applied. This means, we can say a temperature of 40°C is equal to the
temperature of 20°C + temperature of 20°C. However, we cannot say the temperature of 40°C
means it is twice as hot as in temperature of 20°C.
1.5.2.2 Ratio Data
Ratio data represents numeric data for which exact value can be measured. Absolute zero is
available for ratio data. Also, these variables can be added, subtracted, multiplied, or divided. The
central tendency can be measured by mean, median, or mode and methods of dispersion such as
standard deviation. Examples of ratio data include height, weight, age, salary, etc.
Figure 1.3 gives a summarized view of different types of data that we may find in a typical Machine
Learning problem.
1. Supervised learning – Also called predictive learning. A machine predicts the class of unknown
objects based on prior class-related information of similar objects.
2. Unsupervised learning – Also called descriptive learning. A machine finds patterns in unknown
objects by grouping similar objects together.
3. Reinforcement learning – A machine learns to act on its own to achieve the given goals.
These categories differ in the types of training data available to the learner, the order and method
by which training data is received and the test data used to evaluate the learning algorithm. Figure
2.1 Shows the different categories of Machine Learning
Where
Y : the out variable (Target)
X : the input variable (Set of features)
Supervised learning concentrates on learning patterns through connecting the relationship
between variables and known outcomes and working with labeled datasets. Supervised learning
works by feeding the machine sample data with various features (represented as “X”) and the
correct value output of the data (represented as “y”). The fact that the output and feature values
are known qualifies the dataset as “labeled.” The algorithm then deciphers patterns that exist in
the data and creates a model that can reproduce the same underlying rules with new data. The
algorithm learns from a training set and ceases learning once a satisfactory level of performance
is achieved.
Supervised Machine Learning can be categorized into:
i. Classification (where the output variable requires categorization)
Examples of Supervised Machine Learning algorithms include linear regression, random forest,
and Support Vector Machine (SVM).
Figure 2.2 is a simple depiction of the supervised learning process. Labelled training data
containing past information comes as an input. Based on the training data, the machine builds a
predictive model that can be used on test data to assign a label for each example in the test data.
Examples of Unsupervised Machine Learning methods include Apriori (Association) and k-means
(clustering).
Figure 2.3 Depict the unsupervised learning process.
Reinforcement learning can be complicated and is probably best explained through an analogy to
a video game. As a player progresses through the virtual space of a game, they learn the value of
various actions under different conditions and become more familiar with the field of play. Those
learned values then inform and influence a player’s subsequent behavior and their performance
immediately improves based on their learning and past experience. Reinforcement learning is
very similar, where algorithms are set to train the model through continuous learning. A standard
reinforcement learning model has measurable performance criteria where outputs are not
tagged—instead, they are graded. In the case of self-driving vehicles, avoiding a crash will allocate
a positive score and in the case of chess, avoiding defeat will likewise receive a positive score.
The differences between the three categories of Machine Learning is shown in Table 2.1
It is also common practice to write A = 1 to mean the event A is true, and A = 0 to mean the event
A is false. So, this is a binary event where the event is either true or false but can’t be something
indefinite. The probability of selecting an event A, from a sample size of X is defined as
where n is the number of times the instance of event A is present in the sample of size X.
2.2.2.1 Probability of a Union of two Events
Two events A and B are called mutually exclusive if they can’t happen together. For any two
events, A and B, the probability of A or B is defined as:
where p(A|B) is defined as the conditional probability of event A happening if event B happens.
Based on this joint distribution on two events p(A, B).
2.2.2.3 Conditional Probability
We define the conditional probability of event A, given that event B is true, as follows:
where, p(A, B) is the joint probability of A and B and can also be denoted as p(A ∩ B)
Similarly,
17
Example
In a toy-making shop, the automated machine produces few defective pieces. It is observed that
in a lot of 1,000 toy parts, 25 are defective. If two random samples are selected for testing without
replacement (meaning that the first sample is not put back to the lot and thus the second sample
is selected from the lot size of 999) from the lot, calculate the probability that both the samples
are defective.
Solution:
Let A denote the probability of first part being defective and B denote the second part being
defective. Here, we have to employ the conditional probability of the second part being found
defective when the first part is already found defective. By law of probability,
As we are selecting the second sample without replacing the first sample into the lot and the first
one is already found defective, there are now 24 defective pieces out of 999 pieces left in the lot.
Thus,
= 0.0006
Which is the probability of both the parts being found defective.
3.0 CATEGORIES OF SUPERVISED MACHINE LEARNING.
As we have discussed earlier, Supervised Machine Learning is categorized into Classification and
Regression, we will now discuss these two categories.
3.1 Classification
Classification is a type of supervised learning where a target feature, which is of categorical type,
is predicted for test data on the basis of the information imparted by the training data. The
responsibility of the classification model is to assign class label to the target feature based on the
value of the predictor features.
18
A classification problem is one where the output variable is a category such as ‘red’ or ‘blue’ or
‘malignant tumour’ or ‘benign tumour’, etc. The target categorical feature is known as class.
A critical classification problem in the context of the banking domain is identifying potentially
fraudulent transactions. Because there are millions of transactions which have to be scrutinized
to identify whether a particular transaction might be a fraud transction, it is not possible for any
human being to carry out this task. Machine Learning is leveraged efficiently to do this task, and
this is a classic case of classification. On the basis of the past transaction data, especially the ones
labelled as fraudulent, all new incoming transactions are marked or labelled as usual or
suspicious. The suspicious transactions are subsequently segregated for a closer review.
On the basis of the problem identified above, the required data set that precisely represents the
identified problem needs to be identified/evaluated. For example: If the problem is to predict
whether a tumour is malignant or benign, then the corresponding patient data sets related to
malignant tumour and benign tumours are to be identified.
Data Pre-processing:
This is related to the cleaning/transforming the data set. This step ensures that all the
unnecessary/irrelevant data elements are removed. Data pre-processing refers to the
transformations applied to the identified data before feeding the same into the algorithm.
Because the data is gathered from different sources, it is usually collected in a raw format and is
not ready for immediate analysis. This step ensures that the data is ready to be fed into the
Machine Learning algorithm.
Definition of Training Data Set:
Before starting the analysis, the user should decide what kind of data set is to be used as a training
set. In the case of signature analysis, for example, the training data set might be a single
handwritten alphabet, an entire handwritten word (i.e. a group of the alphabets) or an entire line
of handwriting (i.e. sentences or a group of words). Thus, a set of ‘input meta-objects’ and
corresponding ‘output meta-objects’ are also gathered. The training set needs to be actively
representative of the real-world use of the given scenario. Thus, a set of data input (X) and
corresponding outputs (Y) is gathered either from human experts or experiments.
Algorithm Selection:
This involves determining the structure of the learning function and the corresponding learning
algorithm. This is the most critical step of supervised learning model. On the basis of various
parameters, the best algorithm for a given problem is chosen.
Training:
The learning algorithm identified in the previous step is run on the gathered training set for
further fine tuning. Some supervised learning algorithms require the user to determine specific
control parameters (which are given as inputs to the algorithm). These parameters (inputs given
to algorithm) may also be adjusted by optimizing performance on a subset (called as validation
set) of the training set.
Evaluation with the Test Data Set:
21
Training data is run on the algorithm, and its performance is measured here. If a suitable result is
not obtained, further training of parameters may be required.
where:
x = the numerical value you wish to transform
e = Euler's constant, 2.718
In a binary case, a value of 0 represents no chance of occurring, and 1 represents a certain chance
of occurring. The degree of probability for values located between 0 and 1 can be calculated
according to how close they rest to 0 (impossible) or 1 (certain possibility) on the scatterplot.
Figure 3.4 shows the example of Logistic Regression.
Working of K-NN
Let us try to understand the algorithm with a simple data set. Consider a very simple Student data
set as depicted in Figure 3.6. It consists of 15 students studying in a class. Each of the students
25
has been assigned a score on a scale of 10 on two performance parameters – ‘Aptitude’ and
‘Communication’. Also, a class value is assigned to each student based on the following criteria:
1. Students having good communication skills as well as a good level of aptitude have been
classified as ‘Leader’.
2. Students having good communication skills but not so good level of aptitude have been
classified as ‘Speaker’ .
3. Students having not so good communication skill but a good level of aptitude have been
classified as ‘Intel’.
assumed to be the test data. Now that we have the training data and test data identified, we can
start with the modelling.
class label of that data element is directly assigned to the test data element. This is depicted in
Figure 3.9.
• If the value of k is very large (in the extreme case equal to the total number of records in
the training data), the class label of the majority class of the training data set will be
assigned to the test data regardless of the class labels of the neighbours nearest to the
test data.
• If the value of k is very small (in the extreme case equal to 1), the class value of a noisy
data or outlier in the training data set which is the nearest neighbour to the test data will
be assigned to the test data.
The best k value is somewhere between these two extremes.
Few strategies, highlighted below, are adopted by Machine Learning practitioners to arrive at a
value for k.
• One common practice is to set k equal to the square root of the number of training
records.
• An alternative approach is to test several k values on a variety of test data sets and choose
the one that delivers the best performance.
• Another interesting approach is to choose a larger value of k, but apply a weighted voting
process in which the vote of close neighbours is considered more influential than the vote
of distant neighbours.
kNN Algorithm
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest
neighbours to be considered)
Steps:
Do for all test data points
Calculate the distance (usually Euclidean distance) of the test data point from the different
training data points. Find the closest ‘k’ training data points, i.e. training data points whose
distances are least from the test data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
30
Whichever class label is predominantly present in the training data points, assign that
class label to the test data point
End do
4.0 REGRESSION
In machine learning, a regression problem is the problem of predicting the value of a numeric
variable based on observed values of the variable. The value of the output variable may be a
number, such as an integer or a floating point value. These are often quantities, such as amounts
and sizes. The input variables may be discrete or real-valued. Regression analysis is used to
determine the strength of a relationship between variables. Regression is essentially finding a
relationship (or) association between the dependent variable (Y) and the independent variable(s)
(X), i.e. to find the function ‘f ’ for the association Y = f (X).
Regression is used for the development of models which are used for prediction of the numerical
value of the target feature of a data instance.
Consider the data on car prices given in Table 4.1.
Table 4.1: Example of Data for Regression
31
Suppose we are required to estimate the price of a car aged 25 years with distance 53240 KM and
weight 1200 pounds. This is an example of a regression problem because we have to predict the
value of the numeric variable “Price”.
The most common regression algorithms are:
• Simple linear regression
• Multiple linear regression
• Polynomial regression
• kernel ridge regression (KRR),
• support vector regression (SVR),
• Lasso
• Maximum likelihood estimation (least squares) etc.
4.1 Linear Regression.
Linear regression comprises a straight line that splits the data points on a scatterplot. The goal of
linear regression is to split the data in a way that minimizes the distance between the regression
line and all data points on the scatterplot. This means that if you were to draw a vertical line from
the regression line to each data point on the graph, the aggregate distance of each point would
equate to the smallest possible distance to the regression line.
32
Table 4.2:
Where:
Σ = Total sum
Σx = Total sum of all x values
Σy = Total sum of all y values
37
Σx = 1 + 2 + 1 + 4 + 3 = 11
Σy = 3 + 4 + 2 + 7 + 5 = 21
Σxy = 3 + 8 + 2 + 28 + 15 = 56
Σx 2= 1 + 4 + 1 + 16 + 9 = 31
n = 5.
= 1.44
Insert the “a” and “b” values into a linear equation.
y = a + bx
y = 1.029 + 1.441x T
The linear equation y = 1.029 + 1.441x dictates how to draw the hyperplane.
Let’s now test the regression line by looking up the coordinates for x = 2.
y = 1.029 + 1.441(x)
y = 1.029 + 1.441(2)
y = 3.911
In this case, the prediction is very close to the actual result of 4.0.
This is because the input data is just a limited, specific view and the new, unknown data in the
test data set may be differing quite a bit from the training data.
Fitness of a target function approximated by a learning algorithm determines how correctly it is
able to classify a set of data it has never seen.
5.1 Underfitting
If the target function is kept too simple, it may not be able to capture the essential nuances and
represent the underlying data well. A typical case of underfitting may occur when trying to
represent a non-linear data with a linear model as demonstrated by both cases of underfitting
shown in figure 3.5.
Many times underfitting happens due to unavailability of sufficient training data. Underfitting
results in both poor performance with training data as well as poor generalization to test data.
Underfitting can be avoided by:
1. using more training data
2. reducing features by effective feature selection
40
5.2 Overfitting
Overfitting refers to a situation where the model has been designed in such a way that it emulates
the training data too closely. In such a case, any specific deviation in the training data, like noise
or outliers, gets embedded in the model. It adversely impacts the performance of the model on
the test data. Overfitting, in many cases, occur as a result of trying to fit an excessively complex
model to closely match the training data. This is represented with a sample data set in figure 3.5
. The target function, in these cases, tries to make sure all training data points are correctly
partitioned by the decision boundary. However, more often than not, this exact nature is not
replicated in the unknown test data set. Hence, the target function results in wrong classification
in the test data set. Overfitting results in good performance with training data set, but poor
generalization and hence poor performance with test data set. Overfitting can be avoided by:
1. using re-sampling techniques like k-fold cross validation
2. hold back of a validation data set
3. remove the nodes which have little or no predictive power for the given Machine Learning
problem.
5.3 Bias – variance trade-off
In supervised learning, the class value assigned by the learning model built based on the training
data may differ from the actual class value. This error in learning can be of two types – errors due
to ‘bias’ and error due to ‘variance’.
Let’s try to understand each of them in details.
5.3.1 Errors due to Bias
Errors due to bias arise from simplifying assumptions made by the model to make the target
function less complex or easier to learn. In short, it is due to underfitting of the model. Parametric
models generally have high bias making them easier to understand/interpret and faster to learn.
These algorithms have a poor performance on data sets, which are complex in nature and do not
align with the simplifying assumptions made by the algorithm. Underfitting results in high bias.
5.3.2 Errors due to Variance
Errors due to variance occur from difference in training data sets used to train the model. Different
training data sets (randomly sampled from the input data set) are used to train the model. Ideally
the difference in the data sets should not be significant and the model trained using different
training data sets should not be too different. However, in case of overfitting, since the model
closely matches the training data, even a small difference in training data gets magnified in the
41
mode
So, the problems in training a model can either happen because either
(a) The model is too simple and hence fails to interpret the data grossly or
(b) The model is extremely complex and magnifies even small differences in the training data.
As is quite understandable:
when the value of ‘k’ is decreased, the model becomes simpler to fit and bias increases. On the
other hand, when the value of ‘k’ is increased, the variance increases.
6.1 Model Evaluation
To evaluate the performance of the model, the number of correct classifications or predictions
made by the model has to be recorded. A classification is said to be correct if, say for example in
the given problem, it has been predicted by the model that the team will win and it has actually
won.
Based on the number of correct and incorrect classifications or predictions made by a model, the
accuracy of the model is calculated. If 99 out of 100 times the model has classified correctly, e.g.
if in 99 out of 100 games what the model has predicted is same as what the outcome has been,
then the model accuracy is said to be 99%. However, it is quite relative to say whether a model
has performed well just by looking at the accuracy value. For example, 99% accuracy in case of a
sports win predictor model may be reasonably good but the same number may not be acceptable
as a good threshold when the learning problem deals with predicting a critical illness. In this case,
even the 1% incorrect prediction may lead to loss of many lives. So the model performance needs
to be evaluated in light of the learning problem in question. Also, in certain cases, erring on the
side of caution may be preferred at the cost of overall accuracy.
There are four possibilities with regards to the cricket match win/loss prediction:
1. The model predicted win and the team won – True Positive (TP)
2. The model predicted win and the team lost – False Positive (FP)
3. The model predicted loss and the team won – False Negative (FN)
4. The model predicted loss and the team lost – True Negative (TN)
43
Actual Outcome
Predicted Outcome
Positive(Cancer) TP FP
For any classification model, performance of the model can be evaluated using the confusion
matrix. Some of the performance metrics that can be evaluated are as follows:
Model Accuracy: Model accuracy is given by total number of correct classifications divided by
total number of classifications done.
44
Error Rate : The percentage of misclassifications is indicated using error rate which is measured
as:
Precision: Precision gives the proportion of positive predictions which are truly positive, and
indicates the reliability of a model in predicting a class of interest. It is given by:
Recall: Recall indicates the proportion of correct prediction of positives to the total number of
positives. Recall is given by:
Sensitivity: The sensitivity of a model measures the proportion of TP examples or positive cases
which were correctly classified. It is measured as
Specificity: Specificity is also another good measure to indicate a good balance of a model being
excessively conservative or excessively aggressive. Specificity of a model measures the proportion
of negative examples which have been correctly classified. A higher value of specificity indicates
a better model performance.
Example: Given the follwing confution matrix, calculate the following performance measure:
i. Accuracy
ii. Precision
iii. Recall
iv. sensitivity
v. Specicity
45
46
References:
Foundations of Machine Learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.
The MIT Press Cambridge, Massachusetts London, England. © 2012 Massachusetts Institute of
Technology
Machine Learning. Saikat Dutt, Subramanian Chandramouli and Amit Kumar Das. Pearson.
Machine Learning for Absolute Beginners. Oliver Theobald. 2017