0% found this document useful (0 votes)
15 views

COSC 210 INTRODUCTION TO MACHINE LEARNING Module I-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

COSC 210 INTRODUCTION TO MACHINE LEARNING Module I-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

1

GOMBE STATE UNIVERSITY


DEPARTMENT OF COMPUTER SCIENCE
COSC 210 (2 CU) Introduction to Machine Learning 2023/2024 Session
Module I
INTRODUCTION TO MACHINE LEARNING
In this chapter, we consider different definitions of the term “Machine Learning” and explain what
is meant by “Learning” in the context of Machine Learning. We also discuss the various
components of the Machine Learning process. There are also brief discussions about different
types learning like supervised learning, unsupervised learning and reinforcement learning.
1.1 Introduction
1.1.1 Definition of Machine Learning
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined Machine
Learning as “the field of study that gives computers the ability to learn without being explicitly
programmed.” However, there is no universally accepted definition for Machine Learning.
Different authors define the term differently. We give below three more definitions.
1. Machine Learning is programming computers to optimize a performance criterion using
example data or past experience. We have a model defined up to some parameters, and
learning is the execution of a computer program to optimize the parameters of the model
using the training data or past experience. The model may be predictive to make
predictions in the future, or descriptive to gain knowledge from data, or both.
2. The field of study known as Machine Learning is concerned with the question of how to
construct computer programs that automatically improve with experience.
3. Machine Learning can be broadly defined as computational methods using experience to
improve performance or to make accurate predictions. Here, experience refers to the past
information available to the learner, which typically takes the form of electronic data
collected and made available for analysis. This data could be in the form of digitized
human-labeled training sets, or other types of information obtained via interaction with
the environment. In all cases, its quality and size are crucial to the success of the
predictions made by the learner
Remarks
In the above definitions we have used the term “model” and we will be using this term at several
contexts later. It appears that there is no universally accepted one sentence definition of this
term. Loosely, it may be understood as some mathematical expression or equation, or some
mathematical structures such as graphs and trees, or a division of sets into disjoint subsets, or a
2

set of logical “if . . . then . . . else . . .” rules, or some such thing. It may be noted that this is not
an exhaustive list.
1.1.2 Definition of Learning
Definition
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with experience
E.
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given
classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• Training experience: A sequence of images and steering commands recorded
while observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself
Machine Learning program
A computer program which learns from experience is called a Machine Learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
1.2 Machines Learning Process
1.2.1 Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates the various
components and the steps involved in the learning process.
3

Figure 1.1: Components of Learning Process


1. Data Storage
Facilities for storing and retrieving huge amounts of data are an important
component of the learning process. Humans and computers alike utilize data storage as a
foundation for advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieved using
electrochemical signals.
• Computers use hard disk drives, flash memory, random access memory and similar
devices to store data and use cables and other technology to retrieve data.
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves
creating general concepts about the data as a whole. The creation of knowledge involves
application of known models and creation of new models. The process of fitting a model
to a dataset is known as training. When the model has been trained, the data is
transformed into an abstract form that summarizes the original information.
3. Generalization
The third component of the learning process is known as generalisation. The term
generalization describes the process of turning the knowledge about stored data into a
form that can be utilized for future action. These actions are to be carried out on tasks
that are similar, but not identical, to those what have been seen before. In generalization,
the goal is to discover those properties of the data that will be most relevant to future
tasks.
4. Evaluation
Evaluation is the last component of the learning process. It is the process of giving
feedback to the user to measure the utility of the learned knowledge. This feedback is
then utilised to effect improvements in the whole learning process.
1.3 Applications of Machine Learning
Application of Machine Learning methods to large databases is called data mining. In data mining,
a large volume of data is processed to construct a simple model with valuable use, for example,
having high predictive accuracy. The following is a list of some of the typical applications of
Machine Learning.
1. In retail business, Machine Learning is used to study consumer behaviour.
4

2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
4. In medicine, learning programs are used for medical diagnosis.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing the
quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching for
relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
9. Machine Learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.

10. Machine Learning methods have been used to develop programmes for playing games such
as chess, backgammon and Go.

11. Machine Learning is used in text or document classification, e.g., spam detection

12. It is also used in Natural Language Processing, e.g., morphological analysis, part-of-speech
tagging, statistical parsing, named-entity recognition

The list goes on, but here are more applications of Machine Learning : Speech recognition- speech
synthesis, speaker verification, Optical character recognition (OCR), Computational biology
applications, e.g., protein function or structured prediction, Computer vision tasks, e.g., image
recognition, face detection, Fraud detection (credit card, telephone) and network intrusion,
Unassisted vehicle control (robots, navigation), Recommendation systems, search engines,
information extraction systems, etc.
1.4 Understanding Data
Since an important component of the Machine Learning process is data storage, we briefly
consider in this section the different types and forms of data that are encountered in the Machine
Learning process.
1.4.1 Unit of observation
Unit of observation is the smallest entity with measured properties of interest for a study.
5

Examples
• A person, an object or a thing
• A time point
• A geographic region
• A measurement
Sometimes, units of observation are combined to form units such as person-years.
1.4.2 Examples, Features and Labels
Datasets that store the units of observation and their properties can be imagined as collections
of data consisting of Examples and Features.
Examples
An “example” is an instance of the unit of observation for which properties have been recorded.
An “example” is also referred to as an “instance”, or “case” or “record.” (It may be noted that the
word “example” has been used here in a technical sense.) It typically represents a single
observation or unit of data used for training or testing a model.
Features
A “feature” is the set of attributes, often represented as a vector, associated to an example. It is
a recorded property or a characteristic of examples. It is also referred to as “attribute”, or
“variable”.
Label: Values or categories assigned to examples. In classification problems, examples are
assigned specific categories, for instance, the spam and non-spam categories in a binary
classification problem. In regression, items are assigned real-valued labels. Label is the output or
target variable that the model is trying to predict or classify. Labels are used in supervised
learning.
Examples for “examples”, “features” and “Label”
Case1: Cancer detection
Consider the problem of developing a model for detecting cancer. In this study we note the
following.
(a) The units of observation are the patients.
(b) The examples are members of a sample of cancer patients.
(c) The features can be: Gender, Age, Blood pressure, the findings of the pathology report after a
biopsy, etc.
6

(d) Label: Cancer status.


Case 2. House prices prediction
(a) The units of observation are the houses.
(b) The examples are sample of houses within a particular region.
(c) The features might include: square footage, number of bedrooms, location, building age, etc.
(d) Label: prices of the houses.
Case 3. Pet selection: Suppose we want to predict the type of pet a person will choose.
(a) The units are the persons.
(b) The examples are members of a sample of persons who own pets.
(c) The features might include: age, home region, family income, etc. of persons who own pets,
etc.
(d) Label: Pet names.
Case 4. Class of Degree Prediction: Suppose we want to predict the class of degree a student is
likely to graduate with in Gombe State University.
(a) The units of observations are students.
(b) The examples are sample of student who have graduated.
(c) The features might include: age, family background, type of sponsorship, O’level grades, UTME
Score, etc.
(d) Label: Class of Degrees.
1.4.3 Dataset
A data set is a collection of related information or records. The information may be on some entity
or some subject area. For example (Fig. 1.2), we may have a data set on students in which each
record consists of information about a specific student. Again, we can have a data set on student
performance which has records providing performance, i.e. marks on the individual subjects.
7

Figure 1.2: Examples of Data set


1.4.4. Training set: In Machine Learning, data is split into training data and test data. Training set
are examples used to train a learning algorithm. In our spam problem, the training sample consists
of a set of email examples along with their associated labels. The training sample varies for
different learning scenarios.
1.4.5 Validation sample: Examples used to tune the parameters of a learning algorithm when
working with labeled data. Learning algorithms typically have one or more free parameters, and
the validation sample is used to select appropriate values for these model parameters.
1.4.6 Test sample: Examples used to evaluate the performance of a learning algorithm. The test
sample is separate from the training and validation data and is not made available in the learning
stage. In the spam problem, the test sample consists of a collection of email examples for which
the learning algorithm must predict labels based on features. These predictions are then
compared with the labels of the test sample to measure the performance of the algorithm.

1.5 Different forms of data


In the realm of Machine Learning, understanding data types, also known as measurement scales,
is crucial for effective data analysis. This understanding guides the selection of the appropriate
visualization and Machine Learning methods.

Data can broadly be divided into following two types:


8

1. Qualitative data
2. Quantitative data
1.5.1 Qualitative Data
Qualitative data also called categorical data provides information about the quality of an object
or information which cannot be measured. For example, if we consider the quality of
performance of students in terms of ‘Good’, ‘Average’, and ‘Poor’, it falls under the category of
qualitative data. Also, name or roll number of students are information that cannot be measured
using some scale of measurement. So they would fall under qualitative data.
Qualitative data can be further subdivided into two types as follows:
1. Nominal data
2. Ordinal data
1.5.1.1 Nominal Data
Nominal data is one which has no numeric value, but a named value. It is used for assigning
named values to attributes. Nominal values cannot be quantified. Examples of nominal data are:
1. Blood group: A, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Female.
4. Colour: Red, Green, Blue, etc.
1.5.1.2 Ordinal Data
Ordinal data, in addition to possessing the properties of nominal data, can also be naturally
ordered. This means ordinal data also assigns named values to attributes but unlike nominal data,
they can be arranged in a sequence of increasing or decreasing value so that we can say whether
a value is better than or greater than another value. Examples of ordinal data are:
1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc.
1.5.2 Quantitative Data
Quantitative data also refer to as numeric data, relates to information about the quantity of an
object – hence it can be measured. For example, if we consider the attribute ‘score’, it can be
measured using a scale of measurement. Quantitative data is also termed as numeric data. There
are two types of quantitative data:
1. Interval data
9

2. Ratio data
1.5.2.1 Interval Data
Interval data is numeric data for which not only the order is known, but the exact difference
between values is also known. An ideal example of interval data is Celsius temperature. The
difference between each value remains the same in Celsius temperature. For example, the
difference between 12°C and 18°C degrees is measurable and is 6°C as in the case of difference
between 15.5°C and 21.5°C. Other examples include date, time, etc.
Interval data do not have something called a ‘true zero’ value. For example, there is nothing called
‘0 temperature’ or ‘no temperature’. Hence, only addition and subtraction applies for interval
data. The ratio cannot be applied. This means, we can say a temperature of 40°C is equal to the
temperature of 20°C + temperature of 20°C. However, we cannot say the temperature of 40°C
means it is twice as hot as in temperature of 20°C.
1.5.2.2 Ratio Data
Ratio data represents numeric data for which exact value can be measured. Absolute zero is
available for ratio data. Also, these variables can be added, subtracted, multiplied, or divided. The
central tendency can be measured by mean, median, or mode and methods of dispersion such as
standard deviation. Examples of ratio data include height, weight, age, salary, etc.
Figure 1.3 gives a summarized view of different types of data that we may find in a typical Machine
Learning problem.

Figure 1.3 Type of Data

2.1 Types of Machine Learning


Machine Learning incorporates several hundred statistical-based algorithms and choosing the
right algorithm or combination of algorithms for the job is a constant challenge for anyone
working in this field. But before we examine specific algorithms, it is important to understand the
three overarching categories of Machine Learning. These three categories are:
10

1. Supervised learning – Also called predictive learning. A machine predicts the class of unknown
objects based on prior class-related information of similar objects.
2. Unsupervised learning – Also called descriptive learning. A machine finds patterns in unknown
objects by grouping similar objects together.
3. Reinforcement learning – A machine learns to act on its own to achieve the given goals.
These categories differ in the types of training data available to the learner, the order and method
by which training data is received and the test data used to evaluate the learning algorithm. Figure
2.1 Shows the different categories of Machine Learning

Figure 2.1: Types of Machine Learning


Following are subdivisions of Machine Learning:
i. Supervised learning

ii. Unsupervised learning

iii. Reinforcement learning

2.1.1. Supervised Machine Learning


Supervised Machine Learning involves the objective of understanding the mapping function
'f' that relates the input variable (X) to the output variable (Y), as represented in the equation
(2.1).
Y = f (X ) (2.1)
11

Where
Y : the out variable (Target)
X : the input variable (Set of features)
Supervised learning concentrates on learning patterns through connecting the relationship
between variables and known outcomes and working with labeled datasets. Supervised learning
works by feeding the machine sample data with various features (represented as “X”) and the
correct value output of the data (represented as “y”). The fact that the output and feature values
are known qualifies the dataset as “labeled.” The algorithm then deciphers patterns that exist in
the data and creates a model that can reproduce the same underlying rules with new data. The
algorithm learns from a training set and ceases learning once a satisfactory level of performance
is achieved.
Supervised Machine Learning can be categorized into:
i. Classification (where the output variable requires categorization)

ii. Regression (where the output is a real value).

Examples of Supervised Machine Learning algorithms include linear regression, random forest,
and Support Vector Machine (SVM).

Figure 2.2 is a simple depiction of the supervised learning process. Labelled training data
containing past information comes as an input. Based on the training data, the machine builds a
predictive model that can be used on test data to assign a label for each example in the test data.

Figure 2.2: Supervised Learning


12

2.1.2 Unsupervised Machine Learning


Unsupervised Machine Learning revolves around the objective of understanding and exploring
unlabelled input data (X) without the aid of historical data. The objective is to take a dataset as
input and try to find natural groupings or patterns within the data elements or examples.
Therefore, unsupervised learning is often termed as descriptive model and the process of
unsupervised learning is referred as pattern discovery or knowledge discovery.

Unsupervised Machine Learning is categorized into:


i. Association (aimed at discovering rules to describe the data)

ii. Clustering (focused on identifying inherent groups within the data).

Examples of Unsupervised Machine Learning methods include Apriori (Association) and k-means
(clustering).
Figure 2.3 Depict the unsupervised learning process.

Figure 2.3: Unsupervised Learning


13

2.1.3 Reinforcement Learning


Reinforcement learning is the third and most advanced algorithm category in Machine Learning.
It is a Machine Learning process that continuously improves its model by laveraging feedback
from previous iterations. Reinforcement Learning is centered around the goal of mapping actions
to situations in a way that maximizes the obtained rewards. This mapping process involves
considering not only the immediate rewards but also the rewards in subsequent steps.

Reinforcement learning can be complicated and is probably best explained through an analogy to
a video game. As a player progresses through the virtual space of a game, they learn the value of
various actions under different conditions and become more familiar with the field of play. Those
learned values then inform and influence a player’s subsequent behavior and their performance
immediately improves based on their learning and past experience. Reinforcement learning is
very similar, where algorithms are set to train the model through continuous learning. A standard
reinforcement learning model has measurable performance criteria where outputs are not
tagged—instead, they are graded. In the case of self-driving vehicles, avoiding a crash will allocate
a positive score and in the case of chess, avoiding defeat will likewise receive a positive score.

Examples of Reinforcement Learning methods include Monte-Carlo, Markov decision, Q-learning


and Temporal Difference methods. Reinforcement Learning process is shown in Figure 2.4

Figure 2.4: Reinforcement Learning


14

Differences Between Supervised, Unsupervised and Reinforcement Machine Learning

The differences between the three categories of Machine Learning is shown in Table 2.1

Table 2.1: Differences between Supervised, Unsupervised and Reinforcement Learning


15

2.2 Probability and Statistics Review


In this section we will discuss the tools, equations, and models of probability that are useful for
Machine Learning domain.
2.2.1 Importance of Statistical Tools in Machine Learning
In Machine Learning, we train the system by using a limited data set called ‘training data’ and
based on the confidence level of the training data we expect the Machine Learning algorithm to
depict the behaviour of the larger set of actual data. If we have observation on a subset of events,
called ‘sample’, then there will be some uncertainty in attributing the sample results to the whole
set or population. So, the question was how a limited knowledge of a sample set can be used to
predict the behaviour of a real set with some confidence. It was realized by mathematicians that
even if some knowledge is based on a sample, if we know the amount of uncertainty related to
it, then it can be used in an optimum way without causing loss of knowledge. Probability theory
provides a mathematical foundation for quantifying this uncertainty of the knowledge. This is
depicted in figure 2.5.

Figure 2.5 Knowledge and Uncertainty


2.2.2 Probalility Theory Review
The basic concept of Machine Learning is that we want to have a limited set of ‘Training’ data
that we use as a representative of a large set of Actual data and through probability distribution
we try to find out how an event which is matching with the training data can represent the
outcome with some confidence.
Foundation rules
p(A) denotes the probability that the event A is true.
0 ≤ p(A) ≤ 1 denotes that the probability of this event happening lies between 0 and 1, where
p(A) = 0 means the event will definitely not happen, and
p(A) = 1 means the event will definitely happen.
p(A̅ ) denotes the probability of the event not A,
defined as p(A̅ ) = 1 − p(A).
16

It is also common practice to write A = 1 to mean the event A is true, and A = 0 to mean the event
A is false. So, this is a binary event where the event is either true or false but can’t be something
indefinite. The probability of selecting an event A, from a sample size of X is defined as

where n is the number of times the instance of event A is present in the sample of size X.
2.2.2.1 Probability of a Union of two Events
Two events A and B are called mutually exclusive if they can’t happen together. For any two
events, A and B, the probability of A or B is defined as:

if A and B are Mutually exclusive.


2.2.2.2 Joint Probabilities (Product rule)
The probability of the joint event A and B is defined as the product rule:

where p(A|B) is defined as the conditional probability of event A happening if event B happens.
Based on this joint distribution on two events p(A, B).
2.2.2.3 Conditional Probability
We define the conditional probability of event A, given that event B is true, as follows:

where, p(A, B) is the joint probability of A and B and can also be denoted as p(A ∩ B)
Similarly,
17

Example
In a toy-making shop, the automated machine produces few defective pieces. It is observed that
in a lot of 1,000 toy parts, 25 are defective. If two random samples are selected for testing without
replacement (meaning that the first sample is not put back to the lot and thus the second sample
is selected from the lot size of 999) from the lot, calculate the probability that both the samples
are defective.
Solution:
Let A denote the probability of first part being defective and B denote the second part being
defective. Here, we have to employ the conditional probability of the second part being found
defective when the first part is already found defective. By law of probability,

As we are selecting the second sample without replacing the first sample into the lot and the first
one is already found defective, there are now 24 defective pieces out of 999 pieces left in the lot.
Thus,

= 0.0006
Which is the probability of both the parts being found defective.
3.0 CATEGORIES OF SUPERVISED MACHINE LEARNING.
As we have discussed earlier, Supervised Machine Learning is categorized into Classification and
Regression, we will now discuss these two categories.
3.1 Classification
Classification is a type of supervised learning where a target feature, which is of categorical type,
is predicted for test data on the basis of the information imparted by the training data. The
responsibility of the classification model is to assign class label to the target feature based on the
value of the predictor features.
18

A classification problem is one where the output variable is a category such as ‘red’ or ‘blue’ or
‘malignant tumour’ or ‘benign tumour’, etc. The target categorical feature is known as class.

A critical classification problem in the context of the banking domain is identifying potentially
fraudulent transactions. Because there are millions of transactions which have to be scrutinized
to identify whether a particular transaction might be a fraud transction, it is not possible for any
human being to carry out this task. Machine Learning is leveraged efficiently to do this task, and
this is a classic case of classification. On the basis of the past transaction data, especially the ones
labelled as fraudulent, all new incoming transactions are marked or labelled as usual or
suspicious. The suspicious transactions are subsequently segregated for a closer review.

Figure 3.1 Classification Model


Some typical classification problems include the following:
• Image classification
• Disease prediction
19

• Win–loss prediction of games


• Prediction of natural calamity such as earthquake, flood, etc.
• Handwriting recognition
3.2 Classification Learning Steps
The Classification step is represented as in figure 3.2

Figure 3.2 Classification Model Step


Problem Identification:
Identifying the problem is the first step in the supervised learning model. The problem needs to
be a well-formed problem,i.e. a problem with well-defined goals and benefit, which has a long-
term impact.
Identification of Required Data:
20

On the basis of the problem identified above, the required data set that precisely represents the
identified problem needs to be identified/evaluated. For example: If the problem is to predict
whether a tumour is malignant or benign, then the corresponding patient data sets related to
malignant tumour and benign tumours are to be identified.
Data Pre-processing:
This is related to the cleaning/transforming the data set. This step ensures that all the
unnecessary/irrelevant data elements are removed. Data pre-processing refers to the
transformations applied to the identified data before feeding the same into the algorithm.
Because the data is gathered from different sources, it is usually collected in a raw format and is
not ready for immediate analysis. This step ensures that the data is ready to be fed into the
Machine Learning algorithm.
Definition of Training Data Set:
Before starting the analysis, the user should decide what kind of data set is to be used as a training
set. In the case of signature analysis, for example, the training data set might be a single
handwritten alphabet, an entire handwritten word (i.e. a group of the alphabets) or an entire line
of handwriting (i.e. sentences or a group of words). Thus, a set of ‘input meta-objects’ and
corresponding ‘output meta-objects’ are also gathered. The training set needs to be actively
representative of the real-world use of the given scenario. Thus, a set of data input (X) and
corresponding outputs (Y) is gathered either from human experts or experiments.
Algorithm Selection:
This involves determining the structure of the learning function and the corresponding learning
algorithm. This is the most critical step of supervised learning model. On the basis of various
parameters, the best algorithm for a given problem is chosen.
Training:
The learning algorithm identified in the previous step is run on the gathered training set for
further fine tuning. Some supervised learning algorithms require the user to determine specific
control parameters (which are given as inputs to the algorithm). These parameters (inputs given
to algorithm) may also be adjusted by optimizing performance on a subset (called as validation
set) of the training set.
Evaluation with the Test Data Set:
21

Training data is run on the algorithm, and its performance is measured here. If a suitable result is
not obtained, further training of parameters may be required.

3.3 Common Classification Algorithms


The following are some common classification algorithms and we will discuss at least one
algorithm.
1. k-Nearest Neighbour (kNN)
2. Logistic Regression
3. Decision tree
4. Random forest
5. Support Vector Machine (SVM)
6. Naïve Bayes classifier
A Sigmoid Function.
A sigmoid function produces an S-shaped curve that can convert any number and map it into a
numerical value between 0 and 1, but it does so without ever reaching those exact limits.
22

Figure 3.3: A Sigmoid Function used to Classify Data Points


3.3.1 Logistic Regression
Logistic regression adopts sigmoid function to analyze data and predict discrete classes that exist
in a dataset. Although logistic regression shares a visual resemblance to linear regression, it is
technically a classification technique. Whereas linear regression addresses numerical equations
and forms numerical predictions to discern relationships between variables, logistic regression
predicts discrete classes. Logistic regression is typically used for binary classification to predict
two discrete classes, e.g. has cancer or not. To do this, the sigmoid function (shown as follows) is
added to compute the result and convert numerical results into an expression of probability
between 0 and 1.
23

where:
x = the numerical value you wish to transform
e = Euler's constant, 2.718
In a binary case, a value of 0 represents no chance of occurring, and 1 represents a certain chance
of occurring. The degree of probability for values located between 0 and 1 can be calculated
according to how close they rest to 0 (impossible) or 1 (certain possibility) on the scatterplot.
Figure 3.4 shows the example of Logistic Regression.

Figure 3.4 An Example of Logistic Regression


Logistic regression with more than two outcome values is known as multinomial logistic
regression, which can be seen in Figure 3.5.
24

Figure 3.5 An example of Multinomial Logistic Regression


3.3.2 k -Nearest Neighbour (kNN)
The kNN algorithm is a simple but extremely powerful classification algorithm. The name of the
algorithm originates from the underlying philosophy of kNN – i.e. people having similar
background or mindset tend to stay close to each other. In other words, neighbours in a locality
have a similar background. In the same way, as a part of the kNN algorithm, the unknown and
unlabelled data which comes for a prediction problem is judged on the basis of the training data
set elements which are similar to the unknown element. So, the class label of the unknown
element is assigned on the basis of the class labels of the similar training data set elements
(metaphorically can be considered as neighbours of the unknown element).

Working of K-NN
Let us try to understand the algorithm with a simple data set. Consider a very simple Student data
set as depicted in Figure 3.6. It consists of 15 students studying in a class. Each of the students
25

has been assigned a score on a scale of 10 on two performance parameters – ‘Aptitude’ and
‘Communication’. Also, a class value is assigned to each student based on the following criteria:
1. Students having good communication skills as well as a good level of aptitude have been
classified as ‘Leader’.
2. Students having good communication skills but not so good level of aptitude have been
classified as ‘Speaker’ .
3. Students having not so good communication skill but a good level of aptitude have been
classified as ‘Intel’.

Figure 3.6: Student Data set


While building a classification model, a part of the labelled input data is retained as test data. The
remaining portion of the input data is used to train the model – hence known as training data.
The motivation to retain a part of the data as test data is to evaluate the performance of the
model.
In context of the Student data set, to keep the things simple, we assume one data element of the
input data set as the test data. As depicted in Figure 3.7, the record of the student named Josh is
26

assumed to be the test data. Now that we have the training data and test data identified, we can
start with the modelling.

Figure 3.7 : Segregated Student Data set


So, as depicted in Figure 3.8, the training data points of the Student data set considering only the
features ‘Aptitude’ and ‘Communication’ can be represented as dots in a two- dimensional feature
space.
27

Figure 3.8: 2-D Representation of Student Data Set


As shown in the figure, the training data points having the same class value are coming close to
each other. The reason for considering two-dimensional data space is that we are considering just
the two features of the Student data set, i.e. ‘Aptitude’ and ‘Communication’, for doing the
classification. The feature ‘Name’ is ignored because, as we can understand, it has no role to play
in deciding the class value. The test data point for student Josh is represented as an asterisk in
the same space. To find out the closest or nearest neighbours of the test data point, Euclidean
distance of the different dots need to be calculated from the asterisk. Then, the class value of the
closest neighbours helps in assigning the class value of the test data element.
Values of K ( Number of Neighbours)
Now, let us try to find the answer to the second question, i.e. how many similar elements should
be considered. The answer lies in the value of ‘k’ which is a user-defined parameter given as an
input to the algorithm. In the kNN algorithm, the value of ‘k’ indicates the number of neighbours
that need to be considered. For example, if the value of k is 3, only three nearest neighbours or
three training data elements closest to the test data element are considered. Out of the three
data elements, the class which is predominant is considered as the class label to be assigned to
the test data. In case the value of k is 1, only the closest training data element is considered. The
28

class label of that data element is directly assigned to the test data element. This is depicted in
Figure 3.9.

Figure 3.9: Distance calculation between Test and Training Points


Let us now try to find out the outcome of the algorithm for the Student data set we have. In other
words, we want to see what class value kNN will assign for the test data for student Josh. Again,
let us refer back to Figure 3.9. As is evident, when the value of k is taken as 1, only one training
data point needs to be considered. The training record for student Gouri comes as the closest one
to test record of Josh, with a distance value of 1.118. Gouri has class value ‘Intel’. So, the test data
point is also assigned a class label value ‘Intel’. When the value of k is assumed as 3, the closest
neighbours of Josh in the training data set are Gouri, Susant, and Bobby with distances being
1.118, 1.414, and 1.5, respectively. Gouri and Bobby have class value ‘Intel’, while Susant has class
value ‘Leader’. In this case, the class value of Josh is decided by majority voting. Because the class
value of ‘Intel’ is formed by the majority of the neighbours, the class value of Josh is assigned as
‘Intel’. This same process can be extended for any value of k.
Choosing the Value of K
It is often a tricky decision to decide the value of k. The reasons are as follows:
29

• If the value of k is very large (in the extreme case equal to the total number of records in
the training data), the class label of the majority class of the training data set will be
assigned to the test data regardless of the class labels of the neighbours nearest to the
test data.
• If the value of k is very small (in the extreme case equal to 1), the class value of a noisy
data or outlier in the training data set which is the nearest neighbour to the test data will
be assigned to the test data.
The best k value is somewhere between these two extremes.
Few strategies, highlighted below, are adopted by Machine Learning practitioners to arrive at a
value for k.
• One common practice is to set k equal to the square root of the number of training
records.
• An alternative approach is to test several k values on a variety of test data sets and choose
the one that delivers the best performance.
• Another interesting approach is to choose a larger value of k, but apply a weighted voting
process in which the vote of close neighbours is considered more influential than the vote
of distant neighbours.
kNN Algorithm
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest
neighbours to be considered)
Steps:
Do for all test data points
Calculate the distance (usually Euclidean distance) of the test data point from the different
training data points. Find the closest ‘k’ training data points, i.e. training data points whose
distances are least from the test data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
30

Whichever class label is predominantly present in the training data points, assign that
class label to the test data point
End do
4.0 REGRESSION
In machine learning, a regression problem is the problem of predicting the value of a numeric
variable based on observed values of the variable. The value of the output variable may be a
number, such as an integer or a floating point value. These are often quantities, such as amounts
and sizes. The input variables may be discrete or real-valued. Regression analysis is used to
determine the strength of a relationship between variables. Regression is essentially finding a
relationship (or) association between the dependent variable (Y) and the independent variable(s)
(X), i.e. to find the function ‘f ’ for the association Y = f (X).
Regression is used for the development of models which are used for prediction of the numerical
value of the target feature of a data instance.
Consider the data on car prices given in Table 4.1.
Table 4.1: Example of Data for Regression
31

Suppose we are required to estimate the price of a car aged 25 years with distance 53240 KM and
weight 1200 pounds. This is an example of a regression problem because we have to predict the
value of the numeric variable “Price”.
The most common regression algorithms are:
• Simple linear regression
• Multiple linear regression
• Polynomial regression
• kernel ridge regression (KRR),
• support vector regression (SVR),
• Lasso
• Maximum likelihood estimation (least squares) etc.
4.1 Linear Regression.
Linear regression comprises a straight line that splits the data points on a scatterplot. The goal of
linear regression is to split the data in a way that minimizes the distance between the regression
line and all data points on the scatterplot. This means that if you were to draw a vertical line from
the regression line to each data point on the graph, the aggregate distance of each point would
equate to the smallest possible distance to the regression line.
32

Figure 4.1 Linear Regression Line


The regression line is plotted on the scatterplot in Figure 4.1. The technical term for the regression
line is the hyperplane, and you will see this term used throughout your study of Machine
Learning. A hyperplane is practically a trendline. Another important feature of regression is slope,
which can be conveniently calculated by referencing the hyperplane. As one variable increases,
the other variable will increase at the average value denoted by the hyperplane. The slope is
therefore very useful in formulating predictions. For example, if you wish to estimate the value of
Bitcoin at 800 days, you can enter 800 as your x coordinate and reference the slope by finding the
corresponding y value represented on the hyperplane. In this case, the y value is USD $1,850.
33

Figure 4.2: The Value of Bitcoin at day 800


As shown in Figure 4.2, the hyperplane reveals that you actually stand to lose money on your
investment at day 800 (after buying on day 736)! Based on the slope of the hyperplane, Bitcoin is
expected to depreciate in value between day 736 and day 800—despite no precedent in your
dataset for Bitcoin ever dropping in value. While it’s needless to say that linear regression isn’t a
fail-proof method to picking investment trends, the trendline does offer a basic reference point
to predict the future. If we were to use the trendline as a reference point earlier in time, say at
day 240, then the prediction posted would have been more accurate. At day 240 there is a low
degree of deviation from the hyperplane, while at day 736 there is a high degree of deviation.
Deviation refers to the distance between the hyperplane and the data point.
34

Figure 4.3: The Distance of the Data Points to the Hyperplane


In general, the closer the data points are to the regression line, the more accurate the final
prediction. If there is a high degree of deviation between the data points and the regression line,
the slope will provide less accurate predictions. Basing your predictions on the data point at day
736, where there is high deviation, results in poor accuracy. In fact, the data point at day 736
constitutes an outlier because it does not follow the same general trend as the previous four data
points. What’s more, as an outlier it exaggerates the trajectory of the hyperplane based on its
high y-axis value. Unless future data points scale in proportion to the y-axis values of the outlier
data point, the model’s predictive accuracy will suffer.
Calculation Example
Although your programming language will take care of this automatically, it’s useful to understand
how linear regression is actually calculated. We will use the following dataset (table 4.2) and
formula to perform linear regression.
35

Table 4.2:

Linear Regression Formular


Reminder: In supervised Learning, the output is obtained by evaluating a mapping function given
by
Y = f(x)
For linear Regresion, the mapping function is given by
Y = a + bx
Where:
Y = Dependent variable
X = Independent variable
a = intercept and
b = slope of the straight line, as shown in Figure 1.
36

Figure 4.4 Simple Linear Regfression

Where:
Σ = Total sum
Σx = Total sum of all x values
Σy = Total sum of all y values
37

Σxy = Total sum of x*y for each row


Σx 2= Total sum of x*x for each row
n = Total number of rows
Using our example dataset, we expand our table as follows,
Table 4.3:

Σx = 1 + 2 + 1 + 4 + 3 = 11
Σy = 3 + 4 + 2 + 7 + 5 = 21
Σxy = 3 + 8 + 2 + 28 + 15 = 56
Σx 2= 1 + 4 + 1 + 16 + 9 = 31
n = 5.

a = ((21 x 31) – (11 x 56)) / (5(31) – (11)2)


= (651 – 616) / (155 – 121)
= 35 / 34
= 1.029
b = (5(56) – (11 x 21)) / (5(31) – (11)2 )
= (280 – 231) / (155 – 121)
= 49 / 34
38

= 1.44
Insert the “a” and “b” values into a linear equation.
y = a + bx
y = 1.029 + 1.441x T
The linear equation y = 1.029 + 1.441x dictates how to draw the hyperplane.

Let’s now test the regression line by looking up the coordinates for x = 2.
y = 1.029 + 1.441(x)
y = 1.029 + 1.441(2)
y = 3.911
In this case, the prediction is very close to the actual result of 4.0.

Figure 4.5: The Linear Regression Hyperplane Plotted on the Scatterplot


5.0 MODEL REPRESENTATION AND INTERPRETABILITY
We have already seen that the goal of supervised Machine Learning is to learn or derive a target
function which can best determine the target variable from the set of input variables. A key
consideration in learning the target function from the training data is the extent of generalization.
39

This is because the input data is just a limited, specific view and the new, unknown data in the
test data set may be differing quite a bit from the training data.
Fitness of a target function approximated by a learning algorithm determines how correctly it is
able to classify a set of data it has never seen.
5.1 Underfitting
If the target function is kept too simple, it may not be able to capture the essential nuances and
represent the underlying data well. A typical case of underfitting may occur when trying to
represent a non-linear data with a linear model as demonstrated by both cases of underfitting
shown in figure 3.5.
Many times underfitting happens due to unavailability of sufficient training data. Underfitting
results in both poor performance with training data as well as poor generalization to test data.
Underfitting can be avoided by:
1. using more training data
2. reducing features by effective feature selection
40

5.2 Overfitting
Overfitting refers to a situation where the model has been designed in such a way that it emulates
the training data too closely. In such a case, any specific deviation in the training data, like noise
or outliers, gets embedded in the model. It adversely impacts the performance of the model on
the test data. Overfitting, in many cases, occur as a result of trying to fit an excessively complex
model to closely match the training data. This is represented with a sample data set in figure 3.5
. The target function, in these cases, tries to make sure all training data points are correctly
partitioned by the decision boundary. However, more often than not, this exact nature is not
replicated in the unknown test data set. Hence, the target function results in wrong classification
in the test data set. Overfitting results in good performance with training data set, but poor
generalization and hence poor performance with test data set. Overfitting can be avoided by:
1. using re-sampling techniques like k-fold cross validation
2. hold back of a validation data set
3. remove the nodes which have little or no predictive power for the given Machine Learning
problem.
5.3 Bias – variance trade-off
In supervised learning, the class value assigned by the learning model built based on the training
data may differ from the actual class value. This error in learning can be of two types – errors due
to ‘bias’ and error due to ‘variance’.
Let’s try to understand each of them in details.
5.3.1 Errors due to Bias
Errors due to bias arise from simplifying assumptions made by the model to make the target
function less complex or easier to learn. In short, it is due to underfitting of the model. Parametric
models generally have high bias making them easier to understand/interpret and faster to learn.
These algorithms have a poor performance on data sets, which are complex in nature and do not
align with the simplifying assumptions made by the algorithm. Underfitting results in high bias.
5.3.2 Errors due to Variance
Errors due to variance occur from difference in training data sets used to train the model. Different
training data sets (randomly sampled from the input data set) are used to train the model. Ideally
the difference in the data sets should not be significant and the model trained using different
training data sets should not be too different. However, in case of overfitting, since the model
closely matches the training data, even a small difference in training data gets magnified in the
41

mode

So, the problems in training a model can either happen because either
(a) The model is too simple and hence fails to interpret the data grossly or
(b) The model is extremely complex and magnifies even small differences in the training data.
As is quite understandable:

• Increasing the bias will decrease the variance, and


• Increasing the variance will decrease the bias
On one hand, parametric algorithms are generally seen to demonstrate high bias but low
variance. On the other hand, non-parametric algorithms demonstrate low bias and high variance.
As can be observed in Figure 1. , the best solution is to have a model with low bias as well as low
variance. However, that may not be possible in reality. Hence, the goal of supervised Machine
Learning is to achieve a balance between bias and variance. The learning algorithm chosen and
the user parameters which can be configured helps in striking a trade-off between bias and
variance. For example, in a popular supervised algorithm k-Nearest Neighbors or kNN, the user
configurable parameter ‘k’ can be used to do a trade-off between bias and variance. In one hand,
42

when the value of ‘k’ is decreased, the model becomes simpler to fit and bias increases. On the
other hand, when the value of ‘k’ is increased, the variance increases.
6.1 Model Evaluation
To evaluate the performance of the model, the number of correct classifications or predictions
made by the model has to be recorded. A classification is said to be correct if, say for example in
the given problem, it has been predicted by the model that the team will win and it has actually
won.
Based on the number of correct and incorrect classifications or predictions made by a model, the
accuracy of the model is calculated. If 99 out of 100 times the model has classified correctly, e.g.
if in 99 out of 100 games what the model has predicted is same as what the outcome has been,
then the model accuracy is said to be 99%. However, it is quite relative to say whether a model
has performed well just by looking at the accuracy value. For example, 99% accuracy in case of a
sports win predictor model may be reasonably good but the same number may not be acceptable
as a good threshold when the learning problem deals with predicting a critical illness. In this case,
even the 1% incorrect prediction may lead to loss of many lives. So the model performance needs
to be evaluated in light of the learning problem in question. Also, in certain cases, erring on the
side of caution may be preferred at the cost of overall accuracy.
There are four possibilities with regards to the cricket match win/loss prediction:
1. The model predicted win and the team won – True Positive (TP)
2. The model predicted win and the team lost – False Positive (FP)
3. The model predicted loss and the team won – False Negative (FN)
4. The model predicted loss and the team lost – True Negative (TN)
43

6.2 Confusion Matrix


A matrix containing correct and incorrect predictions in the form of TPs, FPs, FNs and TNs is known
as confusion matrix. In the problem of predicting whether a patient with a tumor has cancer or
not, the confusion matrix for that problem can be represented as shown in the table below:

Actual Outcome
Predicted Outcome

Positive(Cancer) Negative (No Cancer)

Positive(Cancer) TP FP

Negative (no Cancer) FN TN

For any classification model, performance of the model can be evaluated using the confusion
matrix. Some of the performance metrics that can be evaluated are as follows:
Model Accuracy: Model accuracy is given by total number of correct classifications divided by
total number of classifications done.
44

Error Rate : The percentage of misclassifications is indicated using error rate which is measured
as:

Precision: Precision gives the proportion of positive predictions which are truly positive, and
indicates the reliability of a model in predicting a class of interest. It is given by:

Recall: Recall indicates the proportion of correct prediction of positives to the total number of
positives. Recall is given by:

Sensitivity: The sensitivity of a model measures the proportion of TP examples or positive cases
which were correctly classified. It is measured as

Specificity: Specificity is also another good measure to indicate a good balance of a model being
excessively conservative or excessively aggressive. Specificity of a model measures the proportion
of negative examples which have been correctly classified. A higher value of specificity indicates
a better model performance.

Example: Given the follwing confution matrix, calculate the following performance measure:
i. Accuracy
ii. Precision
iii. Recall
iv. sensitivity
v. Specicity
45
46

References:
Foundations of Machine Learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.
The MIT Press Cambridge, Massachusetts London, England. © 2012 Massachusetts Institute of
Technology
Machine Learning. Saikat Dutt, Subramanian Chandramouli and Amit Kumar Das. Pearson.
Machine Learning for Absolute Beginners. Oliver Theobald. 2017

You might also like