0% found this document useful (0 votes)
15 views

Unit-1-Introduction (Fundamentals of ML & AI) January 29, 2024

Uploaded by

Arin Daniel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit-1-Introduction (Fundamentals of ML & AI) January 29, 2024

Uploaded by

Arin Daniel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Study Material of Unit-1 Syllabus

Introduction : Definition of learning system, Goals and applications of machine


learning. Aspects of developing a learning system : training data, concept
representation, function approximation

1
What is Machine Learning?

Introduction

Machine learning is an umbrella term for a set of techniques and tools that
help computers learn and adapt on their own. Machine learning
algorithms help AI learn without being explicitly programmed to perform
the desired action. By learning a pattern from sample inputs, the machine
learning algorithm predicts and performs tasks solely based on the learned
pattern and not a predefined program instruction. Machine learning is a life
savior in several cases where applying strict algorithms is not possible. It
will learn the new process from previous patterns and execute the
knowledge.

One of the machine learning applications we are familiar with is the way
our email providers help us deal with spam. Spam filters use an algorithm
to identify and move incoming junk email to your spam folder. Several e-
commerce companies also use machine learning algorithms in conjunction
with other IT security tools to prevent fraud and improve their
recommendation engine performance.

2
With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the
information, the higher will be the performance.

3
Artificial Intelligence: AI can simply be considered a superset and it refers
to a generic term most of the time. From the Robot Sophia, chatbots we
interact with, face recognition features in mobile phones to even self-driving
cars; everything comes under this big umbrella called AI. Hence, it could be
generalized to anything trying to mimic a behavioural pattern of either
human or other living entity.

Machine Learning: Machine Learning could generally be defined as


something that learns by experience. Here learning could mean a training
phase which the model takes to study the dataset (experience). Eg: Predictive
Analytics, Speech/Audio recognition etc.

4
Deep Learning: In Deep Learning, an architecture similar to our neural
system is being utilized. Deep Learning models are widely being used in the
FinTech industry, healthcare, construction and so on. overcast

Dependent Variables and Independent Variables

Consider that we were given the above dataset and we were asked to predict if football
was played or not; hence column Played football(yes/no) will be our target variable
or dependent variable and other columns are our independent variables or features.

Tasks in Machine Learning

5
Machine Learning : Introduction

What is Machine Learning?

Machine Learning (ML) is a sub-field of Artificial Intelligence (AI)


which concerns with developing computational theories of learning and
building learning machines. The goal of machine learning, closely
coupled with the goal of AI, is to achieve a thorough understanding
about the nature of learning process (both human learning and other
6
forms of learning), about the computational aspects of learning
behaviors, and to implant the learning capability in computer systems.
Machine learning has been recognized as central to the success of
Artificial Intelligence, and it has applications in various areas of
science, engineering and society.

What is Learning?

Before we talk about machine learning, it would be useful to discuss the


issue "what is learning". Learning is a phenomenon and process which has
manifestations of various aspects. Roughly speaking, learning process
includes (one or more of) the following:

(1) Acquisition of new (symbolic) knowledge. For example, learning


mathematics is this kind of learning. When we say someone has
learned math, we mean that the learner obtained descriptions of the
mathematical concepts, understood their meaning and their
relationship with each other. The effect of learning is that the
learner has acquired knowledge of mathematical systems and their
properties, and that the learner can use this knowledge to solve math
problems. Thus this kind of learning is characterized as obtaining
new symbolic information plus the ability to apply that information
effectively.

(2) Development of motor or cognitive skills through instruction


and practice. Examples of this kind of learning are learning to ride
a bicycle, to swim, to play piano, etc. This kind of learning is also
called skill refinement. In this case, just acquiring a symbolic
description of the rules to perform the task is not sufficient, repeated
practice is needed for the learner to obtain the skill. Skill refinement
takes place at the subconscious level.

7
(3) Refinement and organization of knowledge into more effective
representations or more useful form. One example of this kind of
learning can be reorganization of the rules in a knowledge base such
that more important rules are given higher priorities so that they can
be used more easily and conveniently.

(4) Discovery of new facts and theories through observation and


experiment. For example, the discovery of physics and chemistry
laws.

The general effect of learning in a system is the improvement of the


system’s capability to solve problems. It is hard to imagine a system
capable of learning cannot improve its problem-solving performance.
A system with learning capability should be able to do self-changing in
order to perform better in its future problem-solving.

We also note that learning can not take place in isolation: We


typically learn something (knowledge K) to perform some tasks (T),
through some experience E, and whether we have learned well or not
will be judged by some performance criteria P at the task T. For
example, as Tom Mitchell put it in his ML book, for the "checkers
learning problem", the task T is to play the game of checkers, the
performance criteria P could be the percentage of games won against
opponents, and the experience E could be in the form playing practice
games with a teacher (or self). For learning to take place, we do need a
learning algorithm A for self-changing, which allows the learner to get
experience E in the task T, and acquire knowledge K (thus change the
learner’s knowledge set) to improve the learner’s performance at task
T.

8
Learning = Improving performance P at task T by acquiring
knowledge K using self-changing algorithm A through experience
E in an environment for task T.

T : Task in an environment
Learning Involves

E : Experience in T

P : performance criteria

K : Knowledge acquired

A : Algorithm for self-changing

There are various forms of improvement of a system’s


problem-solving ability:

1) To solve wider range of problems than before - perform


generalization.
2) To solve the same problem more effectively - give better quality
solutions.
3) To solve the same problem more efficiently - faster.

9
There are other view points as to what constitutes the notion of
learning. For example, Minsky gives a more general definition,
"Learning is making useful changes in our minds".

And McCarthy suggests,


"Learning is constructing or modifying representations of what is
being experienced."
From this perspective, the central aspect of learning is acquisition of certain
forms of representation of some reality, rather than the improvement of
performance. However, since it is in general much easier to observe a
system’s performance behavior than its internal representation of reality, we
usually link the learning behavior with the improvement of the system’s
performance.

The main focus of machine learning is to study the computational aspect of


learning process and to construct machines that are capable of learning (via
computation). We need not only to build computational models of learning,
but also design and implement efficient learning algorithms according to
such computational models. Of course the study of cognitive aspect of
learning is important to machine learning, however, the computational
aspect of learning is of central significance.

Machine Learning Definitions

Alan Turing’s seminal paper (Turing, 1950) introduced a benchmark


standard for demonstrating machine intelligence, such that a machine has to
be intelligent and responsive in a manner that cannot be differentiated from
that of a human being.

Arthur Samuel(1959): “Machine Learning is a field of study that


gives computers, the ability to learn without explicitly being
programmed.” Samuel wrote a Checker playing program which could learn
10
over time. At first it could be easily won. But over time, it learnt all the
board position that would eventually lead him to victory or loss and thus
became a better chess player than Samuel itself. This was one of the most
early attempts of defining Machine Learning and is somewhat less formal.

Tom Michel(1999): “A computer program is said to learn from experience


E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience
E.” This is a more formal and mathematical definition. For the previous
Chess program
• E is number of games.

• T is playing chess against computer.

• P is win/loss by computer.

Machine Learning is an application of artificial intelligence where a


computer/machine learns from the past experiences (input data) and
makes future predictions. The performance of such a system should be
at least human level.

How does Machine Learning work?


A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts the
output for it. The accuracy of predicted output depends upon the amount
of data, as the huge amount of data helps to build a better model which
predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some


predictions, so instead of writing a code for it, we just need to feed the data
to generic algorithms, and with the help of these algorithms, machine builds
the logic as per the data and predict the output. Machine learning has
changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:
11
Features of Machine Learning:

✓ Machine learning uses data to detect various patterns in a given


dataset.
✓ It can learn from past data and improve automatically.
✓ It is a data-driven technology.
✓ Machine learning is much similar to data mining as it also deals with
the huge amount of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind
the need for machine learning is that it is capable of doing tasks that are too
complex for a person to implement directly. As a human, we have some
limitations as we cannot access the huge amount of data manually, so for
this, we need some computer systems and here comes the machine learning
to make things easy for us.

We can train machine learning algorithms by providing them the huge


amount of data and let them explore the data, construct the models, and
predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined
by the cost function. With the help of machine learning, we can save both
time and money.
12
The importance of machine learning can be easily understood by its uses
cases, Currently, machine learning is used in self-driving cars, cyber fraud
detection, face recognition, and friend suggestion by Facebook, etc.
Various top companies such as Netflix and Amazon have build machine
learning models that are using a vast amount of data to analyze the user
interest and recommend product accordingly.

WHY MACHINE LEARNING?

To answer this question, we should look at two issues:

(1) What are the goals of machine learning;


(2) Why these goals are important and desirable.

The Goals of Machine Learning.

The goal of ML, in simple words, is to understand the nature of (human


and other forms of) learning, and to build learning capability in
computers. To be more specific, there are three aspects of the goals of
ML.

(1) To make the computers smarter, more intelligent. The more


direct objective in this aspect is to develop systems (programs) for
specific practical learning tasks in application domains.
(2) To develop computational models of human learning process
and perform computer simulations. The study in this aspect is
also called cognitive modeling.
(3) To explore new learning methods and develop general learning
algorithms independent of applications.

13
Why the goals of ML are Important and Desirable?

It is self-evident that the goals of ML are important and desirable.


However, we still give some more supporting argument to this issue.
First of all, implanting learning ability in computers is practically
necessary. Present day computer applications require the representation
of huge amount of complex knowledge and data in programs and thus
require tremendous amount of work. Our ability to code the computers
falls short of the demand for applications. If the computers are endowed
with the learning ability, then our burden of coding the machine is eased
(or at least reduced). This is particularly true for developing expert
systems where the "bottle-neck" is to extract the expert’s knowledge
and feed the knowledge to computers. The present day computer
programs in general (with the exception of some ML programs) cannot
correct their own errors or improve from past mistakes, or learn to
perform a new task by analogy to a previously seen task. In contrast,
human beings are capable of all the above. ML will produce smarter
computers capable of all the above intelligent behavior.

Second, the understanding of human learning and its computational


aspect is a worthy scientific goal. We human beings have long been
fascinated by our capabilities of intelligent behaviors and have been
trying to understand the nature of intelligence. It is clear that central to
our intelligence is our ability to learn. Thus a thorough understanding
of human learning process is crucial to understand human intelligence.
ML will gain us the insight into the underlying principles of human
learning and that may lead to the discovery of more effective education
techniques. It will also contribute to the design of machine learning
systems.

Finally, it is desirable to explore alternative learning mechanisms in the


space of all possible learning methods. There is no reason to believe that
14
the way human being learns is the only possible mechanism of learning.
It is worthy exploring other methods of learning which may be more
efficient, effective than human learning.

We remark that Machine Learning has become feasible in many


important applications (and hence the popularity of the field) partly
because the recent progress in learning algorithms and theory, the
rapidly increase of computational power, the great availability of huge
amount of data, and interests in commercial ML application
development.

Following are some key points which show the importance of Machine
Learning:

➢ Rapid increment in the production of data


➢ Solving complex problems, which are difficult for a human
➢ Decision making in various sector including finance
➢ Finding hidden patterns and extracting useful information from data.

Moreover we note that ML is inherently a multi-disciplinary subject


area.
We compare the human learning with machine learning along the
dimensions of speed, ability to transfer, and others. which shows that
machine learning is both an opportunity and challenge, in the sense that
we can hope to discover ways for machine to learn which are better than
ways human learn (the opportunity), and that there are amply amount of
difficulties to be overcome in order to make machines learn (the
challenge).

15
Dimension Human Learning Machine Learning
Speed Slow Slow- hope to fi nd tricks for
machine to learn fast
Ability to transfer No copy Easy to copy
mechanism
Require repetition Yes Yes/No
Error-prone Yes Yes
Noise-tolerant Yes No

Demystifying Machine Learning :


Demystifying Machine Learning” is a term used to describe the process of
making machine learning more accessible and understandable to a wider
audience. Machine learning can be a complex and intimidating field, but
with the right approach, it can be made more accessible and
understandable.
There are several ways to demystify machine learning, including:

Breaking down complex concepts: Machine learning concepts can be


complex and difficult to understand. By breaking them down into simpler
components and providing clear explanations, it can be made more
accessible to a wider audience.
1. Using visualizations: Visualizations such as graphs, diagrams, and
animations can help to explain complex concepts in a more intuitive
way.
2. Providing hands-on examples: Providing hands-on examples of how
machine learning works in practice can help to make it more tangible
and understandable.
3. Simplifying the mathematical notation: Machine learning often uses
complex mathematical notation that can be difficult for non-experts to

16
understand. Simplifying the notation and providing explanations can
make the concepts more accessible.
4. Using real-world applications: Explaining how machine learning is
used in real-world applications can help to make it more relatable and
understandable.
5. Encouraging experimentation and exploration: Encouraging
experimentation and exploration with machine learning can help to
demystify it by allowing people to see how it works in practice and gain
a deeper understanding of it.

What is Machine Learning Model?


Machine Learning models can be understood as a program that has been
trained to find patterns within new data and make predictions. These
models are represented as a mathematical function that takes requests in
the form of input data, makes predictions on input data, and then provides
an output in response. First, these models are trained over a set of data, and
then they are provided an algorithm to reason over data, extract the pattern
from feed data and learn from those data. Once these models get trained,
they can be used to predict the unseen dataset. There are various types of
machine learning models available based on different business goals and
data sets.

Classification of Machine Learning Models:


Based on different business goals and data sets, there are major three
learning models for algorithms. Each machine learning algorithm settles
into one of the three models:
1) Supervised Learning,
2) Unsupervised Learning,
3) Reinforcement Learning
4) Semi-Supervised Learning (Not consider for exam)
17
a) Supervised Learning:

Supervised learning is the types of machine learning in which machines


are trained using well "labelled" training data, and on basis of that data,
machines predict the output. The labelled data means some input data is
already tagged with the correct output.
In supervised learning, the training data provided to the machines work as
the supervisor that teaches the machines to predict the output correctly. It
applies the same concept as a student learns in the supervision of the
teacher.
Supervised learning is a process of providing input data as well as correct
output data to the machine learning model. The aim of a supervised
learning algorithm is to find a mapping function to map the input
variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment,
Image classification, Fraud Detection, spam filtering, etc.
In supervised learning the machine experiences the examples along with the
labels or targets for each example. The labels in the data help the algorithm
to correlate the features.

18
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. The given
data is labeled. Both classification and regression problems are supervised
learning problems.

Example — Consider the following data regarding patients entering a


clinic . The data consists of the gender and age of the patients and each
patient is labeled as “healthy” or “sick”.
Gender Age Label
M 48 Sick
M 67 Sick
F 53 Healthy
M 49 Sick
F 32 Healthy
M 34 Healthy
M 21 Healthy

19
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the
model learns about each type of data. Once the training process is
completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below


example and diagram:

20
Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we need
to train the model for each shape.

a) If the given shape has four sides, and all the sides are equal, then it
will be labelled as a Square.
b) If the given shape has three sides, then it will be labelled as a triangle.
c) If the given shape has six equal sides then it will be labelled
as hexagon.
Now, after training, we test our model using the test set, and the task of the
model is to identify the shape.

The machine is already trained on all types of shapes, and when it finds a
new shape, it classifies the shape on the bases of a number of sides, and
predicts the output.

Steps Involved in Supervised Learning:


• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and
validation dataset.
• Determine the input features of the training dataset, which should
have enough knowledge so that the model can accurately predict the
output.
• Determine the suitable algorithm for the model, such as support
vector machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
• Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output, which means our model is accurate.
21
Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:

a) Linear Regression
b) Decision Tree
c) Random Forest
d) Neural Network

a)Linear Regression

Linear regression is the simplest machine learning model in which we try to predict
one output variable using one or more input variables. The representation of linear
regression is a linear equation, which combines a set of input values(x) and predicted
output(y) for the set of those input values. It is represented in the form of a line:

Y = bx+ c.

22
The main aim of the linear regression model is to find the best fit line that best fits the
data points.

Linear regression is extended to multiple linear regression (find a plane of best fit) and
polynomial regression (find the best fit curve).

b) Decision Tree

Decision trees are the popular machine learning models that can be used for both
regression and classification problems.

A decision tree uses a tree-like structure of decisions along with their possible
consequences and outcomes. In this, each internal node is used to represent a test on
an attribute; each branch is used to represent the outcome of the test. The more nodes
a decision tree has, the more accurate the result will be.

The advantage of decision trees is that they are intuitive and easy to implement, but
they lack accuracy.

Decision trees are widely used in operations research, specifically in decision


analysis, strategic planning, and mainly in machine learning.

23
c) Random Forest

Random Forest is the ensemble learning method, which consists of a large number of
decision trees. Each decision tree in a random forest predicts an outcome, and the
prediction with the majority of votes is considered as the outcome.

A random forest model can be used for both regression and classification problems.

For the classification task, the outcome of the random forest is taken from the majority
of votes. Whereas in the regression task, the outcome is taken from the mean or average
of the predictions generated by each tree.

d) Neural Networks

Neural networks are the subset of machine learning and are also known as artificial
neural networks. Neural networks are made up of artificial neurons and designed in a
way that resembles the human brain structure and working. Each artificial neuron
connects with many other neurons in a neural network, and such millions of connected
neurons create a sophisticated cognitive structure.

Neural networks consist of a multilayer structure, containing one input layer, one or
more hidden layers, and one output layer. As each neuron is connected with another

24
neuron, it transfers data from one layer to the other neuron of the next layers. Finally,
data reaches the last layer or output layer of the neural network and generates output.

Neural networks depend on training data to learn and improve their accuracy.
However, a perfectly trained & accurate neural network can cluster data quickly and
become a powerful machine learning and AI tool. One of the best-known neural
networks is Google's search algorithm.

2. Classification

Classification models are the second type of Supervised Learning techniques, which
are used to generate conclusions from observed values in the categorical form. For
example, the classification model can identify if the email is spam or not; a buyer will
purchase the product or not, etc. Classification algorithms are used to predict two
classes and categorize the output into different groups.

In classification, a classifier model is designed that classifies the dataset into different
categories, and each category is assigned a label.

There are two types of classifications in machine learning:

o Binary classification: If the problem has only two possible classes, called a
binary classifier. For example, cat or dog, Yes or No,
o Multi-class classification: If the problem has more than two possible classes, it
is a multi-class classifier.

Some popular classification algorithms are as below:

a) Logistic Regression

Logistic Regression is used to solve the classification problems in machine learning.


They are similar to linear regression but used to predict the categorical variables. It can
predict the output in either Yes or No, 0 or 1, True or False, etc. However, rather than
giving the exact values, it provides the probabilistic values between 0 & 1.

b) Support Vector Machine

Support vector machine or SVM is the popular machine learning algorithm, which is
widely used for classification and regression tasks. However, specifically, it is used to
25
solve classification problems. The main aim of SVM is to find the best decision
boundaries in an N-dimensional space, which can segregate data points into classes,
and the best decision boundary is known as Hyperplane. SVM selects the extreme
vector to find the hyperplane, and these vectors are known as support vectors.

c) Naïve Bayes

Naïve Bayes is another popular classification algorithm used in machine learning. It is


called so as it is based on Bayes theorem and follows the naïve(independent)
assumption between the features which is given as:

Each naïve Bayes classifier assumes that the value of a specific variable is independent
of any other variable/feature. For example, if a fruit needs to be classified based on
color, shape, and taste. So yellow, oval, and sweet will be recognized as mango. Here
each feature is independent of other features.

26
Advantages of Supervised learning:
➢ With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
➢ In supervised learning, we can have an exact idea about the classes of objects.
➢ Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:


➢ Supervised learning models are not suitable for handling the complex tasks.
➢ Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
➢ Training required lots of computation times.
➢ In supervised learning, we need enough knowledge about the classes of object.

b)Unsupervised Learning:

As the name suggests, unsupervised learning is a machine learning


technique in which models are not supervised using training dataset. Instead,
models itself find the hidden patterns and insights from the given data. It
can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are


trained using unlabeled dataset and are allowed to act on that data without
any supervision.

Unsupervised learning cannot be directly applied to a regression or


classification problem because unlike supervised learning, we have the
input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data

27
according to similarities, and represent that dataset in a compressed
format.

Example: Suppose the unsupervised learning algorithm is given an input


dataset containing images of different types of cats and dogs. The algorithm
is never trained upon the given dataset, which means it does not have any
idea about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.

When we have unclassified and unlabeled data, the system attempts to


uncover patterns from the data . There is no label or target given for the
examples. One common task is to group similar examples together called
clustering.

Unsupervised learning is a type of machine learning algorithm used to


draw inferences from datasets consisting of input data without labeled
responses. In unsupervised learning algorithms, classification or
categorization is not included in the observations. Example: Consider the
following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.

28
Gender Age
M 48
M 67
F 53
M 49
F 34
M 21

In the case of unsupervised machine learning, the model will try to find
patterns from the unlabelled datasets (without targets or dependent variables)
from independent variables. A common example would be clustering in
which the model clusters the dataset after learning the patterns.

Why use Unsupervised Learning?


Below are some main reasons which describe the importance of
Unsupervised Learning:

➢ Unsupervised learning is helpful for finding useful insights from the


data.
➢ Unsupervised learning is much similar as a human learns to think by
their own experiences, which makes it closer to the real AI.
29
➢ Unsupervised learning works on unlabeled and uncategorized data
which make unsupervised learning more important.
➢ In real-world, we do not always have input data with the
corresponding output so to solve such cases, we need unsupervised
learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not


categorized and corresponding outputs are also not given. Now, this
unlabeled input data is fed to the machine learning model in order to train
it. Firstly, it will interpret the raw data to find the hidden patterns from the

30
data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according to the similarities and difference between the objects.

Unsupervised Machine learning models


Unsupervised Machine learning models implement the learning process
opposite to supervised learning, which means it enables the model to learn
from the unlabeled training dataset. Based on the unlabeled dataset, the
model predicts the output. Using unsupervised learning, the model learns
hidden patterns from the dataset by itself without any supervision.

Unsupervised learning models are mainly used to perform three tasks, which
are as follows:

➢ Clustering – Clustering is an unsupervised learning technique that


involves clustering or groping the data points into different clusters
based on similarities and differences. The objects with the most
similarities remain in the same group, and they have no or very few
similarities from other groups.

Clustering algorithms can be widely used in different tasks such


as Image segmentation, Statistical data analysis, Market
segmentation, etc.

Some commonly used Clustering algorithms are K-means Clustering,


hierarchal Clustering, DBSCAN, etc.

31
➢ Association Rule Learning

Association rule learning is an unsupervised learning technique, which


finds interesting relations among variables within a large dataset. The
main aim of this learning algorithm is to find the dependency of one
data item on another data item and map those variables accordingly so
that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous
production, etc.

Some popular algorithms of Association rule learning are Apriori


Algorithm, Eclat, FP-growth algorithm.

➢ Dimensionality Reduction

The number of features/variables present in a dataset is known as the


dimensionality of the dataset, and the technique used to reduce the
dimensionality is known as the dimensionality reduction technique.
Although more data provides more accurate results, it can also affect
the performance of the model/algorithm, such as overfitting issues. In
such cases, dimensionality reduction techniques are used.
"It is a process of converting the higher dimensions dataset into
lesser dimensions dataset ensuring that it provides similar
information."
32
Different dimensionality reduction methods such as PCA(Principal
Component Analysis), Singular Value Decomposition, etc.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition

Advantages of Unsupervised Learning


➢ Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
➢ Unsupervised learning is preferable as it is easy to get unlabeled data
in comparison to labeled data.

Disadvantages of Unsupervised Learning


➢ Unsupervised learning is intrinsically more difficult than supervised
learning as it does not have corresponding output.
➢ The result of the unsupervised learning algorithm might be less
accurate as input data is not labeled, and algorithms do not know the
exact output in advance.
33
Supervised v/s Unsupervised Machine Learning

c)Reinforcement Learning:
Reinforcement learning refers to goal-oriented algorithms, which learn how
to attain a complex objective (goal) or maximize along a particular
dimension over many steps. This method allows machines and software
agents to automatically determine the ideal behavior within a specific
context in order to maximize its performance. Simple reward feedback is
required for the agent to learn which action is best; this is known as the
reinforcement signal. For example, maximize the points won in a game over
many moves.

Reinforcement learning is the problem of getting an agent to act in the


world so as to maximize its rewards.
A learner is not told what actions to take as in most forms of machine
learning but instead must discover which actions yield the most reward by
34
trying them. For example — Consider teaching a dog a new trick: we
cannot tell him what to do, what not to do, but we can reward/punish it if
it does the right/wrong thing.
When watching the video, notice how the program is initially clumsy and
unskilled but steadily improves with training until it becomes a champion.

Reinforcement Learning or Reward-Based Learning: This is a reward-


based technique wherein the ultimate aim of the model is to maximize the
rewards based on the actions taken by an agent in an environment. Examples:
Gaming, Bidding & advertisements etc.

In reinforcement learning, the algorithm learns actions for a given set of


states that lead to a goal state. It is a feedback-based learning model that
takes feedback signals after each state or action by interacting with the
environment. This feedback works as a reward (positive for each good
action and negative for each bad action), and the agent's goal is to maximize
the positive rewards to improve their performance.

35
The behavior of the model in reinforcement learning is similar to human
learning, as humans learn things by experiences as feedback and interact
with the environment.

Below are some popular algorithms that come under reinforcement learning:

➢ Q-learning: Q-learning is one of the popular model-free algorithms


of reinforcement learning, which is based on the Bellman equation.

It aims to learn the policy that can help the AI agent to take the best
action for maximizing the reward under a specific circumstance. It
incorporates Q values for each state-action pair that indicate the
reward to following a given state path, and it tries to maximize the Q-
value.

➢ State-Action-Reward-State-Action (SARSA): SARSA is an On-


policy algorithm based on the Markov decision process. It uses the
action performed by the current policy to learn the Q-value. The
SARSA algorithm stands for State Action Reward State Action,
which symbolizes the tuple (s, a, r, s', a').
➢ Deep Q Network: DQN or Deep Q Neural network is Q-
learning within the neural network. It is basically employed in a big
state space environment where defining a Q-table would be a complex
task. So, in such a case, rather than using Q-table, the neural network
uses Q-values for each action based on the state.

d)Semi-supervised learning:
Where an incomplete training signal is given: a training set with some
(often many) of the target outputs missing. There is a special case of this
principle known as Transduction where the entire set of problem instances
is known at learning time, except that part of the targets are missing. Semi-
supervised learning is an approach to machine learning that combines small
labeled data with a large amount of unlabeled data during training. Semi-
36
supervised learning falls between unsupervised learning and supervised
learning.

Training Machine Learning Models


Once the Machine learning model is built, it is trained in order to get the
appropriate results. To train a machine learning model, one needs a huge
amount of pre-processed data. Here pre-processed data means data in
structured form with reduced null values, etc. If we do not provide pre-
processed data, then there are huge chances that our model may perform
terribly.

How to choose the best model?


Above we have discussed different machine learning models and
algorithms. But one most confusing question that may arise to any beginner
that "which model should I choose?". So, the answer is that it depends
mainly on the business requirement or project requirement. Apart from this,
it also depends on associated attributes, the volume of the available dataset,
the number of features, complexity, etc. However, in practice, it is
recommended that we always start with the simplest model that can be
applied to the particular problem and then gradually enhance the complexity
& test the accuracy with the help of parameter tuning and cross-validation.

Difference between Machine Learning Model and Algorithms

One of the most confusing questions among beginners is that are machine
learning models, and algorithms are the same? Because in various cases in
machine learning and data science, these two terms are used
interchangeably.

The answer to this question is No, and the machine learning model is not
the same as an algorithm. In a simple way, an ML algorithm is like a
procedure or method that runs on data to discover patterns from it and
37
generate the model. At the same time, a machine learning model is like a
computer program that generates output or makes predictions. More
specifically, when we train an algorithm with data, it becomes a model.

Machine Learning Model = Model Data + Prediction Algorithm

Is journey and tour is same?

The differences between Supervised and Unsupervised


learning are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are Unsupervised learning algorithms


trained using labeled data. are trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model


feedback to check if it is predicting does not take any feedback.
correct output or not.

Supervised learning model predicts the Unsupervised learning model


output. finds the hidden patterns in data.

In supervised learning, input data is In unsupervised learning, only


provided to the model along with the input data is provided to the
output. model.

The goal of supervised learning is to The goal of unsupervised learning


train the model so that it can predict the is to find the hidden patterns and
output when it is given new data. useful insights from the unknown
dataset.

38
Supervised learning needs supervision Unsupervised learning does not
to train the model. need any supervision to train the
model.

Supervised learning can be categorized Unsupervised Learning can be


in Classification and Regression probl classified
ems. in Clustering and Associations p
roblems.

Supervised learning can be used for Unsupervised learning can be


those cases where we know the input as used for those cases where we
well as corresponding outputs. have only input data and no
corresponding output data.

Supervised learning model produces an Unsupervised learning model may


accurate result. give less accurate result as
compared to supervised learning.

Supervised learning is not close to true Unsupervised learning is more


Artificial intelligence as in this, we first close to the true Artificial
train the model for each data, and then Intelligence as it learns similarly
only it can predict the correct output. as a child learns daily routine
things by his experiences.

It includes various algorithms such as It includes various algorithms


Linear Regression, Logistic such as Clustering, KNN, and
Regression, Support Vector Machine, Apriori algorithm.
Multi-class Classification, Decision
tree, Bayesian Logic, etc.

39
Machine Learning Applications and Examples
(prepare any six application for exam)
• Social Media Features : Social media platforms use machine learning
algorithms and approaches to create some attractive and excellent
features. For instance, Facebook notices and records your activities,
chats, likes, and comments, and the time you spend on specific kinds
of posts. Machine learning learns from your own experience and
makes friends and page suggestions for your profile.

• Product Recommendations : Product recommendation is one of the


most popular and known applications of machine learning. Product
recommendation is one of the stark features of almost every e-
commerce website today, which is an advanced application of
machine learning techniques. Using machine learning and AI,
websites track your behavior based on your previous purchases,
searching patterns, and cart history, and then make product
recommendations.

• Image Recognition : Image recognition, which is an approach for


cataloging and detecting a feature or an object in the digital image, is
one of the most significant and notable machine learning and AI
techniques. This technique is being adopted for further analysis, such
as pattern recognition, face detection, and face recognition.

40
• Sentiment Analysis : Sentiment analysis is one of the most necessary
applications of machine learning. Sentiment analysis is a real-time
machine learning application that determines the emotion or opinion
of the speaker or the writer. For instance, if someone has written a
review or email (or any form of a document), a sentiment analyzer will
instantly find out the actual thought and tone of the text. This
sentiment analysis application can be used to analyze a review based
website, decision-making applications, etc.

Automating Employee Access Control : Organizations are actively


implementing machine learning algorithms to determine the level of
access employees would need in various areas, depending on their job
profiles. This is one of the coolest applications of machine learning.

• Marine Wildlife Preservation : Machine learning algorithms are


used to develop behavior models for endangered cetaceans and other
marine species, helping scientists regulate and monitor their
populations.

• Regulating Healthcare Efficiency and Medical Services :


Significant healthcare sectors are actively looking at using machine
learning algorithms to manage better. They predict the waiting times
of patients in the emergency waiting rooms across various
departments of hospitals. The models use vital factors that help define
the algorithm, details of staff at various times of day, records of
41
patients, and complete logs of department chats and the layout of
emergency rooms. Machine learning algorithms also come to play
when detecting a disease, therapy planning, and prediction of the
disease situation. This is one of the most necessary machine learning
applications.

• Banking Domain : Banks are now using the latest advanced


technology machine learning has to offer to help prevent fraud and
protect accounts from hackers. The algorithms determine what factors
to consider to create a filter to keep harm at bay. Various sites that are
unauthentic will be automatically filtered out and restricted from
initiating transactions.

• Language Translation : One of the most common machine learning


applications is language translation. Machine learning plays a
significant role in the translation of one language to another. We are
amazed at how websites can translate from one language to another
effortlessly and give contextual meaning as well. The technology
behind the translation tool is called ‘machine translation.’ It has
enabled people to interact with others from all around the world;
without it, life would not be as easy as it is now. It has provided
confidence to travelers and business associates to safely venture into
foreign lands with the conviction that language will no longer be a
barrier.

• Web Search Engine: One of the reasons why search engines like
google, bing etc work so well is because the system has learnt how to
rank pages through a complex learning algorithm.
• Photo tagging Applications: Be it Facebook or any other photo
tagging application, the ability to tag friends makes it even more
happening. It is all possible because of a face recognition algorithm
that runs behind the application.
42
• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of
hard work for us in classifying the mails and moving the spam mails
to spam folder. This is again achieved by a spam classifier running in
the back end of mail application.
• Augmentation: Machine learning, which assists humans with their
day-to-day tasks, personally or commercially without having
complete control of the output. Such machine learning is used in
different ways such as Virtual Assistant, Data analysis, software
solutions. The primary user is to reduce errors due to human bias.
• Automation: Machine learning, which works entirely
autonomously in any field without the need for any human
intervention. For example, robots performing the essential process
steps in manufacturing plants.
• Finance Industry: Machine learning is growing in popularity in the
finance industry. Banks are mainly using ML to find patterns inside
the data but also to prevent fraud.
• Government organization: The government makes use of ML to
manage public safety and utilities. Take the example of China with
the massive face recognition. The government uses Artificial
intelligence to prevent jaywalker.
• Marketing: Broad use of AI is done in marketing thanks to
abundant access to data. Before the age of mass data, researchers
develop advanced mathematical tools like Bayesian analysis to
estimate the value of a customer. With the boom of data, marketing
department relies on AI to optimize the customer relationship and
marketing campaign.
• Computer vision: Machine learning algorithms can be used to
recognize objects, people, and other elements in images and videos.
• Natural language processing: Machine learning algorithms can be
used to understand and generate human language, including tasks
such as translation and text classification.
43
• Recommendation systems: Machine learning algorithms can be
used to recommend products or content to users based on their past
behavior and preferences.
• Fraud detection: Machine learning algorithms can be used to
identify fraudulent activity in areas such as credit card transactions
and insurance claims.

Application of Unsupervised Learning Examples


Unsupervised learning enables systems to identify patterns within datasets
with AI algorithms that are otherwise unlabeled or unclassified. There are
numerous application of unsupervised learning examples, with some
common examples including recommendation systems, products
segmentation, data set labeling, customer segmentation, and similarity
detection.

Application of Reinforcement Learning Examples


Reinforcement learning is also frequently used in different types of machine
learning applications. Some common application of reinforcement learning
examples include industry automation, self-driving car
technology, applications that use Natural Language Processing, robotics
manipulation, and more. Reinforcement learning is used in AI in a wide
range of industries, including finance, healthcare, engineering, and gaming.

Some examples of machine learning are:


Today, companies are using Machine Learning to improve business
decisions, increase productivity, detect disease, forecast weather, and do
many more things. With the exponential growth of technology, we not only
need better tools to understand the data we currently have, but we also need
to prepare ourselves for the data we will have. To achieve this goal we need
to build intelligent machines. We can write a program to do simple things.
But for most of times Hardwiring Intelligence in it is difficult. Best way to

44
do it is to have some way for machines to learn things themselves. A
mechanism for learning – if a machine can learn from input then it does the
hard work for us. This is where Machine Learning comes in action.
• Database Mining for growth of automation: Typical applications
include Web-click data for better UX( User eXperience), Medical
records for better automation in healthcare, biological data and many
more.
• Applications that cannot be programmed: There are some tasks that
cannot be programmed as the computers we use are not modelled that
way. Examples include Autonomous Driving, Recognition tasks from
unordered data (Face Recognition/ Handwriting Recognition), Natural
language Processing, computer Vision etc.
• Understanding Human Learning: This is the closest we have
understood and mimicked the human brain. It is the start of a new
revolution, The real AI. Now, After a brief insight lets come to a more
formal definition of Machine Learning

Prerequisites
Before learning machine learning, you must have the basic knowledge of
followings so that you can easily understand the concepts of machine
learning:

o Fundamental knowledge of probability and linear algebra.


o The ability to code in any computer language, especially in Python
language.
o Knowledge of Calculus, especially derivatives of single variable and
multivariate functions.

History of Machine Learning


Before some years (about 40-50 years), machine learning was science
fiction, but today it is the part of our daily life. Machine learning is making
45
our day to day life easy from self-driving cars to Amazon virtual assistant
"Alexa". However, the idea behind machine learning is so old and has a
long history. Below some milestones are given which have occurred in the
history of machine learning:

The early history of Machine Learning (Pre-1940):

o 1834: In 1834, Charles Babbage, the father of the computer,


conceived a device that could be programmed with punch cards.
However, the machine was never built, but all modern computers rely
on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can
determine and execute a set of instructions.

The era of stored program computers:

o 1940: In 1940, the first manually operated computer, "ENIAC" was


invented, which was the first electronic general-purpose computer.
46
After that stored program computer such as EDSAC in 1949 and
EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an
electrical circuit. In 1950, the scientists started applying their idea to
work and analyzed how human neurons might work.

Computer machinery and intelligence:

o 1950: In 1950, Alan Turing published a seminal paper, "Computer


Machinery and Intelligence," on the topic of artificial
intelligence. In his paper, he asked, "Can machines think?"

Machine intelligence in Games:

o 1952: Arthur Samuel, who was the pioneer of machine learning,


created a program that helped an IBM computer to play a checkers
game. It performed better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined
by Arthur Samuel.

The first "AI" winter:

o The duration of 1974 to 1980 was the tough time for AI and ML
researchers, and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people
had reduced their interest from AI, which led to reduced funding by
the government to the researches.

Machine Learning from theory to reality

o 1959: In 1959, the first neural network was applied to a real-world


problem to remove echoes over phone lines using an adaptive filter.

47
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a
neural network NETtalk, which was able to teach itself how to
correctly pronounce 20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game
against the chess expert Garry Kasparov, and it became the first
computer which had beaten a human chess expert.

Machine Learning at 21st century

o 2006: In the year 2006, computer scientist Geoffrey Hinton has given
a new name to neural net research as "deep learning," and nowadays,
it has become one of the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned
to recognize the image of humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing
Test. It was the first Chabot who convinced the 33% of human judges
that it was not a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and
they claimed that it could recognize a person with the same precision
as a human can do.
o 2016: AlphaGo beat the world's number second player Lee
sedol at Go game. In 2017 it beat the number one player of this
game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system
that was able to learn the online trolling. It used to read millions of
comments of different websites to learn to stop online trolling.

Machine Learning at present:

Now machine learning has got a great advancement in its research, and it is
present everywhere around us, such as self-driving cars, Amazon
48
Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.

Modern machine learning models can be used for making various


predictions, including weather prediction, disease prediction, stock
market analysis, etc.

49
Key Elements of Machine Learning
There are tens of thousands of machine learning algorithms and hundreds of
new algorithms are developed every year. Every machine learning algorithm
has three components:

• Representation learning

• Evaluation

• Optimization

What is Representation Learning?


Representation learning is a class of machine learning approaches that
allow a system to discover the representations required for feature detection
or classification from raw data. The requirement for manual feature
engineering is reduced by allowing a machine to learn the features and apply
them to a given activity.

Representation is basically the space of allowed models (the hypothesis


space), but also takes into account the fact that we are expressing models in
some formal language that may encode some models more easily than others
(even within that possible set). This is akin to the landscape of possible
models, the playing field allowed by a given representation. For example, 3-
layer feedforward neural networks (or computational graphs) form one type
of representation, while support vector machines with RBF kernels form
another.

Representation: how to represent knowledge. Examples include decision


trees, sets of rules, instances, graphical models, neural networks, support
vector machines, model ensembles and others.

50
In representation learning, data is sent into the machine, and it learns the
representation on its own. It is a way of determining a data representation
of the features, the distance function, and the similarity function that
determines how the predictive model will perform. Representation learning
works by reducing high-dimensional data to low-dimensional data, making
it easier to discover patterns and anomalies while also providing a better
understanding of the data’s overall behaviour.

Basically, Machine learning tasks such as classification frequently demand


input that is mathematically and computationally convenient to process,
which motivates representation learning. Real-world data, such as photos,
video, and sensor data, has resisted attempts to define certain qualities
algorithmically. An approach is to examine the data for such traits or
representations rather than depending on explicit techniques.

Need of Representation Learning


Assume you’re developing a machine-learning algorithm to predict dog
breeds based on pictures. Because image data provides all of the answers,
the engineer must rely heavily on it when developing the algorithm. Each
observation or feature in the data describes the qualities of the dogs. The
machine learning system that predicts the outcome must comprehend how
each attribute interacts with other outcomes such as Pug, Golden Retriever,
and so on.

Importance of Representation
Representation learning is a very important aspect of machine
learning which automatically discovers the feature patterns in the data.
When the machine is provided with the data, it learns the representation
itself without any human intervention. The goal of representation learning
is to train machine learning algorithms to learn useful representations, such
as those that are interpretable, incorporate latent features, or can be used for
51
transfer learning. In this article, we will discuss the concept of
representation learning along with its need and different approaches. The
major points to be covered in this article are listed below.

EVALUATION
Evaluation: the way to evaluate candidate programs
(hypotheses). Examples include accuracy, prediction and recall, squared
error, likelihood, posterior probability, cost, margin, entropy k-L divergence
and others.

Evaluation is essentially how you judge or prefer one model vs. another; it’s
what you might have seen as a utility function, loss function, scoring
function, or fitness function in other contexts. Think of this as the height of
the landscape for each given model, with lower areas being more
preferable/desirable than higher areas (without loss of generality). Mean
squared error (of a model’s output vs. the data output) or likelihood (the
estimated probability of a model given the observed data) are examples of
different evaluation functions that will imply somewhat different heights at
each point on a single landscape.

OPTIMIZATION
Optimization: The way candidate programs are generated known as the
search process. For example combinatorial optimization, convex
optimization, constrained optimization. All machine learning algorithms are
combinations of these three components. A framework for understanding
all algorithms.

Optimization, finally, is how you search the space of represented models to


obtain better evaluations. This is the way you expect to traverse the
landscape to find the promised land of ideal models; the strategy of getting
to where you want to go. Stochastic gradient descent and genetic algorithms
52
are two (very) different ways of optimizing a model class. Note that once
you provide a trained model, it’s quite possible that you may no longer be
able to recover exactly how it was optimized. Just because I can see you
standing in the pit doesn’t mean I know how you got there.

What Is Function Approximation?


Function approximation is a technique for estimating an unknown
underlying function using historical or available observations from the
domain.

Artificial neural networks learn to approximate a function.

In supervised learning, a dataset is comprised of inputs and outputs, and


the supervised learning algorithm learns how to best map examples of inputs
to examples of outputs.

We can think of this mapping as being governed by a mathematical function,


called the mapping function, and it is this function that a supervised
learning algorithm seeks to best approximate.

Neural networks are an example of a supervised learning algorithm and


seek to approximate the function represented by your data. This is achieved
by calculating the error between the predicted outputs and the expected
outputs and minimizing this error during the training process.

What is Approximation?

We come across approximation very often. For example, the irrational


number π can be approximated by the number 3.14. A more accurate value
is 3.141593, which remains an approximation. You can similarly
approximate the values of all irrational numbers like sqrt(3), sqrt(7), etc.

53
Approximation is used whenever a numerical value, a model, a structure or
a function is either unknown or difficult to compute. Here focus on function
approximation and describe its application to machine learning problems.

There are two different cases:

1. The function is known but it is difficult or numerically expensive to compute


its exact value. In this case approximation methods are used to find values,
which are close to the function’s actual values.
2. The function itself is unknown and hence a model or learning algorithm is
used to closely find a function that can produce outputs close to the
unknown function’s outputs.

1)Approximation When Form of Function is Known


If the form of a function is known, then a well known method in calculus
and mathematics is approximation via Taylor series. The Taylor series of a
function is the sum of infinite terms, which are computed using function’s
derivatives.
The well known method for approximation in calculus and mathematics
is Newton’s method. It can be used to approximate the roots of polynomials,
hence making it a useful technique for approximating quantities such as the
square root of different values or the reciprocal of different numbers, etc.

2)Approximation When Form of Function is Unknown


In data science and machine learning, it is assumed that there is an
underlying function that holds the key to the relationship between the inputs
and outputs. The form of this function is unknown. There are several
machine learning problems that employ approximation.

54
Data sets in Machine Learning
How to get datasets for Machine Learning ?
The key to success in the field of machine learning or to become a great data
scientist is to practice with different types of datasets. But discovering a
suitable dataset for each kind of machine learning project is a difficult task.
So, in this topic, we will provide the detail of the sources from where you
can easily get the dataset according to your project.

Before knowing the sources of the machine learning dataset, let's discuss
datasets.

What is a dataset?

A dataset is a collection of data in which data is arranged in some order. A


dataset can contain any data from a series of an array to a database table.
Below table shows an example of the dataset:

Country Age Salary Purchased


India 38 48000 No
France 43 45000 Yes
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where


each column corresponds to a particular variable, and each row
corresponds to the fields of the dataset. The most supported file type for a
tabular dataset is "Comma Separated File," or CSV.

55
Types of data in datasets

➢ Numerical data:Such as house price, temperature, etc.


➢ Categorical data:Such as Yes/No, True/False, Blue/green, etc.
➢ Ordinal data:These data are similar to categorical data but can be
measured on the basis of comparison.

Note: A real-world dataset is of huge size, which is difficult to manage and


process at the initial level. Therefore, to practice machine learning
algorithms, we can use any dummy dataset.

Need of Dataset

To work with machine learning projects, we need a huge amount of data,


because, without the data, one cannot train ML/AI models. Collecting and
preparing the dataset is one of the most crucial parts while creating an
ML/AI project.

The technology applied behind any ML projects cannot work properly if the
dataset is not well prepared and pre-processed.

During the development of the ML project, the developers completely rely


on the datasets. In building ML applications, datasets are divided into two
parts:

➢ Training dataset:
➢ Test Dataset

56
Note: The datasets are of large size, so to download these datasets, you
must have fast internet on your computer.

Popular sources for Machine Learning datasets (for understanding and


considered as example)

Below is the list of datasets which are freely available for the public to work
on it:

1. Kaggle Datasets

Kaggle is one of the best sources for providing datasets for Data Scientists
and Machine Learners. It allows users to find, download, and publish
datasets in an easy way. It also provides the opportunity to work with other
machine learning engineers and solve difficult Data Science related tasks.
57
Kaggle provides a high-quality dataset in different formats that we can
easily find and download.

2. UCI Machine Learning Repository

UCI Machine learning repository is one of the great sources of machine


learning datasets. This repository contains databases, domain theories, and
data generators that are widely used by the machine learning community for
the analysis of ML algorithms.

Since the year 1987, it has been widely used by students, professors,
researchers as a primary source of machine learning dataset.

It classifies the datasets as per the problems and tasks of machine learning
such as Regression, Classification, Clustering, etc. It also contains some
of the popular datasets such as the Iris dataset, Car Evaluation dataset,
Poker Hand dataset, etc.

58
3. Datasets via AWS

We can search, download, access, and share the datasets that are publicly
available via AWS resources. These datasets can be accessed through AWS
resources but provided and maintained by different government
organizations, researches, businesses, or individuals.

Anyone can analyze and build various services using shared data via AWS
resources. The shared dataset on cloud helps users to spend more time on
data analysis rather than on acquisitions of data.

This source provides the various types of datasets with examples and ways
to use the dataset. It also provides the search box using which we can search
for the required dataset. Anyone can add any dataset or example to
the Registry of Open Data on AWS.

59
4. Google's Dataset Search Engine

Google dataset search engine is a search engine launched


by Google on September 5, 2018. This source helps researchers to get
online datasets that are freely available for use.

The link for the Google dataset search engine


is https://toolbox.google.com/datasetsearch.

60
5. Microsoft Datasets

The Microsoft has launched the "Microsoft Research Open


data" repository with the collection of free datasets in various areas such
as natural language processing, computer vision, and domain-specific
sciences. Using this resource, we can download the datasets to use on the
current device, or we can also directly use it on the cloud infrastructure.

61
6. Awesome Public Dataset Collection

Awesome public dataset collection provides high-quality datasets that are


arranged in a well-organized manner within a list according to topics such
as Agriculture, Biology, Climate, Complex networks, etc. Most of the
datasets are available free, but some may not, so it is better to check the
license before downloading the dataset.

7. Government Datasets

There are different sources to get government-related data. Various


countries publish government data for public use collected by them from
different departments.

The goal of providing these datasets is to increase transparency of


government work among the people and to use the data in an innovative
approach. Below are some links of government datasets:

o Indian Government dataset


o US Government Dataset
o Northern Ireland Public Sector Datasets
62
o European Union Open Data Portal

8. Computer Vision Datasets

Visual data provides multiple numbers of the great dataset that are specific
to computer visions such as Image Classification, Video classification,
Image Segmentation, etc. Therefore, if you want to build a project on deep
learning or image processing, then you can refer to this source.

The link for downloading the dataset from this source


is https://www.visualdata.io/.

63
9. Scikit-learn dataset

Scikit-learn is a great source for machine learning enthusiasts. This source


provides both toy and real-world datasets. These datasets can be obtained
from sklearn.datasets package and using general dataset API.

64
What is Training Data?
Training data is the initial dataset used to train machine learning algorithms.
Models create and refine their rules using this data. It's a set of data samples
used to fit the parameters of a machine learning model to training it by
example.
Training data is also known as training dataset, learning set, and training set.
It's an essential component of every machine learning model and helps them
make accurate predictions or perform a desired task.

Simply put, training data builds the machine learning model. It teaches what
the expected output looks like. The model analyzes the dataset repeatedly to
deeply understand its characteristics and adjust itself for better performance.

Need of Training Data

Machine learning models are as good as the data they're trained on. Without
high-quality training data, even the most efficient machine
learning algorithms will fail to perform.

The need for quality, accurate, complete, and relevant data starts early on in
the training process. Only if the algorithm is fed with good training data can
it easily pick up the features and find relationships that it needs to predict
down the line.

More precisely, quality training data is the most significant aspect of


machine learning (and artificial intelligence) than any other. If you
introduce the machine learning (ML) algorithms to the right data, you are
setting them up for accuracy and success.

65
In a broader sense, training data can be classified into two
categories: labeled data and unlabeled data.

What is labeled data?


Labeled data is a group of data samples tagged with one or more
meaningful labels. It's also called annotated data, and its labels identify
specific characteristics, properties, classifications, or contained objects.

For example, the images of fruits can be tagged as apples,


bananas, or grapes.

Labeled training data is used in supervised learning. It enables ML models


to learn the characteristics associated with specific labels, which can be used
to classify newer data points. In the example above, this means that a model
can use labeled image data to understand the features of specific fruits and
use this information to group new images.

Data labeling or annotation is a time-consuming process as humans need to


tag or label the data points. Labeled data collection is challenging and
66
expensive. It isn't easy to store labeled data when compared to unlabeled
data.

What is unlabeled data?


As expected, unlabeled data is the opposite of labeled data. It's raw data or
data that's not tagged with any labels for identifying classifications,
characteristics, or properties. It's used in unsupervised machine learning,
and the ML models have to find patterns or similarities in the data to reach
conclusions.

Going back to the previous example of apples, bananas, and grapes, in


unlabeled training data, the images of those fruits won't be labeled. The
model will have to evaluate each image by looking at its characteristics,
such as color and shape.

After analyzing a considerable number of images, the model will be able to


differentiate new images (new data) into the fruit types of apples, bananas,
or grapes. Of course, the model wouldn't know that the particular fruit is
called an apple. Instead, it knows the characteristics needed to identify it.

There are hybrid models that use a combination of supervised and


unsupervised machine learning.

How training data is used in machine learning?


Unlike machine learning algorithms, traditional programming algorithms
follow a set of instructions to accept input data and provide output. They
don't rely on historical data, and every action they make is rule-based. This
also means that they don't improve over time, which isn't the case with
machine learning.
67
For machine learning models, historical data is fodder. Just as humans rely
on past experiences to make better decisions, ML models look at their
training dataset with past observations to make predictions.

Predictions could include classifying images as in the case of image


recognition, or understanding the context of a sentence as in natural
language processing (NLP).

Think of a data scientist as a teacher, the machine learning algorithm as the


student, and the training dataset as the collection of all textbooks.

The teacher’s aspiration is that the student must perform well in exams and
also in the real world. In the case of ML algorithms, testing is like exams.
The textbooks (training dataset) contain several examples of the type of
questions that’ll be asked in the exam.

Tip: Check out big data analytics to know how big data is collected, structured,
cleaned, and analyzed.
Of course, it won’t contain all the examples of questions that’ll be asked in the exam,
nor will all the examples included in the textbook will be asked in the exam. The
textbooks can help prepare the student by teaching them what to expect and how to
respond.

No textbook can ever be fully complete. As time passes, the kind of


questions asked will change, and so, the information included in the
textbooks needs to be changed. In the case of ML algorithms, the training
set should be periodically updated to include new information.
In short, training data is a textbook that helps data scientists give ML
algorithms an idea of what to expect. Although the training dataset doesn't

68
contain all possible examples, it’ll make algorithms capable of making
predictions.

Training Data vs. Test Data vs. Validation Data


Training data is used in model training, or in other words, it's the data used
to fit the model. On the contrary, test data is used to evaluate the
performance or accuracy of the model. It's a sample of data used to make an
unbiased evaluation of the final model fit on the training data.

A training dataset is an initial dataset that teaches the ML models to identify


desired patterns or perform a particular task. A testing dataset is used to
evaluate how effective the training was or how accurate the model is.

Once an ML algorithm is trained on a particular dataset and if you test it on


the same dataset, it's more likely to have high accuracy because the model
knows what to expect. If the training dataset contains all possible values the
model might encounter in the future, all well and good.

But that's never the case. A training dataset can never be comprehensive and
can't teach everything that a model might encounter in the real world.
Therefore a test dataset, containing unseen data points, is used to evaluate
the model's accuracy.

69
Then there's validation data. This is a dataset used for frequent evaluation
during the training phase. Although the model sees this dataset occasionally,
it doesn't learn from it. The validation set is also referred to as the
development set or dev set. It helps protect models from overfitting and
underfitting.

Although validation data is separate from training data, data scientists might
reserve a part of the training data for validation. But of course, this
automatically means that the validation data was kept away during the
training.

Tip: If you've got a limited amount of data, a technique called cross-


validation can be used to estimate the model's performance. This method
involves randomly partitioning the training data into multiple subsets and
reserving one for evaluation.
Many use the terms "test data" and "validation data" interchangeably. The
main difference between the two is that validation data is used to validate
the model during the training, while the testing set is used to test the model
after the training is completed.

70
The validation dataset gives the model the first taste of unseen data.
However, not all data scientists perform an initial check using validation
data. They might skip this part and go directly to testing data.

What is human in the loop?


Human in the loop refers to the people involved in the gathering and
preparation of training data. Raw data is gathered from multiple sources,
including IoT devices, social media platforms, websites, and customer
feedback. Once collected, individuals involved in the process would
determine the crucial attributes of the data that are good indicators of the
outcome you want the model to predict.

The data is prepared by cleaning it, accounting for missing values, removing
outliers, tagging data points, and loading it into suitable places for training
ML algorithms. There will also be several rounds of quality checks; as you
know, incorrect labels can significantly affect the model's accuracy.

What makes training data good?


High-quality data translates to accurate machine learning models.

Low-quality data can significantly affect the accuracy of models, which can
lead to severe financial losses. It's almost like giving a student a textbook
containing wrong information and expecting them to excel in the
examination.

The following are the four primary traits of quality training data.
1)Relevant
The data needs to be relevant to the task at hand. For example, if you want
to train a computer vision algorithm for autonomous vehicles, you
probably won't require images of fruits and vegetables. Instead, you would
71
need a training dataset containing photos of roads, sidewalks, pedestrians,
and vehicles.

2)Representative
The AI training data must have the data points or features that the
application is made to predict or classify. Of course, the dataset can never
be absolute, but it must have at least the attributes the AI application is
meant to recognize.

For example, if the model is meant to recognize faces within images, it must
be fed with diverse data containing people's faces from various ethnicities.
This will reduce the problem of AI bias, and the model won't be prejudiced
against a particular race, gender, or age group.

3)Uniform
All data should have the same attribute and must come from the same
source.

Suppose your machine learning project aims to predict churn rate by looking
at customer information. For that, you'll have a customer information
database that includes customer name, address, number of orders, order
frequency, and other relevant information. This is historical data and can be
used as training data.

One part of the data can't have additional information, such as age or gender.
This will make training data incomplete and the model inaccurate. In short,
uniformity is a critical aspect of quality training data.

72
4)Comprehensive
Again, the training data can never be absolute. But it should be a large
dataset that represents the majority of the model's use cases. The training
data must have enough examples that’ll allow the model to learn
appropriately. It must contain real-world data samples as it will help train
the model to understand what to expect.

If you're thinking of training data as values placed in large numbers of rows


and columns, sorry, you're wrong. It could be any data type like text, images,
audio, or videos.

What affects training data quality?


Humans are highly social creatures, but there are some prejudices that we
might have picked as children and require constant conscious effort to get
rid of. Although unfavorable, such biases may affect our creations, and
machine learning applications are no different.

For ML models, training data is the only book they read. Their performance
or accuracy will depend on how comprehensive, relevant, and representative
the very book is.

That being said, three factors affect the quality of training data:

1. People: The people who train the model have a significant impact on
its accuracy or performance. If they're biased, it’ll naturally affect how
they tag data and, ultimately, how the ML model functions.
2. Processes: The data labeling process must have tight quality control
checks in place. This will significantly increase the quality of training
data.
73
3. Tools: Incompatible or outdated tools can make data quality suffer.
Using robust data labeling software can reduce the cost and time
associated with the process.

Data Preprocessing in Machine learning


Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model. It is the first and crucial step while
creating a machine learning model.

When creating a machine learning project, it is not always a case that we


come across the clean and formatted data. And while doing any operation
with data, it is mandatory to clean it and put in a formatted way. So for this,
we use data preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in
an unusable format which cannot be directly used for machine learning
models. Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.

It involves below steps: Play Video

1) Getting the dataset


2) Importing libraries
3) Importing datasets
4) Finding Missing Data
5) Encoding Categorical Data
6) Splitting dataset into training and test set
7) Feature scaling

74
1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset


as a machine learning model completely works on data. The collected data
for a particular problem in a proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we


want to create a machine learning model for business purpose, then dataset
will be different with the dataset required for a liver patient. So each dataset
is different from another dataset. To use the dataset in our code, we usually
put it into a CSV file. However, sometimes, we may also need to use an
HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which


allows us to save the tabular data, such as spreadsheets. It is useful for huge
datasets and can use these datasets in programs.

2) Importing Libraries

In order to perform data preprocessing using Python, we need to import


some predefined Python libraries. These libraries are used to perform some
specific jobs. There are three specific libraries that we will use for data
preprocessing, which are:

Numpy: Numpy Python library is used for including any type of


mathematical operation in the code. It is the fundamental package for
scientific calculation in Python. It also supports to add large,
multidimensional arrays and matrices. So, in Python, we can import it as:

import numpy as nm : Here we have used nm, which is a short name for
Numpy, and it will be used in the whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D


plotting library, and with this library, we need to import a sub-
library pyplot. This library is used to plot any type of charts in Python for
the code. It will be imported as below:

75
import matplotlib.pyplot as mpt : Here we have used mpt as a short name
for this library.

Pandas: The last library is the Pandas library, which is one of the most
famous Python libraries and used for importing and managing the datasets.
It is an open-source data manipulation and analysis library. It will be
imported as below:

Here, we have used pd as a short name for this library. Consider the below
image:

3) Importing the Datasets

Now we need to import the datasets which we have collected for our
machine learning project. But before importing a dataset, we need to set the
current directory as a working directory. To set a working directory in
Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required
directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain
the required dataset.

Here, in the below image, we can see the Python file along with required
dataset. Now, the current folder is set as a working directory.

76
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library,
which is used to read a csv file and performs various operations on it. Using
this function, we can read a csv file locally as well as through an URL.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features


(independent variables) and dependent variables from dataset. In our
dataset, there are three independent variables that are Country, Age,
and Salary, and one is a dependent variable which is Purchased.

4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets.
If our dataset contains some missing data, then it may create a huge problem

77
for our machine learning model. Hence it is necessary to handle missing
values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal
with null values. In this way, we just delete the specific row or column
which consists of null values. But this way is not so efficient and removing
data may lead to loss of information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that
column or row which contains any missing value and will put it on the place
of missing value. This strategy is useful for the features which have numeric
data such as age, salary, year, etc. Here, we will use this approach.

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset;
there are two categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and


numbers, but if our dataset would have a categorical variable, then it may
create trouble while building the model. So it is necessary to encode these
categorical variables into numbers.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value
gives the presence of that variable in a particular column, and rest variables
become 0. With dummy encoding, we will have a number of columns equal
to the number of categories.

6) Splitting the Dataset into the Training set and Test set

78
In machine learning data preprocessing, we divide our dataset into a training
set and test set. This is one of the crucial steps of data preprocessing as by
doing this, we can enhance the performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a


dataset and we test it by a completely different dataset. Then, it will create
difficulties for our model to understand the correlations between the models.

If we train our model very well and its training accuracy is also very high,
but we provide a new dataset to it, then it will decrease the performance. So
we always try to make a machine learning model which performs well with
the training set and also with the test dataset. Here, we can define these
datasets as:

Training Set: A subset of dataset to train the machine learning model, and
we already know the output.

Test set: A subset of dataset to test the machine learning model, and by
using the test set, model predicts the output.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning.


It is a technique to standardize the independent variables of the dataset in a
specific range. In feature scaling, we put our variables in the same range and
in the same scale so that no any variable dominate the other variable.

There are two ways to perform feature scaling in machine learning:

Standardization

79
Normalization

80

You might also like