Unit-1-Introduction (Fundamentals of ML & AI) January 29, 2024
Unit-1-Introduction (Fundamentals of ML & AI) January 29, 2024
1
What is Machine Learning?
Introduction
Machine learning is an umbrella term for a set of techniques and tools that
help computers learn and adapt on their own. Machine learning
algorithms help AI learn without being explicitly programmed to perform
the desired action. By learning a pattern from sample inputs, the machine
learning algorithm predicts and performs tasks solely based on the learned
pattern and not a predefined program instruction. Machine learning is a life
savior in several cases where applying strict algorithms is not possible. It
will learn the new process from previous patterns and execute the
knowledge.
One of the machine learning applications we are familiar with is the way
our email providers help us deal with spam. Spam filters use an algorithm
to identify and move incoming junk email to your spam folder. Several e-
commerce companies also use machine learning algorithms in conjunction
with other IT security tools to prevent fraud and improve their
recommendation engine performance.
2
With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the
information, the higher will be the performance.
3
Artificial Intelligence: AI can simply be considered a superset and it refers
to a generic term most of the time. From the Robot Sophia, chatbots we
interact with, face recognition features in mobile phones to even self-driving
cars; everything comes under this big umbrella called AI. Hence, it could be
generalized to anything trying to mimic a behavioural pattern of either
human or other living entity.
4
Deep Learning: In Deep Learning, an architecture similar to our neural
system is being utilized. Deep Learning models are widely being used in the
FinTech industry, healthcare, construction and so on. overcast
Consider that we were given the above dataset and we were asked to predict if football
was played or not; hence column Played football(yes/no) will be our target variable
or dependent variable and other columns are our independent variables or features.
5
Machine Learning : Introduction
What is Learning?
7
(3) Refinement and organization of knowledge into more effective
representations or more useful form. One example of this kind of
learning can be reorganization of the rules in a knowledge base such
that more important rules are given higher priorities so that they can
be used more easily and conveniently.
8
Learning = Improving performance P at task T by acquiring
knowledge K using self-changing algorithm A through experience
E in an environment for task T.
T : Task in an environment
Learning Involves
E : Experience in T
P : performance criteria
K : Knowledge acquired
9
There are other view points as to what constitutes the notion of
learning. For example, Minsky gives a more general definition,
"Learning is making useful changes in our minds".
• P is win/loss by computer.
The need for machine learning is increasing day by day. The reason behind
the need for machine learning is that it is capable of doing tasks that are too
complex for a person to implement directly. As a human, we have some
limitations as we cannot access the huge amount of data manually, so for
this, we need some computer systems and here comes the machine learning
to make things easy for us.
13
Why the goals of ML are Important and Desirable?
Following are some key points which show the importance of Machine
Learning:
15
Dimension Human Learning Machine Learning
Speed Slow Slow- hope to fi nd tricks for
machine to learn fast
Ability to transfer No copy Easy to copy
mechanism
Require repetition Yes Yes/No
Error-prone Yes Yes
Noise-tolerant Yes No
16
understand. Simplifying the notation and providing explanations can
make the concepts more accessible.
4. Using real-world applications: Explaining how machine learning is
used in real-world applications can help to make it more relatable and
understandable.
5. Encouraging experimentation and exploration: Encouraging
experimentation and exploration with machine learning can help to
demystify it by allowing people to see how it works in practice and gain
a deeper understanding of it.
18
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. The given
data is labeled. Both classification and regression problems are supervised
learning problems.
19
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the
model learns about each type of data. Once the training process is
completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.
20
Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we need
to train the model for each shape.
a) If the given shape has four sides, and all the sides are equal, then it
will be labelled as a Square.
b) If the given shape has three sides, then it will be labelled as a triangle.
c) If the given shape has six equal sides then it will be labelled
as hexagon.
Now, after training, we test our model using the test set, and the task of the
model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a
new shape, it classifies the shape on the bases of a number of sides, and
predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
a) Linear Regression
b) Decision Tree
c) Random Forest
d) Neural Network
a)Linear Regression
Linear regression is the simplest machine learning model in which we try to predict
one output variable using one or more input variables. The representation of linear
regression is a linear equation, which combines a set of input values(x) and predicted
output(y) for the set of those input values. It is represented in the form of a line:
Y = bx+ c.
22
The main aim of the linear regression model is to find the best fit line that best fits the
data points.
Linear regression is extended to multiple linear regression (find a plane of best fit) and
polynomial regression (find the best fit curve).
b) Decision Tree
Decision trees are the popular machine learning models that can be used for both
regression and classification problems.
A decision tree uses a tree-like structure of decisions along with their possible
consequences and outcomes. In this, each internal node is used to represent a test on
an attribute; each branch is used to represent the outcome of the test. The more nodes
a decision tree has, the more accurate the result will be.
The advantage of decision trees is that they are intuitive and easy to implement, but
they lack accuracy.
23
c) Random Forest
Random Forest is the ensemble learning method, which consists of a large number of
decision trees. Each decision tree in a random forest predicts an outcome, and the
prediction with the majority of votes is considered as the outcome.
A random forest model can be used for both regression and classification problems.
For the classification task, the outcome of the random forest is taken from the majority
of votes. Whereas in the regression task, the outcome is taken from the mean or average
of the predictions generated by each tree.
d) Neural Networks
Neural networks are the subset of machine learning and are also known as artificial
neural networks. Neural networks are made up of artificial neurons and designed in a
way that resembles the human brain structure and working. Each artificial neuron
connects with many other neurons in a neural network, and such millions of connected
neurons create a sophisticated cognitive structure.
Neural networks consist of a multilayer structure, containing one input layer, one or
more hidden layers, and one output layer. As each neuron is connected with another
24
neuron, it transfers data from one layer to the other neuron of the next layers. Finally,
data reaches the last layer or output layer of the neural network and generates output.
Neural networks depend on training data to learn and improve their accuracy.
However, a perfectly trained & accurate neural network can cluster data quickly and
become a powerful machine learning and AI tool. One of the best-known neural
networks is Google's search algorithm.
2. Classification
Classification models are the second type of Supervised Learning techniques, which
are used to generate conclusions from observed values in the categorical form. For
example, the classification model can identify if the email is spam or not; a buyer will
purchase the product or not, etc. Classification algorithms are used to predict two
classes and categorize the output into different groups.
In classification, a classifier model is designed that classifies the dataset into different
categories, and each category is assigned a label.
o Binary classification: If the problem has only two possible classes, called a
binary classifier. For example, cat or dog, Yes or No,
o Multi-class classification: If the problem has more than two possible classes, it
is a multi-class classifier.
a) Logistic Regression
Support vector machine or SVM is the popular machine learning algorithm, which is
widely used for classification and regression tasks. However, specifically, it is used to
25
solve classification problems. The main aim of SVM is to find the best decision
boundaries in an N-dimensional space, which can segregate data points into classes,
and the best decision boundary is known as Hyperplane. SVM selects the extreme
vector to find the hyperplane, and these vectors are known as support vectors.
c) Naïve Bayes
Each naïve Bayes classifier assumes that the value of a specific variable is independent
of any other variable/feature. For example, if a fruit needs to be classified based on
color, shape, and taste. So yellow, oval, and sweet will be recognized as mango. Here
each feature is independent of other features.
26
Advantages of Supervised learning:
➢ With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
➢ In supervised learning, we can have an exact idea about the classes of objects.
➢ Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.
b)Unsupervised Learning:
27
according to similarities, and represent that dataset in a compressed
format.
28
Gender Age
M 48
M 67
F 53
M 49
F 34
M 21
In the case of unsupervised machine learning, the model will try to find
patterns from the unlabelled datasets (without targets or dependent variables)
from independent variables. A common example would be clustering in
which the model clusters the dataset after learning the patterns.
30
data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according to the similarities and difference between the objects.
Unsupervised learning models are mainly used to perform three tasks, which
are as follows:
31
➢ Association Rule Learning
➢ Dimensionality Reduction
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
c)Reinforcement Learning:
Reinforcement learning refers to goal-oriented algorithms, which learn how
to attain a complex objective (goal) or maximize along a particular
dimension over many steps. This method allows machines and software
agents to automatically determine the ideal behavior within a specific
context in order to maximize its performance. Simple reward feedback is
required for the agent to learn which action is best; this is known as the
reinforcement signal. For example, maximize the points won in a game over
many moves.
35
The behavior of the model in reinforcement learning is similar to human
learning, as humans learn things by experiences as feedback and interact
with the environment.
Below are some popular algorithms that come under reinforcement learning:
It aims to learn the policy that can help the AI agent to take the best
action for maximizing the reward under a specific circumstance. It
incorporates Q values for each state-action pair that indicate the
reward to following a given state path, and it tries to maximize the Q-
value.
d)Semi-supervised learning:
Where an incomplete training signal is given: a training set with some
(often many) of the target outputs missing. There is a special case of this
principle known as Transduction where the entire set of problem instances
is known at learning time, except that part of the targets are missing. Semi-
supervised learning is an approach to machine learning that combines small
labeled data with a large amount of unlabeled data during training. Semi-
36
supervised learning falls between unsupervised learning and supervised
learning.
One of the most confusing questions among beginners is that are machine
learning models, and algorithms are the same? Because in various cases in
machine learning and data science, these two terms are used
interchangeably.
The answer to this question is No, and the machine learning model is not
the same as an algorithm. In a simple way, an ML algorithm is like a
procedure or method that runs on data to discover patterns from it and
37
generate the model. At the same time, a machine learning model is like a
computer program that generates output or makes predictions. More
specifically, when we train an algorithm with data, it becomes a model.
38
Supervised learning needs supervision Unsupervised learning does not
to train the model. need any supervision to train the
model.
39
Machine Learning Applications and Examples
(prepare any six application for exam)
• Social Media Features : Social media platforms use machine learning
algorithms and approaches to create some attractive and excellent
features. For instance, Facebook notices and records your activities,
chats, likes, and comments, and the time you spend on specific kinds
of posts. Machine learning learns from your own experience and
makes friends and page suggestions for your profile.
40
• Sentiment Analysis : Sentiment analysis is one of the most necessary
applications of machine learning. Sentiment analysis is a real-time
machine learning application that determines the emotion or opinion
of the speaker or the writer. For instance, if someone has written a
review or email (or any form of a document), a sentiment analyzer will
instantly find out the actual thought and tone of the text. This
sentiment analysis application can be used to analyze a review based
website, decision-making applications, etc.
• Web Search Engine: One of the reasons why search engines like
google, bing etc work so well is because the system has learnt how to
rank pages through a complex learning algorithm.
• Photo tagging Applications: Be it Facebook or any other photo
tagging application, the ability to tag friends makes it even more
happening. It is all possible because of a face recognition algorithm
that runs behind the application.
42
• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of
hard work for us in classifying the mails and moving the spam mails
to spam folder. This is again achieved by a spam classifier running in
the back end of mail application.
• Augmentation: Machine learning, which assists humans with their
day-to-day tasks, personally or commercially without having
complete control of the output. Such machine learning is used in
different ways such as Virtual Assistant, Data analysis, software
solutions. The primary user is to reduce errors due to human bias.
• Automation: Machine learning, which works entirely
autonomously in any field without the need for any human
intervention. For example, robots performing the essential process
steps in manufacturing plants.
• Finance Industry: Machine learning is growing in popularity in the
finance industry. Banks are mainly using ML to find patterns inside
the data but also to prevent fraud.
• Government organization: The government makes use of ML to
manage public safety and utilities. Take the example of China with
the massive face recognition. The government uses Artificial
intelligence to prevent jaywalker.
• Marketing: Broad use of AI is done in marketing thanks to
abundant access to data. Before the age of mass data, researchers
develop advanced mathematical tools like Bayesian analysis to
estimate the value of a customer. With the boom of data, marketing
department relies on AI to optimize the customer relationship and
marketing campaign.
• Computer vision: Machine learning algorithms can be used to
recognize objects, people, and other elements in images and videos.
• Natural language processing: Machine learning algorithms can be
used to understand and generate human language, including tasks
such as translation and text classification.
43
• Recommendation systems: Machine learning algorithms can be
used to recommend products or content to users based on their past
behavior and preferences.
• Fraud detection: Machine learning algorithms can be used to
identify fraudulent activity in areas such as credit card transactions
and insurance claims.
44
do it is to have some way for machines to learn things themselves. A
mechanism for learning – if a machine can learn from input then it does the
hard work for us. This is where Machine Learning comes in action.
• Database Mining for growth of automation: Typical applications
include Web-click data for better UX( User eXperience), Medical
records for better automation in healthcare, biological data and many
more.
• Applications that cannot be programmed: There are some tasks that
cannot be programmed as the computers we use are not modelled that
way. Examples include Autonomous Driving, Recognition tasks from
unordered data (Face Recognition/ Handwriting Recognition), Natural
language Processing, computer Vision etc.
• Understanding Human Learning: This is the closest we have
understood and mimicked the human brain. It is the start of a new
revolution, The real AI. Now, After a brief insight lets come to a more
formal definition of Machine Learning
Prerequisites
Before learning machine learning, you must have the basic knowledge of
followings so that you can easily understand the concepts of machine
learning:
o The duration of 1974 to 1980 was the tough time for AI and ML
researchers, and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people
had reduced their interest from AI, which led to reduced funding by
the government to the researches.
47
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a
neural network NETtalk, which was able to teach itself how to
correctly pronounce 20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game
against the chess expert Garry Kasparov, and it became the first
computer which had beaten a human chess expert.
o 2006: In the year 2006, computer scientist Geoffrey Hinton has given
a new name to neural net research as "deep learning," and nowadays,
it has become one of the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned
to recognize the image of humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing
Test. It was the first Chabot who convinced the 33% of human judges
that it was not a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and
they claimed that it could recognize a person with the same precision
as a human can do.
o 2016: AlphaGo beat the world's number second player Lee
sedol at Go game. In 2017 it beat the number one player of this
game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system
that was able to learn the online trolling. It used to read millions of
comments of different websites to learn to stop online trolling.
Now machine learning has got a great advancement in its research, and it is
present everywhere around us, such as self-driving cars, Amazon
48
Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.
49
Key Elements of Machine Learning
There are tens of thousands of machine learning algorithms and hundreds of
new algorithms are developed every year. Every machine learning algorithm
has three components:
• Representation learning
• Evaluation
• Optimization
50
In representation learning, data is sent into the machine, and it learns the
representation on its own. It is a way of determining a data representation
of the features, the distance function, and the similarity function that
determines how the predictive model will perform. Representation learning
works by reducing high-dimensional data to low-dimensional data, making
it easier to discover patterns and anomalies while also providing a better
understanding of the data’s overall behaviour.
Importance of Representation
Representation learning is a very important aspect of machine
learning which automatically discovers the feature patterns in the data.
When the machine is provided with the data, it learns the representation
itself without any human intervention. The goal of representation learning
is to train machine learning algorithms to learn useful representations, such
as those that are interpretable, incorporate latent features, or can be used for
51
transfer learning. In this article, we will discuss the concept of
representation learning along with its need and different approaches. The
major points to be covered in this article are listed below.
EVALUATION
Evaluation: the way to evaluate candidate programs
(hypotheses). Examples include accuracy, prediction and recall, squared
error, likelihood, posterior probability, cost, margin, entropy k-L divergence
and others.
Evaluation is essentially how you judge or prefer one model vs. another; it’s
what you might have seen as a utility function, loss function, scoring
function, or fitness function in other contexts. Think of this as the height of
the landscape for each given model, with lower areas being more
preferable/desirable than higher areas (without loss of generality). Mean
squared error (of a model’s output vs. the data output) or likelihood (the
estimated probability of a model given the observed data) are examples of
different evaluation functions that will imply somewhat different heights at
each point on a single landscape.
OPTIMIZATION
Optimization: The way candidate programs are generated known as the
search process. For example combinatorial optimization, convex
optimization, constrained optimization. All machine learning algorithms are
combinations of these three components. A framework for understanding
all algorithms.
What is Approximation?
53
Approximation is used whenever a numerical value, a model, a structure or
a function is either unknown or difficult to compute. Here focus on function
approximation and describe its application to machine learning problems.
54
Data sets in Machine Learning
How to get datasets for Machine Learning ?
The key to success in the field of machine learning or to become a great data
scientist is to practice with different types of datasets. But discovering a
suitable dataset for each kind of machine learning project is a difficult task.
So, in this topic, we will provide the detail of the sources from where you
can easily get the dataset according to your project.
Before knowing the sources of the machine learning dataset, let's discuss
datasets.
What is a dataset?
55
Types of data in datasets
Need of Dataset
The technology applied behind any ML projects cannot work properly if the
dataset is not well prepared and pre-processed.
➢ Training dataset:
➢ Test Dataset
56
Note: The datasets are of large size, so to download these datasets, you
must have fast internet on your computer.
Below is the list of datasets which are freely available for the public to work
on it:
1. Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists
and Machine Learners. It allows users to find, download, and publish
datasets in an easy way. It also provides the opportunity to work with other
machine learning engineers and solve difficult Data Science related tasks.
57
Kaggle provides a high-quality dataset in different formats that we can
easily find and download.
Since the year 1987, it has been widely used by students, professors,
researchers as a primary source of machine learning dataset.
It classifies the datasets as per the problems and tasks of machine learning
such as Regression, Classification, Clustering, etc. It also contains some
of the popular datasets such as the Iris dataset, Car Evaluation dataset,
Poker Hand dataset, etc.
58
3. Datasets via AWS
We can search, download, access, and share the datasets that are publicly
available via AWS resources. These datasets can be accessed through AWS
resources but provided and maintained by different government
organizations, researches, businesses, or individuals.
Anyone can analyze and build various services using shared data via AWS
resources. The shared dataset on cloud helps users to spend more time on
data analysis rather than on acquisitions of data.
This source provides the various types of datasets with examples and ways
to use the dataset. It also provides the search box using which we can search
for the required dataset. Anyone can add any dataset or example to
the Registry of Open Data on AWS.
59
4. Google's Dataset Search Engine
60
5. Microsoft Datasets
61
6. Awesome Public Dataset Collection
7. Government Datasets
Visual data provides multiple numbers of the great dataset that are specific
to computer visions such as Image Classification, Video classification,
Image Segmentation, etc. Therefore, if you want to build a project on deep
learning or image processing, then you can refer to this source.
63
9. Scikit-learn dataset
64
What is Training Data?
Training data is the initial dataset used to train machine learning algorithms.
Models create and refine their rules using this data. It's a set of data samples
used to fit the parameters of a machine learning model to training it by
example.
Training data is also known as training dataset, learning set, and training set.
It's an essential component of every machine learning model and helps them
make accurate predictions or perform a desired task.
Simply put, training data builds the machine learning model. It teaches what
the expected output looks like. The model analyzes the dataset repeatedly to
deeply understand its characteristics and adjust itself for better performance.
Machine learning models are as good as the data they're trained on. Without
high-quality training data, even the most efficient machine
learning algorithms will fail to perform.
The need for quality, accurate, complete, and relevant data starts early on in
the training process. Only if the algorithm is fed with good training data can
it easily pick up the features and find relationships that it needs to predict
down the line.
65
In a broader sense, training data can be classified into two
categories: labeled data and unlabeled data.
The teacher’s aspiration is that the student must perform well in exams and
also in the real world. In the case of ML algorithms, testing is like exams.
The textbooks (training dataset) contain several examples of the type of
questions that’ll be asked in the exam.
Tip: Check out big data analytics to know how big data is collected, structured,
cleaned, and analyzed.
Of course, it won’t contain all the examples of questions that’ll be asked in the exam,
nor will all the examples included in the textbook will be asked in the exam. The
textbooks can help prepare the student by teaching them what to expect and how to
respond.
68
contain all possible examples, it’ll make algorithms capable of making
predictions.
But that's never the case. A training dataset can never be comprehensive and
can't teach everything that a model might encounter in the real world.
Therefore a test dataset, containing unseen data points, is used to evaluate
the model's accuracy.
69
Then there's validation data. This is a dataset used for frequent evaluation
during the training phase. Although the model sees this dataset occasionally,
it doesn't learn from it. The validation set is also referred to as the
development set or dev set. It helps protect models from overfitting and
underfitting.
Although validation data is separate from training data, data scientists might
reserve a part of the training data for validation. But of course, this
automatically means that the validation data was kept away during the
training.
70
The validation dataset gives the model the first taste of unseen data.
However, not all data scientists perform an initial check using validation
data. They might skip this part and go directly to testing data.
The data is prepared by cleaning it, accounting for missing values, removing
outliers, tagging data points, and loading it into suitable places for training
ML algorithms. There will also be several rounds of quality checks; as you
know, incorrect labels can significantly affect the model's accuracy.
Low-quality data can significantly affect the accuracy of models, which can
lead to severe financial losses. It's almost like giving a student a textbook
containing wrong information and expecting them to excel in the
examination.
The following are the four primary traits of quality training data.
1)Relevant
The data needs to be relevant to the task at hand. For example, if you want
to train a computer vision algorithm for autonomous vehicles, you
probably won't require images of fruits and vegetables. Instead, you would
71
need a training dataset containing photos of roads, sidewalks, pedestrians,
and vehicles.
2)Representative
The AI training data must have the data points or features that the
application is made to predict or classify. Of course, the dataset can never
be absolute, but it must have at least the attributes the AI application is
meant to recognize.
For example, if the model is meant to recognize faces within images, it must
be fed with diverse data containing people's faces from various ethnicities.
This will reduce the problem of AI bias, and the model won't be prejudiced
against a particular race, gender, or age group.
3)Uniform
All data should have the same attribute and must come from the same
source.
Suppose your machine learning project aims to predict churn rate by looking
at customer information. For that, you'll have a customer information
database that includes customer name, address, number of orders, order
frequency, and other relevant information. This is historical data and can be
used as training data.
One part of the data can't have additional information, such as age or gender.
This will make training data incomplete and the model inaccurate. In short,
uniformity is a critical aspect of quality training data.
72
4)Comprehensive
Again, the training data can never be absolute. But it should be a large
dataset that represents the majority of the model's use cases. The training
data must have enough examples that’ll allow the model to learn
appropriately. It must contain real-world data samples as it will help train
the model to understand what to expect.
For ML models, training data is the only book they read. Their performance
or accuracy will depend on how comprehensive, relevant, and representative
the very book is.
That being said, three factors affect the quality of training data:
1. People: The people who train the model have a significant impact on
its accuracy or performance. If they're biased, it’ll naturally affect how
they tag data and, ultimately, how the ML model functions.
2. Processes: The data labeling process must have tight quality control
checks in place. This will significantly increase the quality of training
data.
73
3. Tools: Incompatible or outdated tools can make data quality suffer.
Using robust data labeling software can reduce the cost and time
associated with the process.
74
1) Get the Dataset
2) Importing Libraries
import numpy as nm : Here we have used nm, which is a short name for
Numpy, and it will be used in the whole program.
75
import matplotlib.pyplot as mpt : Here we have used mpt as a short name
for this library.
Pandas: The last library is the Pandas library, which is one of the most
famous Python libraries and used for importing and managing the datasets.
It is an open-source data manipulation and analysis library. It will be
imported as below:
Here, we have used pd as a short name for this library. Consider the below
image:
Now we need to import the datasets which we have collected for our
machine learning project. But before importing a dataset, we need to set the
current directory as a working directory. To set a working directory in
Spyder IDE, we need to follow the below steps:
Note: We can set any directory as a working directory, but it must contain
the required dataset.
Here, in the below image, we can see the Python file along with required
dataset. Now, the current folder is set as a working directory.
76
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library,
which is used to read a csv file and performs various operations on it. Using
this function, we can read a csv file locally as well as through an URL.
The next step of data preprocessing is to handle missing data in the datasets.
If our dataset contains some missing data, then it may create a huge problem
77
for our machine learning model. Hence it is necessary to handle missing
values present in the dataset.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal
with null values. In this way, we just delete the specific row or column
which consists of null values. But this way is not so efficient and removing
data may lead to loss of information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that
column or row which contains any missing value and will put it on the place
of missing value. This strategy is useful for the features which have numeric
data such as age, salary, year, etc. Here, we will use this approach.
Categorical data is data which has some categories such as, in our dataset;
there are two categorical variable, Country, and Purchased.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value
gives the presence of that variable in a particular column, and rest variables
become 0. With dummy encoding, we will have a number of columns equal
to the number of categories.
6) Splitting the Dataset into the Training set and Test set
78
In machine learning data preprocessing, we divide our dataset into a training
set and test set. This is one of the crucial steps of data preprocessing as by
doing this, we can enhance the performance of our machine learning model.
If we train our model very well and its training accuracy is also very high,
but we provide a new dataset to it, then it will decrease the performance. So
we always try to make a machine learning model which performs well with
the training set and also with the test dataset. Here, we can define these
datasets as:
Training Set: A subset of dataset to train the machine learning model, and
we already know the output.
Test set: A subset of dataset to test the machine learning model, and by
using the test set, model predicts the output.
7) Feature Scaling
Standardization
79
Normalization
80