0% found this document useful (0 votes)
93 views40 pages

machine learning unit 1 ppt

The document provides an overview of machine learning, including its life cycle, types, applications, and historical context. It explains the differences between artificial intelligence, machine learning, deep learning, and data science, and outlines the steps involved in the machine learning life cycle from data gathering to deployment. Additionally, it highlights various real-world applications of machine learning, such as image recognition, speech recognition, and self-driving cars.

Uploaded by

harshsinhahs16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views40 pages

machine learning unit 1 ppt

The document provides an overview of machine learning, including its life cycle, types, applications, and historical context. It explains the differences between artificial intelligence, machine learning, deep learning, and data science, and outlines the steps involved in the machine learning life cycle from data gathering to deployment. Additionally, it highlights various real-world applications of machine learning, such as image recognition, speech recognition, and self-driving cars.

Uploaded by

harshsinhahs16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

CONTENTS of UNIT-I

 Introduction To Machine Learning


 Machine Learning Life Cycle
 Types Of Machine Learning System (Supervised And Unsupervised Learning, Batch And Online
Learning, Instance-based And Model Based Learning),
Scope And Limitations
 Challenges Of Machine Learning
 Data Visualization
 Hypothesis Function And Testing
 Data Pre-processing
 Data Augmentation
 Normalizing Data Sets
 Bias-variance Trade off
 Relation Between AI (Artificial Intelligence), ML (Machine Learning), DL (Deep Learning) And DS
(Data Science).
AI Vs ML Vs Deep Learning Vs DS Data science is a discipline that uses various
technologies and methods to analyze data. It
combines computational science, statistics,
mathematics, and company or business
knowledge.
The main goal of data science is to find
patterns in data. It uses various statistical
techniques to analyze and extract
information from the data. Data scientists
can help companies make smarter business
• AI is a technique which enables decisions through these valuable insights.

machine to mimic human behavior.

DS

• Subset of AI technique which use


statistical methods to enable
• Subset of ML which make the computation of machines to improve with experience.
multilayer neural network feasible.
AI ML DL
AI was firstly coined in the year 1956 by ML was firstly coined in the year 1959 DL was firstly coined in the year 2000 by
Arthur john McCarthy. by Arthur Samuel. Arthur Igor Aizen berg.

AI is a technique which enables machine Subset of AI technique which use DL deals with the algorithm which are
to mimic human behavior. statistical methods to enable inspired by structure & function of human
The concept is pretty old but it has machines to improve with experience. brain.
gained its popularity recently the reason The algorithm and programs are
is earlier we had very small amount of designed in a way that they can learn
data so it can not predict accurate result. and improve over time when exposed
But now there is tremendous increase in to new data.
the amount of data. Now we have more
advanced algorithm and high end
computing power and storage that can
deal with such large amount of data
Training time required using ML for Training time required using DL for
particular models is less. particular models is comparatively more.

Testing time is more in ML Testing time is less in DL.

Data dependency- ML gives better Data dependency- DL gives better result


result for small amount of data. for large amount of data.

Ml algorithm can work on low end Deep learning algorithm are heavily
machines as well. dependent on high end machines
because there is requirement of
GPUs for large amount of matrix
multiplication operations.
Example of ML- suppose we want to create a system
that tells about the expected weight of the person
based on height.
Steps:-
1. First we collect the data. Each points on the
graph represent one data point . To start with we
can draw a single line to predict the weight based
on the height. For example W = H – 100, where W
is weight in kg and H is height in centimeter.
2. This line help us to make the prediction. Our main
goal is to reduce the difference between the
estimated value and the actual value.
3. In order to achieve it we try to draw a straight line
that fits through all these different points and
minimize the error. Decreasing the error or
difference between the actual and estimated
value increases the performance of the model.
4. Further the more data points we collect the
better our model become. We can also improve
our model by adding more variable and creating
different prediction lines for them.
History of Machine Learning
• The term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and first in the field
of computer gaming and artificial intelligence. It is a field of study that gives computers a capability to
learn without being explicitly programmed,

• By the early 1960s an experimental "learning machine" with punched tape memory, called Cyber Tron,
had been developed by Raytheon company to analyze sonar signals, electrocardiograms, and speech
patterns using rudimentary reinforcement learning.

• Tom Michael Mitchell (born August 9, 1951) is an American computer scientist .He is a former Chair
person of the Machine Learning Department at CMU. Mitchell is known for his contributions to the
advancement of machine learning, artificial intelligence, and cognitive neuroscience and is the author of
the textbook Machine Learning. According to him "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E.“ or In simple way we can define machine
learning as “Machine learning is programming computers to optimize a performance criterion using
example data or past experience”

• Modern-day machine learning has two objectives, one is to classify data based on models which have
been developed, the other purpose is to make predictions for future outcomes based on these models.
What is ML?
• ML is a method or set of algorithms that one can use to replicate activities
typically that require human cognition.
• ML is a type of AI that enables machine to learn from data and deliver
predictive models. The ML is not dependent on any explicit programming but
the data feed into it, is a complicated process. Based on the data we feed into
ML algorithm and training given to it, output is delivered.
• It is based on the idea that we should be able to give machine the access to
data & let them learn from them selves. It deals with the extraction of pattern
from the data set this means that machine can not only find the rules for
optimal behavior but also can adopt to the changes in the world.
• ML is the core of many futuristic technology advancement in our world. Today
we can see many applications and implementation of ML around us such as
TESLA’S SELF DRIVING CAR, APPLE SIRI, SOFIA AI ROBOT & MAY MORE.
ML Application-
Most Trending Real-world Applications Of Machine Learning:
1. Image Recognition: Image recognition is one of the most common applications of machine learning. It is used to identify objects, persons, places, digital images, etc.
The popular use case of image recognition and face detection is, Automatic friend tagging suggestion. Facebook provides us a feature of auto friend tagging suggestion.
Whenever we upload a photo with our Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this is machine
learning's face detection and recognition algorithm. It is based on the Facebook project named "Deep Face," which is responsible for face recognition and person
identification in the picture.
2. Speech Recognition -While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to
follow the voice instructions.
3. Traffic prediction: If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest route and predicts the traffic
conditions. It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of two ways. Real Time location of the
vehicle form Google Map app and sensors Average time has taken on past days at the same time. Everyone who is using Google Map is helping this app to make it better.
It takes information from the user and sends back to its database to improve the performance.
4. Product recommendations: Machine learning is widely used by various e-commerce and entertainment companies such as Amazon, Netflix, etc., for product
recommendation to the user. Whenever we search for some product on Amazon, then we started getting an advertisement for the same product while internet surfing on
the same browser and this is because of machine learning. Google understands the user interest using various machine learning algorithms and suggests the product as
per customer interest. As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and this is also done with the help of
machine learning.
5. Self-driving cars: One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a significant role in self-driving cars. Tesla, the
most popular car manufacturing company is working on self-driving car. It is using unsupervised learning method to train the car models to detect people and objects
while driving.
6. Email Spam and Malware Filtering: whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always receive an important mail

in our inbox with the important symbol and spam emails in our spam box, and the technology behind this is Machine learning. Some machine learning algorithms such
as Multi-Layer Perceptron, Decision tree, and Naïve Bayes classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant: We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name suggests, they help us in
finding the information using our voice instruction. These assistants can help us in various ways just by our voice instructions such as Play music, call someone,
Open an email, Scheduling an appointment, etc.These virtual assistants use machine learning algorithms as an important part. These assistant record our voice
instructions, send it over the server on a cloud, and decode it using ML algorithms and act accordingly.
8. Online Fraud Detection: Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or a fraud
transaction. For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more secure.
9. Stock Market trading: Machine learning is widely used in stock market trading. In the stock market, there is always
a risk of up and downs in shares, so for this machine learning's long short term memory neural network is used for
the prediction of stock market trends.
10. Medical Diagnosis: In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
11. Automatic Language Translation: Nowadays, if we visit a new place and we are not aware of the language then it
is not a problem at all, as for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning that
translates the text into our familiar language, and it called as automatic translation. The technology behind the
automatic translation is a sequence to sequence learning algorithm, which is used with image recognition and
translates the text from one language to another language.
ML life cycle
Machine learning has given the
computer systems the abilities to
automatically learn without being
explicitly programmed. But how
does a machine learning system
work? So, it can be described
using
the life cycle of machine learning.
Machine learning life cycle is a
cyclic process to build an efficient
machine learning project. The
main
purpose of the life cycle is to find
a
solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
• Gathering Data
• Data preparation
• Data Wrangling
• Analyse Data
• Train the model
• Test the model
• Deployment

1.Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify and obtain all data-related
problems. In this step, we need to identify the different data sources, as data can be collected from various sources such
as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The quantity and quality of
the collected data will determine the efficiency of the output. The more will be the data, the more accurate will be the prediction.
This step includes the below tasks:
• Identify various data sources
• Collect data
• Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in further steps.
2.Data preparation-After collecting the data, we need to prepare it for further steps. Data preparation is a step where we
put our data into a suitable place and prepare it to use in our machine training. In this step, first, we put all data together, and
then randomize the ordering of data. This step can be further divided into two processes:
• Data exploration: It is used to understand the nature of data that we have to work with. We need to understand the
characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends, and outliers.
• Data pre-processing: Now the next step is preprocessing of data for its analysis.
3.Data Wrangling-Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper format to make it more
suitable for analysis in the next step. It is one of the most important steps of the complete process. Cleaning of data is
required to address the quality issues. It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues, including:
• Missing Values
• Duplicate data
• Invalid data
• Noise
So, we use various filtering techniques to clean the data.It is mandatory to detect and remove the above issues because it can
negatively affect the quality of the outcome.
4.Data Analysis-Now the cleaned and prepared data is passed on to the analysis step. This step involves:
• Selection of analytical techniques
• Building models
• Review the result
The aim of this step is to build a machine learning model to analyze the data using various analytical techniques
and review the outcome. It starts with the determination of the type of the problems, where we select the
machine learning techniques such as Classification, Regression, Cluster analysis, Association, etc. then build the
model using prepared data, and evaluate the model.

5. Train Model- The next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem . We use datasets to train the model using various machine
learning algorithms. Training a model is required so that it can understand the various patterns, rules, and,
features.

6.Test Model-Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to it . Testing the model
determines the percentage accuracy of the model as per the requirement of project or problem.

7. Deployment- The last step of machine learning life cycle is deployment, where we deploy the model in
the real-world system. If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the project, we will check
whether it is improving its performance using available data or not. The deployment phase is similar to making
the final report for a project.
Types Of ML System

Supervised
learning

Unsupervised
Learning Reinforcement
learning

Batch learning

Online learning

Instance Based
learning

Model Based
learning
• Supervised learning- Supervised learning is the types of machine learning in which machines are trained using well "
labelled" training data, and on basis of that data, machines predict the output. The labelled data means some input data is already
tagged with the correct output. In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student learns in the supervision of the
teacher. Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The
aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y).
Example- Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and Polygon. Now the first
step is that we need to train the model for each shape.
• If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
• Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
• The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on the bases of a
number of sides, and predicts the output.
Steps Involved in Supervised Learning:
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation dataset.
• Determine the input features of the training dataset, which should have enough knowledge so that the model can accurately predict
the output.
• Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation sets as the control parameters, which are the subset
of training datasets.
Types of supervised Machine learning Algorithms : Supervised learning can be further divided into
two types of problems:
Regression
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression

Classification: Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Unsupervised learning -As the name suggests, unsupervised learning is a machine learning
technique in which models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes place in the human brain
while learning new things. It can be defined as:” Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed to act on that data without any supervision.”
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of different
types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have any
idea about the features of the dataset. The task of the unsupervised learning algorithm is to identify the image
features on their own. Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.
Types of Unsupervised Learning Algorithm: The unsupervised learning algorithm can be further categorized into
two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most similarities
remains into a group and has less or no similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the presence and absence of those
commonalities.
Association: An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database. It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition

Batch learning- In batch learning, the system is incapable of learning incrementally: It must be trained using all the available
data. This will generally take a lot of time and computing resources, so it is typically done offline, first the system is trained and then
it’s launched into production and runs without learning anymore; it just applied what it has learned. This is called offline learning.
Also, training on the full set of data requires a lot of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.
).If you have a lot of data and you automate your systems to train from scratch every day, it will end up costing you a lot of money. If
the amount of data is huge, it may even be impossible to use a batch learning algorithm.
Online Learning- In online learning, we train the system
incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches.
Each learning step is cheap and fast, so the system can learn about new data. Online learning is great for systems
that receives data as a continuous flow (e.g., stock prices) and needs to adapt to change rapidly or autonomously.
It is also a good option if you have limited computing resources: Once an online learning system has learned about
new data instances, it does not need them anymore, so you can discard them (unless you to be able to roll back to
a previous state and “replay” the data). This can save a huge amount of space.
Instance Based learning - Instance based learning is a
supervised classification learning algorithm that performs operation after comparing the current instances with the
previously trained instances, which have been stored in memory. Its name is derived from the fact that it creates
assumption from the training data instances.
• Learning in these algorithms consists of simply storing the presented training data. When a new query instance is
encountered, a set of similar related instances is retrieved from memory and used to classify the new query instance
• Instance-based approaches can construct a different approximation to the target function for each distinct query
instance that must be classified
• Time complexity of Instance based learning algorithm depends upon the size of training data. Time complexity of this
algorithm in worst case is O (n), where n is the number of training items to be used to classify a single new instance.
• Instance-based learning methods such as nearest neighbor and locally weighted regression are conceptually
straightforward approaches to approximating real-valued or discrete-valued target functions.
Advantages of Instance-based learning
• 1. Training is very fast
• 2. Learn complex target function
• 3. Don’t lose information
Disadvantages of Instance-based learning
• The cost of classifying new instances can be high

Model Based learning-


In machine learning there are multitude of different algorithms for solving a broad range of problems. To tackle a
new application, one typically tries to map their problem onto one of these existing methods often influenced by
their familiarity with specific algorithm and by the availability of corresponding software implementations. In this
traditional machine learning approach, sometimes there is the difficulty of adapting a standard algorithm to
match the particular requirements of a specific application.
however, model based learning is an alterative methodology for applying machine learning. In which a bespoke
solution is formulated for each new application. Typically model based machine learning will be implemented
using a model specific language In which the model can be defined using compact code, from which the software
implementing that model can be generated automatically.
Challenges of machine learning
1.Inadequate Training Data-
The major issue that comes while using machine learning algorithms is the lack of quality as well as quantity of data. data
quality is also important for the algorithms to work ideally, but the absence of data quality is also found in Machine
Learning applications. Data quality can be affected by some factors as follows:
• Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as accuracy in classification
tasks.
• Incorrect data- It is also responsible for faulty programming and results obtained in machine learning models. Hence,
incorrect
data may affect the accuracy of the results also.
• Generalizing of output data- Sometimes, it is also found that generalizing output data becomes complex, which results in
comparatively poor future actions.
•2. Poor quality of data- data plays a significant role in machine learning, and it must be of good quality as well. Noisy data,
incomplete data, inaccurate data, and unclean data lead to less accuracy in classification and low-quality results. Hence,
data quality can also be considered as a major common problem while processing machine learning algorithms.
•3. Non-representative training data- we have to ensure that sample training data must be representative of new cases that
we need to generalize. The training data must cover all cases that are already occurred as well as occurring. Further, if we
are using non-representative training data in the model, it results in less accurate predictions. A machine learning model is
said to be ideal if it predicts well for generalized cases and provides accurate decisions. If there is less training data, then
there will be a sampling noise in the model, called the non-representative training set. It won't be accurate in predictions. To
overcome this, it will be biased against one class or a group. Hence, we should use representative data in training to protect
against being biased and make accurate predictions without any drift.
Overfitting and Under fitting- Overfitting is one of the most common issues faced by Machine Learning engineers and data
scientists. Whenever a machine learning model is trained with a huge amount of data, it starts capturing noise and inaccurate
data into the training data set. It negatively affects the performance of the model. The main reason behind overfitting is using
non-linear methods used in machine learning algorithms as they build non-realistic data models. We can overcome overfitting by
using linear and parametric algorithms in the machine learning models.
Methods to reduce overfitting:
Increase training data in a dataset.
Reduce model complexity by simplifying the model by selecting one with fewer parameters
• Ridge Regularization and Lasso Regularization
• Early stopping during the training phase
• Reduce the noise
• Early stopping during the training phase
• Reduce the noise
• Reduce the number of attributes in training data.
• Constraining the model.

• Under fitting is just the opposite of overfitting. Whenever a machine learning model is trained with fewer amounts of data, and
as a result, it provides incomplete and inaccurate data and destroys the accuracy of the machine learning model. Underfitting
occurs when our model is too simple to understand the base structure of the data, just like an undersized pant. This generally
happens when we have limited data into the data set, and we try to build a linear model with non-linear data. In such scenarios,
the complexity of the model destroys, and rules of the machine learning model become too easy to be applied on this data set,
and the model starts doing wrong predictions as well.
Methods to reduce Underfitting:
Increase model complexity
Remove noise from the data
Trained on increased and better features
Reduce the constraints
Increase the number of epochs to get better results.
5. Monitoring and maintenance
6. Lack of skilled resources
7. Customer Segmentation
8. Data Bias - Data Biasing is also found a big challenge in Machine Learning. These errors exist when certain elements of the
dataset are heavily weighted or need more importance than others. Biased data leads to inaccurate results, skewed outcomes,
and other analytical errors. However, we can resolve this error by determining where data is actually biased in the dataset.
Further, take necessary steps to reduce it.
Methods to remove Data Bias:
• Research more for customer segmentation.
• Be aware of your general use cases and potential outliers.
• Combine inputs from multiple sources to ensure data diversity.
• Include bias testing in the development process.
• Analyze data regularly and keep tracking errors to resolve them easily.
• Review the collected and annotated data.
• Use multi-pass annotation such as sentiment analysis, content moderation, and intent recognition.
What is Data Visualization
• Data visualization is a graphical representation of quantitative information and data by using visual elements like graphs, charts, and
maps. Data visualization convert large and small data sets into visuals, which is easy to understand and process for humans.
• Data visualization tools provide accessible ways to understand outliers, patterns, and trends in the data. In the world of Big Data, the
data visualization tools and technologies are required to analyze vast amounts of information. Data visualizations are common in
your everyday life, but they always appear in the form of graphs and charts. The combination of multiple visualizations and bits of
information are still referred to as Infographics.
• Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel spreadsheet, which displays the data in
more sophisticated ways such as dials and gauges, geographic maps, heat maps, pie chart, and fever chart.
• Data visualization is important because of the processing of information in human brains. Using graphs and charts to visualize a large
amount of the complex data sets is more comfortable in comparison to studying the spreadsheet and reports.
• Data visualization is an easy and quick way to convey concepts universally. You can experiment with a different outline by making a
slight adjustment.

Why Use Data Visualization?


• To make easier in understand and remember.
• To discover unknown facts, outliers, and trends.
• To visualize relationships and patterns quickly.
• To ask a better question and make better decisions.
• To competitive analyze.
• To improve insights.
Hypothesis Function & Testing
• Data scientists and ML professionals conduct experiments to solve a problem. These ML professionals and data scientists
make an initial assumption for the solution of the problem. This assumption in Machine learning is known as Hypothesis.
“The hypothesis is defined as the supposition or proposed explanation based on insufficient evidence or assumptions”.
• The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in Supervised
Machine learning, where an ML model learns a function that best maps the input to corresponding outputs with the help of an
available dataset.
• Hypothesis space (H):Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known as a
hypothesis set. It is used by supervised machine learning algorithms to determine the best possible hypothesis to describe the
target function or best maps input to output.
• Hypothesis (h):It is defined as the approximate function that best describes the target in supervised machine learning algorithms.
It is primarily based on data as well as bias and restrictions applied to data.
Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper output and can be evaluated as well as
used to make predictions.
The hypothesis (h) can be formulated in machine learning as follows:
Y = mx + b
Where, Y: Range
m: Slope of the line which divided test data or changes in y divided by change in x.
x: domain, c: intercept (constant)
Data preprocessing in ML
• Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
model. It is the first and crucial step while creating a machine learning.
• Need for data preprocessing: A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models, so it is mandatory to clean it
and put in a formatted way. which also increases the accuracy and efficiency of a machine learning model.
• It involves following steps:
 Getting the dataset
 Importing libraries
 Importing datasets
 Finding Missing Data
 Encoding Categorical Data
 Splitting dataset into training and test set
 Feature scaling


k k
Data augmentation

• Data augmentation is a set of techniques to artificially increase the amount of data by generating new
data points from existing data.
This includes making small changes to data or using deep learning models to generate new data poin
ts.
• Data augmentation is useful to improve the performance and outcomes of machine learning models
by forming new and different examples to train datasets. If the dataset in a machine learning model is
rich and sufficient, the model performs better and more accurately.
• For example in image classification and segmentation classic image processing activities for data augmentation are:
padding, random rotating, re-scaling, vertical and horizontal flipping, translation ( image is moved along X, Y direction),
cropping, zooming, darkening & brightening/color modification, gray scaling, changing contrast, adding noise, random
erasing.
• When Should You Use Data Augmentation?
To prevent models from overfitting.
The initial training set is too small.
To improve the model accuracy.
To Reduce the operational cost of labeling and cleaning the raw dataset.
• Limitations of Data Augmentation
The biases in the original dataset persist in the augmented data.
Quality assurance for data augmentation is expensive.
Research and development are required to build a system with advanced applications. For example, generating high-resolution
images using GANs can be challenging.
Finding an effective data augmentation approach can be challenging.
Data Augmentation Techniques
In this section, we will learn about audio, text, image, and advanced data augmentation techniques.
Audio Data Augmentation
Noise injection: add gaussian or random noise to the audio dataset to improve the model performance.
Shifting: shift audio left (fast forward) or right with random seconds.
Changing the speed: stretches times series by a fixed rate.
Changing the pitch: randomly change the pitch of the audio.
• Text Data Augmentation
Word or sentence shuffling: randomly changing the position of a word or sentence.
Word replacement: replace words with synonyms.
Syntax-tree manipulation: paraphrase the sentence using the same word.
Random word insertion: inserts words at random.
Random word deletion: deletes words at random.
• Image Augmentation
Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images. You need to be careful about applying multiple transformations on
the same images, as this can reduce model performance.
Color space transformations: randomly change RGB color channels, contrast, and brightness.
Kernel filters: randomly change the sharpness or blurring of the image.
Random erasing: delete some part of the initial image.
Mixing images: blending and mixing multiple images.
Data Augmentation Applications
Automatic Speech Recognition
In sound classification and speech recognition, data augmentation works wonders. It improves the model performance even on low-resource
languages. The random noise injection, shifting, and changing the pitch can help you produce state-of-the-art speech-to-text models. You can also
use GANs to generate realistic sounds for a particular application.

Healthcare
Acquiring and labeling medical imaging datasets is time-consuming and expensive. You also need a subject matter expert to validate the dataset
before performing data analysis. Using geometric and other transformations can help you train robust and accurate machine-learning models.
Natural Language Processing:-Text data augmentation is generally used in situations with limited quality data, and improving
the performance metric takes priority. You can apply synonym augmentation, word embedding, character swap, and random
insertion and deletion. These techniques are also valuable for low-resource languages.
Normalizing Data Sets
• Normalization is one of the most frequently used data preparation techniques, normalization is a scaling technique in Machine
Learning applied during data preparation to change the values of numeric columns in the dataset to use a common scale. It is not
necessary for all datasets in a model. It is required only when features of machine learning models have different ranges.
• Although Normalization is no mandate for all datasets available in machine learning, it is used whenever the attributes of the
dataset have different ranges. It helps to enhance the performance and reliability of a machine learning model.
• Normalization techniques in Machine Learning
Min-Max Scaling: This technique is also referred to as scaling. As we have already discussed above, the Min-Max scaling method
helps the dataset to shift and rescale the values of their attributes, so they end up ranging between 0 and 1.
Standardization Scaling: Standardization scaling is also known as Z-score normalization, in which values are centered around the
mean with a unit standard deviation, which means the attribute becomes zero and the resultant distribution has a unit standard
deviation. Mathematically, we can calculate the standardization by subtracting the feature value from the mean and dividing it by
standard deviation.

• Mathematically, we can calculate normalization with the below formula:


Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
Where, Xn = Value of Normalization
Xmaximum = Maximum value of a feature
Xminimum = Minimum value of a feature
Case1- If the value of X is minimum, the value of Numerator will be 0; hence Normalization will also be 0.
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
Put X =Xminimum in above formula, we get;
Xn = Xminimum- Xminimum/ ( Xmaximum - Xminimum)
Xn = 0
Case2- If the value of X is maximum, then the value of the numerator is equal to the denominator; hence Normalization will be 1.
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
Put X =Xmaximum in above formula, we get;
Xn = Xmaximum - Xminimum/ ( Xmaximum - Xminimum)
Xn = 1
Case3- On the other hand, if the value of X is neither maximum nor minimum, then values of normalization will also be between 0
and 1.
Hence, Normalization can be defined as a scaling method where values are shifted and rescaled to maintain their ranges
between 0 and 1, or in other words; it can be referred to as Min-Max scaling technique.

Z-score normalization: Standardization scaling is also known as Z-score normalization, in which values are centered around the
mean with a unit standard deviation, which means the attribute becomes zero and the resultant distribution has a unit standard
deviation. Mathematically, we can calculate the standardization by subtracting the feature value from the mean and dividing it by
standard deviation
Bias &Variance
• Machine learning is a branch of Artificial Intelligence, which allows machines to perform data analysis and make
predictions. However, if the machine learning model is not accurate, it can make predictions errors, and these
prediction errors are usually known as Bias and Variance. In machine learning, these errors will always be present
as there is always a slight difference between the model predictions and actual predictions. The main aim of ML/
data science analysts is to reduce these errors in order to get more accurate results.
• Errors in Machine Learning-There are mainly two types of errors in machine learning, which are:
1. Reducible errors: These errors can be reduced to improve the model accuracy. Such errors can further be
classified into bias and Variance.
2`.Irreducible errors: These errors will always be present in the model.
What is Bias: In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them to test data for
prediction. While making predictions, a difference occurs between prediction values made by the model and
actual values/expected values, and this difference is known as bias errors or Errors due to bias. It can be defined
as an inability of machine learning algorithms such as Linear Regression to capture the true relationship between
the data points.
• Low Bias: A low bias model will make fewer assumptions about the form of the target function.
• High Bias: A model with a high bias makes more assumptions, and the model becomes unable to capture the
important features of our dataset. A high bias model also cannot perform well on new data
.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours and
Support Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear Discriminant
Analysis and Logistic Regression.

Ways to reduce High Bias:


• Increase the input features as the model is under fitted.
• Use more complex models, such as including some polynomial features.
Variance Error
• The variance would specify the amount of variation in the prediction if the different training data was used. In
simple words, variance tells that how much a random variable is different from its expected value. Ideally, a
model should not vary too much from one training dataset to another, which means the algorithm should be
good in understanding the hidden mapping between inputs and output variables. Variance errors are either
of low variance or high variance.
• Low variance means there is a small variation in the prediction of the target function with changes in the training
data set. At the same time, High variance shows a large variation in the prediction of the target function with
changes in the training dataset.
• A model that shows high variance learns a lot and perform well with the training dataset, and does not
generalize well with the unseen dataset. with high variance, the model learns too much from the dataset, it leads
to overfitting of the model. A model with high variance has the below problems:
 A high variance model leads to overfitting.
 Increase model complexities.
Ways to Reduce High Variance:
 Reduce the input features or number of parameters as a model is overfitted.
 Do not use a much complex model.
 Increase the training data.
 Increase the Regularization term.
• There are four possible combinations of bias and variances, which are represented by the below diagram:
Bias-Variance Trade-Off
• While building the machine learning model, it is really important to take care of bias and variance in order to
avoid overfitting and underfitting in the model. If the model is very simple with fewer parameters, it may have
low variance and high bias. Whereas, if the model has a large number of parameters, it will have high variance
and low bias. So, it is required to make a balance between bias and variance errors, and this balance between
the bias error and variance error is known as the Bias-Variance trade-off.
• For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible
because bias and variance are related to each other:
If we decrease the variance, it will increase the bias.
If we decrease the bias, it will increase the variance.
• Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and
variance errors.
Thank you

You might also like