Machine Learning Unit 1: A Machine Can Learn If It Can Gain More Data To Improve Its Performance
Machine Learning Unit 1: A Machine Can Learn If It Can Gain More Data To Improve Its Performance
Introduction: Machine learning is a field of artificial intelligence that allows systems to learn and improve from experience without
being explicitly programmed. It has become an increasingly popular topic in recent years due to the many practical applications it
has in a variety of industries.
Machine learning is an application of artificial intelligence that uses statistical techniques to enable computers to learn and make
decisions without being explicitly programmed. It is predicated on the notion that computers can learn from data, spot patterns, and
make judgments with little assistance from humans.
In traditional programming, we would feed the input data and a well-written and tested program into a machine to generate output.
When it comes to machine learning, input data, along with the output, is fed into the machine during the learning phase, and it
works out a program for itself.
Machine learning allows computers to automatically learn from previous data. For building mathematical models and
making predictions based on historical data or information, machine learning employs a variety of algorithms. It is
currently being used for a variety of tasks, including speech recognition, email filtering, auto-tagging on Facebook, a
recommender system, and image recognition
A subset of artificial intelligence known as machine learning focuses primarily on the creation of algorithms that enable
a computer to independently learn from data and previous experiences. Without being explicitly programmed, machine
learning enables a machine to automatically learn from data, improve performance from experiences, and predict
things.
Machine learning algorithms create a mathematical model that, without being explicitly programmed, aids in making
predictions or decisions with the assistance of sample historical data, or training data. For the purpose of developing
predictive models, machine learning brings together statistics and computer science. Algorithms that learn from
historical data are either constructed or utilized in machine learning. The performance will rise in proportion to the
quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
A machine learning system builds prediction models, learns from previous data, and predicts the output of new data
whenever it receives it. The amount of data helps to build a better model that accurately predicts the output, which in
turn affects the accuracy of the predicted output.
Let's say we have a complex problem in which we need to make predictions. Instead of writing code, we just need to
feed the data to generic algorithms, which build the logic based on the data and predict the output. Our perspective
on the issue has changed as a result of machine learning. The Machine Learning algorithm's operation is depicted in
the following block diagram:
There are many reasons why learning machine learning is important:
Machine learning is widely used in many industries, including healthcare, finance, and e-commerce. By learning machine
learning, you can open up a wide range of career opportunities in these fields.
Machine learning can be used to build intelligent systems that can make decisions and predictions based on data. This
can help organizations make better decisions, improve their operations, and create new products and services.
Machine learning is an important tool for data analysis and visualization. It allows you to extract insights and patterns
from large datasets, which can be used to understand complex systems and make informed decisions.
Machine learning is a rapidly growing field with many exciting developments and research opportunities. By learning
machine learning, you can stay up-to-date with the latest research and developments in the field.
The demand for machine learning is steadily rising. Because it is able to perform tasks that are too complex for a person
to directly implement, machine learning is required. Humans are constrained by our inability to manually access vast
amounts of data; as a result, we require computer systems, which is where machine learning comes in to simplify our
lives.
By providing them with a large amount of data and allowing them to automatically explore the data, build models, and
predict the required output, we can train machine learning algorithms. The cost function can be used to determine the
amount of data and the machine learning algorithm's performance. We can save both time and money by using machine
learning.
Following are some key points which show the importance of Machine Learning:
Terminology:
Model: Also known as “hypothesis”, a machine learning model is the mathematical representation of a real-world
process. A machine learning algorithm along with the training data builds a machine learning model.
Feature: A feature is a measurable property or parameter of the data-set.
Feature Vector: It is a set of multiple numeric features. We use it as an input to the machine learning model for training
and prediction purposes.
Training: An algorithm takes a set of data known as “training data” as input. The learning algorithm finds patterns in the
input data and trains the model for expected results (target). The output of the training process is the machine learning
model.
Prediction: Once the machine learning model is ready, it can be fed with input data to provide a predicted output.
Target (Label): The value that the machine learning model has to predict is called the target or label.
Overfitting: When a massive amount of data trains a machine learning model, it tends to learn from the noise and
inaccurate data entries. Here the model fails to characterize the data correctly.
Underfitting: It is the scenario when the model fails to decipher the underlying trend in the input data. It destroys the
accuracy of the machine learning model. In simple terms, the model or the algorithm does not fit the data well enough.
In supervised learning, sample labeled data are provided to the machine learning system for training, and the system
then predicts the output based on the training data.
The system uses labeled data to build a model that understands the datasets and learns about each one. After the
training and processing are done, we test the model with sample data to see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning. The managed learning depends
on oversight, and it is equivalent to when an understudy learns things in the management of the educator. Spam
filtering is an example of supervised learning.
Supervised learning is the types of machine learning in which machines are trained using well "labelled" training data,
and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with
the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches the machines to
predict the output correctly. It applies the same concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine learning model.
The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the
output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud Detection, spam
filtering, etc.
Supervised machine learning is based on supervision. It means in the supervised learning technique, we train the
machines using the "labelled" dataset, and based on the training, the machine predicts the output. Here, the labelled
data specifies that some of the inputs are already mapped to the output. More preciously, we can say; first, we train the
machine with the input and corresponding output, and then we ask the machine to predict the output using the test
dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset of cats and dog images. So,
first, we will provide the training to the machine to understand the images, such as the shape & size of the tail of cat
and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After completion of training, we input
the picture of a cat and ask the machine to identify the object and predict the output. Now, the machine is well trained,
so it will check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat.
So, it will put it in the Cat category. This is the process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection, Spam
filtering, etc.
o Classification
o Regression
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and Polygon. Now
the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on the bases
of a number of sides, and predicts the output.
Regression algorithms are used if there is a relationship between the input variable and the output variable. It is used
for the prediction of continuous variables, such as Weather forecasting, Market Trends, etc. Below are some popular
Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship between input and
output variables. These are used to predict continuous output variables, such as market trends, weather prediction, etc.
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are two classes such as
Yes-No, Male-Female, True-false, etc. Classification algorithms are used to solve the classification problems in which
the output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms are Spam
Detection, Spam filtering, Email filtering, etc.
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o With the help of supervised learning, the model can predict the output on the basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud detection, spam
filtering, etc.
Disadvantages of supervised learning:
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object
o ImageSegmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image classification is
performed on different image data with pre-defined labels.
o MedicalDiagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by using medical
images and past labelled data with labels for disease conditions. With such a process, the machine can identify
a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying fraud transactions,
fraud customers, etc. It is done by using historic data to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These algorithms classify
an email as spam or not spam. The spam emails are sent to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition. The algorithm is
trained with voice data, and various identifications can be done using the same, such as voice-activated
passwords, voice commands, etc.
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labeled, classified, or categorized, and
the algorithm needs to act on that data without any supervision. The goal of unsupervised learning is to restructure the
input data into new features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights from the huge
amount of data.
Unsupervised learning is a machine learning technique in which models are not supervised using training dataset.
Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which
takes place in the human brain while learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to
Unsupervised learning cannot be directly applied to a regression or classification problem because unlike supervised
learning, we have the input data but no corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and represent that dataset in a
compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of different types
of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have any idea about
the features of the dataset. The task of the unsupervised learning algorithm is to identify the image features on their
own. Unsupervised learning algorithm will perform this task by clustering the image dataset into the groups according
to similarities between images.
Unsupervised learning is different from the Supervised learning technique; as its name suggests, there is no need for
supervision. It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset, and the
machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor labelled, and the model acts
on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according
to the similarities, patterns, and differences. Machines are instructed to find the hidden patterns from the input
dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit images, and we input it into
the machine learning model. The images are totally unknown to the model, and the task of the machine is to find the
patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference, shape difference, and predict
the output when it is tested with the test dataset.
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences, which makes it
closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised learning more
important.
o In real-world, we do not always have input data with the corresponding output so to solve such cases, we need
unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to the similarities
and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most similarities remains
into a group and has less or no similarities with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence of those commonalities.
The clustering technique is used when we want to find the inherent groups from the data. It is a way to group the
objects into a cluster such that the objects with the most similarities remain in one group and have fewer or no
similarities with the objects of other groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.
Association: An association rule is an unsupervised learning method which is used for finding the relationships between
variables in the large database. It determines the set of items that occurs together in the dataset. Association rule makes
marketing strategy more effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
Association rule learning is an unsupervised learning technique, which finds interesting relations among variables within
a large dataset. The main aim of this learning algorithm is to find the dependency of one data item on another data
item and map those variables accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth algorithm.
o Unsupervised learning is used for more complex tasks as compared to supervised learning because, in
unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data.
o Unsupervised learning is intrinsically more difficult than supervised learning as it does not have corresponding
output.
o The result of the unsupervised learning algorithm might be less accurate as input data is not labeled, and
algorithms do not know the exact output in advance.
Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract particular information from
the database. For example, extracting information of each user located at a particular
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning. It represents the intermediate ground between Supervised (With Labelled training
data) and Unsupervised learning (with no labelled training data) algorithms and uses the combination of labelled and
unlabeled datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised learning and operates
on the data that consists of a few labels, it mostly consists of unlabeled data. As labels are costly, but for corporate
purposes, they may have few labels. It is completely different from supervised and unsupervised learning as they are
based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept of Semi-
supervised learning is introduced. The main aim of semi-supervised learning is to effectively use all the available data,
rather than only labelled data like in supervised learning. Initially, similar data is clustered along with an unsupervised
learning algorithm, and further, it helps to label the unlabeled data into labelled data. It is because labelled data is a
comparatively more expensive acquisition than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student is under the supervision of
an instructor at home and college. Further, if that student is self-analysing the same concept without any help from the
instructor, it comes under unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.
Advantages:
Disadvantages:
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A software component) automatically
explore its surrounding by hitting & trail, taking action, learning from experiences, and improving its
performance. Agent gets rewarded for each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their experiences
only.
The reinforcement learning process is similar to a human being; for example, a child learns various things by experiences
in his day-to-day life. An example of reinforcement learning is to play a game, where the Game is the environment,
moves of an agent at each step define states, and the goal of the agent is to get a high score. Agent receives feedback
in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such as Game theory, Operation
Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP, the agent
constantly interacts with the environment and performs actions; at each action, the environment responds and
generates a new state.
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for each right
action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and improves its
performance. In reinforcement learning, the agent interacts with the environment and explores it. The goal of an agent
is to get the most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement learning
Categories of Reinforcement Learning
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the tendency that the
required behaviour would occur again by adding something. It enhances the strength of the behaviour of the
agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the positive
RL. It increases the tendency that the specific behaviour would occur again by avoiding the negative condition.
Advantages
o It helps in solving complex real-world problems which are difficult to be solved by general techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate results can be found.
o Helps in achieving long term results.
Disadvantage
The curse of dimensionality limits reinforcement learning for real physical systems.
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We are using machine
learning in our daily life even without knowing it such as Google Maps, Google assistant, Alexa, etc. Below are some
most trending real-world applications of Machine Learning
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify objects, persons,
places, digital images, etc. The popular use case of image recognition and face detection is, Automatic friend tagging
suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our Facebook
friends, then we automatically get a tagging suggestion with name, and the technology behind this is machine
learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and person
identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a popular
application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to text", or
"Computer speech recognition." At present, machine learning algorithms are widely used by various applications of
speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest route
and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of
two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from the user and sends
back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as Amazon, Netflix, etc.,
for product recommendation to the user. Whenever we search for some product on Amazon, then we started getting
an advertisement for the same product while internet surfing on the same browser and this is because of machine
learning.
Google understands the user interest using various machine learning algorithms and suggests the product as per
customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and this is also
done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a significant role
in self-driving cars. Tesla, the most popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always receive an
important mail in our inbox with the important symbol and spam emails in our spam box, and the technology behind
this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes classifier are
used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name suggests, they
help us in finding the information using our voice instruction. These assistants can help us in various ways just by our
voice instructions such as Play music, call someone, Open an email, Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML algorithms and
act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever we perform
some online transaction, there may be various ways that a fraudulent transaction can take place such as fake
accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values become the input for
the next round. For each genuine transaction, there is a specific pattern which gets change for the fraud transaction
hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and downs in
shares, so for this machine learning's long short term memory neural network is used for the prediction of stock
market trends.
10. Medical Diagnosis:
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing very fast
and able to build 3D models that can predict the exact position of lesions in the brain.
If we visit a new place and we are not aware of the language then it is not a problem at all, as for this also machine
learning helps us by converting the text into our known languages. Google's GNMT (Google Neural Machine
Translation) provide this feature, which is a Neural Machine Learning that translates the text into our familiar language,
and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which is used with
image recognition and translates the text from one language to another language.
Machine learning has given the computer systems the abilities to automatically learn without being explicitly
programmed. But how does a machine learning system work? So, it can be described using the life cycle of machine
learning. Machine learning life cycle is a cyclic process to build an efficient machine learning project. The main purpose
of the life cycle is to find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
The most important thing in the complete process is to understand the problem and to know the purpose of the
problem. Therefore, before starting the life cycle, we need to understand the problem because the good result depends
on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system called "model", and this
model is created by providing "training". But to train a model, we need data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify and obtain all data-
related problems.
In this step, we need to identify the different data sources, as data can be collected from various sources such
as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The quantity and
quality of the collected data will determine the efficiency of the output. The more will be the data, the more accurate
will be the prediction.
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we put our data into
a suitable place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
o Dataexploration:
It is used to understand the nature of data that we have to work with. We need to understand the
characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends, and
outliers.
o Datapre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the process of cleaning
the data, selecting the variable to use, and transforming the data in a proper format to make it more suitable for analysis
in the next step. It is one of the most important steps of the complete process. Cleaning of data is required to address
the quality issues.
It is not necessary that data we have collected is always of our use as some of the data may not be useful. In real-world
applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
The aim of this step is to build a machine learning model to analyze the data using various analytical techniques and
review the outcome. It starts with the determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster analysis, Association, etc. then build the model using prepared
data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its performance for better outcome
of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model is required so that it
can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model. In this step, we check
for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the real-world system.
If the above-prepared model is producing an accurate result as per our requirement with acceptable speed, then we
deploy the model in the real system. But before deploying the project, we will check whether it is improving its
performance using available data or not. The deployment phase is similar to making the final report for a project.
BATCH LEARNING AND ONLINE LEARNING
Batch learning and online learning are two different types of machine learning techniques that are used to train
models on data.
Batch learning, also known as offline learning, involves training a model on a fixed dataset, or a batch of data, all at
once. The model is trained on the entire dataset, and then used to make predictions on new data. This means that
batch learning requires a complete dataset before training can begin, and the model cannot be updated once it has
been trained without retraining the entire model. Batch learning is commonly used in situations where the dataset is
relatively small and can be processed quickly.
On the other hand, online learning, also known as incremental learning or streaming learning, involves training a
model on new data as it arrives, one observation at a time. The model is updated each time a new observation is
received, allowing it to adapt to changes in the data over time. Online learning is commonly used in situations where
the data is too large to be processed all at once, or where the data is constantly changing, such as in stock market
data or social media data.
The key difference between batch learning and online learning is that batch learning requires a fixed dataset, while
online learning can adapt to new data as it arrives. Batch learning is typically faster and requires less computational
resources than online learning, but may not be as flexible in handling changing or large datasets. Online learning, on
the other hand, can be more flexible and adaptable, but may require more resources and be slower to process data.
Both techniques have their advantages and disadvantages, and the choice between them depends on the specific
problem being solved and the characteristics of the data.
The Machine Learning systems which are categorized as instance-based learning are the systems that learn the training
examples by heart and then generalizes to new instances based on some similarity measure. It is called instance -based because
it builds the hypotheses from the training instances. It is also known as memory-based learning or lazy-learning (because
they delay processing until a new instance must be classified). The time complexity of this algorithm depends upon the size of
training data. Each time whenever a new query is encountered, its previously stores data is examined. And assign to a target
function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training instances. For example, If we
were to create a spam filter with an instance-based learning algorithm, instead of just flagging emails that are already marked
as spam emails, our spam filter would be programmed to also flag emails that are very similar to them. This requires a
measure of resemblance between two emails. A similarity measure between two emails could be the same sender or the
repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the identification of a
local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
MODEL BASED LEARNING
Model-based learning in machine learning refers to an approach where the algorithm builds an internal model of the
underlying structure or patterns in the data. This model is then used to make predictions, classifications, or decisions.
In contrast to model-free approaches, where the algorithm directly learns from data without explicitly building a
model, model-based learning involves creating a representation of the system being modeled.
1. Model Representation:
Selecting a Model: Choose a suitable mathematical or algorithmic representation for the model based on
the nature of the problem. This could be a linear model, decision tree, neural network, or any other
applicable model.
2. Model Training:
Parameter Estimation: If the chosen model has parameters, these need to be estimated or learned from the
training data. This is typically done by minimizing a loss function that quantifies the difference between the
model's predictions and the actual outcomes.
3. Model Evaluation:
Assessing Model Performance: Evaluate how well the model performs on a separate validation or test
dataset. This step helps ensure that the model generalizes well to unseen data and is not overfitting the
training data.
4. Model Utilization:
Making Predictions or Decisions: Once the model is trained and validated, it can be used to make
predictions or decisions on new, unseen data. The model leverages the patterns learned during training to
generalize its understanding to novel instances.
5. Refinement and Iteration:
Model Tuning: Adjust model parameters or hyperparameters to improve performance based on evaluation
results.
Iteration: Iterate through the training, evaluation, and tuning steps as needed to refine the model.
6. Applications of Model-Based Learning:
Reinforcement Learning: In reinforcement learning, an agent builds a model of the environment and uses it
to make decisions to maximize a reward signal. This approach is often contrasted with model-free
reinforcement learning.
System Identification: Model-based learning is used in system identification to estimate the parameters
and dynamics of a system based on observed input-output data.
Control Systems: In control systems, model-based methods are employed to design controllers that
regulate a system's behavior.
7. Advantages and Challenges:
Advantages: Model-based learning can provide interpretability, insights into the underlying data
distribution, and the ability to make predictions in situations with limited data.
Challenges: The effectiveness of model-based learning heavily depends on the accuracy of the chosen
model. If the model is too simple, it may fail to capture complex patterns, while an overly complex model
may overfit the training data.
Model-based learning is applicable in various machine learning paradigms, including supervised learning,
unsupervised learning, and reinforcement learning. The choice of the model and its complexity depends on the
characteristics of the data and the specific requirements of the task at hand.
1. Applications in Industry:
Healthcare: Machine learning is used for disease diagnosis, personalized medicine, drug discovery,
and predicting patient outcomes.
Finance: ML algorithms are employed for fraud detection, risk assessment, algorithmic trading, and
credit scoring.
Marketing: ML powers recommendation systems, customer segmentation, targeted advertising,
and sentiment analysis.
Manufacturing: ML is applied for predictive maintenance, quality control, and supply chain
optimization.
2. Natural Language Processing (NLP):
ML plays a crucial role in NLP, enabling machines to understand, interpret, and generate human-
like text. Applications include language translation, chatbots, and sentiment analysis.
3. Computer Vision:
ML techniques are used in computer vision for image and video analysis. This includes image
recognition, object detection, facial recognition, and autonomous vehicles.
4. Speech Recognition:
ML is integral to the development of speech recognition systems, enabling machines to transcribe
spoken words and respond to voice commands.
5. Reinforcement Learning:
In areas such as robotics and gaming, reinforcement learning is employed to train agents to make
decisions by interacting with an environment and receiving feedback in the form of rewards or
penalties.
6. Predictive Analytics:
ML is extensively used for predictive modeling, forecasting, and trend analysis across industries. It
helps organizations make data-driven decisions based on historical and real-time data.
7. Personalization:
ML algorithms power personalization features in various applications, including online platforms,
content recommendations, and targeted advertising.
8. Biomedical and Genomic Research:
ML is applied to analyze large-scale biological and genomic data, leading to advancements in
understanding diseases, drug discovery, and personalized medicine.
9. Autonomous Systems:
ML is crucial for the development of autonomous systems, including self-driving cars, drones, and
robotic systems that can perform tasks without human intervention.
10. Environmental Monitoring:
ML is utilized in environmental science for tasks such as climate modeling, weather prediction, and
the analysis of satellite imagery to monitor changes in ecosystems.
11. Education and E-Learning:
ML is applied to personalize educational content, provide adaptive learning experiences, and assess
student performance.
12. Research and Innovation:
ML is a driving force in scientific research, aiding in data analysis, pattern recognition, and
hypothesis generation across various scientific disciplines.
limitations of machine learning:
1. Data Dependence:
ML models heavily rely on the quality and quantity of training data. Biases and inaccuracies in the
data can lead to biased models or models that struggle to generalize to new, unseen data.
2. Bias and Fairness:
ML models can perpetuate and even amplify biases present in the training data. This can result in
unfair or discriminatory outcomes, especially in sensitive areas like hiring, lending, and criminal
justice.
3. Interpretability:
Many ML models, especially complex ones like deep neural networks, lack interpretability.
Understanding how and why a model makes a specific prediction or decision can be challenging,
limiting trust and transparency.
4. Overfitting and Underfitting:
Models can be prone to overfitting (capturing noise in the training data) or underfitting
(oversimplifying the underlying patterns). Striking the right balance requires careful model selection
and tuning.
5. Computational Resources:
Training sophisticated ML models, particularly deep neural networks, demands significant
computational resources. This can be a barrier for smaller organizations or researchers with limited
access to high-performance computing.
6. Lack of Causality:
ML models often infer associations from data but may not necessarily uncover causal relationships.
Drawing conclusions about cause and effect from correlations can lead to misleading
interpretations.
7. Adversarial Attacks:
ML models can be vulnerable to adversarial attacks, where subtle, intentional modifications to input
data lead to misclassifications. This poses security risks, especially in applications like image
recognition or autonomous vehicles.
8. Domain Adaptation:
Models trained on data from one domain may struggle to perform well in a different domain.
Adapting models to new and diverse environments is a challenge, particularly when the training
and deployment contexts differ.
9. Limited Generalization:
ML models might not generalize well to situations that significantly deviate from the training data
distribution. This limitation can be problematic in dynamic environments or when facing novel
scenarios.
10. Ethical and Privacy Concerns:
ML systems may inadvertently infringe on privacy rights, especially when dealing with sensitive
personal data. Ensuring ethical use of ML and safeguarding against privacy breaches are ongoing
challenges.
11. Scalability:
Scaling ML models to handle large datasets or deploy them across massive user bases can be
complex. Achieving scalability often requires optimizing algorithms and infrastructure.
12. Explainability and Regulatory Compliance:
Explainability is crucial for gaining user trust and meeting regulatory requirements. ML models must
comply with regulations that mandate transparency, fairness, and accountability in decision-making
processes.
13. Concept Drift:
ML models may degrade in performance over time due to changes in the underlying data
distribution (concept drift). Continual monitoring and adaptation are required to address this
challenge.
14. Human-Machine Collaboration:
Integrating ML systems into human workflows requires careful consideration of how humans and
machines collaborate. Ensuring effective communication and understanding between humans and
automated systems is an ongoing area of research.
15. Robustness to Uncertainty:
ML models may struggle to handle uncertainty inherent in real-world scenarios. Uncertainty in input
data, noise, or unexpected events can impact model reliability and decision-making.
Acknowledging these limitations is essential for responsible development and deployment of machine learning
systems. Researchers and practitioners are actively working on addressing these challenges through advancements in
algorithmic development, research on fairness and bias, and the establishment of ethical guidelines for the
responsible use of machine learning. As the field continues to evolve, ongoing efforts to mitigate these limitations will
contribute to the development of more robust, transparent, and trustworthy ML systems.
CHALLENGES OF MACHINE LEARNING
In Machine Learning, there occurs a process of analyzing data for building or training models. It is just everywhere; from
Amazon product recommendations to self-driven cars, it beholds great value throughout. As per the latest research, the global
machine learning market is expected to grow by 43% by 2024. This revolution has enhanced the demand for machine learning
professionals to a great extent. AI and machine learning jobs have observed a significant growth rate of 75% in the past four
years, and the industry is growing continuously. A career in the Machine learning domain offers job satisfaction, excellent
growth, insanely high salary, but it is a complex and challenging process.
There are a lot of challenges that machine learning professionals face to inculcate ML skills and create an application from
scratch. What are these challenges? In this blog, we will discuss seven major challenges faced by machine learning
professionals. Let’s have a look.
Data plays a significant role in the machine learning process. One of the significant issues that machine learning profession als
face is the absence of good quality data. Unclean and noisy data can make the whole process extremely exhausting. We don’t
want our algorithm to make inaccurate or faulty predictions. Hence the quality of data is essential to enhance the output.
Therefore, we need to ensure that the process of data preprocessing which includes removing outliers, filtering missing
values, and removing unwanted features, is done with the utmost level of perfection.
This process occurs when data is unable to establish an accurate relationship between input and output variables. It simply
means trying to fit in undersized jeans. It signifies the data is too simple to establish a precise relationship. To overcome this
issue:
Overfitting refers to a machine learning model trained with a massive amount of data that negatively affect its performance. It
is like trying to fit in Oversized jeans. Unfortunately, this is one of the significant issues faced by machine learning
professionals. This means that the algorithm is trained with noisy and biased data, which will affect its overall performance .
Let’s understand this with the help of an example. Let’s consider a model trained to differentiate between a cat, a rabbit, a
dog, and a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerabl e
probability that it will identify the cat as a rabbit. In this example, we had a vast amount of data, but it was biased; hence the
prediction was negatively affected.
We can tackle this issue by:
Analyzing the data with the utmost level of perfection
Use data augmentation technique
Remove outliers in the training set
Select a model with lesser features
The machine learning industry is young and is continuously changing. Rapid hit and trial experiments are being carried on.
The process is transforming, and hence there are high chances of error which makes the learning complex. It includes
analyzing the data, removing data bias, training data, applying complex mathematical calculations, and a lot more. Hence it i s
a really complicated process which is another big challenge for Machine learning professionals.
The most important task you need to do in the machine learning process is to train the data to achieve an accurate output. Le ss
amount training data will produce inaccurate or too biased predictions. Let us understand this with the help of an example.
Consider a machine learning algorithm similar to training a child. One day you decided to explain to a child how to
distinguish between an apple and a watermelon. You will take an apple and a watermelon and show him the difference
between both based on their color, shape, and taste. In this way, soon, he will attain perfection in differentiating between the
two. But on the other hand, a machine-learning algorithm needs a lot of data to distinguish. For complex problems, it may
even require millions of data to be trained. Therefore we need to ensure that Machine learning algorithms are trained with
sufficient amounts of data.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The machine learning models are highly efficient
in providing accurate results, but it takes a tremendous amount of time. Slow programs, data overload, and excessive
requirements usually take a lot of time to provide accurate results. Further, it requires constant monitoring and maintenance to
deliver the best output.
So you have found quality data, trained it amazingly, and the predictions are really concise and accurate. Yay, you have
learned how to create a machine learning algorithm!! But wait, there is a twist; the model may become useless in the future
as data grows. The best model of the present may become inaccurate in the coming Future and require further rearrangem ent.
So you need regular monitoring and maintenance to keep the algorithm working. This is one of the most exhausting issues
faced by machine learning professionals.
Conclusion: Machine learning is all set to bring a big bang transformation in technology. It is one of the most rapidly
growing technologies used in medical diagnosis, speech recognition, robotic training, product recommendations, video
surveillance, and this list goes on. This continuously evolving domain offers immense job satisfaction, excel lent
opportunities, global exposure, and exorbitant salary. It is a high risk and a high return technology. Before starting your
machine learning journey, ensure that you carefully examine the challenges mentioned above. To learn this fantastic
technology, you need to plan carefully, stay patient, and maximize your efforts. Once you win this battle, you can conquer the
Future of work and land your dream job!
Data Visualization
Data Science is the science of analyzing raw data using statistics and machine learning techniques with the purpose of
drawing conclusions about that information“.
tes, etc.), to an understandable format so that we can store it and use it for analysis.”
Data Visualization term.
Data visualization is the graphical representation of information and data in a pictorial or graphical format(Example: charts,
graphs, and maps). Data visualization tools provide an accessible way to see and understand trends, patterns in data, and
outliers. Data visualization tools and technologies are essential to analyzing massive amounts of information and making data-
driven decisions. The concept of using pictures is to understand data that has been used for centuries. General types of data
visualization are Charts, Tables, Graphs, Maps, Dashboards.
Categories of Data Visualization
Data visualization is very critical to market research where both numerical and categorical data can be visualized, which hel ps
in an increase in the impact of insights and also helps in reducing the risk of analysis paralysis. So, data visualization is
categorized into the following categories:
1. Can be time-consuming: Creating visualizations can be a time-consuming process, especially when dealing
with large and complex datasets. This can slow down the machine learning workflow and reduce productivity.
2. Can be misleading: While data visualization can help identify patterns and relationships in data, it can also be
misleading if not done correctly. Visualizations can create the impression of patterns or trends that may not
actually exist, leading to incorrect conclusions and poor decision-making.
3. Can be difficult to interpret: Some types of visualizations, such as those that involve 3D or interactive elements,
can be difficult to interpret and understand. This can lead to confusion and misinterpretation of the data.
4. May not be suitable for all types of data: Certain types of data, such as text or audio data, may not lend
themselves well to visualization. In these cases, alternative methods of analysis may be more appropriate.
5. May not be accessible to all users: Some users may have visual impairments or other disabilities that make it
difficult or impossible for them to interpret visualizations. In these cases, alternative met hods of presenting data
may be necessary to ensure accessibility.
Why is Data Visualization Important?
Let’s take an example. Suppose you compile a data visualization of the company’s profits from 2010 to 2020 and create a line
chart. It would be very easy to see the line going constantly up with a drop in just 2018. So you can observe in a second that the
company has had continuous profits in all the years except a loss in 2018. It would not be that easy to get this information so
fast from a data table. This is just one demonstration of the usefulness of data visualization. Let’s see some more reasons why
data visualization is so important.
4. Saves Time
It is definitely faster to gather some insights from the data using data visualization rather than just studying a chart. In the
screenshot below on Tableau, it is very easy to identify the states that have suffered a net loss rather than a profit. This is
because all the cells with a loss are colored red using a heat map, so it is obvious states have suffer ed a loss. Compare this to a
normal table where you would need to check each cell to see if it has a negative value to determine a loss. Obviously, data
visualization saves a lot of time in this situation!
The hypothesis is a common term in Machine Learning and data science projects. As we know, machine learning is one
of the most powerful technologies across the world, which helps us to predict results based on past experiences.
Moreover, data scientists and ML professionals conduct experiments that aim to solve a problem. These ML
professionals and data scientists make an initial assumption for the solution of the problem.
This assumption in Machine learning is known as Hypothesis. In Machine Learning, at various times, Hypothesis and
Model are used interchangeably. However, a Hypothesis is an assumption made by scientists, whereas a model is a
mathematical representation that is used to test the hypothesis.
What is Hypothesis?
The hypothesis is defined as the supposition or proposed explanation based on insufficient evidence or
assumptions. It is just a guess based on some known facts but has not yet been proven. A good hypothesis is testable,
which results in either true or false.
Example: Let's understand the hypothesis with a common example. Some scientist claims that ultraviolet (UV) light can
damage the eyes then it may also cause blindness.
In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume they may cause blindness.
However, it may or may not be possible. Hence, these types of assumptions are called a hypothesis.
The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in
Supervised Machine learning, where an ML model learns a function that best maps the input to corresponding outputs
with the help of an available dataset.
In supervised learning techniques, the main aim is to determine the possible hypothesis out of hypothesis space that
best maps input to the corresponding or correct outputs.
There are some common methods given to find out the possible hypothesis from the Hypothesis space, where
hypothesis space is represented by uppercase-h (H) and hypothesis by lowercase-h (h). Th ese are defined as follows:
Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known as a hypothesis set. It is used
by supervised machine learning algorithms to determine the best possible hypothesis to describe the target function
or best maps input to output.
It is often constrained by choice of the framing of the problem, the choice of model, and the choice of model
configuration.
Hypothesis (h):
It is defined as the approximate function that best describes the target in supervised machine learning algorithms. It is
primarily based on data as well as bias and restrictions applied to data.
Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper output and can be evaluated
as well as used to make predictions.
y= mx + b
Where,
Y: Range
m: Slope of the line which divided test data or changes in y divided by change in x.
x: domain
c: intercept (constant)
Example: Let's understand the hypothesis (h) and hypothesis space (H) with a two-dimensional coordinate plane
showing the distribution of data as follows:
Now, assume we have some test data by which ML algorithms predict the outputs for input as follows:
If we divide this coordinate plane in such as way that it can help you to predict output or result as follows:
Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate plane can also be divided in the following ways as
follows:
Hypothesis space (H) is the composition of all legal best possible ways to divide the coordinate plane so that it best
maps input to proper output.
Further, each individual best possible way is called a hypothesis (h). Hence, the hypothesis and hypothesis space would
be like this:
Hypothesis in Statistics
Similar to the hypothesis in machine learning, it is also considered an assumption of the output. However, it is falsifiable,
which means it can be failed in the presence of sufficient evidence.
Unlike machine learning, we cannot accept any hypothesis in statistics because it is just an imaginary result and based
on probability. Before start working on an experiment, we must be aware of two important types of hypotheses as
follows:
o Null Hypothesis: A null hypothesis is a type of statistical hypothesis which tells that there is no statistically
significant effect exists in the given set of observations. It is also known as conjecture and is used in quantitative
analysis to test theories about markets, investment, and finance to decide whether an idea is true or false.
o Alternative Hypothesis: An alternative hypothesis is a direct contradiction of the null hypothesis, which
means if one of the two hypotheses is true, then the other must be false. In other words, an alternative
hypothesis is a type of statistical hypothesis which tells that there is some significant effect that exists in the
given set of observations.
Significance level
The significance level is the primary thing that must be set before starting an experiment. It is useful to define the
tolerance of error and the level at which effect can be considered significantly. During the testing process in an
experiment, a 95% significance level is accepted, and the remaining 5% can be neglected. The significance level also
tells the critical or threshold value. For e.g., in an experiment, if the significance level is set to 98%, then the critical value
is 0.02%.
P-value
The p-value in statistics is defined as the evidence against a null hypothesis. In other words, P-value is the probability
that a random chance generated the data or something else that is equal or rarer under the null hypothesis condition.
If the p-value is smaller, the evidence will be stronger, and vice-versa which means the null hypothesis can be rejected
in testing. It is always represented in a decimal form, such as 0.035.
Whenever a statistical test is carried out on the population and sample to find out P-value, then it always depends upon
the critical value. If the p-value is less than the critical value, then it shows the effect is significant, and the null hypothesis
can be rejected. Further, if it is higher than the critical value, it shows that there is no significant effect and hence fails
to reject the Null Hypothesis.
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the
first and crucial step while creating a machine learning model.When creating a machine learning project, it is not always
a case that we come across the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be
directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model.
DATA PREPROCESSING
1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of
missing data, noisy data etc.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
Data Integration: Since it is challenging to guarantee quality with low-quality data, data integration is crucial in
resolving this issue. The process of merging information from various data sets into one is known as data integration.
Before transferring to the ultimate location, this step makes sure that the embedded data set is standardized and
formatted using data cleansing technologies.
Data Transformation:
Data transformation in data mining refers to the process of converting raw data into a format that is suitable for analysis and
modeling. The goal of data transformation is to prepare the data for data mining so that it can be used to extract useful
insights and knowledge. Data transformation can also help to improve the performance of data mining algorithms, by
reducing the dimensionality of the data, and by scaling the data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms It allows for highlighting
important features present in the dataset. It helps in predicting the patterns. When collecti ng data, it can be manipulated to
eliminate or reduce any variance or any other noise form. The concept behind data smoothing is that it will be able to identi fy
simple changes to help predict different trends and patterns. This serves as a help to analyst s or traders who need to look at a
lot of data which can often be difficult to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a crucia l
step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data used. Gathe ring
accurate data of high quality and a large enough quantity is necessary to produce relevant results. The collection of data is
useful for everything from decisions concerning financing or business strategy of the product, pricing, operations, and
marketing strategies. For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in
the real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle these
attributes. Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency b y
replacing a constant quality attribute with its discrete values. For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining process from the given set of
attributes. This simplifies the original data & makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example
Age initially in Numerical form (22, 25) is converted into categorical value (young, old). F or example, Categorical attributes,
such as house addresses, may be generalized to higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given range. Techniques that are used for
normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
Where v is the value you want to plot in the new range.
v’ is the new value you get after normalizing the old value.
Z-Score Normalization:
In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
A value, v, of attribute A is normalized to v’ by computing
Decimal Scaling:
It normalizes the values of an attribute by changing the position of their decimal points
The number of points by which the decimal point is moved can be determined by the absolute
maximum value of attribute A.
A value, v, of attribute A is normalized to v’ by computing
where j is the smallest integer such that Max(|v’|) < 1.
Suppose: Values of an attribute P varies from -99 to 99.
The maximum absolute value of P is 99.
For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in
the largest number) so that values come out to be as 0.98, 0.97 and so on.
ADVANTAGES OR DISADVANTAGES:
1. Improves Data Quality: Data transformation helps to improve the quality of data by removing errors,
inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data from multiple sources, which
can improve the accuracy and completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis and modeling by
normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or to remove sensitive
information from the data, which can help to increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve the performance of data
mining algorithms by reducing the dimensionality of the data and scaling the data to a common range of values.
1. Time-consuming: Data transformation can be a time-consuming process, especially when dealing with large
datasets.
2. Complexity: Data transformation can be a complex process, requiring specialized skills and knowledge to
implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when discretizing continuous data, or when
removing attributes or features from the data.
4. Biased transformation: Data transformation can result in bias, if the data is not properly understood or used.
5. High cost: Data transformation can be an expensive process, requiring significant investments in hardware,
software, and personnel.
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a process that reduces
the volume of original data and represents it in a much smaller volume. Data reduction techniques are used to obtain
a reduced representation of the dataset that is much smaller in volume by maintaining the integrity of the original data.
By reducing the data, the efficiency of the data mining process is improved, which produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the result obtained from data mining
before and after data reduction is the same or almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to apply sophisticated and
computationally high-priced algorithms. The reduction of the data may be in terms of the number of rows (records) or
terms of the number of columns (dimensions).
Here are the following techniques or methods of data reduction in data mining, such as:
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis. Dimensionality reduction
eliminates the attributes from the data set under consideration, thereby reducing the volume of original data. It reduces
data size as it eliminates outdated or redundant features. Here are three methods of dimensionality reduction.
i. Wavelet Transform:
ii. Principal Component Analysis:
iii. Attribute Subset Selection:
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller form. This technique
includes two types parametric and non-parametric numerosity reduction.
i. Parametric: Parametric numerosity reduction incorporates storing only data parameters instead of the original
data. One method of parametric numerosity reduction is the regression and log-linear method.
o Regression and Log-Linear:
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume any model. The non-
Parametric technique results in a more uniform reduction, irrespective of data size, but it may not achieve a
high volume of data reduction like the parametric. There are at least four types of Non-Parametric data
reduction techniques, Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
o Histogram:
o Clustering:
o Sampling: One of the methods used for data reduction is sampling, as it can reduce the large data
set into a much smaller data sample. Below we will discuss the different methods in which we can
sample a large data set D containing N tuples:
1. Simple random sample without replacement (SRSWOR) of size s: Simple random
sample with replacement (SRSWR) of size s: drawn again.
2. Cluster sample: The tuples in data set D are clustered into M mutually disjoint subsets. The
data reduction can be applied by implementing SRSWOR on these clusters. A simple random
sample of size s could be generated from these clusters where s<M.
3. Stratified sample: The large data set D is partitioned into mutually disjoint sets called
'strata'. A simple random sample is taken from each stratum to get stratified data. This
method is effective for skewed data.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional aggregation
that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year 2022. If you
want to get the annual sale per year, you just have to aggregate the sales per quarter for each year. In this way,
aggregation provides you with the required data, which is much smaller in size, and thereby we achieve data reduction
even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis. The data cube
present precomputed and summarized data which eases the data mining into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that consumes less
space. Data compression involves building a compact representation of information by removing redundancy and
representing data in binary form. Data that can be restored successfully from its compressed form is called Lossless
compression. In contrast, the opposite where it is not possible to restore the original form from the compressed form
is Lossy compression. Dimensionality and numerosity reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman Encoding and run-
length Encoding. We can divide it into two types based on their compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed
data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from the original data but
are useful enough to retrieve information from them. For example, the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. Methods such as the Discrete
Wavelet transform technique PCA (principal component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into data with intervals. We
replace many constant values of the attributes with labels of small intervals. This means that mining results are shown
in a concise and easily understandable way.
i. Top-down discretization: If you first consider one or a couple of points (so-called breakpoints or split points)
to divide the whole set of attributes and repeat this method up to the end, then the process is known as top-
down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points, some are discarded
through a combination of the neighborhood values in the interval. That process is called bottom-up
discretization.
The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk space, the less capacity
you will need to purchase. Here are some benefits of data reduction, such as:
Data reduction greatly increases the efficiency of a storage system and directly impacts your total spending on capacity.
Data Augumentation
It is a set of technique to artificiall increase the dataset by modifying the copies of existing data or synthetically generating new
copies of the dataset by using the existing dataset. While training the machine learning models its act as regularization and
reduces overfitting.
Improved the accuracy of model. It reduces the model overfitting and increase the accuracy of the unseen dataset.
Companies can leverage data augmentation to reduce reliance on training data collection and preparation and to build more
accurate machine learning models faster.
Data augmentation is a set of techniques to artificially increase the amount of data by generating new data points from existing da
ta. This includes making small changes to data or using deep learning models to generate new data points.
Data augmentation is useful to improve the performance and outcomes of machine learning models by forming new and different
examples to train datasets. If the dataset in a machine learning model is rich and sufficient, the model performs better and more
accurately.
For machine learning models, collecting and labeling data can be exhausting and costly processes. Transformations in datasets by
using data augmentation techniques allow companies to reduce these operational costs.
One of the steps in a data model is cleaning data which is necessary for high-accuracy models. However, if cleaning reduces the
representability of data, then the model cannot provide good predictions for real-world inputs. Data augmentation techniques can
enable machine learning models to be more robust by creating variations that the model may see in the real world.
For data augmentation, making simple alterations on visual data is popular. In addition, generative adversarial networks
(GANs) are used to create new synthetic data. Classic image processing activities for data augmentation are:
padding
random rotating
re-scaling,
vertical and horizontal flipping
translation ( image is moved along X, Y direction)
cropping
zooming
darkening & brightening/color modification
grayscaling
changing contrast
adding noise
random erasing
Adversarial training/Adversarial machine learning: It generates adversarial examples which disrupt a machine
learning model and injects them into a dataset to train.
Generative adversarial networks (GANs): GAN algorithms can learn patterns from input datasets and automatically
create new examples which resemble training data.
Neural style transfer: Neural style transfer models can blend content image and style image and separate style from
content.
Reinforcement learning: Reinforcement learning models train software agents to attain their goals and make decisions
in a virtual environment.
Popular open source python packages for data augmentation in computer vision are Keras ImageDataGenerator, Skimage and
OpeCV.
Data augmentation is not as popular in the NLP domain as in the computer vision domain. Augmenting text data is difficult, due
to the complexity of a language. Common methods for data augmentation in NLP are:
Easy Data Augmentation (EDA) operations: synonym replacement, word insertion, word swap and word deletion
Back translation: re-translating text from the target language back to its original language
Contextualized word embeddings
Generating synthetic data is one way to augment data. There are other approaches (e.g. making minimal changes to existing data
to create new data) for data augmentation as outlined above.
Feel free to check our article on synthetic data for computer vision.
Companies need to build evaluation systems for the quality of augmented datasets. As use of data augmentation
methods increases, assessment of quality of their output will be required.
Data augmentation domain needs to develop new research and studies to create new/synthetic data with advanced
applications. For example, generation of high-resolution images by using GANs can be challenging.
If a real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data
augmentation strategy is important.
Image recognition and NLP models generally use data augmentation methods. Also, the medical imaging domain utilizes data
augmentation to apply transformations on images and create diversity into the datasets. The reasons of data augmentation interest
in healthcare are
Normalization is one of the most frequently used data preparation techniques, which helps us to change the values of numeric
columns in the dataset to use a common scale.
Although Normalization is no mandate for all datasets available in machine learning, it is used whenever the attributes of the
dataset have different ranges. It helps to enhance the performance and reliability of a machine learning model.
Normalization is a scaling technique in Machine Learning applied during data preparation to change the values of numeric columns
in the dataset to use a common scale. It is not necessary for all datasets in a model. It is required only when features of machine
learning models have different ranges.
Case1- If the value of X is minimum, the value of Numerator will be 0; hence Normalization will also be 0.
Xn = 0
Case2- If the value of X is maximum, then the value of the numerator is equal to the denominator; hence Normalization will be 1.
Xn = 1
Case3- On the other hand, if the value of X is neither maximum nor minimum, then values of normalization will also be between
0 and 1.
Hence, Normalization can be defined as a scaling method where values are shifted and rescaled to maintain their ranges between 0
and 1, or in other words; it can be referred to as Min-Max scaling technique.
Although there are so many feature normalization techniques in Machine Learning, few of them are most frequently used. These
are as follows:
o Min-Max Scaling: This technique is also referred to as scaling. As we have already discussed above, the Min-Max
scaling method helps the dataset to shift and rescale the values of their attributes, so they end up ranging between 0 and
1.
o Standardization scaling:Standardization scaling is also known as Z-score normalization, in which values are centered
around the mean with a unit standard deviation, which means the attribute becomes zero and the resultant distribution
has a unit standard deviation. Mathematically, we can calculate the standardization by subtracting the feature value from
the mean and dividing it by standard deviation.
Here, µ represents the mean of feature value, and σ represents the standard deviation of feature values.
However, unlike Min-Max scaling technique, feature values are not restricted to a specific range in the standardization technique.
This technique is helpful for various machine learning algorithms that use distance measures such as KNN, K-means clustering,
and Principal component analysis, etc. Further, it is also important that the model is built on assumptions and data is normally
distributed.
Normalization Standardization
This technique uses minimum and max values for scaling of model. This technique uses mean and standard deviation for scaling of
model.
It is helpful when features are of different scales. It is helpful when the mean of a variable is set to 0 and the standard
deviation is set to 1.
Scales values ranges between [0, 1] or [-1, 1]. Scale values are not restricted to a specific range.
It got affected by outliers. It is comparatively less affected by outliers.
Scikit-Learn provides a transformer called MinMaxScaler for Scikit-Learn provides a transformer called StandardScaler for
Normalization. Normalization.
It is also called Scaling normalization. It is known as Z-score normalization.
It is useful when feature distribution is unknown. It is useful when feature distribution is normal.
Which is suitable for our machine learning model, Normalization or Standardization? This is probably a big confusion among all
data scientists as engineers. Although both terms have the almost same meaning choice of using normalization or standardizati on
will depend on your well as machine learning problem and the algorithm you are using in models.
1. Normalization is a transformation technique that helps to improve the performance as well as the accuracy of your model better.
Normalization of a machine learning model is useful when you don't know feature distribution exactly. In other words, the feature
distribution of data does not follow a Gaussian (bell curve) distribution. Normalization must have an abounding range, so if you
have outliers in data, they will be affected by Normalization.
Further, it is also useful for data having variable scaling techniques such as KNN, artificial neural networks. Hence, you can't
use assumptions for the distribution of data.
2. Standardization in the machine learning model is useful when you are exactly aware of the feature distribution of data or, in other
words, your data follows a Gaussian distribution. However, this does not have to be necessarily true. Unlike Normalization,
Standardization does not necessarily have a bounding range, so if you have outliers in your data, they will not be affected by
Standardization.
Further, it is also useful when data has variable dimensions and techniques such as linear regression, logistic regression, and
linear discriminant analysis.
Example: Let's understand an experiment where we have a dataset having two attributes, i.e., age and salary. Where the age ranges
from 0 to 80 years old, and the income varies from 0 to 75,000 dollars or more. Income is assumed to be 1,000 times that of age.
As a result, the ranges of these two attributes are much different from one another.
Because of its bigger value, the attributed income will organically influence the conclusion more when we undertake further
analysis, such as multivariate linear regression. However, this does not necessarily imply that it is a better predictor. As a result, we
normalize the data so that all of the variables are in the same range.
Further, it is also helpful for the prediction of credit risk scores where normalization is applied to all numeric data except the class
column. It uses the tanh transformation technique, which converts all numeric features into values of range between 0 to 1.
Bias and Variance in Machine Learning
Machine learning is a branch of Artificial Intelligence, which allows machines to perform data analysis and make
predictions. However, if the machine learning model is not accurate, it can make predictions errors, and these prediction
errors are usually known as Bias and Variance. In machine learning, these errors will always be present as there is always
a slight difference between the model predictions and actual predictions. The main aim of ML/data science analysts is
to reduce these errors in order to get more accurate results.
In machine learning, an error is a measure of how accurately an algorithm can make predictions for the previously
unknown dataset. On the basis of these errors, the machine learning model is selected that can perform best on the
particular dataset. There are mainly two types of errors in machine learning, which are:
o Reducible errors: These errors can be reduced to improve the model accuracy. Such errors can further be
classified into bias and Variance.
regardless of which algorithm has been used. The cause of these errors is unknown variables whose value can't be
reduced.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make predictions. While training, the
model learns these patterns in the dataset and applies them to test data for prediction. While making predictions, a
difference occurs between prediction values made by the model and actual values/expected values, and this
difference is known as bias errors or Errors due to bias. It can be defined as an inability of machine learning
algorithms such as Linear Regression to capture the true relationship between the data points. Each algorithm begins
with some amount of bias because bias occurs from assumptions in the model, which makes the target function simple
to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of the target function.
o High Bias: A model with a high bias makes more assumptions, and the model becomes unable to capture the
important features of our dataset. A high bias model also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the algorithm, the higher the bias
it has likely to be introduced. Whereas a nonlinear algorithm often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours and Support
Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear Discriminant Analysis
and Logistic Regression.
High bias mainly occurs due to a much simple model. Below are some ways to reduce the high bias:
The variance would specify the amount of variation in the prediction if the different training data was used. In simple
words, variance tells that how much a random variable is different from its expected value. Ideally, a model should
not vary too much from one training dataset to another, which means the algorithm should be good in understanding
the hidden mapping between inputs and output variables. Variance errors are either of low variance or high variance.
Low variance means there is a small variation in the prediction of the target function with changes in the training data
set. At the same time, High variance shows a large variation in the prediction of the target function with changes in
the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset, and does not generalize well
with the unseen dataset. As a result, such a model gives good results with the training dataset but shows high error
rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to overfitting of the model. A model
with high variance has the below problems:
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression, Logistic Regression, and
Linear discriminant analysis. At the same time, algorithms with high variance are decision tree, Support Vector
Machine, and K-nearest neighbours.
There are four possible combinations of bias and variances, which are represented by the below diagram:
1. Low-Bias,Low-Variance:
The combination of low bias and low variance shows an ideal machine learning model. However, it is not
possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are inconsistent and accurate
on average. This case occurs when the model learns with a large number of parameters and hence leads to
an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are consistent but inaccurate on
average. This case occurs when a model does not learn well with the training dataset or uses few numbers of
the parameter. It leads to underfitting problems in the model.
4. High-Bias,High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on average.
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and variance in order to avoid
overfitting and underfitting in the model. If the model is very simple with fewer parameters, it may have low variance
and high bias. Whereas, if the model has a large number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this balance between the bias error and variance
error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible because
bias and variance are related to each other:
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that accurately captures the
regularities in training data and simultaneously generalizes well with the unseen dataset. Unfortunately, doing this is
not possible simultaneously. Because a high variance algorithm may perform well with training data, but it may lead to
overfitting to noisy data. Whereas, high bias algorithm generates a much simple model that may not even capture
important regularities in the data. So, we need to find a sweet spot between bias and variance to make an optimal
model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and variance
errors.
AI describes machine that can perform tasks resembling those of humans. So AI implies machine that artificially model human
intelligence. AI systems help us manage, model and analyze complex systems. It is superset which has ML & DL as subset.
Artificial intelligence, which encompasses machine learning, neural networks and deep learning, aims to replicate human
decision and thought processes. Basically, AI is a collection of mathematical algorithms that make computers understand
complex relationships, make actionable decisions, and plan for the future.
AI enables computers to interpret the environment around them and make decisions based on what they observe. With a machine
learning component, AI can enable machines to adjust their “knowledge” based on new input.
AI can be used for manufacturing process improvement, processing biomedical and clinical data, creating “smart” assistants or
chat bots, social media monitoring, financial planning or investing, and many other areas.
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think,
learn, and problem-solve. The goal of AI is to create systems that can perform tasks that typically require human
intelligence. These tasks include reasoning, problem-solving, perception, learning, language understanding, and even
the ability to manipulate objects.
1. Narrow or Weak AI (ANI): This type of AI is designed and trained for a particular task. It excels in
performing specific tasks but lacks the broad cognitive abilities of a human. Examples include voice
assistants, image recognition systems, and recommendation algorithms.
2. General or Strong AI (AGI): This type of AI would possess the ability to understand, learn, and apply
knowledge across a wide range of tasks—similar to human intelligence.
Machine Learning (ML): A subset of AI that involves the use of algorithms and statistical models to enable
systems to improve their performance on a task over time without being explicitly programmed. Types of
machine learning include supervised learning, unsupervised learning, and reinforcement learning.
Natural Language Processing (NLP): NLP involves the interaction between computers and humans using
natural language. It enables machines to understand, interpret, and generate human-like text.
Computer Vision: This field focuses on enabling machines to interpret and make decisions based on visual
data, such as images or videos. Applications include facial recognition, image classification, and object
detection.
Robotics: AI is used in robotics to control and guide autonomous systems, allowing them to perform tasks
in the physical world.
Expert Systems: These are computer systems that mimic the decision-making ability of a human expert.
They use knowledge bases and inference engines to solve specific problems within a particular domain.
AI has the potential to revolutionize various industries, from healthcare and finance to transportation and
entertainment. As technology advances, AI continues to evolve, and researchers and engineers work towards creating
more sophisticated and capable systems. It's important to consider ethical implications and responsible AI
development to ensure that AI technologies are used for the benefit of society.
What is ML
ML uses algorithm to parse data , learn from that data, and make informed decisions based on what it has learned.
Machine learning (ML) is considered a sub-set of AI and is often used to implement AI. Instead of explicitly writing algorithms
to dictate a computer’s actions, machine learning is used to "train" the computer to find the right way of solving a task given
many examples of the correct solution to a given problem. Once the model is mature enough to give reliable and high accuracy
results, it can be deployed to a production setup where it can be used to solve new problems such as predictions or classification.
A number of different regression and clustering algorithms are used in ML, such as simple linear regression, polynomial
regression, partial least squares regression (including OPLS and PLS) support vector regression, decision tree regression, random
forest regression, K-nearest neighbors, and others. ML is often used for pattern discovery (to find hidden patterns in a dataset)
and to make meaningful predictions.
Machine learning can be used to analyze business trends or make financial predictions, create simulations and safety models,
review CT scans and support diagnosis, and solve engineering problems in auto manufacturing
What is DL
DL structures in layers to create artificial neural network ANN that can learn and make intelligent decision on its own. DL is sub
file the broane learning . While both fall under the broad category of AI, DL is what powers the most human-like AI
Deep learning (DL) is an advancement of ML using a specific type of algorithm called deep neural network models. DL excels at
learning how to represent unstructured data, such as images, text, protein structures, genome sequences, etc, in a way that is
useful for prediction. It can be used when the training datasets are very large, and the relationships to learn are very complex,
such as in medical science or with self-driving cars. Most virtual assistants, such as Alexa and Siri, use deep learning to
understand requests (using Natural Language Processing, NLP), and social networks use DL to analyze the contents of all images
you upload (using Computer Vision, CV).
Deep Learning (DL) is a subset of machine learning that involves artificial neural networks with multiple layers (deep
neural networks). These networks, also known as deep neural networks or simply deep networks, attempt to simulate
the human brain's architecture to "learn" from large amounts of data. The term "deep" refers to the depth of the
network, which comes from having multiple layers, each comprising interconnected nodes (neurons).
1. Neural Networks: Deep learning is based on artificial neural networks, which are inspired by the structure
and functioning of the human brain. Neural networks consist of layers
of interconnected nodes, with each connection having an associated weight.
2. Deep Neural Networks: Unlike traditional machine learning models with a small number of layers (shallow
architectures), deep learning models have multiple layers, allowing them to learn hierarchical features and
representations from the data.
3. Feature Learning: Deep learning models automatically learn hierarchical representations or features from
the raw data. This eliminates the need for manual feature engineering, as the model learns to extract relevant
features during the training process.
4. Representation Learning: DL focuses on learning data representations as opposed to task-specific
algorithms. The idea is that better representations lead to improved performance on a wide range of tasks.
5. End-to-End Learning: Deep learning models aim to learn directly from raw data to outputs without relying
on handcrafted intermediate representations or features. This is known as end-to-end learning.
6. Complex Architectures: Various deep learning architectures have been developed, including Convolutional
Neural Networks (CNNs) for image-related tasks, Recurrent Neural Networks (RNNs) for sequential data, and
Transformer architectures for natural language processing and other tasks.
7. Deep Learning Applications: DL has been successful in a wide range of applications, including computer
vision (image and video analysis), natural language processing (language translation, sentiment analysis),
speech recognition, recommendation systems, and autonomous vehicles.
Popular deep learning frameworks, such as TensorFlow and PyTorch, provide tools and libraries to build, train, and
deploy deep learning models. The availability of powerful hardware, such as Graphics Processing Units (GPUs) and
TPUs (Tensor Processing Units), has also contributed to the success of deep learning by enabling faster training times
for complex models.
It's important to note that while deep learning has achieved remarkable success in various domains, it requires large
amounts of labeled data and computational resources, and the interpretability of deep models can be a challenge in
certain applications.
What is DS?
DS is broad field that spans the collection, management, analysis and interpretation of large amounts of data with a wide range of
applications. It integrates all the terms above and summarizes or extract insights from data(EDA) and make predictions from
large datasets(Predictive analysis)
The field involves many different disciplines and tools , including statistical inference , domain knowledge(expertise), data
visualization, experiment design and communication. Data science helps answer the questions “What if” and it plays a crucial
role in building ML & AI systems and vice versa.
Data science is a broad field of study pertaining to data systems and processes, aimed at maintaining data sets and deriving
meaning out of them. It incorporates techniques of statistics and mathematics, such data mining, multivariate data analysis and
visualization, along with computer science and even machine learning to draw knowledge from data and provide both insights
and decision paths. It is an area that is being used successfully by many businesses to improve production processes, enable
strategic planning and innovate product design.
Data scientists are analytical data experts who have the technical skills to uncover data trends, as well as specific domain
knowledge for their industry that helps them solve complex business problems. Many data scientists start out as mathematicians,
statisticians or data analysts, but may evolve into roles that incorporate big data, artificial intelligence or process technologies. A
good data scientists understands his or her problem domain very well, what questions to answer, and peculiarities of the data
associated with it.
Data analytics is the discipline of analyzing raw data in order to make conclusions about that information. Data analytics is a
broad term that encompasses a number of diverse techniques to get insights that can be used to optimize processes or to increase
the overall efficiency of a business or system.
Data analytics techniques can reveal trends and metrics that would otherwise be lost in a mass of information.
One way data science is different from AI and ML is that a human is involved. A person is using data analytics to gain insight and
understanding and forming conclusions to make decisions.
Data scientists deal with huge chunks of data to analyze the patterns, trends and more. Using data analytics, data scientists can
generate reports that are used to draw inferences. Data analytics tools and software are useful in this process and can be used to
make predictions based on patterns.
Two areas that fall under this field and are being implemented in many industries ranging from pharma to chemicals to energy to
food and beverage are predictive analytics and real-time analytics.
Predictive analytics: These are models that predict the possibilities of a particular event happening in the future.
Real-time analytics: Data analytics can also be used to detect deviations in a process by modeling against historical
parameters in real-time. This is also a type of machine-learning.
Data Science is an interdisciplinary field that involves the extraction of knowledge and insights from structured and
unstructured data. It combines techniques from statistics, mathematics, computer science, and domain-specific
knowledge to analyze and interpret complex datasets. The primary goal of data science is to uncover hidden patterns,
trends, and valuable information that can support decision-making processes, predictions, and the development of
data-driven applications.
1. Data Collection: Gathering relevant data from various sources, which can include databases, files, APIs,
sensors, and more.
2. Data Cleaning and Preprocessing: Ensuring data quality by handling missing values, dealing with outliers,
and transforming data into a suitable format for analysis.
3. Exploratory Data Analysis (EDA): Investigating and summarizing the main characteristics of the data using
statistical and visual methods to gain insights.
4. Feature Engineering: Creating new features or transforming existing ones to improve the performance of
machine learning models.
5. Model Building and Machine Learning: Developing and training models that can make predictions or
classifications based on the data. This involves selecting appropriate algorithms, tuning parameters, and
evaluating model performance.
6. Data Visualization: Presenting data in a visual format to facilitate understanding and communication of
findings.
7. Statistical Analysis: Applying statistical methods to test hypotheses, validate assumptions, and derive
meaningful conclusions from the data.
8. Big Data Technologies: Handling and analyzing large volumes of data using distributed computing
frameworks like Apache Hadoop and Apache Spark.
9. Domain Knowledge Integration: Incorporating subject matter expertise into the analysis to ensure that
insights align with the specific context or industry.
10. Communication of Results: Effectively communicating findings and insights to stakeholders using reports,
visualizations, or other means.
Data science is widely applied in various industries, including finance, healthcare, marketing, technology, and more.
Professionals in the field, known as data scientists, use a combination of programming skills, statistical knowledge,
and domain expertise to extract valuable information from data and contribute to informed decision-making within
organizations.
UNIT : 2 CLUSTERING
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be defined
as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with
the possible similarities remain in a group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides
them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the unlabeled
dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id
to simplify the processing of large and complex datasets.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of dataset
that we are using. In classification, we work with the labeled data set, whereas in clustering, we work with the
unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any shopping
mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped
in separate sections, so that we can easily find out the things. The clustering technique also works in the same way.
Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the movies and
web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided into
several groups with similar properties.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft
Clustering (data points can belong to another group also). But there are also other various approaches of Clustering
exist. Below are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups.
The cluster center is created in such a way that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different
clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are divided
from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high dimensions.
In the distribution model-based clustering method, the data is divided based on the probability of how a dataset
belongs to a particular distribution. The grouping is done by assuming some distributions commonly Gaussian
Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture Models
(GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to create a tree-
like structure, which is also called a dendrogram. The observations or any number of clusters can be selected by cutting
the tree at the correct level. The most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each
dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C-
means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are different types of
clustering algorithms published, but only a few are commonly used. The clustering algorithm is based on the kind of
data that we are using. Such as, some algorithms need to guess the number of clusters in the given dataset, whereas
some are required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It classifies the
dataset by dividing the samples into different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data points.
It is an example of a centroid-based model, that works on updating the candidates for centroid to be the
center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the mean-shift, but with some remarkable advantages. In this
algorithm, the areas of high density are separated by the areas of low density. Because of this, the clusters can
be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-
means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data points
are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the bottom-up
hierarchical clustering. In this, each data point is treated as a single cluster at the outset and then successively
merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify the
number of clusters. In this, each data point sends a message between the pair of data points until convergence.
It has O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of cancerous
cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears based on
the closest object to the search query. It does it by grouping similar data objects in one group that is far from
the other dissimilar objects. The accurate result of a query depends on the quality of the clustering algorithm
used.
o Customer Segmentation: It is used in market research to segment the customers based on their choice and
preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the image
recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database.
This can be very useful to find that for what purpose the particular land should be used, that means for which
purpose it is more suitable.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine
learning or data science.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs
only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to
minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center,
create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means
here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as k points, which
are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by
applying some mathematics that we have studied to calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the
right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose
the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as
below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median
line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the
line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown
in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So, the image
will be:
o We can see in the above image; there are no dissimilar data points on either side of the line, which means our
model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the
below image:
BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised algorithm that performs
hierarchical clustering over large data sets. With modifications, it can also be used to accelerate k-means clustering and
Gaussian mixture modeling with the expectation-maximization algorithm. An advantage of BIRCH is its ability to
incrementally and dynamically cluster incoming, multi-dimensional metric data points to produce the best quality
clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan
of the database.
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' (data
points that are not part of the underlying pattern) effectively", beating DBSCAN by two months. The BIRCH algorithm
received the SIGMOD 10 year test of time award in 2006.
Basic clustering algorithms like K means and agglomerative clustering are the most commonly used clustering
algorithms. But when performing clustering on very large datasets, BIRCH and DBSCAN are the advanced clustering
algorithms useful for performing precise clustering on large datasets. Moreover, BIRCH is very useful because of its easy
implementation. BIRCH is a clustering algorithm that clusters the dataset first in small summaries, then after small
summaries get clustered. It does not directly cluster the dataset. That is why BIRCH is often used with other clustering
algorithms; after making the summary, the summary can also be clustered by other clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the centroids being
read off the leaf. And these centroids can be the final cluster centroid or the input for other cluster algorithms like
Agglomerative Clustering.
Stages of BIRCH
BIRCH is often used to complement other clustering algorithms by creating a summary of the dataset that the other
clustering algorithm can now use. However, BIRCH has one major drawback it can only process metric attributes. A
metric attribute is an attribute whose values can be represented in Euclidean space, i.e., no categorical attributes should
be present. The BIRCH clustering algorithm consists of two stages:
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions called Clustering Feature
(CF) entries. Formally, a Clustering Feature entry is defined as an ordered triple (N, LS, SS) where 'N' is the
number of data points in the cluster, 'LS' is the linear sum of the data points, and 'SS' is the squared sum of
the data points in the cluster. A CF entry can be composed of other CF entries. Optionally, we can condense
this initial CF tree into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree. A CF tree is a tree
where each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node, and a
CF entry made up of the sum of CF entries in the child nodes. Optionally, we can refine these clusters.
The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature tree (CF tree). This
algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a tree-structured summary to
create clusters.
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that have several
sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary information of
given data for further hierarchical clustering. This prevents the need to work with whole data given as input. The tree
cluster of data points as CF is represented by three numbers (N, LS, SS).
There are mainly four phases which are followed by the algorithm of BIRCH.
Two of them (resize data and refining clusters) are optional in these four phases. They come in the process when more
clarity is required. But scanning data is just like loading data into a model. After loading the data, the algorithm scans
the whole data and fits them into the CF trees.
In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it sends CF trees for
clustering using existing clustering algorithms. Finally, refining fixes the problem of CF trees where the same valued
points are assigned to different leaf nodes.
Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set
of data points. These summary statistics constitute a CF and represent a sufficient substitute for the actual data for
clustering purposes.
A CF is a set of three summary statistics representing a set of data points in a single cluster. These statistics are as
follows:
NOTE: The linear sum and the squared sum are equivalent to the mean and variance of the data point.
CF Tree
The building process of the CF Tree can be summarized in the following steps, such as:
Step 1: For each given record, BIRCH compares the location of that record with the location of each CF in the root node,
using either the linear sum or the mean of the CF. BIRCH passes the incoming record to the root node CF closest to the
incoming record.
Step 2: The record then descends down to the non-leaf child nodes of the root node CF selected in step 1. BIRCH
compares the location of the record with the location of each non-leaf CF. BIRCH passes the incoming record to the
non-leaf node CF closest to the incoming record.
Step 3: The record then descends down to the leaf child nodes of the non-leaf node CF selected in step 2. BIRCH
compares the location of the record with the location of each leaf. BIRCH tentatively passes the incoming record to the
leaf closest to the incoming record.
1. If the radius of the chosen leaf, including the new record, does not exceed the threshold T, then the incoming
record is assigned to that leaf. The leaf and its parent CF's are updated to account for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the Threshold T, then a new leaf is formed,
consisting of the incoming record only. The parent CFs is updated to account for the new data point.
If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the leaf node is split into two leaf nodes.
If the parent node is full, split the parent node, and so on. The most distant leaf node CFs are used as leaf node seeds,
with the remaining CFs being assigned to whichever leaf node is closer. Note that the radius of a cluster may be
calculated even without knowing the data points, as long as we have the count n, the linear sum LS, and the squared
sum SS. This allows BIRCH to evaluate whether a given data point belongs to a particular sub-cluster without scanning
the original data set.
Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters (the CF leaf nodes) to
combine these sub-clusters into clusters. The task of clustering becomes much easier as the number of sub-clusters is
much less than the number of data points. When a new data value is added, these statistics may be easily updated, thus
making the computation more efficient.
Parameters of BIRCH
There are three parameters in this algorithm that needs to be tuned. Unlike K-means, the optimal number of clusters
(k) need not be input by the user as the algorithm determines them.
o Threshold: Threshold is the maximum number of data points a sub-cluster in the leaf node of the CF tree can
hold.
o branching_factor: This parameter specifies the maximum number of CF sub-clusters in each node (internal
node).
o n_clusters: The number of clusters to be returned after the entire BIRCH algorithm is complete, i.e., the number
of clusters after the final clustering step. The final clustering step is not performed if set to none, and
intermediate clusters are returned.
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points and existing clusters. It exploits the
observation that the data space is not usually uniformly occupied, and not every data point is equally important.
It uses available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental
method that does not require the whole data set in advance.
CURE Architecture
Idea: Random sample, say ‘s’ is drawn out of a given data. This random sample is partitioned, say ‘p’
partitions with size s/p. The partitioned sample is partially clustered, into say ‘s/pq’ clusters. Outliers are
discarded/eliminated from this partially clustered partition. The partially clustered partitions need to be
clustered again. Label the data in the disk.
Representation of partitioning and clustering
Procedure :
1. Select target sample number ‘gfg’.
2. Choose ‘gfg’ well scattered points in a cluster.
3. These scattered points are shrunk towards centroid.
4. These points are used as representatives of clusters and used in ‘Dmin’ cluster merging
approach. In Dmin(distance minimum) cluster merging approach, the minimum distance from
the scattered point inside the sample ‘gfg’ and the points outside ‘gfg sample, is calculated.
The point having the least distance to the scattered point inside the sample, when compared to
other points, is considered and merged into the sample.
5. After every such merging, new sample points will be selected to represent the new cluster.
6. Cluster merging will stop until target, say ‘k’ is reached.
Suppose there are a set of data points that need to be grouped into several parts or clusters based on their similarity. In
Machine Learning, this is known as Clustering. There are several methods available for clustering:
K Means Clustering
Hierarchical Clustering
Gaussian Mixture Models
In real life, many datasets can be modeled by Gaussian Distribution (Univariate or Multivariate). So it is quite natural and
intuitive to assume that the clusters come from different Gaussian Distributions. Or in other words, it tried to model the
dataset as a mixture of several Gaussian Distributions. This is the core idea of this model.
In one dimension the probability density function of a Gaussian Distribution is given by
where and are respectively the mean and variance of the distribution. For Multivariate ( let us say d -variate)
Gaussian Distribution, the probability density function is given by
Here is a d dimensional vector denoting the mean of the distribution and is the d X d covariance matrix.
Gaussian Mixture Model
Suppose there are K clusters (For the sake of simplicity here it is assumed that the number of clusters is known and it is K).
So and are also estimated for each k. Had it been only one distribution, they would have been estimated by
the maximum-likelihood method. But since there are K such clusters and the probability density is defined as a linear
function of densities of all these K distributions, i.e.
where is the mixing coefficient for k th distribution. For estimating the parameters by the maximum log-likelihood
method, compute p(X| , , ).
Now for the log-likelihood function to be maximum, its derivative of with respect to , ,
and should be zero. So equating the derivative of with respect to to zero and rearranging
the terms,
Similarly taking the derivative with respect to and pi respectively, one can obtain the following expressions.
And
Note: denotes the total number of sample points in the k th cluster. Here it is assumed that there is
a total N number of samples and each sample containing d features is denoted by .
So it can be clearly seen that the parameters cannot be estimated in closed form. This is where the Expectation-
Maximization algorithm is beneficial.
The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood estimates for model
parameters when the data is incomplete or has some missing data points or has some hidden variables. EM chooses some
random values for the missing data points and estimates a new set of data. These new values are then recursively used to
estimate a better first date, by filling up missing points, until the values get fixed.
In the Expectation-Maximization (EM) algorithm, the estimation step (E-step) and maximization step (M-step) are the two
most important steps that are iteratively performed to update the model parameters until the model convergence.
In the maximization step, we update parameter values ( i.e. , and ) using the estimated latent
variable γk.
We will update the mean of the cluster point (μk) by taking the weighted average of data points using the
corresponding latent variable probabilities
We will update the covariance matrix (Σk) by taking the weighted average of the squared differences between
the data points and the mean, using the corresponding latent variable probabilities.
We will update the mixing coefficients (πk) by taking the average of the latent variable probabilities for each
component.
Repeat the E-step and M-step until convergence
We iterate between the estimation step and maximization step until the change in the log-likelihood or the
parameters falls below a predefined threshold or until a maximum number of iterations is reached.
Basically, in the estimation step, we update the latent variables based on the current parameter values.
However, in the maximization step, we update the parameter values using the estimated latent variables
This process is iteratively repeated until our model converges.
The Expectation-Maximization (EM) algorithm is a general framework and can be applied to various models, including
Gaussian Mixture Models (GMMs). The steps described above are specifically for GMMs, but the overall concept of the
Estimization-step and Maximization-step remains the same for other models that use the EM algorithm.
Clustering is a technique used in machine learning and data analysis to group similar data points together based on
certain features or characteristics. This approach has various applications across different domains. Here are some
common applications of clustering:
1. Customer Segmentation:
Retail: Clustering can help retailers identify groups of customers with similar purchasing behavior.
This information can be used to target specific customer segments with personalized marketing
strategies.
E-commerce: E-commerce platforms can use clustering to categorize products and recommend
items to users based on their preferences and previous purchasing patterns.
2. Image Segmentation:
Medical Imaging: Clustering is used in medical image analysis to segment images into different
regions, which can aid in the detection and diagnosis of diseases.
Computer Vision: In computer vision applications, clustering helps identify and group similar
objects or features within images.
3. Anomaly Detection:
Clustering can be used to identify unusual patterns or outliers in data. This is particularly useful in
fraud detection, network security, and quality control in manufacturing.
4. Document Clustering:
In natural language processing, clustering is employed to group similar documents together. This is
useful for organizing large document collections, topic modeling, and information retrieval.
5. Recommendation Systems:
Clustering is applied in recommendation systems to group users or items with similar preferences.
This helps in providing personalized recommendations to users based on the behavior of similar
users.
6. Genomic Data Analysis:
In bioinformatics, clustering is used to identify groups of genes with similar expression patterns
across different conditions or samples. This can provide insights into genetic relationships and
functions.
7. Social Network Analysis:
Clustering helps identify communities or groups of users with similar interests or connections in
social networks. This information is valuable for targeted marketing and understanding social
structures.
8. Network Traffic Analysis:
Clustering is used to analyze network traffic patterns, helping in the detection of network intrusions
and identifying potential security threats.
9. Speech Recognition:
Clustering can be applied to group similar speech patterns, aiding in the development of more
accurate and efficient speech recognition systems.
10. Climate Pattern Analysis:
Clustering is used in environmental science to identify patterns and trends in climate data, helping
researchers understand climate variability and change.
11. Manufacturing and Quality Control:
Clustering is applied in manufacturing to group similar products or components. It can also help in
quality control by identifying clusters of products with similar characteristics.
Maximum likelihood is an approach commonly used for such density estimation problems, in which a
likelihood function is defined to get the probabilities of the distributed data. It is imperative to study and
understand the concept of maximum likelihood as it is one of the primary and core concepts essential
for learning other advanced machine learning and deep learning techniques and algorithms.
In machine learning, the likelihood is a measure of the data observations up to which it can tell us the
results or the target variables value for particular data points. In simple words, as the name suggests,
the likelihood is a function that tells us how likely the specific data point suits the existing data
distribution.
For example. Suppose there are two data points in the dataset. The likelihood of the first data point is
greater than the second. In that case, it is assumed that the first data point provides accurate
information to the final model, hence being likable for the model being informative and precise.
After this discussion, a gentle question may appear in your mind, If the working of the likelihood function
is the same as the probability function, then what is the difference?
Although the working and intuition of both probability and likelihood appear to be the same, there is a
slight difference, here the possibility is a function that defines or tells us how accurate the particular
data point is valuable and contributes to the final algorithm in data distribution and how likely is to the
machine learning algorithm.
Whereas probability, in simple words is a term that describes the chance of some event or thing
happening concerning other circumstances or conditions, mostly known as conditional probability.
Also, the sum of all the probabilities associated with a particular problem is one and can not exceed it,
whereas the likelihood can be greater than one.
After discussing the intuition of the likelihood function, it is clear to us that a higher likelihood is desired for every
model to get an accurate model and has accurate results. So here, the term maximum likelihood represents that we
are maximizing the likelihood function, called the Maximization of the Likelihood Function.
Let us suppose that we have a classification dataset in which the independent column is the marks of the students
that they achieved in the particular exam, and the target or dependent column is categorical, which has yes and No
attributes representing if students are placed on the campus placements or not.
Noe here, if we try to solve the same problem with the help of maximum likelihood estimation, the function will first
calculate the probability of every data point according to every suitable condition for the target variable. In the next
step, the function will plot all the data points in the two-dimensional plots and try to find the line that best fits the
dataset to divide it into two parts. Here the best-fit line will be achieved after some epochs, and once achieved, the
line is used to classify the data point by simply plotting it to the graph.
The maximum likelihood estimation is a base of some machine learning and deep learning approaches used for
classification problems. One example is logistic regression, where the algorithm is used to classify the data point
using the best-fit line on the graph. The same approach is known as the perceptron trick regarding deep learning
algorithms.
As shown in the above image, all the data observations are plotted in a two-dimensional diagram where the X-axis
represents the independent column or the training data, and the y-axis represents the target variable. The line is
drawn to separate both data observations, positives and negatives. According to the algorithm, the observations that
fall above the line are considered positive, and data points below the line are regarded as negative data points.