IML MATERIAL
IML MATERIAL
TOPICS
1 Introduction
2 Twords Intelligent Machines Well posed Problems, Examples of
Applications in diverse fields.
3 Data Representation
4 Domain Knowledge for Productive use of Machine Learning
5 Diversity of Data: Structured and Un Structured Data
6 Forms of Machine Learning
7 Machine Learning and Data Mining
8 Basic Liner Algebra in Machine Learning Techniques
Machine learning (ML) is a type of artificial intelligence (AI) that allows computers to
learn without being explicitly programmed.
ML is one of the most exciting technologies that one would have ever come across. As it is
evident from the name, it gives the computer that makes it more similar to humans: The
ability to learn. Machine learning is actively being used today, perhaps in many more places
than one would expect.
Definition 2: or
Machine learning is a subfield of artificial intelligence (AI) that uses algorithms trained on
data sets to create self-learning models that are capable of predicting outcomes and classifying
information without human intervention.
Machine learning is used today for a wide range of commercial purposes, including
suggesting products to consumers based on their past purchases, predicting stock market
fluctuations, and translating text from one language to another.
In common usage, the terms “machine learning” and “artificial intelligence” are often used
interchangeably with one another due to the prevalence of machine learning for AI purposes in
the world today. But, the two terms are meaningfully distinct.
While AI refers to the general attempt to create machines capable of human-like cognitive
abilities, machine learning specifically refers to the use of algorithms and data sets to do so.
Definition 3: or
Machine learning is a subfield of artificial intelligence that uses algorithms trained on data sets
to create models that enable machines to perform tasks that would otherwise only be possible
for humans, such as categorizing images, analyzing data, or predicting price fluctuations.
Today, machine learning is one of the most common forms of artificial intelligence and often
powers many of the digital goods and services we use every day.
Artificial Intelligence (AI) has become increasingly integrated into various aspects of our
lives, revolutionizing industries and impacting daily routines. Here are some examples
illustrating the diverse applications of AI:
1.Virtual Personal Assistants: Popular examples like Siri, Google Assistant, and Amazon
Alexa utilize AI to understand and respond to user commands. These assistants employ
natural language processing (NLP) and machine learning algorithms to improve their
accuracy and provide more personalized responses over time.
2.Autonomous Vehicles: AI powers the development of self-driving cars, trucks, and drones.
Companies like Tesla, Waymo, and Uber are at the forefront of this technology, using AI
algorithms to analyse sensory data from cameras, radar, and lidar to make real-time driving
decisions.
3.Healthcare Diagnosis and Treatment: AI algorithms are used to analyse medical data,
including patient records, imaging scans, and genetic information, to assist healthcare
professionals in diagnosing diseases and planning treatments. IBM’s Watson for Health and
Google’s DeepMind are examples of AI platforms employed in healthcare.
4.Recommendation Systems: Online platforms like Netflix, Amazon, and Spotify utilize AI
to analyse user behaviour and preferences, providing personalized recommendations for
movies, products, and music. These systems employ collaborative filtering and content-based
filtering techniques to enhance user experience and increase engagement.
AI has the potential to revolutionize many industries and fields, such as healthcare, finance,
transportation, and education. However, it also raises important ethical and societal questions,
such as the impact on employment and privacy, and the responsible development and use of
AI technology.
Deep Learning:
The definition of Deep learning is that it is the branch of machine learning that is based
on artificial neural network architecture. An artificial neural network or ANN uses layers of
interconnected nodes called neurons that work together to process and learn from the input
data.
In a fully connected Deep neural network, there is an input layer and one or more hidden
layers connected one after the other. Each neuron receives input from the previous layer
neurons or the input layer. The output of one neuron becomes the input to other neurons in
the next layer of the network, and this process continues until the final layer produces the
output of the network. The layers of the neural network transform the input data through a
series of nonlinear transformations, allowing the network to learn complex representations of
the input data.
Machine learning is data driven technology. Large amount of data generated by organizations
on daily bases. So, by notable relationships in data, organizations makes better decisions.
Machine can learn itself from past data and automatically improve.
For the big organizations branding is important and it will become more easy to target
relatable customer base.
It is similar to data mining because it is also deals with the huge amount of data.
Well Posed Learning Problem – A computer program is said to learn from experience E in
context to some task T and some performance measure P, if its performance on T, as was
measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
Task
Performance Measure
Experience
Certain examples that efficiently defines the well-posed learning problem are –
3. Data Representation
Data representation methods in machine learning refer to the techniques used to transform
and present input data in a format that is suitable for training and evaluating machine learning
models. Effective data representation is crucial for ensuring that models can learn
meaningful patterns and relationships from the input features. Different types of data, such
as numerical, categorical, and text, may require specific representation methods.
Feature Extraction
Data representation methods help extract meaningful features from the raw data, which are
the characteristics or attributes that the machine learning algorithm will learn from to make
predictions.
Dimensionality Reduction
These methods can reduce the dimensionality of the data, which is the number of features, by
identifying and eliminating redundant or irrelevant features. This can improve the efficiency
of machine learning algorithms and reduce computational complexity.
Data representation methods can normalize and scale numerical features to ensure that they
are all on a similar scale. This helps to prevent features with larger magnitudes from
dominating the learning process.
Feature Engineering
These methods can be used to create new features from existing data or transform existing
features to improve the performance of machine learning models.
Numerical Data
Scaling and Normalization: Numerical features often have different scales, and models
might be sensitive to these variations. Scaling methods, such as Min-Max scaling or Z-score
normalization, ensure that numerical features are on a similar scale, preventing certain
features from dominating the model training process.
Categorical Data
One-Hot Encoding: Categorical variables, which represent discrete categories, need to be
encoded numerically for machine learning models. One-hot encoding is a common method
where each category is transformed into a binary vector, with a 1 indicating the presence of
the category and 0 otherwise.
Text Data
Vectorization: Text data needs to be converted into a numerical format for machine learning
models. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word
embeddings, such as Word2Vec or GloVe, are used to represent words or documents as
numerical vectors.
Time Series Data
Temporal Features: For time series data, relevant temporal features may be extracted, such
as day of the week, month, or time of day. Additionally, lag features can be created to capture
historical patterns in the data.
Image Data
Pixel Values: Images are typically represented as grids of pixel values. Deep learning
models, particularly convolutional neural networks (CNNs), directly operate on these pixel
values to extract hierarchical features.
Composite Data
Combining Representations: In some cases, datasets may consist of a combination of
numerical, categorical, and text features. Representing such composite data involves using a
combination of the methods mentioned above, creating a comprehensive and effective input
format for the model.
Embeddings
Learned Representations: In certain cases, embeddings are learned during the model
training process. This is common in deep learning models, where the model learns a low-
dimensional representation of the input data that captures meaningful patterns.
Sparse Data
Sparse Matrix: In cases where data is sparse, such as in natural language processing with a
large vocabulary, a sparse matrix representation may be used. This is an efficient way to
represent data with a significant number of zero values.
Feature Engineering
Creating Informative Features: Feature engineering is an overarching concept that involves
creating new features or transforming existing ones to provide the model with more
informative input. This process is critical for enhancing the model's ability to capture relevant
patterns.
The choice of data representation method depends on the specific type of data, the machine
learning task, and the desired model performance. Careful consideration of these factors can
significantly impact the effectiveness of the machine learning model.
4. Domain Knowledge for Productive use of Machine Learning
Domain Knowledge in machine learning refers to expertise and understanding of the specific
field or subject matter to which the machine learning model is applied. While machine
learning algorithms are powerful tools for analyzing data and making predictions, they often
require domain experts to ensure that the models interpret the data correctly and make
meaningful predictions.
Objective Definition: Domain experts is integral throughout the machine learning process,
from defining object to deploy models.
Data Collection: Collecting relevant datasets from diverse sources is important, align with
domain intricacies and data availability.
Data Preprocessing: Cleaning, transforming, and encoding data to ensure quality and
compatibility with the chosen machine learning algorithms.
Model Selection & Tuning: Selecting appropriate algorithms and fine-tuning model
parameters, Guided by domain knowledge to optimize performance and interoperability.
Deployemnt : Deploying the trained model into prection environments, considering domain
constraints and scalability requirements for real-world applications.
Domain knowledge directs the whole machine learning process, from data preparation to
model deployment, in addition to helping to comprehend the underlying data. Domain
specialists make contributions to ML by:
Feature engineering: They find pertinent features and connections within the data, which
helps to provide more insightful features for machine learning models.
Model Interpretation: By comprehending the domain context, model predictions may be
interpreted more effectively, increasing the actionability and reliability of machine learning
outputs.
Data preprocessing and cleaning: By spotting noise and abnormalities unique to a certain
domain, domain specialists improve the quality of the data, which in turn improves model
performance.
1. Business understanding: Knowing the organization, its goals, and the problem to be solved.
2. Data understanding: Familiarity with the data, its structure, and its limitations.
3. Industry knowledge: Understanding the industry, its trends, and its challenges.
1. Problem formulation: Defining the problem and identifying the key issues.
Structured Data:
Structured data is well-organized, easy to quantify, well defined, simple to search and analyze
with software in data analytics. Structured data is usually located in a specific field within
files or records. It is easy to place structured data into a standard pattern of set rows, tables,
and columns.
A good example of handling structured data is accessing the hotel database where all the
relevant details of the inmates, like name, contact number, address, etc., can be accessed with
ease. Such types of data are structured.
Structured data is encased in RDBMS (relational databases). Any information stored in the
database can be updated by person or machines and accessed with ease by algorithms or
manual search. Structured Query Language (SQL) is the standard tool used to handle
structured data, be it locating, adding & deleting, or updating.
Let us now take a look at the pros and cons of structured data.
The well-organized and quantitative nature of structured data makes it very easy for them to
update, modify, and search for data.
Anyone with basic knowledge of data and its related applications can use structured data.
Structured data facilitates the self-service mode of data access to the user. So, it is not
necessary to have in-depth knowledge of data types and their relationships.
3. More tool options:As structured data has been in use for a long time, most tools have been
tested for their efficiency in data analysis. Data managers have a lot of tools to choose from
when tackling structured data.
4. Seamless integrations:
Simple and streamlined programs like Excel can be used to store and organize structured
data. Furthermore, several other analytical tools can be linked to Excel for further data
analysis as required.
1. Limited use
Structured data lacks versatility. It can be used only with a set vision and cannot deviate from
that as it has a predefined structure.
Structured data is stored in data warehouses with a rigid data storage method. Any change in
data storage will require a complete update of existing data to accommodate additional
expensive and time-consuming requirements.
Structured data can offer limited insight as it works on pre-set parameters. It does not provide
the details of how and why the data analytics is carried out.
Learn data science courses online from the World’s top Universities. Earn Executive PG
Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Unstructured data refers to information that is not organized and cannot be accommodated in
a set or defined framework. It can be stored only in its original form until put to use. This
feature is known as schema on read.
The majority of the data we come across is unstructured. Nearly 80% of the enterprise data is
unstructured; this percentage appears to be constantly growing. Unstructured data comes in
various formats like emails, posts on social media platforms, chats, presentations, images,
satellite feeds, and data from IoT sensors.
Naturally, companies that invest time and money in deciphering unstructured data get access
to vital and valuable business intelligence to increase their profits. It can also help them
connect to their customers more efficiently and in a personalized fashion, thereby
contributing to increased profits.
Unstructured data is rather tricky to decipher; extracting valuable insights from unstructured
data requires cutting-edge tools and complex algorithms by skilled data professionals who
can leverage top-class programming skills and data analytics.
However, the results are highly rewarding as the crucial qualitative insights (customer
feedback, decision-making) help businesses streamline customer queries and improve
organizational efficiency.
As unstructured data is accumulated in its original form (native form), it is not defined until
used. This results in a larger reserve pool as the unstructured data can adapt to any data
requirement. It also facilitates data analysts and data scientists to process and analyze only
the required information.
Unstructured data has an impressive accumulation rate. As it does not require pre-set
parameters, it can be gathered easily and quickly.
Cloud data lakes store unstructured data due to their impressive storage capacity. Cloud data
lakes charge on a pay-for-what-you-use basis and are highly cost-effective, flexible, and
scalable.
As we mentioned before, you require data science expertise to leverage unstructured data for
useful processing and analysis. So, a regular business person or user can not possibly extract
any meaningful information from unstructured data in its crude native form. Processing
unstructured data requires the knowledge of the topic related to the data and the knowledge of
linking the data to make it resourceful. Even more disadvantageous is that there is a shortage
of data science professionals despite the continually growing demand across industries.
Unstructured data requires specialized tools for manipulation besides data science expertise.
Standard data analytics tools are useful and compatible with structured data, and data
engineers only have a limited choice of tools to analyze unstructured data. However, new
tools and technologies are being developed in the market as we speak
Semi-Structured Data
The third category of data features both structured and unstructured data, known as semi-
structured data. Semi-structured data does not fit into any pre-set parameters or organized
structures in a relational database resembling unstructured data. Yet, they have markers or
metadata that carry processed, analyzed, and structured information just like structured data.
The best example of semi-structured data is the pictures in smartphones. Every image or
photo in a smartphone has unstructured data and structured details like time, location, and
other related information. Semi-structured data can be seen in the form of JSON, CSV, and
XML file formats.
6. Forms of Machine Learning
Machine learning is the branch of Artificial Intelligence that focuses on developing models
and algorithms that let computers learn from data and improve from previous experience
without being explicitly programmed for every task. In simple words, ML teaches the
systems to think and understand like humans by learning from the data.
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
1.Supervised Machine Learning
4.Reinforcement Learning
Anomaly detection: Identifying data points that deviate from the expected
patterns, often signaling anomalies or outliers.
One of the main challenges of unsupervised learning is the lack of labeled data.
This can make it difficult to evaluate the performance of unsupervised learning
algorithms, as there are no predefined labels or categories against which to
compare results. Additionally, unsupervised learning algorithms can be
sensitive to the quality of the data, and may perform poorly on noisy or
incomplete data.
3.Semi-Supervised Learning in ML
Text classification: In text classification, the goal is to classify a given text into
one or more predefined categories. Semi-supervised learning can be used to
train a text classification model using a small amount of labeled data and a large
amount of unlabeled text data.
Continuity Assumption: The algorithm assumes that the points which are
closer to each other are more likely to have the same output label.
Cluster Assumption: The data can be divided into discrete clusters and points
in the same cluster are more likely to share an output label.
Manifold Assumption: The data lie approximately on a manifold of a much
lower dimension than the input space. This assumption allows the use of
distances and densities which are defined on a manifold.
Speech Analysis: Since labeling audio files is a very intensive task, Semi-
Supervised learning is a very natural approach to solve this problem.
Protein Sequence Classification: Since DNA strands are typically very large
in size, the rise of Semi-Supervised learning has been imminent in this field.
The most basic disadvantage of any Supervised Learning algorithm is that the
dataset has to be hand-labeled either by a Machine Learning Engineer or a Data
Scientist. This is a very costly process, especially when dealing with large
volumes of data. The most basic disadvantage of any Unsupervised Learning is
that its application spectrum is limited.
4.Reinforcement learning
Reinforcement learning uses algorithms that learn from outcomes and decide
which action to take next. After each action, the algorithm receives feedback
that helps it determine whether the choice it made was correct, neutral or
incorrect. It is a good technique to use for automated systems that have to make
a lot of small decisions without human guidance.
Example:
The problem is as follows: We have an agent and a reward, with many hurdles
in between. The agent is supposed to find the best possible path to reach the
reward. The following problem explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot
is to get the reward that is the diamond and avoid the hurdles that are
fired. The robot learns by trying all the possible paths and then choosing
the path which gives him the reward with the least hurdles. Each right step
will give the robot a reward and each wrong step will subtract the reward
of the robot. The total reward will be calculated when it reaches the final
reward that is the diamond.
Input: The input should be an initial state from which the model will start
Output: There are many possible outputs as there are a variety of solutions
to a particular problem
Training: The training is based upon the input, The model will return a
state and the user will decide to reward or punish the model based on its
output.
1
2.
3.
4.
Machine learning has a strong connection with mathematics. Each machine learning algorithm is
based on the concepts of mathematics & also with the help of mathematics, one can choose the
correct algorithm by considering training time, complexity, number of features, etc. Linear
Algebra is an essential field of mathematics, which defines the study of vectors, matrices,
planes, mapping, and lines required for linear transformation.
Linear algebra is the branch of mathematics that deals with vector spaces and
linear mappings between these spaces. It encompasses the study of vectors,
matrices, linear equations, and their properties.
B. Fundamental Concepts
1. Vectors
Vectors are quantities that have both magnitude and direction, often represented
as arrows in space.
Linear algebra is a fundamental tool for machine learning, and many
techniques rely heavily on it. Here are some key linear algebra techniques used
in machine learning:
These linear algebra techniques are essential for many machine learning
algorithms and applications, including:
- NumPy
- SciPy
- TensorFlow
- PyTorch
A machine is said to be learning from past Experiences (data feed-in) with respect to some
class of tasks if its Performance in a given Task improves with the Experience.
For example, assume that a machine has to predict whether a customer will buy a specific
product let’s say “Antivirus” this year or not. The machine will do it by looking at the
previous knowledge/past experiences i.e. the data of products that the customer had
bought every year and if he buys an Antivirus every year, then there is a high probability
that the customer is going to buy an antivirus this year as well. This is how machine learning
works at the basic conceptual level.
Supervised learning
Supervised learning is a machine learning technique that is widely used in various fields such
as finance, healthcare, marketing, and more. It is a form of machine learning in which the
algorithm is trained on labeled data to make predictions or decisions based on the data
inputs.In supervised learning, the algorithm learns a mapping between the input and output
data. This mapping is learned from a labeled dataset, which consists of pairs of input and
output data. The algorithm tries to learn the relationship between the input and output
data so that it can make accurate predictions on new, unseen data.
Supervised learning is where the model is trained on a labelled dataset. A labelled dataset is
one that has both input and output parameters. In this type of learning both training and
validation, datasets are labelled as shown in the figures below.
The labeled dataset used in supervised learning consists of input features and corresponding
output labels. The input features are the attributes or characteristics of the data that are
used to make predictions, while the output labels are the desired outcomes or targets that
the algorithm tries to predict.
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that the
customer won’t purchase it.
Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed
based on different parameters.
Training the system: While training the model, data is usually split in the ratio of 80:20 i.e.
80% as training data and the rest as testing data. In training data, we feed input as well as
output for 80% of data. The model learns from training data only. We use different machine
learning algorithms(which we will discuss in detail in the next articles) to build our model.
Learning means that the model will build some logic of its own.
Once the model is ready then it is good to be tested. At the time of testing, the input is fed
from the remaining 20% of data that the model has never seen before, the model will
predict some value and we will compare it with the actual output and calculate the
accuracy.
Supervised learning is typically divided into two main categories: regression and
classification. In regression, the algorithm learns to predict a continuous output value, such
as the price of a house or the temperature of a city. In classification, the algorithm learns to
predict a categorical output variable or class label, such as whether a customer is likely to
purchase a product or not.
One of the primary advantages of supervised learning is that it allows for the creation of
complex models that can make accurate predictions on new data. However, supervised
learning requires large amounts of labeled training data to be effective. Additionally, the
quality and representativeness of the training data can have a significant impact on the
accuracy of the model.
The goal is to minimize the difference between predicted and actual values using algorithms
like Linear Regression, Decision Trees, or Neural Networks, ensuring the model captures
underlying patterns in the data.
Classification
Classification is a type of supervised learning that categorizes input data into predefined
labels. It involves training a model on labeled examples to learn patterns between input
features and output classes. In classification, the target variable is a categorical value. For
example, classifying emails as spam or not.
The model’s goal is to generalize this learning to make accurate predictions on new, unseen
data. Algorithms like Decision Trees, Support Vector Machines, and Neural Networks are
commonly used for classification tasks.
NOTE: There are common Supervised Machine Learning Algorithm that can be used for both
regression and classification task.
Supervised Machine Learning Algorithm
UNIT-3
STATISTICAL LEARNING
Topics:
1.Machine Learning
2.Inferential Statistical Analysis
3.Descriptive Statistical in Machine Learning Techniques
4.Bayesian Reasoning :A Probabilistic approach to inference .
5.K-Nearest Neighbour Classifier
6.Discrimination Functions
7.Regression Functions
8.Liner Regression With Least Square Error Criterion
9. Logistic Regression for Classification Tasks
10.Fisher’s Linear Discriminate
11.Thresholding for Classification
12.Minimum Description Length Principle
• The results obtained from statistical learning help us determine trends and
predict a possible outcome for the future.
Descriptive Statistics are a way of using charts, graphs, and summary measures to organize,
represent, and explain a set of Data.
Why do we Need Statistical Learning?
Statistical Learning helps us understand why a system
behaves the way it does. It reduces ambiguity and
produces results that matter in the real world.
Types of Statistics
Statistics is mainly divided into the following two categories.
1.Descriptive Statistics
• In the descriptive Statistics, the Data is described in a summarized way. The
summarization is done from the sample of the population using different
parameters like Mean or standard deviation.
• Descriptive Statistics are a way of using charts, graphs, and summary measures to
organize, represent, and explain a set of Data.
2.Inferential Statistics
• In the Inferential Statistics, we try to interpret the Meaning of descriptive Statistics.
After the Data has been collected, analyzed, and summarized we use Inferential
Statistics to describe the Meaning of the collected Data.
2.Inferential Statistics
In the Inferential Statistics, we try to interpret the Meaning of descriptive
Statistics.
After the Data has been collected, analyzed, and summarised we use Inferential
Statistics to describe the Meaning of the collected Data.
3.Descriptive Statistical in Machine Learning Techniques
Easily identifies patterns and trends. With the identified trends, targeting
specific customers for specific products becomes more accessible.
Saves time. Hundreds and thousands of epochs for achieving the optimized
result are possible within a few minutes.
Where
Advantages:
KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs to
the supervised learning domain and finds intense application in pattern recognition, data mining, and
intrusion detection.
Euclidean Distance
This is nothing but the cartesian distance between the
two points which are in the plane/hyperplane.
O=Observed value
A=Actual value
Example for Lazy Learning
Given data Query
x=(Maths=6,CS=8 and K=3)
Classification =Fail/Pass
Maths CS Result
1 4 3 F
2 6 7 P
3 7 8 P
4 5 5 F
5 8 8 P
6.Discrimination Functions
A discrimination function is a mathematical function used to
classify objects or individuals into different categories or groups
based on their characteristics or features. It's a crucial concept
in:
1. Statistics
2. Machine Learning
3. Data Mining
4. Pattern Recognition
1.Simple Regression
Used to predict a continuous dependent variable based on a single independent variable.
Simple linear regression should be used when there is only a single independent variable.
2.Multiple Regression
Used to predict a continuous dependent variable based on multiple independent variables.
Multiple linear regression should be used when there are multiple independent variables.
Linear Regression
Linear regression is one of the simplest and most widely used statistical models. This assumes that there is a linear
relationship between the independent and dependent variables. This means that the change in the dependent
variable is proportional to the change in the independent variables.
Polynomial Regression
Polynomial regression is used to model nonlinear relationships between the dependent variable and the
independent variables. It adds polynomial terms to the linear regression model to capture more complex
relationships.
Support Vector Regression (SVR)
Support vector regression (SVR) is a type of regression algorithm that is based on the support vector
machine (SVM) algorithm. SVM is a type of algorithm that is used for classification tasks, but it can
also be used for regression tasks. SVR works by finding a hyperplane that minimizes the sum of the
squared residuals between the predicted and actual values.
People use it for investigating and modelling the relationship between variables
(i.e. dependent variable and one or more independent variables).
1. Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.
Basically, it is one of the widely used methods of fitting curves that works by
minimizing the sum of squared errors as small as possible. It helps you draw a
line of best fit depending on your data points.
9. Logistic Regression for Classification Tasks
Logistic regression is a supervised machine learning algorithm used for classification tasks where
the goal is to predict the probability that an instance belongs to a given class or not. Logistic
regression is a statistical algorithm which analyze the relationship between two data factors. The
article explores the fundamentals of logistic regression, it’s types and implementations.
What is Logistic Regression?
Logistic regression is used for binary classification where we
use sigmoid function, that takes input as independent variables and
produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of
the logistic function for an input is greater than 0.5 (threshold
value) then it belongs to Class 1 otherwise it belongs to Class 0. It’s
referred to as regression because it is the extension of linear
regression but is mainly used for classification problems.
Key Points:
Logistic regression predicts the output of a categorical dependent
variable. Therefore, the outcome must be a categorical or discrete
value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an
“S” shaped logistic function, which predicts two maximum values (0
or 1).
Logistic Function – Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Introduction
To deal with classification problems with 2 or more classes, most Machine Learning (ML) algorithms
work the same way.
Usually, they apply some kind of transformation to the input data with the effect of reducing the
original input dimensions to a new (smaller) one. The goal is to project the data to a new space.
Then, once projected, they try to classify the data points by finding a linear separation.
For problems with small input dimensions, the task
is somewhat easier. Take the following dataset as an
example.
Suppose we want to classify the red and blue circles correctly. It is clear that with a
simple linear model we will not get a good result. There is no linear combination of
the inputs and weights that maps the inputs to their correct classes. But what if we
could transform the data so that we could draw a line that separates the 2 classes?
That is what happens if we square the two input
feature-vectors. Now, a linear model will easily classify
the blue and red points.
11.Thresholding for Classification
What Is the Classification Threshold in Machine Learning?
Classification is the set of algorithms that, together with regression, comprises supervised
machine learning (ML). Supervised ML provides predictions on data. These predictions can take
the form of a discrete class or a continuous value. Discrete use cases are the remit of
classification (e.g., yes/no or true/false predictions), and continuous use cases fall under
regression (such as propensity scores or price predictions).
Correct predictions can be true positives (TP) or true negatives (TN), and
misclassifications can be false positives (FP) or false negatives (FP). The positives are
associated with the positive class, i.e., the class that we are interested in, and the
negatives with the negative class, which can be seen as representing standard, less
relevant behavior.
12.Minimum Description Length Principle
1.support vector machine
(SVM)
TOPICS:
1.Introduction,
2. Linear Discriminant Functions for Binary Classification,
3.Perceptron Algorithm,
4.Large Margin Classifier for linearly separable data,
5.Linear Soft Margin Classifier for Overlapping Classes,
6.Kernel Induced Feature Spaces, Nonlinear Classifier,
7.Regression by Support vector Machines.
8.Learning with Neural Networks: Towards Cognitive Machine,
9.Neuron Models, Network Architectures,
10.Perceptrons,
11.Linear neuron and the Widrow-Hoff Learning Rule, The error correction delta
rule.
1.Introduction
A support vector machine (SVM) is a supervised machine learning algorithm that classifies data and solves
regression tasks:
How it works
SVMs find a hyperplane or line that maximizes the distance between classes in an N-dimensional
space. They transform input data into a higher-dimensional feature space to make it easier to classify the
data.
• Advantages
SVMs are effective in high dimensional spaces, even when the number of dimensions is greater than the
number of samples. They are also memory efficient because they use a subset of training points in the
decision function.
• History
SVMs were developed in the 1990s by Vladimir N. Vapnik and his colleagues at AT&T Bell
Laboratories.
• Related algorithms
The support vector clustering algorithm uses the statistics of support vectors to categorize unlabeled
data.
Support Vector Machines or SVMs are supervised machine learning models i.e. they use labeled datasets
to train the algorithms. SVM can work out for both linear and nonlinear problems, and by the notion of
margin, it classifies between various classes. However, in essence, it is used for Classification problems in
Machine Learning. The objective of the algorithm is to find the finest line or decision boundary that can
separate n-dimensional space into classes such that one can put the new data points in the right class in
the future.
This decision boundary is called a hyperplane. In most of the cases, SVMs have a cut above precision
than Decision Trees, KNNs, Naive Bayes Classifiers, logistic regressions, etc. In addition to this SVMs have
been well known to outmatch neural networks on a few occasions.
1. Linear SVM – Data points can be easily separated with a linear line.
2. Non-Linear SVM – Data points cannot be easily separated with a linear line.
Now, in the next section, we will look forward to the working of SVMs.
Before delving deep into the working of SVM, let us quickly understand the following terms.
• Margin – Margin is the gap between the hyperplane and the support vectors.
• Hyperplane – Hyperplanes are decision boundaries that aid in classifying the data points.
• Support Vectors – Support Vectors are the data points that are on or nearest to the hyperplane
and influence the position of the hyperplane.
• Kernel function – These are the functions used to determine the shape of the hyperplane and
decision boundary.
Linear SVM
We will try to understand the working of SVM by an example where we have two classes that are shown
below.
Class A: Circle
Class B: Triangle
Now, the SVM algorithm is applied and it finds out the best hyperplane that divides both the classes.
SVM takes all the data points in deliberation and produces a line that is known as ‘Hyperplane’ which
segregates both the classes. This line is also called ‘Decision boundary’. Anything that will fall in the circle
would belong to class A and vice-versa.
There can be more than one hyperplane but we can find that the hyperplane for which the margin is
maximum is the optimal hyperplane. The main objective of SVM is to find such hyperplanes that can
classify the data points with high precision.
Non-Linear SVM
A Kernel function is invariably used by Support Vector Machines, even if the data is linear or not, but its
real robustness is leveraged only when the data is inseparable in its present form.
In the instance of nonlinear data, SVM makes use of the Kernel-trick. The intention is to map the non-
linearly separable data from a lower dimension into a higher dimensional space to find a hyperplane.
For example, the mapping function transforms the 2D nonlinear input space into a 3D output space using
kernel functions. The complexity of finding the mapping function in SVM reduces significantly by using
Kernel Functions.
Applications of SVM
SVM has got employment across numerous regression and classification real-life challenges. Some of the
fundamental applications of SVM are given below.
• Classification of text/hypertext
• Image classification
Pros of SVM:
1. It is efficacious for problems where the number of dimensions is greater than the number of
samples.
2. It works well in cases with a clear margin of separation.
3. It uses a subset of training points in the decision function (support vectors), which makes it
memory efficient.
Cons of SVM :
1. The performance is not so good when the data set is large because the time required for training
is more.
SVM
Definition:
A Linear Discriminant Function is a decision boundary that
separates two classes in a binary classification problem.
Mathematical Formulation:
g(x) = w^T x + b
where:
w ∈ ℝd is e bias term
Decision Rule:
Classify x as:
Training:
Optimization Problem:
Minimize:
subject to:
y_i ∈ {+1, -1} (class labels)
Solution:
Properties:
2. Fast computation
3. Easy interpretation
Assumptions:
SVR can use both linear and non-linear kernels. A linear kernel is a simple dot product between two input
vectors, while a non-linear kernel is a more complex function that can capture more intricate patterns in
the data. The choice of kernel depends on the data’s characteristics and the task’s complexity.
There are several concepts related to support vector regression (SVR) that you may want to understand in
order to use it effectively. Here are a few of the most important ones:
• Support vector machines (SVMs): SVR is a type of support vector machine (SVM), a supervised
learning algorithm that can be used for classification or regression tasks. SVMs try to find the
hyperplane in a high-dimensional space that maximally separates different classes or output
values.
• Kernels: SVR can use different types of kernels, which are functions that determine the similarity
between input vectors. A linear kernel is a simple dot product between two input vectors, while a
non-linear kernel is a more complex function that can capture more intricate patterns in the data.
The choice of kernel depends on the data’s characteristics and the task’s complexity.
• Hyperparameters: SVR has several hyperparameters that you can adjust to control the behavior
of the model. For example, the ‘C’ parameter controls the trade-off between the insensitive loss
and the sensitive loss. A larger value of ‘C’ means that the model will try to minimize the
insensitive loss more, while a smaller value of C means that the model will be more lenient in
allowing larger errors.
• Model evaluation: Like any machine learning model, it’s important to evaluate the performance
of an SVR model. One common way to do this is to split the data into a training set and a test set,
and use the training set to fit the model and the test set to evaluate it. You can then use metrics
like mean squared error (MSE) or mean absolute error (MAE) to measure the error between the
predicted and true output values.
3.Perceptron Algorithm
Perceptron is Machine Learning algorithm for supervised learning of various binary classification tasks.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit that helps to
detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks.
However, it is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a single-
layer neural network with four main parameters, i.e., input values, weights and Bias, net sum, and an
activation function.
In Machine Learning, binary classifiers are defined as the function that helps in deciding whether input
data can be represented as vectors of numbers and belongs to some specific class.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as
a classification algorithm that can predict linear predictor function in terms of weight and feature
vectors.
o This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
o Weight parameter represents the strength of the connection between units. This is another most
important parameter of Perceptron components. Weight is directly proportional to the strength
of the associated input neuron in deciding the output. Further, Bias can be considered as the line
of intercept in a linear equation.
o Activation Function:
o These are the final and important components that help to determine whether the neuron will fire
or not. Activation Function can be considered primarily as a step function.
o Advertisement
o Sign function
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative
of the strength of a node. Similarly, an input's bias value gives the ability to shift the activation
function curve up or down.
Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:
Advertisement
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
Advertisement
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Advertisement
Based on the layers, Perceptron models are divided into two types. These are as follows:
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters. Further, it sums up all inputs (weight). After
adding all inputs, if the total sum of all inputs is more than a pre-determined value, the model
gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this model
is stated as satisfied, and weight demand does not change. However, this model consists of a few
discrepancies triggered when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary for the weights input.
o The perceptron is a single processing unit of any neural network. Frank Rosenblatt first proposed
in 1958 is a simple neuron which is used to classify its input into one or two categories.
Perceptron is a linear classifier, and is used in supervised learning. It helps to organize the given
input data.
o A perceptron is a neural network unit that does a precise computation to detect features in the
input data. Perceptron is mainly used to classify the data into two parts. Therefore, it is also
known as Linear Binary Classifier.
o Perceptron uses the step function that returns +1 if the weighted sum of its input 0 and -1.
o The activation function is used to map the input between the required value like (0, 1) or (-1, 1).
PauseNext
Unmute
Duration 18:10
Loaded: 5.87%
Fullscreen
o Input value or One input layer: The input layer of the perceptron is made of artificial input
neurons and takes the initial data into the system for further processing.
o The perceptron works on these simple steps which are given below:
o a. In the first step, all the inputs x are multiplied with their weights w.
o b. In this step, add all the increased values and call them the Weighted sum.
o c. In our last step, apply the weighted sum to a correct Activation Function.
Advertisement
o For Example:
o Multi-Layer Perceptron
o The single-layer perceptron was the first neural network model, proposed in 1958 by Frank
Rosenbluth. It is one of the earliest models for learning. Our goal is to find a linear decision
function measured by the weight vector w and the bias parameter b.
o The artificial neural network (ANN) is an information processing system, whose mechanism is
inspired by the functionality of biological neural circuits. An artificial neural network consists of
several processing units that are interconnected.
o This is the first proposal when the neural model is built. The content of the neuron's local memory
contains a vector of weight.
Advertisement
o The single vector perceptron is calculated by calculating the sum of the input vector multiplied by
the corresponding element of the vector, with each increasing the amount of the corresponding
component of the vector by weight. The value that is displayed in the output is the input of an
activation function.
o Let us focus on the implementation of a single-layer perceptron for an image classification
problem using TensorFlow. The best example of drawing a single-layer perceptron is through the
representation of "logistic regression."
Advertisement
o The weights are initialized with the random values at the origination of each training.
o For each element of the training set, the error is calculated with the difference between the
desired output and the actual output. The calculated error is used to adjust the weight.
o The process is repeated until the fault made on the entire training set is less than the specified
limit until the maximum number of iterations has been reached.
Advertisement
The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate
on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward
on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single layer
perceptron model. Instead of linear, activation function can be executed as sigmoid, TanH, ReLU,
etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT,
XNOR, NOR.
o It helps to obtain the same accuracy ratio with large as well as small data.
o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each
independent variable.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned
weight coefficient 'w'.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
3. Initially, weights are multiplied with input features, and the decision is made whether the neuron
is fired or not.
4. The activation function applies a step rule to check whether the weight function is greater than
zero.
5. The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer
function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors
are non-linear, it is not easy to classify them properly.
o Neural networks or artificial neural networks are fundamental tools in machine learning, powering
many state-of-the-art algorithms and applications across various domains, including computer
vision, natural language processing, robotics, and more.
o
o Biological Neuron vs. Artificial Neural Network Source: ResearchGate
o The ANN depicted on the right of the image is a simple neural network called ‘perceptron’. It
consists of a single layer, which is the input layer, with multiple neurons with their own weights;
there are no hidden layers. The perceptron algorithm learns the weights for the input signals in
order to draw a linear decision boundary.
o There are several types of ANN, each designed for specific tasks and architectural requirements.
Let's briefly discuss some of the most common types before diving deeper into MLPs next.
o These are the simplest form of ANNs, where information flows in one direction, from input to
output. There are no cycles or loops in the network architecture. Multilayer perceptrons (MLP) are
a type of feedforward neural network.
o In RNNs, connections between nodes form directed cycles, allowing information to persist over
time. This makes them suitable for tasks involving sequential data, such as time series prediction,
natural language processing, and speech recognition.
o CNNs are designed to effectively process grid-like data, such as images. They consist of layers of
convolutional filters that learn hierarchical representations of features within the input data. CNNs
are widely used in tasks like image classification, object detection, and image segmentation.
o Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU)
o These are specialized types of recurrent neural networks designed to address the vanishing
gradient problem in traditional RNN. LSTMs and GRUs incorporate gated mechanisms to better
capture long-range dependencies in sequential data, making them particularly effective for tasks
like speech recognition, machine translation, and sentiment analysis.
o Autoencoder
o It is designed for unsupervised learning and consists of an encoder network that compresses the
input data into a lower-dimensional latent space, and a decoder network that reconstructs the
original input from the latent representation. Autoencoders are often used for dimensionality
reduction, data denoising, and generative modeling.
o GANs consist of two neural networks, a generator and a discriminator, trained simultaneously in a
competitive setting. The generator learns to generate synthetic data samples that are
indistinguishable from real data, while the discriminator learns to distinguish between real and
fake samples. GANs have been widely used for generating realistic images, videos, and other
types of data.
Multilayer Perceptrons
o A multilayer perceptron is a type of feedforward neural network consisting of fully connected
neurons with a nonlinear kind of activation function. It is widely used to distinguish data that is
not linearly separable.
o Input layer
o The input layer consists of nodes or neurons that receive the initial input data. Each neuron
represents a feature or dimension of the input data. The number of neurons in the input layer is
determined by the dimensionality of the input data.
o Hidden layer
o Between the input and output layers, there can be one or more layers of neurons. Each neuron in
a hidden layer receives inputs from all neurons in the previous layer (either the input layer or
another hidden layer) and produces an output that is passed to the next layer. The number of
hidden layers and the number of neurons in each hidden layer are hyperparameters that need to
be determined during the model design phase.
o Output layer
o This layer consists of neurons that produce the final output of the network. The number of
neurons in the output layer depends on the nature of the task. In binary classification, there may
be either one or two neurons depending on the activation function and representing the
probability of belonging to one class; while in multi-class classification tasks, there can be multiple
neurons in the output layer.
o Weights
o Neurons in adjacent layers are fully connected to each other. Each connection has an associated
weight, which determines the strength of the connection. These weights are learned during the
training process.
o Bias neurons
o In addition to the input and hidden neurons, each layer (except the input layer) usually includes a
bias neuron that provides a constant input to the neurons in the next layer. Bias neurons have
their own weight associated with each connection, which is also learned during training.
o The bias neuron effectively shifts the activation function of the neurons in the subsequent layer,
allowing the network to learn an offset or bias in the decision boundary. By adjusting the weights
connected to the bias neuron, the MLP can learn to control the threshold for activation and better
fit the training data.
o Activation function
o Typically, each neuron in the hidden layers and the output layer applies an activation function to
its weighted sum of inputs. Common activation functions include sigmoid, tanh, ReLU (Rectified
Linear Unit), and softmax. These functions introduce nonlinearity into the network, allowing it to
learn complex patterns in the data.
o MLPs are trained using the backpropagation algorithm, which computes gradients of a loss
function with respect to the model's parameters and updates the parameters iteratively to
minimize the loss.
o
o Example of an MLP with two hidden layers. Image by Author
o The input layer of an MLP receives input data, which could be features extracted from the input
samples in a dataset. Each neuron in the input layer represents one feature.
o Neurons in the input layer do not perform any computations; they simply pass the input values to
the neurons in the first hidden layer.
o Hidden layers
o The hidden layers of an MLP consist of interconnected neurons that perform computations on the
input data.
o Each neuron in a hidden layer receives input from all neurons in the previous layer. The inputs are
multiplied by corresponding weights, denoted as w. The weights determine how much influence
the input from one neuron has on the output of another.
o
o Where n is the total number of input connections, wi is the weight for the i-th input, and xi is the
i-th input value.
o The weighted sum is then passed through an activation function, denoted as f. The activation
function introduces nonlinearity into the network, allowing it to learn and represent complex
relationships in the data. The activation function determines the output range of the neuron and
its behavior in response to different input values. The choice of activation function depends on
the nature of the task and the desired properties of the network.
o Output layer
o The output layer of an MLP produces the final predictions or outputs of the network. The number
of neurons in the output layer depends on the task being performed (e.g., binary classification,
multi-class classification, regression).
o Iterative Optimization: The aim of this step is to find the minimum of a loss function, by
iteratively moving in the direction of the steepest decrease in the function's value.
o Compute the gradient of the loss function with respect to the model parameters using only the
data points in the mini-batch. This gradient estimation is a stochastic approximation of the true
gradient.
o Update the model parameters by taking a step in the opposite direction of the gradient, scaled by
a learning rate:Θt+1 = θt - n * ⛛ J (θt)Where:θt represents the model parameters at iteration t.
This parameter can be the weight⛛ J (θt) is the gradient of the loss function J with respect to the
parameters θtn is the learning rate, which controls the size of the steps taken during optimization
o Direction of Descent: The gradient of the loss function indicates the direction of the steepest
ascent. To minimize the loss function, gradient descent moves in the opposite direction, towards
the steepest descent.
o Learning Rate: The step size taken in each iteration of gradient descent is determined by a
parameter called the learning rate, denoted above as n. This parameter controls the size of the
steps taken towards the minimum. If the learning rate is too small, convergence may be slow; if it
is too large, the algorithm may oscillate or diverge.
o Convergence: Repeat the process for a fixed number of iterations or until a convergence criterion
is met (e.g., the change in loss function is below a certain threshold).
o Backpropagation
o Forward Pass: During the forward pass, input data is fed into the neural network, and the
network's output is computed layer by layer. Each neuron computes a weighted sum of its inputs,
applies an activation function to the result, and passes the output to the neurons in the next layer.
o Loss Computation: After the forward pass, the network's output is compared to the true target
values, and a loss function is computed to measure the discrepancy between the predicted output
and the actual output.
o
o Backward Pass (Gradient Calculation): In the backward pass, the gradients of the loss function
with respect to the network's parameters (weights and biases) are computed using the chain rule
of calculus. The gradients represent the rate of change of the loss function with respect to each
parameter and provide information about how to adjust the parameters to decrease the loss.
o Parameter update: Once the gradients have been computed, the network's parameters are
updated in the opposite direction of the gradients in order to minimize the loss function. This
update is typically performed using an optimization algorithm such as stochastic gradient descent
(SGD), that we discussed earlier.
o Iterative Process: Steps 1-4 are repeated iteratively for a fixed number of epochs or until
convergence criteria are met. During each iteration, the network's parameters are adjusted based
on the gradients computed in the backward pass, gradually reducing the loss and improving the
model's performance.
o Preparing data for training an MLP involves cleaning, preprocessing, scaling, splitting, formatting,
and maybe even augmenting the data. Based on the activation functions used and the scale of
the input features, the data might need to be standardized or normalized. Experimenting with
different preprocessing techniques and evaluating their impact on model performance is often
necessary to determine the most suitable approach for a particular dataset and task.
o Encode categorical variables: Convert categorical variables into numerical representations, such as
one-hot encoding.
o Feature Scaling
o Standardization or normalization: Rescale the features to a similar scale to ensure that the
optimization process converges efficiently.
o Standardization (Z-score normalization): Subtract the mean and divide by the standard deviation
of each feature. It centers the data around zero and scales it to have unit variance.
o Normalization (Min-Max scaling): Scale the features to a fixed range, typically between 0 and 1, by
subtracting the minimum value and dividing by the range (max-min).
o To learn more about feature scaling, check out Datacamp’s Feature Engineering for Machine
Learning in Python course.
o Train-Validation-Test Split
o Split the dataset into training, validation, and test sets. The training set is used to train the model,
the validation set is used to tune hyperparameters and monitor model performance, and the test
set is used to evaluate the final model's performance on unseen data.
o Data Formatting
o Ensure that the data is in the appropriate format for training. This may involve reshaping the data
or converting it to the required data type (e.g., converting categorical variables to numeric).
o For tasks such as image classification, data augmentation techniques such as rotation, flipping,
and scaling may be applied to increase the diversity of the training data and improve model
generalization.
o The choice between standardization and normalization may depend on the activation functions
used in the MLP. Activation functions like sigmoid and tanh are sensitive to the scale of the input
data and may benefit from standardization. On the other hand, activation functions like ReLU are
less sensitive to the scale and may not require standardization.
o 1. Model architecture
o Begin with a simple architecture and gradually increase complexity as needed. Start with a single
hidden layer and a small number of neurons, and then experiment with adding more layers and
neurons if necessary.
o 2. Task complexity
o For simple tasks with relatively low complexity, such as binary classification or regression on small
datasets, a shallow architecture with fewer layers and neurons may suffice.
o For more complex tasks, such as multi-class classification or regression on high-dimensional data,
deeper architectures with more layers and neurons may be necessary to capture intricate patterns
in the data.
o 3. Data preprocessing
o Clean and preprocess your data, including handling missing values, encoding categorical
variables, and scaling numerical features.
o Split your data into training, validation, and test sets to evaluate the model's performance.
o 4. Initialization
o Initialize the weights and biases of your MLP appropriately. Common initialization techniques
include random initialization with small weights or using techniques like Xavier or He initialization.
o 5. Experimentation
o Ultimately, the best approach is to experiment with different architectures, varying the number of
layers and neurons, and evaluate their performance empirically.
o 6. Training
o Train your MLP using the training data and monitor its performance on the validation set.
o Experiment with different batch sizes, number of epochs, and other hyperparameters to find the
optimal training settings.
o Visualize training progress using metrics such as loss and accuracy to diagnose issues and track
convergence.
o 7. Optimization algorithm
o Experiment with different learning rates and consider using techniques like learning rate
schedules or adaptive learning rates.
o 8. Avoid overfitting
o Be cautious not to overfit the model to the training data by introducing unnecessary complexity.
o Use techniques such as regularization (e.g., L1, L2 regularization), dropout, and early stopping to
prevent overfitting and improve generalization performance.
o Tune the regularization strength based on the model's performance on the validation set.
o 9. Model evaluation
o Monitor the model's performance on a separate validation set during training to assess how
changes in architecture affect performance.
o Evaluate the trained model on the test set to assess its generalization performance.
o Use metrics such as accuracy, loss, and validation error to evaluate the model's performance and
guide architectural decisions.
o Iterate on your implementation based on insights gained from training and evaluation results.
Widrow-Hoff Algorithm
Widrow-Hoff Algorithm is developed by Bernard Widrow and his student Ted Hoff in the 1960s
for minimizing the mean square error between a desired output and output produce by a linear
predictor.
Widrow-Hoff Learning Algorithm, known as the Least Mean Squares (LMS) algorithm, is used
in machine learning, deep learning, and adaptive signal processing. The algorithm is primarily
used for supervised learning where the system iteratively adjusts the parameters to approximate a
desired target function. It operates by updating the weights of the linear predictor so that the
predicted output converges to the actual output over time.
The update rule guides how the weights of Adaptive Linear Neuron (ADALINE) are adjusted based
on the error between expected output and observed output. The weight update rule in the
Widrow-Hoff algorithm is given by:
w(t+1)=w(t)+η(d(t)−y(t))x(t)
Here,
• w(t) and w(t+1) are the weight vectors before and after the update, respectively.
• η is the learning rate, a small positive constant that determines the step size of the weight update.
Interpretation
• Error Signal: d(t)−y(t) calculates the error. Positive error means actual output needs to increase,
negative means decrease.
• Learning Rate: η scales the error for weight update. A larger error will result in a bigger
adjustment to the weights.
• Direction of Update: x(t) dictates weight update direction. Positive error adjusts weights in input
direction, negative adjusts in opposite direction.
Applications of Widrow-Hoff Algorithm
1. Adaptive Filtering: Widrow-Hoff is used in systems for reducing unwanted noise or interference
and enhancing the desired signal.
2. Pattern Recognition: Widrow-Hoff assists in categorizing input data into defined groups based
on their features.
3. Signal Processing: Widrow-Hoff is helpful where signals are distorted or have different
conditions.
The delta rule, also known as the Widrow–Hoff rule or the least mean squares (LMS) rule, is an
error-correcting rule for neural networks that uses gradient descent to minimize the error in the
network's output. It's used for networks with linear, binary threshold, non-linear, and binary or
continuous input patterns
The delta rule algorithm works by:
2. Selecting a datapoint and calculating its inner product with the weight vector
The delta rule can be modified to speed up network training by adding a momentum term,
which is called the generalized delta rule.
TOPICS:
Frank Rosenblatt first defined the word Perceptron in his perceptron program. Perceptron is a basic unit
of an artificial neural network that defines the artificial neuron in the neural network. It is a supervised
learning algorithm containing nodes’ values, activation functions, inputs, and weights to calculate the
output.
The Multilayer Perceptron (MLP) Neural Network works only in the forward direction. All nodes are fully
connected to the network. Each node passes its value to the coming node only in the forward direction.
The MLP neural network uses a Backpropagation algorithm to increase the accuracy of the training
model.
Structure of MultiLayer Perceptron Neural Network
This network has three main layers that combine to form a complete Artificial Neural Network. These
layers are as follows:
Input Layer
It is the initial or starting layer of the Multilayer perceptron. It takes input from the training data set and
forwards it to the hidden layer. There are n input nodes in the input layer. The number of input nodes
depends on the number of dataset features. Each input vector variable is distributed to each of the nodes
of the hidden layer.
Hidden Layer
It is the heart of all Artificial neural networks. This layer comprises all computations of the neural network.
The edges of the hidden layer have weights multiplied by the node values. This layer uses the activation
function.
Output Layer
This layer gives the estimated output of the Neural Network. The number of nodes in the output layer
depends on the type of problem. For a single targeted variable, use one node. N classification problem,
ANN uses N nodes in the output layer.
Each input node passes the vector input value to the hidden layer.
In the hidden layer, each edge has some weight multiplied by the input variable. All the
production values from the hidden nodes are summed together. To generate the output
The activation function is used in the hidden layer to identify the active nodes.
Calculate the difference between predicted and actual output at the output layer.
The model uses backpropagation after calculating the predicted output.
Calculate the error after calculating the output from the Multilayer perceptron neural network.
This error is the difference between the output generated by the neural network and the actual
output. The calculated error is fed back to the network, from the output layer to the hidden layer.
The model reduces error by adjusting the weights in the hidden layer.
Calculate the predicted output with adjusted weight and check the error. The process is
recursively used till there is minimum or no error.
What is backpropagation?
In machine learning, backpropagation is an effective algorithm used to train artificial neural
networks, especially in feed-forward neural networks.
Backpropagation is an iterative algorithm, that helps to minimize the cost function by determining
which weights and biases should be adjusted. During every epoch, the model learns by adapting
the weights and biases to minimize the loss by moving down toward the gradient of the error.
Thus, it involves the two most popular optimization algorithms, such as gradient
descent or stochastic gradient descent.
Computing the gradient in the backpropagation algorithm helps to minimize the cost
function and it can be implemented by using the mathematical rule called chain rule from calculus
to navigate through complex layers of the neural network.
Ease of Implementation: Backpropagation does not require prior knowledge of neural networks,
making it accessible to beginners. Its straightforward nature simplifies the programming process,
as it primarily involves adjusting weights based on error derivatives.
Simplicity and Flexibility: The algorithm’s simplicity allows it to be applied to a wide range of
problems and network architectures. Its flexibility makes it suitable for various scenarios, from
simple feedforward networks to complex recurrent or convolutional neural networks.
Efficiency: Backpropagation accelerates the learning process by directly updating weights based
on the calculated error derivatives. This efficiency is particularly advantageous in training deep
neural networks, where learning features of a function can be time-consuming.
Scalability: Backpropagation scales well with the size of the dataset and the complexity of the
network. This scalability makes it suitable for large-scale machine learning tasks, where training
data and network size are significant factors.
In conclusion, the backpropagation algorithm offers several advantages that contribute to its
widespread use in training neural networks. Its ease of implementation, simplicity, efficiency,
generalization ability, and scalability make it a valuable tool for developing and training neural
network models for various machine learning applications.
Working of Backpropagation Algorithm
Forward pass
Backward pass
Types of
It takes vector inputs. It takes both vectors and matrices as input.
Input
Network Type It is a fully connected Neural network It is a spatially connected neural network.
Focus Problem It can deal with non-linear problems. Can only deal with linear problems.
MultiLayer Perceptron Neural Networks can easily work with non-linear problems.
Developers use this model to deal with the fitness problem of Neural Networks.
It has a higher accuracy rate and reduces prediction error by using backpropagation.
After training the model, the Multilayer Perceptron Neural Network quickly predicts the output.
This Neural Network consists of large computation, which sometimes increases the overall cost of
the model.
Due to this model’s tight connections, the number of parameters and node redundancy
increases.
3.Radial Basis Functions Networks.
Radial Basis Function (RBF)
Neural Networks are a specialized type of Artificial Neural Network (ANN) used primarily
for function approximation tasks. Known for their distinct three-layer architecture and
universal approximation capabilities, RBF Networks offer faster learning speeds and
efficient performance in classification and regression problems.
1. Input Layer: Receives input data and passes it to the hidden layer.
2. Hidden Layer: The core computational layer where RBF neurons process the data.
1. Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
2. RBF Neurons: Each neuron in the hidden layer represents a prototype vector from the
training set. The network computes the Euclidean distance between the input vector and
each neuron’s center.
3. Activation Function: The Euclidean distance is transformed using a Radial Basis
Function (typically a Gaussian function) to compute the neuron’s activation value. This
value decreases exponentially as the distance increases.
4. Output Nodes: Each output node calculates a score based on a weighted sum of the
activation values from all RBF neurons. For classification, the category with the highest
score is chosen.
Radial Basis Functions: These are real-valued functions dependent solely on the
distance from a central point. The Gaussian function is the most commonly used type.
Center and Radius: Each RBF neuron has a center and a radius (spread). The radius
affects how broadly each neuron influences the input space.
Input Layer
Function: After receiving the input features, the input layer sends them straight to the
hidden layer.
Hidden Layer
Function: This layer uses radial basis functions (RBFs) to conduct the non-linear
transformation of the input data.
Components: Neurons in the buried layer apply the RBF to the incoming data. The
Gaussian function is the RBF that is most frequently utilized.
RBF Neurons: Every neuron in the hidden layer has a spread parameter (σ) and a center,
which are also referred to as prototype vectors. The spread parameter modulates the
distance between the center of an RBF neuron and the input vector, which in turn
determines the neuron’s output.
Output Layer
Function: The output layer uses weighted sums to integrate the hidden layer neurons’
outputs to create the network’s final output.
Components: It is made up of neurons that combine the outputs of the hidden layer in a
linear fashion. To reduce the error between the network’s predictions and the actual
target values, the weights of these combinations are changed during training.
1. Universal Approximation: RBF Networks can approximate any continuous function with
arbitrary accuracy given enough neurons.
2. Faster Learning: The training process is generally faster compared to other neural
network architectures.
Classification: RBF Networks are used in pattern recognition and classification tasks,
such as speech recognition and image classification.
Regression: These networks can model complex relationships in data for prediction
tasks.
Function Approximation: RBF Networks are effective in approximating non-linear
functions.
There are specialized terms associated with decision trees that denote various
components and facets of the tree structure and decision-making procedure. :
Root Node: A decision tree’s root node, which represents the original choice or feature
from which the tree branches, is the highest node.
Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by
the values of particular attributes. There are branches on these nodes that go to other
nodes.
Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are
decided upon. There are no more branches on leaf nodes.
Branches (Edges): Links between nodes that show how decisions are made in response
to particular circumstances.
Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets of
data.
Parent Node: A node that is split into child nodes. The original node from which a split
originates.
Decision Criterion: The rule or condition used to determine how the data should be
split at a decision node. It involves comparing feature values against a threshold.
Pruning: The process of removing branches or nodes from a decision tree to improve its
generalisation and prevent overfitting.
Understanding these terminologies is crucial for interpreting and working with decision
trees in machine learning applications.
Decision trees are widely used in machine learning for a number of reasons:
Their portrayal of complex choice scenarios that take into account a variety of causes
and outcomes is made possible by their hierarchical structure.
They provide comprehensible insights into the decision logic, decision trees are
especially helpful for tasks involving categorisation and regression.
They are proficient with both numerical and categorical data, and they can easily adapt
to a variety of datasets thanks to their autonomous feature selection capability.
Decision trees also provide simple visualization, which helps to comprehend and
elucidate the underlying decision processes in a model.
Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the
tree. We can represent any boolean function on discrete attributes using the decision
tree.
Below are some assumptions that we made while using the decision tree:
Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
We use statistical methods for ordering attributes as root or the internal node.
As you can see from the above image the Decision Tree works on the Sum of Product
form which is also known as Disjunctive Normal Form. In the above image, we are
predicting the use of computer in the daily life of people. In the Decision Tree, the major
challenge is the identification of the attribute for the root node at each level. This
process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain:
When we use a node in a decision tree to partition the training instances into smaller
subsets the entropy changes. Information gain is a measure of this change in entropy.
A is an attribute
Sv is the subset of S
v represents an individual value that the attribute A can take and Values (A) is the set of
all possible values of A, then
Gain(S,A)=Entropy(S)–∑vA∣Sv∣∣S∣.Entropy(Sv)Gain(S,A)=Entropy(S)–∑vA∣S∣∣Sv∣
.Entropy(Sv)
Introduction:
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where each
internal node tests on attribute, each branch corresponds to attribute value and each leaf node
represents the final decision or prediction. The decision tree algorithm falls under the category
of supervised learning. They can be used to solve both regression and classification problems.
split.
An example of a decision tree can be explained using above binary tree. Let’s say you
want to predict whether a person is fit given their information like age, eating habit, and
physical activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does
he exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either
‘fit’, or ‘unfit’. In this case this was a binary classification problem (a yes no type problem).
There are two main types of Decision Trees:
2. What we’ve seen above is an example of classification tree, where the outcome was a
variable like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Regression trees (Continuous data types)
Purity and impurity in a junction are the primary focus of the Entropy and
Information Gain framework.
The Gini Index, also known as Impurity, calculates the likelihood that somehow a
randomly picked instance would be erroneously cataloged.
Difference between Gini Index and
Entropy
While entropy measures the amount
It is the probability of misclassifying a randomly
of uncertainty or randomness in a
chosen element in a set.
set.
Formula for the Gini index is Gini(P) = 1 – Formula for entropy is Entropy(P) = -
∑(Px)^2 , where Pi is ∑(Px)log(Px),
the proportion of the instances of class x in a where pi is the proportion of the
set. instances of class x in a set.
Gini index is typically used in CART Entropy is typically used in ID3 and
(Classification and Regression Trees) algorithms C4.5 algorithms
Calculate the entropy of each child node, then calculate the entropy of each split using the
weighted average entropy of the child nodes. Choose the split with the lowest entropy or the
greatest gain in information.
Chi-square
Calculate the Chi-square value for each split by summing the Chi-square values for all the child
nodes. The split with the highest Chi-square value is chosen.
Variance reduction
Calculate the variance of the parent node, then calculate the weighted average of the child
nodes' variance. The difference between the two is the variance reduction.
Hellinger distance
Calculate the Hellinger distance by minimizing the distance between the probability distribution
of the parent node and the probability distributions of the child nodes.
ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm, we’ll go
through few definitions.
This type of tree uses a value called Information Gain or Entropy to determine which
features to use. Information Gain multiplies the probability of the class times the log
(base=2) of that class probability.
Everything mentioned above might seem a little complicated, but there are Python
libraries that make using this algorithm super easy! All you need to do is write three lines
of code, and everything I’ve explained so far happens behind the scenes.
ID3 metrices
The ID3 algorithm utilizes metrics related to information theory, particularly entropy and
information gain, to make decisions during the tree-building process.
1. Determine entropy for the overall the dataset using class distribution.
2. For each feature.
H(S)=Σ−(Pi∗log2(Pi))
where, Pi Pi represents the fraction of the sample within a particular node.
i – Set of classes in S
Assess information gain for each unique categorical value of the feature.
IG(A,D)=H(S)–∑v∣S∣∣Sv∣×H(Sv)]
where, ∣S∣∣S∣ is the total number of instances in dataset.
∣Sv∣∣Sv∣ is the number of instances in dataset for which attribute D has values v.
4. Iteratively apply all above steps to build the decision tree structure
7.C4. 5
The C4.5 algorithm is used in Data Mining as a Decision Tree Classifier which can be
employed to generate a decision, based on a certain sample of data (univariate or
multivariate predictors).
When dealing with continuous attributes, C4. 5 sorts the attribute's values first,
and then chooses the midpoint between each pair of adjacent values as a potential
split point. Next, it determines which split point has the largest value by calculating
the information gain or gain ratio for each.
C4. 5 introduces a number of extensions of the original ID3 algorithm. In building a
decision tree we can deal with training sets that have records with unknown attribute values
by evaluating the gain, or the gain ratio, for an attribute by considering only the records
where that attribute is defined.
1. The algorithm inherently employs Single Pass Pruning Process to Mitigate overfitting.
We should also keep in mind that C4.5 is not the best algorithm out there but it does
certainly prove to be useful in certain cases.
CART was first produced by Leo Breiman, Jerome Friedman, Richard Olshen, and
Charles Stone in 1984
The term CART serves as a generic term for the following categories of decision trees:
Classification Trees: The tree is used to determine which “class” the target variable is
most likely to fall into when it is continuous.
Tree structure: CART builds a tree-like structure consisting of nodes and branches. The
nodes represent different decision points, and the branches represent the possible
outcomes of those decisions. The leaf nodes in the tree contain a predicted class label or
value for the target variable.
Splitting criteria: CART uses a greedy approach to split the data at each node. It
evaluates all possible splits and selects the one that best reduces the impurity of the
resulting subsets. For classification tasks, CART uses Gini impurity as the splitting
criterion. The lower the Gini impurity, the more pure the subset is. For regression tasks,
CART uses residual reduction as the splitting criterion. The lower the residual reduction,
the better the fit of the model to the data.
Pruning: To prevent overfitting of the data, pruning is a technique used to remove the
nodes that contribute little to the model accuracy. Cost complexity pruning and
information gain pruning are two popular pruning techniques. Cost complexity pruning
involves calculating the cost of each node and removing nodes that have a negative cost.
Information gain pruning involves calculating the information gain of each node and
removing nodes that have a low information gain.
Based on the best-split points of each input in Step 1, the new “best” split point is
identified.
CART for classification works by recursively splitting the training data into smaller and
smaller subsets based on certain criteria. The goal is to split the data in a way that
minimizes the impurity within each subset. Impurity is a measure of how mixed up the
data is in a particular subset. For classification tasks, CART uses Gini impurity
Splitting Criteria- The CART algorithm evaluates all potential splits at every node and
chooses the one that best decreases the Gini impurity of the resultant subsets. This
process continues until a stopping criterion is reached, like a maximum tree depth or a
minimum number of instances in a leaf node.
A Regression tree is an algorithm where the target variable is continuous and the tree is
used to predict its value. Regression trees are used when the response variable is
continuous. For example, if the response variable is the temperature of the day.
.
Regression CART works by splitting the training data recursively into smaller subsets
based on specific criteria. The objective is to split the data in a way that minimizes the
residual reduction in each subset.
Residual Reduction- Residual reduction is a measure of how much the average squared
difference between the predicted values and the actual values for the target variable is
reduced by splitting the subset. The lower the residual reduction, the better the model
fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the one
that results in the greatest reduction of residual error in the resulting subsets. This
process is repeated until a stopping criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
Decision tree pruning is a technique used to prevent decision trees from overfitting the
training data. Pruning aims to simplify the decision tree by removing parts of it that do
not provide significant predictive power, thus improving its ability to generalize to new
data.
Decision Tree Pruning removes unwanted nodes from the overfitted decision tree
to make it smaller in size which results in more fast, more accurate and more
effective predictions.
There are two main types of decision tree pruning: Pre-Pruning and Post-Pruning.
Sometimes, the growth of the decision tree can be stopped before it gets too complex,
this is called pre-pruning. It is important to prevent the overfitting of the training data,
which results in a poor performance when exposed to new data.
Minimum Samples per Leaf: Set a minimum threshold for the number of samples in
each leaf node.
Minimum Samples per Split: Specify the minimal number of samples needed to break
up a node.
By pruning early, we come to be with a simpler tree that is less likely to overfit the
training facts.
After the tree is fully grown, post-pruning involves removing branches or nodes to
improve the model’s ability to generalize. Some common post-pruning techniques
include:
Cost-Complexity Pruning (CCP): This method assigns a price to each subtree primarily
based on its accuracy and complexity, then selects the subtree with the lowest fee.
Reduced Error Pruning: Removes branches that do not significantly affect the overall
accuracy.
Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini impurity or
entropy) is beneath a certain threshold.
Minimum Leaf Size: Removes leaf nodes with fewer samples than a specified threshold.
What is the definition of tree pruning?
Pruning is when you selectively remove branches from a tree. The goal is to
remove unwanted branches, improve the tree’s structure, and direct new, healthy
growth.
Pruning is one of best things you can do for your trees. A proper prune is both an
investment in the long-term health of your plants and in the overall look and
safety of your property.
When you trim your trees, you get all these benefits!
When you remove old branches, you give trees the green light to put out healthy, new
growth.
Train trees to grow on your terms so that branches won’t hang over the roof or stretch
into power lines.
Give trees a clean, polished look that elevates your whole landscape.
Reducing density removes limbs all the way back to their branch of origin. It’s a method
used to free up a full canopy so that more sunlight can come through.
Maintaining health is like fine-tuning a tree. Simple cuts are used to clear out dead,
diseased, and damaged limbs to give the tree a polished look.
Size management cuts reduce a tree’s height or width. This method typically shortens
branches that are inching into utility lines or reduces a wide-spread tree.
Structural (subordination) cuts could involve one or more of the above methods to
improve a plant’s structure and long-term health.
10.Advantages and Disadvantages of Decision Tree