CBM342 BCI unit IV
CBM342 BCI unit IV
2. Dimensionality Reduction:
LDA involves transforming the original features into a new set of features, known as discriminant
functions or variables, which are linear combinations of the original features. These new features are
chosen in such a way that the separation between classes is maximized.
3. Assumptions:
LDA assumes that the data follows a normal distribution and that the classes have identical
covariance matrices. If these assumptions are not met, the performance of LDA may be compromised.
5. Eigenvalue Decomposition:
The next step is to find the eigenvectors and eigenvalues of the matrix resulting from the inverse of the
within-class scatter matrix multiplied by the between-class scatter matrix.
7. Decision Rule:
In the context of classification, a decision rule is established to assign a given observation to a
specific class based on its position in the new feature space.
8. Comparison with PCA (Principal Component Analysis):
LDA is often compared with PCA, another dimensionality reduction technique. While PCA focuses on
capturing overall variance in the data, LDA specifically aims to maximize the separation between
classes.
9. Use Cases:
LDA is widely used in various applications, including face recognition, image processing, and biomedical
data analysis.
In summary, Linear Discriminant Analysis is a powerful technique for dimensionality reduction and
classification, especially when there is a need to maximize the separation between different classes in a
dataset.
1
It is commonly employed the goal is to enhance class discrimination. A discriminant function is a linear
combination of the components of x which can be written as
g(x) =w'x + ωo (1)
where w is the weight vector and ωo the bias or threshold weight.
For a discriminant function of the form of Eq. 1, a two-category classifier implements the following
decision rule:
Decide ω1 if g(x) > 0 and ω2 1f g(x) < 0. Thus, x is assigned to ω1 if the inner product w'x
exceeds the threshold -ωo and to ω2 .
otherwise, If g (x) =0, x can ordinarily be assigned to either class.
Above figure shows a typical implementation, a clear example of the general structure of a pattern
recognition system.
The equation g (x)=0 defines the decision surface that separates points assigned to ω1 from points
assigned to ω2. When g(x) is linear, this decision surface is a hyperplane.
If x1 and x2 are both on the decision surface, then this shows that w is normal to any vector lying in the
hyperplane.
In general, the hyperplane H divides the feature space into two half-spaces: decision region R1 for w1 and
region R2 for w2. Because g(x) > 0, if x is in Rı, it follows that the normal vector w1 points into R1. It is
sometimes said that any x in R1 is on the positive side of H, and any x in R2 is on the negative side.
The discriminant function g(x) gives an algebraic measure of the distance from x to the hyperplane.
𝑊
X=Xp + r
||𝑊||
Where xp is the normal projection of x onto H, and r is the desired algebraic distance- positive if x is on
the positive side and negative if x is on the negative side. Then, because g(xp) = 0,
g(x) = w'x + wo =r ||w||
𝑔(𝑥)
r=||𝑊||
In particular, the distance from the origin to H is given by wo/||w ||.
If w0>0, the origin is on the positive side of H,
If wo < 0, it is on the negative side.
2
If wo =0. then g(x) has the homogeneous form w'x, and the hyperplane passes through the origin.
To summarize, a linear discriminant function divides the feature space by a hyperplane decision surface.
The orientation of the surface is determined by the normal vector w, and the location of the surface is
determined by the bias wo. The discriminant function g(x) is proportional to the signed distance from x to
the hyperplane with g (x) > 0 when x is on the positive side, and g (x) < 0 when x is on the negative side.
● SVMs can be used for a variety of tasks, such as text classification, image classification, spam
detection, handwriting identification, gene expression analysis, face detection, and anomaly
detection.
● SVMs are adaptable and efficient in a variety of applications because they can manage high-
dimensional data and nonlinear relationships.
● SVM algorithms are very effective to find the maximum separating hyperplane between the
different classes available in the target feature.
3
How Does Support Vector Machine Work?
● SVM is defined in terms of the support vectors only, using the points which are closest to
the hyperplane (support vectors), whereas in logistic regression the classifier is defined over
all the points.
● Hence SVM some natural speed-ups.
● To classify that the new data point as either circle or diamond shape.
● Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression.
● The main objective of the SVM algorithm is to find the optimal hyperplane in an N-
dimensional space that can separate the data points in different classes in the feature
space.
● The hyperplane tries that the margin between the closest points of different classes should be
as maximum as possible.
● The dimension of the hyperplane depends upon the number of features. If the number of
input features is two, then the hyperplane is just a line. If the number of input features is
three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when the
number of features exceeds three.
● Let’s consider two independent variables x1, x2, and one dependent variable which is either a
blue circle or a red circle.
4
From the figure above it’s very clear that there are multiple lines because consider only two input features
x1, x2 that segregate data points or classification between red and blue circles.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a
critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main objective of
the support vector machine algorithm is to maximize the margin. The wider margin indicates better
classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input
data points into high-dimensional feature spaces, so, that the hyperplane can be easily found out
even if the data points are not linearly separable in the original input space. Some of the common
kernel functions are linear, polynomial, radial basis function (RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane
that properly separates the data points of different categories without any misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft
margin technique. Each data point has a slack variable introduced by the soft-margin SVM
formulation, which softens the strict margin requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the margin and reducing violations.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or
margin violations. The objective function in SVM is frequently formed by combining it with the
regularization term.
9. Dual Problem: A dual Problem of the optimization problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.
5
Mathematical intuition of Support Vector Machine
Consider a binary classification problem with two classes, labelled as +1 and -1. Training dataset
consisting of input feature vectors X and their corresponding class labels Y.
The equation for the linear hyperplane can be written as:
wTx +b=0
The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the
hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from
the origin along the normal vector w. The distance between a data point xi and the decision boundary
as
di= wTxi+b/||w||
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal
vector is W.
6
Advantages of SVM:
● Effective in high-dimensional cases.
● Its memory is efficient as it uses a subset of training points in the decision
function called support vectors.
● Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.
III. Regression:
1. Linear SVR:
Let, a set of training data where xi is a multivariate set of N observations with label values. To find
the linear function,
f(x)= xw+b
2. NonLinear SVR:
Some regression problems cannot be described properly using a linear model. There need to extend the
previously-described technique to nonlinear functions. To achieve a nonlinear SVR model replace the
dot product x1′x2 with a nonlinear kernel function G(x1,x2)=< ψ(x1),ψ(x2)>, where ψ(x) is a
transformation, which maps x to a high- dimensional space.
7
IV. Vector Quantization:
● Vector quantization (VQ) is a data compression technique similar to k-means algorithm
which can model any data distribution.
● Vector quantization used in a wide range of applications for speech, image, and video data,
such as image generation, speech and audio coding, voice conversion, music generation
and text-to-speech synthesis. The figure below shows how vector quantization (VQ) works.
Xquantized = C2
● To solve this challenge and apply VQ for higher bitrates and higher dimensional data, use
some variants of VQ such as Residual VQ, Additive VQ, and Product VQ.
● These methods consider more than one codebook to apply VQ on the data.
8
1. Residual Vector Quantization (RVQ):
● Residual VQ quantizes the input vector x by applying M consecutive VQ modules on it.
● According to the above figure, suppose M=3. We apply the first VQ module on input vector x
using the first codebook (CB¹).
● Then, after finding the closest codeword form first codebook, calculate the remainder (R1).
Afterwards, pass R1 as input to the next VQ module using the second codebook (CB²).
● This process will continue for M stages where to find three closest codeword coming from
separate codebooks. At the end, quantize the input vector x as a summation of M closest
codewords.
2. Additive Vector Quantization (AVQ):
● In a similar way as Residual VQ, Additive VQ quantizes the input vector x by applying M
consecutive VQ modules.
● However, Additive VQ adopts the complex beam searching algorithm to find the closest
codewords for the quantization process.
● According to the above figure, suppose M=3. In Additive VQ, first we search for the closest
codeword from the union of all three codebooks (here CB¹, CB², CB³). Then, find the best
codeword from CB².
● After that, calculate the residual (R1) and pass it as input to the next VQ module. Since the
first codeword is selected from CB², now we search for the closest codeword from the
union of CB¹ and CB³.
● After calculating the residual R2, pass it as input to the last VQ module, where we do the
search using the last codebook (in this case CB¹) which is not yet contributed to the
quantization process.
● At the end, we quantize the input vector x as a summation of M closest codewords.
9
● At the end, Product VQ quantizes the input vector x as a concatenation of M closest
codewords.
The figure below shows the Product VQ when M=3.
Codebooks Optimization:
Vector quantization (VQ) training means to optimize the codebook(s) such that the model of the
data distribution in a way that the error of quantization (such as mean squared error) between data
points and codebook elements is minimized.
To optimize the codebooks for these three above-mentioned variants of VQ (Residual VQ, Additive
VQ, and Product VQ) there are different approaches which we will mention in the following.
10
V. Gaussian Mixture Modeling:
A Gaussian Mixture is a function that is comprised of several Gaussians, each identified by k ∈ {1, K},
where K is the number of clusters of our dataset. Each Gaussian k in the mixture is comprised of the
following parameters:
● A mean μ that defines its center.
● A covariance Σ that defines its width. This would be equivalent to the dimensions of an
ellipsoid in a multivariate scenario.
● A mixing probability π that defines how big or small the Gaussian function. Let us now
illustrate these parameters graphically:
● Now determine the optimal values for these parameters. To achieve this ensure that each Gaussian
fits the data points belonging to each cluster. This is exactly what maximum likelihood does.
● In general, the Gaussian density function is given by:
● Where x represents data points, D is the number of dimensions of each data point. μ and Σ are the
mean and covariance, respectively. If dataset comprised of N = 1000 three-dimensional points (D
= 3), then x will be a 1000 × 3 matrix. μ will be a 1 × 3 vector, and Σ will be a 3 × 3 matrix. For
later purposes, we will also find it useful to take the log of this equation, which is given by:
● If differentiate this equation with respect to the mean and covariance and then equate it to zero,
then able to find the optimal values for these parameters, and the solutions will correspond to
the Maximum Likelihood Estimates (MLE).
● However, because we are dealing with not just one, but many Gaussians, things will get a bit
complicated when time comes for us to find the parameters for the whole mixture.
● A Gaussian Mixture Model (GMM) is a parametric representation of a probability density
11
function, based on a weighted sum of multi-variate Gaussian distributions.
GMMs are commonly used as a parametric model of the probability distribution of continuous
measurements or features in a biometric system.
● GMM parameters are estimated from training data using the iterative Expectation-
Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a well-
trained prior model.
● GMM is computationally inexpensive, does not require phonetically labeled training speech,
and is well suited for text-independent tasks, where there is no strong prior knowledge of the
spoken text.
● GMM is used in Signal Processing, Speaker Recognition, Language identification, Classification
etc.
● An Expectation-Maximization (EM) algorithm is an iterative method for finding Maximum
Likelihood or Maximum A Posteriori (MAP) estimates of parameters in statistical models, where
the model depends on unobserved latent variables. EM Algorithm consists of two major steps:
1. E (Expectation) step
2. M (Maximization) step
EIM ALGORITHM:
E (Expectation) step:
In the E-step, the expected value of the log-likelihood function is calculated given the observed data
and current estimate of the model parameters.
M (Maximization) step:
The M-step computes the parameters which maximize the expected log-likelihood found on the E-
step. These parameters are then used to determine the distribution of the latent variables in the next E-
step until the algorithm has converged.
12
MAXIMUM A POSTERIORI PROBABILITY (MAP):
i. Estimates of the sufficient statistics of the training data are computed for each mixture in the prior
model.
ii. For adaptation these 'new' sufficient statistic estimates are then combined with the 'old'
sufficient estimates are then combined with the 'old' sufficient statistics from the prior mixture
parameters using a data-dependent mixing coefficient.
MAXIMUM LIKELIHOOD:
A set of states each of which has limited number of transitions and emissions,
Each transition between states has an assigned probability,
Each model starts from start state and ends in end state.
1. HMM Architecture:
The following diagram shows a generalized automate architecture of an operating HMM, λi with
the two integrated stochastic processes.
13
Figure. Generalized Architecture of an operating Hidden Markov Model
● Each shape represents a random variable that can adopt any of a number of values. The random
variable s(t) is the hidden state at time t. The random variable o(t) is the observation at the time t.
The law of conditional probability of the Hidden Markov variable s(t) at the time t, knowing the
values of the hidden variables at all times depends only on the value of the hidden variable s(t-1)
at the time t-1.
● Every values before are not necessary anymore, so that the Markov property as defined before is
satisfied. By the second stochastic process, the value of the observed variable o(t) depends on
the value of the hidden variable s(t) also at the time t.
Three fundamental Problem:
i. Evaluation problem:
Compute the probability that a particular output sequence was produced by that model (solved by
the forward algorithm).
is the probability that the HMM will generate the observations from
14
The computations of both αj(t) and βi(t) have complexity O (N2T).
For classification, compute the posterior probabilities
Where P(θ) is the prior for a particular class and P(νT | θ) is computed using the forward algorithm
with the HMM for that class. Then select the class with the highest posterior.
One possible solution is to enumerate every possible hidden state sequence and calculate the
probability of the observed sequence with O(NTT) Complexity.
Also define the problem of finding the optimal state sequence as finding the one that includes the
states that are individually most likely.
This also corresponds to maximizing the expected number of correct individual states.
Define γi(t) as the probability that the HMM is in state wi at time t given the observation
sequence νT.
15
● One problem is that the resulting sequence may not be consistent with the underlying model
because it may include transitions with zero probability (aij = 0 for some i and j).
● One possible solution is the Viterbi algorithm that finds the single best state sequence WT by
maximizing P ( WT / νT. , Ө) (or equivalently P ( WT, νT | Ө ).
● This algorithm recursively computes the state sequence with the highest probability at time
t and keeps track of the states that form the sequence with the highest probability at
timeT.
● The goal is to determine the model parameters {aij}, {bjk} and {πi} from a collection of
training samples.
● Define ξij(t) as the probability that the HMM is in state wi at time t-1 and state wj at time
t given the observation sequence νT
Then, aij, the estimate of the probability of a transition from wi, at t - l to wj, at t, can be
computed as
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑎𝑤𝑎𝑦 𝑓𝑟𝑜𝑚 𝑤𝑖 𝑡𝑜 𝑤𝑗
aij=
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑎𝑤𝑎𝑦 𝑓𝑟𝑜𝑚 𝑤𝑖
Similarly, bjk, the estimate of the probability of observing the symbol νk while in state
wj, computed as
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑜𝑏𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑠𝑦𝑚𝑏𝑜𝑙 𝑣𝑘 𝑖𝑛 𝑠𝑡𝑎𝑡𝑒 𝑤𝑗
bjk=
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑖𝑛 𝑤𝑗
16
Where ν(t),νk is the Kronecker delta which is 1 only when ν(t) = νk.
● Finally, t is the estimate for the initial state distribution, can be computed as which is the expected
number of times in state wi, at time t=1.
● These are called the Baum-Welch equations (also called the EM estimates for HMMs or the forward-
backward algorithm) that can be computed iteratively until some convergence criterion is met (e.g.,
sufficiently small changes in the estimated values in subsequent iterations).
● The estimates b j (x) when the observations are continuous and their distributions are modeled using
Gaussian mixtures.
Applications:
On-line handwriting recognition
Speech recognition
Gesture recognition
Language modeling
Motion video analysis and tracking
17
Biological Neuron Artificial Neuron
Figure: Neural network neurons
Input - It is the set of features that are fed into the model for the learning process. For example,
the input in object detection can be an array of pixel values pertaining to an image.
Weight - Its main function is to give importance to those features that contribute more towards
the learning. It does so by introducing scalar multiplication between the input value and the
weight matrix. For example, a negative word would impact the decision of the sentiment
analysis model more than a pair of neutral words.
Transfer function - The job of the transfer function is to combine multiple inputs into one output
value so that the activation function can be applied. It is done by a simple summation of all the
inputs to the transfer function.
18
Activation Function—It introduces non-linearity in the working of perceptrons to consider varying
linearity with the inputs. Without this, the output would just be a linear combination of input values
and would not be able to introduce non-linearity in the network
Bias - The role of bias is to shift the value produced by the activation function. Its role is similar
to the role of a constant in a linear function. When multiple neurons are stacked together in a
row, they constitute a layer, and multiple layers piled next to each other are called a multi-layer
neural network.
Input Layer:
The data that we feed to the model is loaded into the input layer from external sources like a CSV file or
a web service. It is the only visible layer in the complete Neural Network architecture that passes the
complete information from the outside world without any computation.
Hidden Layers:
Hidden Layers are intermediate layers that do all the computations and extract the features from the data.
There can be multiple interconnected hidden layers that account for searching different hidden features in
the data. For example, in image processing, the first hidden layers are responsible for higher-level features
like edges, shapes, or boundaries. On the other hand, the later hidden layers perform more complicated
tasks like identifying complete objects (a car, a building, a person).
Output Layer:
The output layer takes input from preceding hidden layers and comes to a final prediction based on the
model’s learning. It is the most important layer where we get the final result. In the case of
classification/regression models, the output layer
They can also be described by the number of hidden nodes the model has or in terms of how many input
layers and output layers each node has. Variations on the classic neural network design enable various
forms of forward and backward propagation of information among tiers.
19
i. Feed-forward neural networks:
One of the simplest variants of neural networks, these pass information in one direction, through various
input nodes, until it makes it to the output node. The network might or might not have hidden node layers,
making their functioning more interpretable. It's prepared to process large amounts of noise.
This type of ANN computational model is used in technologies such as facial recognition and
computer vision.
● More complex in nature, RNNs save the output of processing nodes and feed the result back into
the model. This is how the model learns to predict the outcome of a layer. Each node in the RNN
model acts as a memory cell, continuing the computation and execution of operations.
● This neural network starts with the same front propagation as a feed-forward network but then
goes on to remember all processed information to reuse it in the future. If the network's prediction
is incorrect, then the system self-learns and continues working toward the correct prediction
during back propagation. This type of ANN is frequently used in text-to-speech conversions.
● CNNs are one of the most popular models used today. This computational model uses a variation of
multilayer perceptrons and contains one or more convolutional layers that can be either entirely connected
or pooled.
● These convolutional layers create feature maps that record a region of the image that's
ultimately broken into rectangles and sent out for nonlinear processing.
● The CNN model is particularly popular in the realm of image recognition. It has been used in
many of the most advanced applications of AI, including facial recognition, text digitization
and NLP. Other use cases include paraphrase detection, signal processing and image
classification.
20
iv. Deconvolutional neural networks:
Deconvolutional neural networks use a reversed CNN model process. They try to find lost features or
signals that might have originally been considered unimportant to the CNN system's task. This network
model can be used in image synthesis and analysis.
● Parallel processing abilities. ANNs have parallel processing abilities, which means the
network can perform more than one job at a time.
● Information storage. ANNs store information on the entire network, not just in a
database. This ensures that even if a small amount of data disappears from one location,
the entire network continues to operate.
● Non-linearity. The ability to learn and model nonlinear, complex relationships helps model
the real-world relationships between input and output.
● Fault tolerance. ANNs come with fault tolerance, which means the corruption or fault of one
or more cells of the ANN won't stop the generation of output.
● Gradual corruption. This means the network slowly degrades over time instead of
degrading instantly when a problem occurs.
● Unrestricted input variables. No restrictions are placed on the input variables, such as how they
should be distributed.
● Observation-based decisions. Machine learning means the ANN can learn from events and
make decisions based on the observations.
● Ability to learn hidden relationships. ANNs can learn the hidden relationships in data without
commanding any fixed relationship. This means ANNs can better model highly volatile data
and non-constant variance.
● Ability to generalize data. The ability to generalize and infer unseen relationships on unseen
data means ANNs can predict the output of unseen data.
21
Disadvantages of artificial neural networks:
● Lack of rules. The lack of rules for determining the proper network structure means the
appropriate artificial neural network architecture can only be found through trial, error and
experience.
● Hardware dependency. The requirement of processors with parallel processing abilities
makes neural networks dependent on hardware.
● Numerical translation. The network works with numerical information, meaning all
problems must be translated into numerical values before they can be presented to the ANN.
● Lack of trust. The lack of explanation behind probing solutions is one of the biggest
disadvantages of ANNs. The inability to explain the why or how behind the solution generates
a lack of trust in the network.
● Inaccurate results. If not trained properly, ANNs can often produce incomplete or inaccurate
results.
● Black box nature. Because of their black box AI model, it can be challenging to grasp how
neural networks make their predictions or categorize data.
22