0% found this document useful (0 votes)
25 views22 pages

CBM342 BCI unit IV

Unit IV discusses various feature translation methods, including Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). LDA is a statistical technique for dimensionality reduction and classification that maximizes class separation, while SVM is a powerful algorithm for classification and regression that finds the optimal hyperplane to separate data points. The document also covers regression techniques, vector quantization, and key concepts related to SVM such as support vectors, margins, and kernel functions.

Uploaded by

jenitta89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views22 pages

CBM342 BCI unit IV

Unit IV discusses various feature translation methods, including Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). LDA is a statistical technique for dimensionality reduction and classification that maximizes class separation, while SVM is a powerful algorithm for classification and regression that finds the optimal hyperplane to separate data points. The document also covers regression techniques, vector quantization, and key concepts related to SVM such as support vectors, margins, and kernel functions.

Uploaded by

jenitta89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT IV FEATURE TRANSLATION METHODS

Linear Discriminant Analysis – Support Vector Machines - Regression – Vector Quantization–


Gaussian Mixture Modeling – Hidden Markov Modeling – Neural Networks.

I. Linear Discriminant Analysis:


Linear Discriminant Analysis (LDA) is a statistical method used for dimensionality reduction and
classification. It is particularly employed in the field of pattern recognition and machine learning to
find the linear combinations of features that best separate two or more classes in a dataset. Here are the
key concepts associated with Linear Discriminant Analysis:
1. Objective:
The primary goal of LDA is to maximize the separation between different classes while minimizing the
variance within each class.

2. Dimensionality Reduction:
LDA involves transforming the original features into a new set of features, known as discriminant
functions or variables, which are linear combinations of the original features. These new features are
chosen in such a way that the separation between classes is maximized.

3. Assumptions:
LDA assumes that the data follows a normal distribution and that the classes have identical
covariance matrices. If these assumptions are not met, the performance of LDA may be compromised.

4. Between-Class Scatter and Within-Class Scatter:


LDA involves calculating two scatter matrices: the between-class scatter matrix (measuring the
spread between different classes) and the within- class scatter matrix (measuring the spread within
each class).

5. Eigenvalue Decomposition:
The next step is to find the eigenvectors and eigenvalues of the matrix resulting from the inverse of the
within-class scatter matrix multiplied by the between-class scatter matrix.

6. Selection of Discriminant Functions:


The discriminant functions are then chosen based on the eigenvectors corresponding to the largest
eigenvalues. These functions form the basis for the new feature space.

7. Decision Rule:
In the context of classification, a decision rule is established to assign a given observation to a
specific class based on its position in the new feature space.
8. Comparison with PCA (Principal Component Analysis):
LDA is often compared with PCA, another dimensionality reduction technique. While PCA focuses on
capturing overall variance in the data, LDA specifically aims to maximize the separation between
classes.

9. Use Cases:
LDA is widely used in various applications, including face recognition, image processing, and biomedical
data analysis.

In summary, Linear Discriminant Analysis is a powerful technique for dimensionality reduction and
classification, especially when there is a need to maximize the separation between different classes in a
dataset.
1
It is commonly employed the goal is to enhance class discrimination. A discriminant function is a linear
combination of the components of x which can be written as
g(x) =w'x + ωo (1)
where w is the weight vector and ωo the bias or threshold weight.
For a discriminant function of the form of Eq. 1, a two-category classifier implements the following
decision rule:
 Decide ω1 if g(x) > 0 and ω2 1f g(x) < 0. Thus, x is assigned to ω1 if the inner product w'x
exceeds the threshold -ωo and to ω2 .
 otherwise, If g (x) =0, x can ordinarily be assigned to either class.

Above figure shows a typical implementation, a clear example of the general structure of a pattern
recognition system.
The equation g (x)=0 defines the decision surface that separates points assigned to ω1 from points
assigned to ω2. When g(x) is linear, this decision surface is a hyperplane.
If x1 and x2 are both on the decision surface, then this shows that w is normal to any vector lying in the
hyperplane.
In general, the hyperplane H divides the feature space into two half-spaces: decision region R1 for w1 and
region R2 for w2. Because g(x) > 0, if x is in Rı, it follows that the normal vector w1 points into R1. It is
sometimes said that any x in R1 is on the positive side of H, and any x in R2 is on the negative side.
The discriminant function g(x) gives an algebraic measure of the distance from x to the hyperplane.
𝑊
X=Xp + r
||𝑊||

Where xp is the normal projection of x onto H, and r is the desired algebraic distance- positive if x is on
the positive side and negative if x is on the negative side. Then, because g(xp) = 0,
g(x) = w'x + wo =r ||w||
𝑔(𝑥)
r=||𝑊||
In particular, the distance from the origin to H is given by wo/||w ||.
 If w0>0, the origin is on the positive side of H,
 If wo < 0, it is on the negative side.
2
 If wo =0. then g(x) has the homogeneous form w'x, and the hyperplane passes through the origin.

To summarize, a linear discriminant function divides the feature space by a hyperplane decision surface.
The orientation of the surface is determined by the normal vector w, and the location of the surface is
determined by the bias wo. The discriminant function g(x) is proportional to the signed distance from x to
the hyperplane with g (x) > 0 when x is on the positive side, and g (x) < 0 when x is on the negative side.

II. Support Vector Machines:


● Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or
nonlinear classification, regression, and even outlier detection tasks.

● SVMs can be used for a variety of tasks, such as text classification, image classification, spam
detection, handwriting identification, gene expression analysis, face detection, and anomaly
detection.
● SVMs are adaptable and efficient in a variety of applications because they can manage high-
dimensional data and nonlinear relationships.
● SVM algorithms are very effective to find the maximum separating hyperplane between the
different classes available in the target feature.

Support Vector Machine Algorithm:


It is a supervised machine learning problem where to find a hyperplane that best separates the two classes.
Logistic Regression vs Support Vector Machine (SVM):
Depending on the number of features choose Logistic Regression or SVM.SVM works best when the
dataset is small and complex. It is usually advisable to first use logistic regression and how does it
performs, if it fails to give a good accuracy for SVM without any kernel.
Logistic regression and SVM without any kernel have similar performance but depending on features, one
may be more efficient than the other.

1. Types of Support Vector Machine (SVM) Algorithms


● Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can
entirely divide the data points into their respective classes. A hyperplane that maximizes the
margin between the classes is the decision boundary.
● Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be separated
into two classes by a straight line (in the case of 2D). By using kernel functions, nonlinear SVMs
can handle nonlinearly separable data.
● The original input data is transformed by these kernel functions into a higher- dimensional
feature space, where the data points can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.
Important Terms
● Support Vectors: These are the points that are closest to the hyperplane. A separating line will
be defined with the help of these data points.
● Margin: It is the distance between the hyperplane and the observations closest to the
hyperplane (support vectors). In SVM large margin is considered a good margin. There are two
types of margins hard margin and soft margin

3
How Does Support Vector Machine Work?
● SVM is defined in terms of the support vectors only, using the points which are closest to
the hyperplane (support vectors), whereas in logistic regression the classifier is defined over
all the points.
● Hence SVM some natural speed-ups.
● To classify that the new data point as either circle or diamond shape.

● Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression.
● The main objective of the SVM algorithm is to find the optimal hyperplane in an N-
dimensional space that can separate the data points in different classes in the feature
space.
● The hyperplane tries that the margin between the closest points of different classes should be
as maximum as possible.

● The dimension of the hyperplane depends upon the number of features. If the number of
input features is two, then the hyperplane is just a line. If the number of input features is
three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when the
number of features exceeds three.
● Let’s consider two independent variables x1, x2, and one dependent variable which is either a
blue circle or a red circle.

4
From the figure above it’s very clear that there are multiple lines because consider only two input features
x1, x2 that segregate data points or classification between red and blue circles.

2. Support Vector Machine Terminology:


1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of
different classes in a feature space. In the case of linear classifications, it will be a linear equation
i.e. wx+b = 0.

2. Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a
critical role in deciding the hyperplane and margin.

3. Margin: Margin is the distance between the support vector and hyperplane. The main objective of
the support vector machine algorithm is to maximize the margin. The wider margin indicates better
classification performance.

4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input
data points into high-dimensional feature spaces, so, that the hyperplane can be easily found out
even if the data points are not linearly separable in the original input space. Some of the common
kernel functions are linear, polynomial, radial basis function (RBF), and sigmoid.

5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane
that properly separates the data points of different categories without any misclassifications.

6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft
margin technique. Each data point has a slack variable introduced by the soft-margin SVM
formulation, which softens the strict margin requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the margin and reducing violations.

7. C: Margin maximization and misclassification fines are balanced by the regularization


parameter C in SVM. The penalty for going over the margin or misclassifying data items is
decided by it. A stricter penalty is imposed with a greater value of C, which results in a smaller
margin and perhaps fewer misclassifications.

8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or
margin violations. The objective function in SVM is frequently formed by combining it with the
regularization term.

9. Dual Problem: A dual Problem of the optimization problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective computing.

5
Mathematical intuition of Support Vector Machine
Consider a binary classification problem with two classes, labelled as +1 and -1. Training dataset
consisting of input feature vectors X and their corresponding class labels Y.
The equation for the linear hyperplane can be written as:
wTx +b=0
The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the
hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from
the origin along the normal vector w. The distance between a data point xi and the decision boundary
as

di= wTxi+b/||w||
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal
vector is W.

For Linear SVM classifier


y = {1 : wTx + b ≥ 0 0 : wTx + b < 0
If N, denotes the total number of support vectors, then for n training patterns the expected value of the
generalization error rate is bounded, according to
ℇn[Ns]
ℇn[error] ≤
n
where the expectation is over all training sets of size n drawn from the (stationary)distributions
describing the categories.
● This bound is independent of the dimensionality of the space of transformed vectors,
determined by ψ(.). Now we can understand this informally by means of the leave-one-out
bound.
● Suppose n points in the training set, train an SVM on n - 1 of them, and test on the single
remaining point.
● If that remaining point happens to be a support vector for the full n sample case, then there
will be an error otherwise, there will not.
● Note that if find a transformation ψ that well separates the data so the expected number
of support vectors is small then the above equation, shows that the expected error rate will
be lower.

Popular kernel functions in SVM:


The SVM kernel is a function that takes low-dimensional input space and transforms it into higher-
dimensional space, ie it converts nonseparable problems to separable problems.
It is mostly useful in non-linear separation problems. Simply put the kernel, does some
extremely complex data transformations and then finds out the process to separate the data based
on the labels or outputs defined.
Linear : K (w, b)=wTx +b
Polynomial : K (w, x) = (ϒwTx +b) N
Gaussian RBF: K(w, x) = exp (-ϒ⃒ ⃒ xi-x j ⃒⃒ n

Sigmoid : K (xi, xj) = tanh (αx iTi xj+b)

6
Advantages of SVM:
● Effective in high-dimensional cases.
● Its memory is efficient as it uses a subset of training points in the decision
function called support vectors.
● Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.

III. Regression:

● SVM regression is considered a nonparametric technique because it relies on kernel


functions. The Support Vector Regression (SVR) uses the same principles as the SVM for
classification, there are only a few minor differences.
● First, as output is a real number it becomes very difficult to predict the information at
hand, which has infinite possibilities.
● In the case of regression, a margin of tolerance (epsilon) is set in approximation to the
SVM which would have already requested from the problem.
● Not only that, there is also a more complicated reason, the algorithm is more complicated
therefore to be taken in consideration.
● The main goal is to minimize error, individualizing the hyperplane which maximizes the
margin, keeping in mind that part of the error is tolerated.
● In other words, the goal is to find a function f(x) that deviates from yn by a value no greater
than ε for each training point x, and at the same time is as flat as possible.

1. Linear SVR:
Let, a set of training data where xi is a multivariate set of N observations with label values. To find
the linear function,

f(x)= xw+b

2. NonLinear SVR:
Some regression problems cannot be described properly using a linear model. There need to extend the
previously-described technique to nonlinear functions. To achieve a nonlinear SVR model replace the
dot product x1′x2 with a nonlinear kernel function G(x1,x2)=< ψ(x1),ψ(x2)>, where ψ(x) is a
transformation, which maps x to a high- dimensional space.

7
IV. Vector Quantization:
● Vector quantization (VQ) is a data compression technique similar to k-means algorithm
which can model any data distribution.
● Vector quantization used in a wide range of applications for speech, image, and video data,
such as image generation, speech and audio coding, voice conversion, music generation
and text-to-speech synthesis. The figure below shows how vector quantization (VQ) works.

● For VQ process, require a codebook which includes a number of codewords. Applying VQ on


a data point (gray dots) means to map it to the closest codeword (blue dots), i.e. replace the
value of data point with the closest codeword value.
● Each voronoi cell (black lines) contains one codeword such that all data points located in
that cell will be mapped to that codeword, since it is the closest codeword to data points
located in that voronoi cell.

Vector Quantization Operation


In other words, vector quantization maps the input vector x to the closest codeword within the
codebook (CB) using the following formula:

Xquantized = C2

● The computational complexity of VQ increases exponentially with the increase in the


codebook size (increase in VQ bitrate).

● To solve this challenge and apply VQ for higher bitrates and higher dimensional data, use
some variants of VQ such as Residual VQ, Additive VQ, and Product VQ.
● These methods consider more than one codebook to apply VQ on the data.

8
1. Residual Vector Quantization (RVQ):
● Residual VQ quantizes the input vector x by applying M consecutive VQ modules on it.

● According to the above figure, suppose M=3. We apply the first VQ module on input vector x
using the first codebook (CB¹).
● Then, after finding the closest codeword form first codebook, calculate the remainder (R1).
Afterwards, pass R1 as input to the next VQ module using the second codebook (CB²).
● This process will continue for M stages where to find three closest codeword coming from
separate codebooks. At the end, quantize the input vector x as a summation of M closest
codewords.
2. Additive Vector Quantization (AVQ):
● In a similar way as Residual VQ, Additive VQ quantizes the input vector x by applying M
consecutive VQ modules.
● However, Additive VQ adopts the complex beam searching algorithm to find the closest
codewords for the quantization process.

● According to the above figure, suppose M=3. In Additive VQ, first we search for the closest
codeword from the union of all three codebooks (here CB¹, CB², CB³). Then, find the best
codeword from CB².
● After that, calculate the residual (R1) and pass it as input to the next VQ module. Since the
first codeword is selected from CB², now we search for the closest codeword from the
union of CB¹ and CB³.
● After calculating the residual R2, pass it as input to the last VQ module, where we do the
search using the last codebook (in this case CB¹) which is not yet contributed to the
quantization process.
● At the end, we quantize the input vector x as a summation of M closest codewords.

3. Product Vector Quantization (PVQ):


● Product VQ splits the input vector x of dimension D to M independent subspaces of
dimension D/M. Then it applies M independent VQ modules to the existing subspaces.

9
● At the end, Product VQ quantizes the input vector x as a concatenation of M closest
codewords.
The figure below shows the Product VQ when M=3.

Codebooks Optimization:
Vector quantization (VQ) training means to optimize the codebook(s) such that the model of the
data distribution in a way that the error of quantization (such as mean squared error) between data
points and codebook elements is minimized.
To optimize the codebooks for these three above-mentioned variants of VQ (Residual VQ, Additive
VQ, and Product VQ) there are different approaches which we will mention in the following.

1. K-means Algorithm (traditional approach):


Based on the codebooks for these three VQ methods optimized by k-means algorithm.

2. Stochastic Optimization (machine learning algorithms):


Machine learning optimization algorithms are based on gradient calculation. Therefore, it is impossible
to optimize vector quantization methods using machine learning optimization, since the argmin function
in vector quantization function (first equation above) is not differentiable. In other words, we cannot pass
the gradients over vector quantization function in backpropagation. Here mentioned two solutions to
solve this problem.

i. Straight Through Estimator (STE):


STE solves the problem by simply copying the gradients intactly over VQ module in backpropagation.
Hence, it does not consider the influence of vector quantization and leads to a mismatch between the
gradients and true behavior of the VQ function.

ii. Noise Substitution in Vector Quantization (NSVQ):


The NSVQ technique is recently proposed method, in which the vector quantization error is simulated
by adding noise to the input vector, such that the simulated noise would gain the shape of original VQ
error distribution (you can read shortly about NSVQ in this post).

NSVQ technique has some advantages over STE method:


 NSVQ yields more accurate gradients for VQ function.
 NSVQ achieves faster convergence for VQ training (codebook optimization).
 NSVQ does not need any additional hyper-parameter tuning for VQ training (does not
require additional loss term for VQ training to be added to the global optimization loss
function).

10
V. Gaussian Mixture Modeling:
A Gaussian Mixture is a function that is comprised of several Gaussians, each identified by k ∈ {1, K},
where K is the number of clusters of our dataset. Each Gaussian k in the mixture is comprised of the
following parameters:
● A mean μ that defines its center.
● A covariance Σ that defines its width. This would be equivalent to the dimensions of an
ellipsoid in a multivariate scenario.
● A mixing probability π that defines how big or small the Gaussian function. Let us now
illustrate these parameters graphically:

Figure: Gaussian mixture


Here, that there are three Gaussian functions, hence K = 3. Each Gaussian explains the data
contained in each of the three clusters available. The mixing coefficients are themselves
probabilities and must meet this condition:

● Now determine the optimal values for these parameters. To achieve this ensure that each Gaussian
fits the data points belonging to each cluster. This is exactly what maximum likelihood does.
● In general, the Gaussian density function is given by:

● Where x represents data points, D is the number of dimensions of each data point. μ and Σ are the
mean and covariance, respectively. If dataset comprised of N = 1000 three-dimensional points (D
= 3), then x will be a 1000 × 3 matrix. μ will be a 1 × 3 vector, and Σ will be a 3 × 3 matrix. For
later purposes, we will also find it useful to take the log of this equation, which is given by:

● If differentiate this equation with respect to the mean and covariance and then equate it to zero,
then able to find the optimal values for these parameters, and the solutions will correspond to
the Maximum Likelihood Estimates (MLE).
● However, because we are dealing with not just one, but many Gaussians, things will get a bit
complicated when time comes for us to find the parameters for the whole mixture.
● A Gaussian Mixture Model (GMM) is a parametric representation of a probability density

11
function, based on a weighted sum of multi-variate Gaussian distributions.
 GMMs are commonly used as a parametric model of the probability distribution of continuous
measurements or features in a biometric system.

● GMM parameters are estimated from training data using the iterative Expectation-
Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a well-
trained prior model.
● GMM is computationally inexpensive, does not require phonetically labeled training speech,
and is well suited for text-independent tasks, where there is no strong prior knowledge of the
spoken text.
● GMM is used in Signal Processing, Speaker Recognition, Language identification, Classification
etc.
● An Expectation-Maximization (EM) algorithm is an iterative method for finding Maximum
Likelihood or Maximum A Posteriori (MAP) estimates of parameters in statistical models, where
the model depends on unobserved latent variables. EM Algorithm consists of two major steps:
1. E (Expectation) step
2. M (Maximization) step

EIM ALGORITHM:

E (Expectation) step:
In the E-step, the expected value of the log-likelihood function is calculated given the observed data
and current estimate of the model parameters.

M (Maximization) step:
The M-step computes the parameters which maximize the expected log-likelihood found on the E-
step. These parameters are then used to determine the distribution of the latent variables in the next E-
step until the algorithm has converged.

12
MAXIMUM A POSTERIORI PROBABILITY (MAP):

● A maximum a posteriori probability (MAP) estimate is a mode of the posterior


distribution
● The MAP estimation is a two step estimation process:

i. Estimates of the sufficient statistics of the training data are computed for each mixture in the prior
model.
ii. For adaptation these 'new' sufficient statistic estimates are then combined with the 'old'
sufficient estimates are then combined with the 'old' sufficient statistics from the prior mixture
parameters using a data-dependent mixing coefficient.

MAXIMUM LIKELIHOOD:

In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a


statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation
provides estimates for the model's parameters.

VI. Hidden Markov Modeling:


Hidden Markov models (HMM) are introduced for the offline classification of single-trail EEG data in
a brain-computer-interface (BCI). The HMMs are used to classify Hjorth parameters calculated from
bipolar EEG data, recorded during the imagination of a left- or right-hand movement.

Hidden Markov models (HMM):


A hidden Markov model (HMM) is a statistical model, in which the system being modeled is assumed to be
a Markov process (Memoryless process: its future and past are independent) with hidden states.

 A set of states each of which has limited number of transitions and emissions,
 Each transition between states has an assigned probability,
 Each model starts from start state and ends in end state.

1. HMM Architecture:

The following diagram shows a generalized automate architecture of an operating HMM, λi with
the two integrated stochastic processes.

13
Figure. Generalized Architecture of an operating Hidden Markov Model

● Each shape represents a random variable that can adopt any of a number of values. The random
variable s(t) is the hidden state at time t. The random variable o(t) is the observation at the time t.
The law of conditional probability of the Hidden Markov variable s(t) at the time t, knowing the
values of the hidden variables at all times depends only on the value of the hidden variable s(t-1)
at the time t-1.
● Every values before are not necessary anymore, so that the Markov property as defined before is
satisfied. By the second stochastic process, the value of the observed variable o(t) depends on
the value of the hidden variable s(t) also at the time t.
Three fundamental Problem:
i. Evaluation problem:

Compute the probability that a particular output sequence was produced by that model (solved by
the forward algorithm).

Similarly, define a backward algorithm where

βi(t) = P(v(t + 1), v(t + 2),..., v(T)|w(t) = wi , θ)

is the probability that the HMM will generate the observations from

t+ 1to Tin VT given that it is in state wi at time t.

βi(t) ,i =1, .... N can be computed as

14
The computations of both αj(t) and βi(t) have complexity O (N2T).
For classification, compute the posterior probabilities

Where P(θ) is the prior for a particular class and P(νT | θ) is computed using the forward algorithm
with the HMM for that class. Then select the class with the highest posterior.

ii. Decoding problem:


Find the most likely sequence of hidden states which could have generated a given output sequence
(solved by the Viterbi algorithm).
 Given a sequence of observations V, to find the most probable sequence of hidden states.

 One possible solution is to enumerate every possible hidden state sequence and calculate the
probability of the observed sequence with O(NTT) Complexity.
 Also define the problem of finding the optimal state sequence as finding the one that includes the
states that are individually most likely.
 This also corresponds to maximizing the expected number of correct individual states.
 Define γi(t) as the probability that the HMM is in state wi at time t given the observation
sequence νT.

Then, the individually most likely state w(t) at time t becomes

15
● One problem is that the resulting sequence may not be consistent with the underlying model
because it may include transitions with zero probability (aij = 0 for some i and j).
● One possible solution is the Viterbi algorithm that finds the single best state sequence WT by
maximizing P ( WT / νT. , Ө) (or equivalently P ( WT, νT | Ө ).
● This algorithm recursively computes the state sequence with the highest probability at time
t and keeps track of the states that form the sequence with the highest probability at
timeT.
● The goal is to determine the model parameters {aij}, {bjk} and {πi} from a collection of
training samples.
● Define ξij(t) as the probability that the HMM is in state wi at time t-1 and state wj at time
t given the observation sequence νT

iii. Learning problem:


A set of output sequences, find the most likely set of state transition and output probabilities (solved by
the Baum-Welch algorithm).
ϒi(t) defined in the decoding problem and ξij (t) defined here can be related as

Then, aij, the estimate of the probability of a transition from wi, at t - l to wj, at t, can be
computed as
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑎𝑤𝑎𝑦 𝑓𝑟𝑜𝑚 𝑤𝑖 𝑡𝑜 𝑤𝑗
aij=
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑎𝑤𝑎𝑦 𝑓𝑟𝑜𝑚 𝑤𝑖

Similarly, bjk, the estimate of the probability of observing the symbol νk while in state
wj, computed as
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑜𝑏𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑠𝑦𝑚𝑏𝑜𝑙 𝑣𝑘 𝑖𝑛 𝑠𝑡𝑎𝑡𝑒 𝑤𝑗
bjk=
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑖𝑛 𝑤𝑗

16
Where ν(t),νk is the Kronecker delta which is 1 only when ν(t) = νk.
● Finally, t is the estimate for the initial state distribution, can be computed as which is the expected
number of times in state wi, at time t=1.
● These are called the Baum-Welch equations (also called the EM estimates for HMMs or the forward-
backward algorithm) that can be computed iteratively until some convergence criterion is met (e.g.,
sufficiently small changes in the estimated values in subsequent iterations).
● The estimates b j (x) when the observations are continuous and their distributions are modeled using
Gaussian mixtures.

Applications:
 On-line handwriting recognition
 Speech recognition
 Gesture recognition
 Language modeling
 Motion video analysis and tracking

VII. Neural Networks:


NN is an assembly of artificial neurons.
● Neural Networks are the functional unit of Deep Learning and are known to mimic the behavior of
the human brain to solve complex data-driven problems.
● The input data is processed through different layers of artificial neurons stacked together to produce
the desired output.
● From speech recognition and person recognition to healthcare and marketing, Neural Networks have
been used in a varied set of domains.
NNs can be clustered under two categories:
1. Multilayer Perceptron (MLP)
2. Other Neural Network architectures

1. Multilayer Perceptron (MLP):


MLP is composed of several layers of neurons
● an input layer
● several hidden layers
● output layers
when composed of enough neurons, MLP can approximate any continuous function.

2. Other Neural Network Architectures:


There is one that among all NN architectures which has been specially created for BCI:
Gaussian Classifier:
• This classifier has been applied with success to motorimagery and mental task classification.
• BCI team in EPFL state that this NN outperforms MLP on BCI data.

17
Biological Neuron Artificial Neuron
Figure: Neural network neurons

Neural Network Architecture:


The Neural Network architecture is made of individual units called neurons that mimic the biological
behavior of the brain. Here are the various components of a neuron.

Input - It is the set of features that are fed into the model for the learning process. For example,
the input in object detection can be an array of pixel values pertaining to an image.
Weight - Its main function is to give importance to those features that contribute more towards
the learning. It does so by introducing scalar multiplication between the input value and the
weight matrix. For example, a negative word would impact the decision of the sentiment
analysis model more than a pair of neutral words.

Transfer function - The job of the transfer function is to combine multiple inputs into one output
value so that the activation function can be applied. It is done by a simple summation of all the
inputs to the transfer function.
18
Activation Function—It introduces non-linearity in the working of perceptrons to consider varying
linearity with the inputs. Without this, the output would just be a linear combination of input values
and would not be able to introduce non-linearity in the network
Bias - The role of bias is to shift the value produced by the activation function. Its role is similar
to the role of a constant in a linear function. When multiple neurons are stacked together in a
row, they constitute a layer, and multiple layers piled next to each other are called a multi-layer
neural network.

Main components of this type of structure:

Input Layer:
The data that we feed to the model is loaded into the input layer from external sources like a CSV file or
a web service. It is the only visible layer in the complete Neural Network architecture that passes the
complete information from the outside world without any computation.
Hidden Layers:
Hidden Layers are intermediate layers that do all the computations and extract the features from the data.
There can be multiple interconnected hidden layers that account for searching different hidden features in
the data. For example, in image processing, the first hidden layers are responsible for higher-level features
like edges, shapes, or boundaries. On the other hand, the later hidden layers perform more complicated
tasks like identifying complete objects (a car, a building, a person).
Output Layer:
The output layer takes input from preceding hidden layers and comes to a final prediction based on the
model’s learning. It is the most important layer where we get the final result. In the case of
classification/regression models, the output layer

Types of neural networks:


Neural networks are sometimes described in terms of their depth, including how many layers they have
between input and output, or the model's so-called hidden layers. This is why the term neural network is
used almost synonymously with deep learning.

They can also be described by the number of hidden nodes the model has or in terms of how many input
layers and output layers each node has. Variations on the classic neural network design enable various
forms of forward and backward propagation of information among tiers.

19
i. Feed-forward neural networks:
One of the simplest variants of neural networks, these pass information in one direction, through various
input nodes, until it makes it to the output node. The network might or might not have hidden node layers,
making their functioning more interpretable. It's prepared to process large amounts of noise.
This type of ANN computational model is used in technologies such as facial recognition and
computer vision.

ii. Recurrent neural networks (RNNs):

● More complex in nature, RNNs save the output of processing nodes and feed the result back into
the model. This is how the model learns to predict the outcome of a layer. Each node in the RNN
model acts as a memory cell, continuing the computation and execution of operations.
● This neural network starts with the same front propagation as a feed-forward network but then
goes on to remember all processed information to reuse it in the future. If the network's prediction
is incorrect, then the system self-learns and continues working toward the correct prediction
during back propagation. This type of ANN is frequently used in text-to-speech conversions.

iii. Convolutional neural networks (CNNs):

● CNNs are one of the most popular models used today. This computational model uses a variation of
multilayer perceptrons and contains one or more convolutional layers that can be either entirely connected
or pooled.
● These convolutional layers create feature maps that record a region of the image that's
ultimately broken into rectangles and sent out for nonlinear processing.
● The CNN model is particularly popular in the realm of image recognition. It has been used in
many of the most advanced applications of AI, including facial recognition, text digitization
and NLP. Other use cases include paraphrase detection, signal processing and image
classification.

20
iv. Deconvolutional neural networks:
Deconvolutional neural networks use a reversed CNN model process. They try to find lost features or
signals that might have originally been considered unimportant to the CNN system's task. This network
model can be used in image synthesis and analysis.

v. Modular neural networks:


These contain multiple neural networks working separately from one another. The networks don't
communicate or interfere with each other's activities during the computation process. Consequently,
complex or big computational processes can be performed more efficiently.

Advantages of artificial neural networks:

● Parallel processing abilities. ANNs have parallel processing abilities, which means the
network can perform more than one job at a time.

● Information storage. ANNs store information on the entire network, not just in a
database. This ensures that even if a small amount of data disappears from one location,
the entire network continues to operate.

● Non-linearity. The ability to learn and model nonlinear, complex relationships helps model
the real-world relationships between input and output.

● Fault tolerance. ANNs come with fault tolerance, which means the corruption or fault of one
or more cells of the ANN won't stop the generation of output.

● Gradual corruption. This means the network slowly degrades over time instead of
degrading instantly when a problem occurs.
● Unrestricted input variables. No restrictions are placed on the input variables, such as how they
should be distributed.

● Observation-based decisions. Machine learning means the ANN can learn from events and
make decisions based on the observations.

● Unorganized data processing. Artificial neural networks are exceptionally good at


organizing large amounts of data by processing, sorting and categorizing it.

● Ability to learn hidden relationships. ANNs can learn the hidden relationships in data without
commanding any fixed relationship. This means ANNs can better model highly volatile data
and non-constant variance.

● Ability to generalize data. The ability to generalize and infer unseen relationships on unseen
data means ANNs can predict the output of unseen data.

21
Disadvantages of artificial neural networks:

● Lack of rules. The lack of rules for determining the proper network structure means the
appropriate artificial neural network architecture can only be found through trial, error and
experience.
● Hardware dependency. The requirement of processors with parallel processing abilities
makes neural networks dependent on hardware.

● Numerical translation. The network works with numerical information, meaning all
problems must be translated into numerical values before they can be presented to the ANN.

● Lack of trust. The lack of explanation behind probing solutions is one of the biggest
disadvantages of ANNs. The inability to explain the why or how behind the solution generates
a lack of trust in the network.

● Inaccurate results. If not trained properly, ANNs can often produce incomplete or inaccurate
results.

● Black box nature. Because of their black box AI model, it can be challenging to grasp how
neural networks make their predictions or categorize data.

22

You might also like