0% found this document useful (0 votes)
78 views

UNIT2

deep learning

Uploaded by

Principal CECC
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

UNIT2

deep learning

Uploaded by

Principal CECC
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Deep Feed forward Networks

Example: Learning XOR, Gradient-Based Learning, Hidden Units, Architecture


Design, Back-Propagation and Other Differentiation Algorithms
Regularization for Deep Learning
parameter Norm Penalties, Norm Penalties as Constrained Optimization,
Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised Learning, Multitask Learning.

Deep Feed forward Networks


Example: Learning XOR
Deep feedforward networks, also often called feedforward
neural networks, or multilayer perceptron (MLPs), are the
quintessential deep learning models.

The XOR function (“exclusive or”) is an operation on two


binary values, x1 and x2. When exactly one of these binary
values is equal to 1, the XOR function returns 1.
Otherwise, it returns 0. The XOR function provides the
target function y = f∗(x) that we want to learn. Our model
provides a function y = f(x;θ) and our learning algorithm
will adapt the parameters θ to make f as similar as possible
In this simple example, we will not be concerned with
statistical generalization. We want our network to perform
correctly on the four points X = {[0, 0], [0,1],[1,0], and
[1,1]}. We will train the network on all four of these points.
The only challenge is to fit the training set. We can treat
this problem as a regression problem and use a mean
squared error loss function. We choose this loss function to
simplify the math for this example as much as possible. In
practical applications, MSE is usually not an appropriate
cost function for modeling binary data. More appropriate
approaches a

Evaluated on our whole training set, the MSE loss function


is a linear model, with θ consisting of w and b. Our model
is defined to be

f (x; w, b) = x w + b.

We can minimize J(θ) in closed form with respect to w and


b using the normal equations.
Solving the XOR problem by learning a representation. The
bold numbers printed on the plot indicate the value that
the learned function must output at each point. (Left)A
linear model applied directly to the original input cannot
implement the XOR function. When x1 = 0, the model’s
output must increase as x2 increases. When x1 = 1, the
model’s output must decrease as x2 increases. A linear
model must apply a fixed coefficient w2 to x2. The linear
model therefore cannot use the value of x1 to change the
coefficient on x2 and cannot solve this problem. (Right)In
the transformed space represented by the features
extracted by a neural network, a linear model can now
solve the problem. In our example solution, the two points
that must have output 1 have been collapsed into a single
point in feature space. In other words, the nonlinear
features have mapped both x = [1,0] and x = [0,1] to a
single point in feature space, h = [1,0]. The linear model
can now describe the function as increasing in h1 and
decreasing in h2. In this example, the motivation for
learning the feature space is only to make the model
capacity greater so that it can fit the training set. In more
realistic applications, learned representations can also
help the model to generalize.
Gradient-Based Learning
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid
of 18th century. Gradient Descent is defined as one of the most
commonly used iterative optimization algorithms of machine learning
to train the machine learning and deep learning models. It helps in
finding the local minimum of a function.

o If we move towards a negative gradient or away from the gradient


of the function at the current point, it will give the local
minimum of that function.
o Whenever we move towards a positive gradient or towards the
gradient of the function at the current point, we will get the local
maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as


steepest descent. The main objective of using a gradient descent
algorithm is to minimize the cost function using iteration. To
achieve this goal, it performs two steps iteratively:
o Calculates the first-order derivative of the function to compute the
gradient or slope of that function.
o Move away from the direction of the gradient, which means slope
increased from the current point by alpha times, where Alpha is
defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.

What is Cost-function?
The cost function is defined as the measurement of difference or
error between actual values and expected values at the current
position and present in the form of a single real number.

How does Gradient Descent work

Y=mX+c

The starting point(shown in above fig.) is used to evaluate the


performance as it is considered just as an arbitrary point. At this starting
point, we will derive the first derivative or slope and then use a tangent
line to calculate the steepness of this slope. Further, this slope will inform
the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but
whenever new parameters are generated, then steepness gradually
reduces, and at the lowest point, it approaches the lowest point, which is
called a point of convergence.

The main objective of gradient descent is to minimize the cost function or


the error between expected and actual. To minimize the cost function, two
data points are required:

Types of Gradient Descent


Based on the error in various training models, the Gradient Descent
learning algorithm can be divided into Batch gradient descent,
stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:


Batch gradient descent (BGD) is used to find the error for each point in
the training set and update the model after evaluating all training
examples. This procedure is known as the training epoch. In simple words,
it is a greedy approach where we have to sum over all examples for each
update.

Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.


o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training
samples.

2. Stochastic gradient descent


Stochastic gradient descent (SGD) is a type of gradient descent that runs
one training example per iteration. Or in other words, it processes a
training epoch for each example within a dataset and updates each
training example's parameters one at a time. As it requires only one
training example at a time, hence it is easier to store in allocated
memory. However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent updates that
require more detail and speed. Further, due to frequent updates, it is also
treated as a noisy gradient. However, sometimes it can be helpful in
finding the global minimum and also escaping the local minimum.

Advantages of Stochastic gradient descent:


In Stochastic gradient descent (SGD), learning happens on every example,
and it consists of a few advantages over other gradient descent.

o It is easier to allocate in desired memory.


o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:


Mini Batch gradient descent is the combination of both batch gradient
descent and stochastic gradient descent. It divides the training datasets
into small batch sizes then performs the updates on those batches
separately. Splitting training datasets into smaller batches make a
balance to maintain the computational efficiency of batch gradient
descent and speed of stochastic gradient descent. Hence, we can achieve
a special type of gradient descent with higher computational efficiency
and less noisy gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.


o It is computationally efficient.
o It produces stable gradient descent convergence.

Challenges with the Gradient Descent


Although we know Gradient Descent is one of the most popular methods
for optimization problems, it still also has some challenges. There are a
few challenges as follows:

1. Local Minima and Saddle Point:


For convex problems, gradient descent can find the global minimum
easily, while for non-convex problems, it is sometimes difficult to find the
global minimum, where the machine learning models achieve the best
results.

2. Vanishing and Exploding Gradient


In a deep neural network, if the model is trained with gradient descent
and back propagation, there can occur two more issues other than local
minima and saddle point.
Vanishing Gradients:

Vanishing Gradient occurs when the gradient is smaller than expected.


During back propagation, this gradient becomes smaller that causing the
decrease in the learning rate of earlier layers than the later layer of the
network. Once this happens, the weight parameters update until they
become insignificant.

Exploding Gradient:

Exploding gradient is just opposite to the vanishing gradient as it occurs


when the Gradient is too large and creates a stable model. Further, in this
scenario, model weight increases, and they will be represented as NaN.
This problem can be solved using the dimensionality reduction technique,
which helps to minimize complexity within the model.

hidden units in deep learning


feed-forward neural network. Feed-forward networks can be seen as
cascaded squashed linear functions. The inputs feed into a layer of hidden
units, which can feed into layers of more hidden units, which eventually
feed into the output layer. Each of the hidden units is a squashed linear
function of its inputs.
Neural networks of this type can have as inputs any real numbers, and
they have a real number as output.
For regression, it is typical for the output units to be a linear function of
their inputs. For classification it is typical for the output to be a sigmoid
function of its inputs (because there is no point in predicting a value
outside of [0,1]). For the hidden layers, there is no point in having their
output be a linear function of their inputs because a linear function of a
linear function is a linear function; adding the extra layers gives no added
functionality. The output of each hidden unit is thus a squashed linear
function of its inputs.
Associated with a network are the parameters for all of the linear
functions. These parameters can be tuned simultaneously to minimize the
prediction error on the training examples.
Architecture Design

Supervised deep learning


Convolutional neural networks

A CNN is a multilayer neural network that was biologically inspired by the animal
visual cortex. The architecture is particularly useful in image-processing applications.
The first CNN was created by Yann LeCun; at the time, the architecture focused on
handwritten character recognition, such as postal code interpretation. As a deep
network, early layers recognize features (such as edges), and later layers recombine
these features into higher-level attributes of the input.

The LeNet CNN architecture is made up of several layers that implement feature
extraction and then classification (see the following image). The image is divided into
receptive fields that feed into a convolutional layer, which then extracts features from
the input image. The next step is pooling, which reduces the dimensionality of the
extracted features (through down-sampling) while retaining the most important
information (typically, through max pooling). Another convolution and pooling step is
then performed that feeds into a fully connected multilayer perceptron. The final
output layer of this network is a set of nodes that identify features of the image (in
this case, a node per identified number). You train the network by using back-
propagation.
The use of deep layers of processing, convolutions, pooling, and a fully connected
classification layer opened the door to various new applications of deep learning
neural networks. In addition to image processing, the CNN has been successfully
applied to video recognition and various tasks within natural language processing.

Recurrent neural networks

The RNN is one of the foundational network architectures from which other deep
learning architectures are built. The primary difference between a typical multilayer
network and a recurrent network is that rather than completely feed-forward
connections, a recurrent network might have connections that feed back into prior
layers (or into the same layer). This feedback allows RNNs to maintain memory of
past inputs and model problems in time.

RNNs consist of a rich set of architectures (we'll look at one popular topology called
LSTM next). The key differentiator is feedback within the network, which could
manifest itself from a hidden layer, the output layer, or some combination thereof.
LSTM networks

The LSTM was created in 1997 by Hochreiter and Schimdhuber, but it has grown in
popularity in recent years as an RNN architecture for various applications. You'll find
LSTMs in products that you use every day, such as smartphones. IBM applied
LSTMs in IBM Watson® for milestone-setting conversational speech recognition.

The LSTM departed from typical neuron-based neural network architectures and
instead introduced the concept of a memory cell. The memory cell can retain its
value for a short or long time as a function of its inputs, which allows the cell to
remember what's important and not just its last computed value.

The LSTM memory cell contains three gates that control how information flows into
or out of the cell. The input gate controls when new information can flow into the
memory. The forget gate controls when an existing piece of information is forgotten,
allowing the cell to remember new data. Finally, the output gate controls when the
information that is contained in the cell is used in the output from the cell. The cell
also contains weights, which control each gate. The training algorithm, commonly
BPTT, optimizes these weights based on the resulting network output error.
Unsupervised deep learning

Self-organized maps

Self-organized map (SOM) was invented by Dr. Teuvo Kohonen in 1982 and was
popularly known as the Kohonen map. SOM is an unsupervised neural network that
creates clusters of the input data set by reducing the dimensionality of the input.
SOMs vary from the traditional artificial neural network in quite a few ways.

The first significant variation is that weights serve as a characteristic of the node.
After the inputs are normalized, a random input is first chosen. Random weights
close to zero are initialized to each feature of the input record. These weights now
represent the input node. Several combinations of these random weights represent
variations of the input node. The euclidean distance between each of these output
nodes with the input node is calculated. The node with the least distance is declared
as the most accurate representation of the input and is marked as the best
matching unit or BMU. With these BMUs as center points, other units are similarly
calculated and assigned to the cluster that it is the distance from. Radius of points
around BMU weights are updated based on proximity. Radius is shrunk.

Next, in an SOM, no activation function is applied, and because there are no target
labels to compare against there is no concept of calculating error and back
propogation.
Autoencoders
Though the history of when autoencoders were invented is hazy, the first known
usage of autoencoders was found to be by LeCun in 1987. This variant of an ANN is
composed of 3 layers: input, hidden, and output layers.

First, the input layer is encoded into the hidden layer using an appropriate encoding
function. The number of nodes in the hidden layer is much less than the number of
nodes in the input layer. This hidden layer contains the compressed representation
of the original input. The output layer aims to reconstruct the input layer by using a
decoder function.

Back-Propagation and Other Differentiation Algorithms

When we use a feedforward neural network to accept an input xx and produce an


output ^yy^, information flows forward through the network. The inputs xx provide
the initial information that then propagates up to the hidden units at each layer and
finally produces ^yy^ . This is called forward propagation . During
training, forward propagation can continue onward until it produces a scalar
cost J(θ)J(θ). The back-propagation algorithm ( Rumelhart et al. 1986a ), often
simply called backprop, allows the information from the cost to then flow backwards
through the network, in order to compute the gradient.
Computing an analytical expression for the gradient is straightforward, but
numerically evaluating such an expression can be computationally expensive. The
back-propagation algorithm does so using a simple and inexpensive procedure.

The term back-propagation is often misunderstood as meaning the whole learning


algorithm for multi-layer neural networks. Actually, back-propagation refers only to
the method for computing the gradient, while another algorithm, such as stochastic
gradient descent, is used to perform learning using this gradient. Furthermore,
back-propagation is often misunderstood as being specific to multi-layer neural
networks, but in principle it can compute derivatives of any function (for some
functions, the correct response is to report that the derivative of the function is

the gradient ∇xf(x,y)∇xf(x,y) for an arbitrary function ff , where xx is a set of


undefined). Specifically, we will describe how to compute

variables whose derivatives are desired, and yy is an additional set of variables that
are inputs to the function but whose derivatives are not required. In learning

function with respect to the parameters, ∇θJ(θ)∇θJ(θ) . Many machine learning


algorithms, the gradient we most often require is the gradient of the cost

tasks involve computing other derivatives, either as part of the learning process, or to
analyze the learned model. The back-propagation algorithm can be applied to
these tasks as well, and is not restricted to computing the gradient of the cost
function with respect to the parameters. The idea of computing derivatives by
propagating information through a network is very general, and can be used to
compute values such as the Jacobian of a function f with multiple outputs. We
restrict our description here to the most commonly used case where has a single
output.

Computational Graphs

Node
Here, we use each node in the graph to indicate a variable. The variable may be a
scalar, vector, matrix, tensor, or even a variable of another type

To formalize our graphs, we also need to introduce the idea of an operation . An


operation is a simple function of one or more variables. Our graph language is
accompanied by a set of allowable operations. Functions more complicated than the
operations in this set may be described by composing many operations together.

Without loss of generality, we define an operation to return only a single output


variable. This does not lose generality because the output variable can have multiple
entries, such as a vector. Software implementations of back-propagation usually
support operations with multiple outputs, but we avoid this case in our description
because it introduces many extra details that are not important to conceptual
understanding.
Construction of computational graph
If a variable yy is computed by applying an operation to a variable xx , then we
draw a directed edge from xx to yy . We sometimes annotate the output node with
the name of the operation applied, and other times omit this label when the operation
is clear from context.

Chain Rule of Calculus


he chain rule of calculus (not to be confused with the chain rule of probability) is
used to compute the derivatives of functions formed by composing other functions
whose derivatives are known. Back-propagation is an algorithm that computes the
chain rule, with a specific order of operations that is highly efficient.

Let xx be a real number, and let ff and gg both be functions mapping from a real
number to a real number. Suppose
that y=g(x)y=g(x) and z=f(g(x))=f(y)z=f(g(x))=f(y). Then the chain rule
states that

gradient of scalar
dz/dx=dz/dy dy/dx
Recursively Applying the Chain Rule to Obtain
Backprop
Using the chain rule, it is straightforward to write down an algebraic expression for
the gradient of a scalar with respect to any node in the computational graph that
produced that scalar. However, actually evaluating that expression in a computer
introduces some extra considerations.

Specifically, many subexpressions may be repeated several times within the overall
expression for the gradient. Any procedure that computes the gradient will need to choose
whether to store these subexpressions or to recompute them several times. An example of
how these repeated subexpressions arise is given in figure 6.9. In some cases, computing
the same subexpression twice would simply be wasteful. For complicated graphs, there can
be exponentially many of these wasted computations, making a naive implementation of
the chain rule infeasible. In other cases, computing the same subexpression twice could be
a valid way to reduce memory consumption at the cost of higher runtime.
We first begin by a version of the back-propagation algorithm that specifies the
actual gradient computation directly (algorithm 6.2 along with algorithm 6.1 for the
associated forward computation), in the order it will actually be done and according to the
recursive application of chain rule. One could either directly perform these computations or
view the description of the algorithm as a symbolic specification of the computational
graph for computing the back-propagation. However, this formulation does not make explicit
the manipulation and the construction of the symbolic graph that performs the gradient
computation. Such a formulation is presented below in section 6.5.6, with algorithm 6.5,
where we also generalize to nodes that contain arbitrary tensors.

Back-Propagation Computation in Fully-Connected


MLP
To clarify the above definition of the back-propagation computation, let us consider
the specific graph associated with a fully-connected multi-layer MLP.

Algorithm 6.3 first shows the forward propagation, which maps parameters to the
supervised loss L(^y,y)L(y^,y), associated with a single (input,target) training
example $(x, y) $ , with ^yy^ the output of the neural network when is provided in
input.

Algorithm 6.4 then shows the corresponding computation to be done for applying the
back-propagation algorithm to this graph.

Algorithms 6.3 and 6.4 are demonstrations that are chosen to be simple and
straightforward to understand. However, they are specialized to one specific
problem.

Modern software implementations are based on the generalized form of back-


propagation described in section 6.5.6 below, which can accommodate
any computational graph by explicitly manipulating a data structure for
representing symbolic computation.

Regularization for Deep Learning


parameter Norm Penalties
Regularization has been used for decades prior to the advent of deep learning.
Linear models such as linear regression and logistic regression allow simple,
straightforward, and effective regularization strategies. Many regularization
approaches are based on limiting the capacity of models, such as neural
networks, linear regression, or logistic regression, by adding a parameter norm

function by J˜: J˜(θ; X, y) = J(θ; X, y) + αΩ(θ) (7.1) where α ∈ [0, ∞) is a


penalty Ω(θ) to the objective function J. We denote the regularized objective

hyperparameter that weights the relative contribution of the norm penalty term,
Ω, relative to the standard objective function J(x; θ). Setting α to 0 results in no
regularization. Larger values of α correspond to more regularization. When our
training algorithm minimizes the regularized objective function ˜J it will
decrease both the original objective J on the training data and some measure of
the size of the parameters θ (or some subset of the parameters). Different
choices for the parameter norm Ω can result in different solutions being
preferred. In this section, we discuss the effects of the various norms when used
as penalties on the model parameters. Before delving into the regularization
behaviour of different norms, we note that for neural networks, we typically
choose to use a parameter norm penalty Ω that penalizes only the weights of the
affine transformation at each layer and leaves the biases unregularized. The
biases typically require less data to fit accurately than the weights. Each weight
specifies how two variables interact. Fitting the weight well requires observing
both variables in a variety of conditions. Each bias controls only a single
variable. This means that we do not induce too much variance by leaving the
biases unregularized. Also, regularizing the bias parameters can introduce a
significant amount of underfitting. We therefore use the vector w to indicate all
of the weights that should be affected by a norm penalty, while the vector θ
denotes all of the parameters, including both w and the unregularized
parameters. In the context of neural networks, it is sometimes desirable to use a
separate penalty with a different α coefficient for each layer of the network.
Because it can be expensive to search for the correct value of multiple hyper
parameters, it is still reasonable to use the same weight decay at all layers just to
reduce the search space.
Norm penalties as constrained optimization

From chapter 4’s section 4, we know that to minimize any


function under some constraints, we can construct a
generalized Lagrangian function containing the objective
function along with the penalties. Suppose we
wanted Ω(θ) < k, then we could construct the following
Lagrangian:

We get optimal θ by solving the Lagrangian. If Ω(θ) > k,


then the weights need to be compensated highly and
hence, α should be large to reduce its value below k.
Likewise, if Ω(θ)<k, then the norm shouldn’t be reduced
too much and hence, α should be small. This is now similar
to the parameter norm penalty regularized objective
function as both of them encourage lower values of the
norm. Thus, parameter norm penalties naturally impose a
constraint, like the L²-regularization, defining a
constrained L²-ball. Larger α implies a smaller constrained
region as it pushes the values really low, hence, allowing a
small radius and vice versa. The idea of constraints over
penalties is important for several reasons. Large penalties
might cause non-convex optimization algorithms to get
stuck in local minima due to small values of θ, leading to
the formation of so-called dead cells, as the weights
entering and leaving them are too small to have an impact.
Constraints don’t enforce the weights to be near zero,
rather being confined to a constrained region.

Another reason is that constraints induce higher stability.


With higher learning rates, there might be a large weight,
leading to a large gradient, which could go on iteratively
leading to numerical overflow in the value of θ. Constrains,
along with reprojection (to the corresponding ball),
prevent the weights from becoming too large, thus,
maintaining stability.

A final suggestion made by Hinton was to restrict the


individual column norms of the weight matrix rather than
the Frobenius norm of the entire weight matrix, so as to
prevent any hidden unit from having a large weight. The
idea here is that if we restrict the Frobenius norm, it
doesn’t guarantee that the individual weights would be
small, just their norm. So, we might have large weights
being compensated by extremely small weights to make
the overall norm small. Restricting each hidden unit
individually gives us the required guarantee.

3. Regularized & Under-constrained problems

Underdetermined problems are those problems that have


infinitely many solutions. A logistic regression problem
having linearly separable classes with w as a solution, will
always have 2w as a solution and so on. In some machine
learning problems, regularization is necessary. For e.g.,
many algorithms (e.g. PCA) require the inversion of X’ X,
which might be singular. In such a case, we can use a
regularized form instead. (X’ X + αI) is guaranteed to be
invertible.

Regularization can solve underdetermined problems. For


e.g. the Moore-Pentose pseudoinverse defined earlier as:

This can be seen as performing a linear regression with L²-


regularization.

4. Data augmentation

Having more data is the most desirable thing to improving


a machine learning model’s performance. In many cases, it
is relatively easy to artificially generate data. For a
classification task, we desire for the model to be invariant
to certain types of transformations, and we can generate
the corresponding (x,y)pairs by translating the input x. But
for certain problems, like density estimation, we can’t
apply this directly unless we have already solved the
density estimation problem.

However, caution needs to be maintained while


augmenting data to make sure that the class doesn’t
change. For e.g., if the labels contain both “b” and “d”,
then horizontal flipping would be a bad idea for data
augmentation. Adding random noise to the inputs is
another form of data augmentation, while adding noise to
hidden units can be seen as doing data augmentation at
multiple levels of abstraction.

Finally, when comparing machine learning models, we


need to evaluate them using the same hand-designed data
augmentation schemes or else it might happen that
algorithm A outperforms algorithm B, just because it was
trained on a dataset which had more / better data
augmentation.

5. Noise Robustness

Noise with infinitesimal variance imposes a penalty on the


norm of the weights. Noise added to hidden units is very
important and is discussed later in Dropout. Noise can
even be added to the weights. This has several
interpretations. One of them is that adding noise to
weights is a stochastic implementation of Bayesian
inference over the weights, where the weights are
considered to be uncertain, with the uncertainty being
modelled by a probability distribution. It is also interpreted
as a more traditional form of regularization by ensuring
stability in learning.

For e.g. in the linear regression case, we want to learn the


mapping y(x) for each feature vector x, by reducing the
mean square error.
Now, suppose a zero mean unit variance Gaussian random
noise, ϵ, is added to the weights. We till want to learn the
appropriate mapping through reducing the mean square.
Minimizing the loss after adding noise to the weights is
equivalent to adding another regularization term which
makes sure that small perturbations in the weight values
don’t affect the predictions much, thus stabilising training.

Sometimes we may have the wrong output labels, in which


case maximizing p(y | x)may not be a good idea. In such a
case, we can add noise to the labels by assigning a
probability of (1-ϵ) that the label is correct and a
probability of ϵ that it is not. In the latter case, all the
other labels are equally likely. Label
Smoothing regularizes a model with k softmax outputs by
assigning the classification targets with probability (1-ϵ )
or choosing any of the remaining (k-1) classes with
probability ϵ / (k-1).

6. Semi-Supervised Learning

P(x,y) denotes the joint distribution of x and y, i.e.,


corresponding to a training sample x, I have a
label y. P(x) denotes the marginal distribution of x, i.e., just
the training examples without any labels. In Semi-
supervised Learning, we use both P(x,y)(some labelled
samples) and P(x)(unlabelled samples) to estimate P(y|x)

(since we want to predict the class, given the training


sample). We want to learn some representation h =
f(x)such that samples which are closer in the input space
have similar representations and a linear classifier in the
new space achieves better generalization error.

Instead of separating the supervised and unsupervised


criteria, we can instead have a generative model
of P(x) (or P(x, y)) which shares parameters with the
discriminative model. The idea is to share the
unsupervised/generative criterion with the supervised
criterion to express a prior belief that the structure
of P(x) (or P(x, y)) is connected to the structure of P(y|x),

which is expressed by the shared parameters.

Multitask Learning

The idea is to improve the generalization error by pooling


together examples from multiple tasks. Similar to how
more data leads to more generalization, using a part of the
model for different tasks constrains that part to learn good
values. There are two types of model parameters:

 Task-specific: These parameters benefit only from that


particular task.

 Generic, shared across all tasks: These are the ones


which benefit from learning through various tasks.
Multitask learning leads to better generalization when
there is actually some relationship between the tasks,
which actually happens in the context of Deep Learning
where some of the factors, which explain the variation
observed in the data, are shared across different tasks.

You might also like