UNIT2
UNIT2
f (x; w, b) = x w + b.
What is Cost-function?
The cost function is defined as the measurement of difference or
error between actual values and expected values at the current
position and present in the form of a single real number.
Y=mX+c
Exploding Gradient:
A CNN is a multilayer neural network that was biologically inspired by the animal
visual cortex. The architecture is particularly useful in image-processing applications.
The first CNN was created by Yann LeCun; at the time, the architecture focused on
handwritten character recognition, such as postal code interpretation. As a deep
network, early layers recognize features (such as edges), and later layers recombine
these features into higher-level attributes of the input.
The LeNet CNN architecture is made up of several layers that implement feature
extraction and then classification (see the following image). The image is divided into
receptive fields that feed into a convolutional layer, which then extracts features from
the input image. The next step is pooling, which reduces the dimensionality of the
extracted features (through down-sampling) while retaining the most important
information (typically, through max pooling). Another convolution and pooling step is
then performed that feeds into a fully connected multilayer perceptron. The final
output layer of this network is a set of nodes that identify features of the image (in
this case, a node per identified number). You train the network by using back-
propagation.
The use of deep layers of processing, convolutions, pooling, and a fully connected
classification layer opened the door to various new applications of deep learning
neural networks. In addition to image processing, the CNN has been successfully
applied to video recognition and various tasks within natural language processing.
The RNN is one of the foundational network architectures from which other deep
learning architectures are built. The primary difference between a typical multilayer
network and a recurrent network is that rather than completely feed-forward
connections, a recurrent network might have connections that feed back into prior
layers (or into the same layer). This feedback allows RNNs to maintain memory of
past inputs and model problems in time.
RNNs consist of a rich set of architectures (we'll look at one popular topology called
LSTM next). The key differentiator is feedback within the network, which could
manifest itself from a hidden layer, the output layer, or some combination thereof.
LSTM networks
The LSTM was created in 1997 by Hochreiter and Schimdhuber, but it has grown in
popularity in recent years as an RNN architecture for various applications. You'll find
LSTMs in products that you use every day, such as smartphones. IBM applied
LSTMs in IBM Watson® for milestone-setting conversational speech recognition.
The LSTM departed from typical neuron-based neural network architectures and
instead introduced the concept of a memory cell. The memory cell can retain its
value for a short or long time as a function of its inputs, which allows the cell to
remember what's important and not just its last computed value.
The LSTM memory cell contains three gates that control how information flows into
or out of the cell. The input gate controls when new information can flow into the
memory. The forget gate controls when an existing piece of information is forgotten,
allowing the cell to remember new data. Finally, the output gate controls when the
information that is contained in the cell is used in the output from the cell. The cell
also contains weights, which control each gate. The training algorithm, commonly
BPTT, optimizes these weights based on the resulting network output error.
Unsupervised deep learning
Self-organized maps
Self-organized map (SOM) was invented by Dr. Teuvo Kohonen in 1982 and was
popularly known as the Kohonen map. SOM is an unsupervised neural network that
creates clusters of the input data set by reducing the dimensionality of the input.
SOMs vary from the traditional artificial neural network in quite a few ways.
The first significant variation is that weights serve as a characteristic of the node.
After the inputs are normalized, a random input is first chosen. Random weights
close to zero are initialized to each feature of the input record. These weights now
represent the input node. Several combinations of these random weights represent
variations of the input node. The euclidean distance between each of these output
nodes with the input node is calculated. The node with the least distance is declared
as the most accurate representation of the input and is marked as the best
matching unit or BMU. With these BMUs as center points, other units are similarly
calculated and assigned to the cluster that it is the distance from. Radius of points
around BMU weights are updated based on proximity. Radius is shrunk.
Next, in an SOM, no activation function is applied, and because there are no target
labels to compare against there is no concept of calculating error and back
propogation.
Autoencoders
Though the history of when autoencoders were invented is hazy, the first known
usage of autoencoders was found to be by LeCun in 1987. This variant of an ANN is
composed of 3 layers: input, hidden, and output layers.
First, the input layer is encoded into the hidden layer using an appropriate encoding
function. The number of nodes in the hidden layer is much less than the number of
nodes in the input layer. This hidden layer contains the compressed representation
of the original input. The output layer aims to reconstruct the input layer by using a
decoder function.
variables whose derivatives are desired, and yy is an additional set of variables that
are inputs to the function but whose derivatives are not required. In learning
tasks involve computing other derivatives, either as part of the learning process, or to
analyze the learned model. The back-propagation algorithm can be applied to
these tasks as well, and is not restricted to computing the gradient of the cost
function with respect to the parameters. The idea of computing derivatives by
propagating information through a network is very general, and can be used to
compute values such as the Jacobian of a function f with multiple outputs. We
restrict our description here to the most commonly used case where has a single
output.
Computational Graphs
Node
Here, we use each node in the graph to indicate a variable. The variable may be a
scalar, vector, matrix, tensor, or even a variable of another type
Let xx be a real number, and let ff and gg both be functions mapping from a real
number to a real number. Suppose
that y=g(x)y=g(x) and z=f(g(x))=f(y)z=f(g(x))=f(y). Then the chain rule
states that
gradient of scalar
dz/dx=dz/dy dy/dx
Recursively Applying the Chain Rule to Obtain
Backprop
Using the chain rule, it is straightforward to write down an algebraic expression for
the gradient of a scalar with respect to any node in the computational graph that
produced that scalar. However, actually evaluating that expression in a computer
introduces some extra considerations.
Specifically, many subexpressions may be repeated several times within the overall
expression for the gradient. Any procedure that computes the gradient will need to choose
whether to store these subexpressions or to recompute them several times. An example of
how these repeated subexpressions arise is given in figure 6.9. In some cases, computing
the same subexpression twice would simply be wasteful. For complicated graphs, there can
be exponentially many of these wasted computations, making a naive implementation of
the chain rule infeasible. In other cases, computing the same subexpression twice could be
a valid way to reduce memory consumption at the cost of higher runtime.
We first begin by a version of the back-propagation algorithm that specifies the
actual gradient computation directly (algorithm 6.2 along with algorithm 6.1 for the
associated forward computation), in the order it will actually be done and according to the
recursive application of chain rule. One could either directly perform these computations or
view the description of the algorithm as a symbolic specification of the computational
graph for computing the back-propagation. However, this formulation does not make explicit
the manipulation and the construction of the symbolic graph that performs the gradient
computation. Such a formulation is presented below in section 6.5.6, with algorithm 6.5,
where we also generalize to nodes that contain arbitrary tensors.
Algorithm 6.3 first shows the forward propagation, which maps parameters to the
supervised loss L(^y,y)L(y^,y), associated with a single (input,target) training
example $(x, y) $ , with ^yy^ the output of the neural network when is provided in
input.
Algorithm 6.4 then shows the corresponding computation to be done for applying the
back-propagation algorithm to this graph.
Algorithms 6.3 and 6.4 are demonstrations that are chosen to be simple and
straightforward to understand. However, they are specialized to one specific
problem.
hyperparameter that weights the relative contribution of the norm penalty term,
Ω, relative to the standard objective function J(x; θ). Setting α to 0 results in no
regularization. Larger values of α correspond to more regularization. When our
training algorithm minimizes the regularized objective function ˜J it will
decrease both the original objective J on the training data and some measure of
the size of the parameters θ (or some subset of the parameters). Different
choices for the parameter norm Ω can result in different solutions being
preferred. In this section, we discuss the effects of the various norms when used
as penalties on the model parameters. Before delving into the regularization
behaviour of different norms, we note that for neural networks, we typically
choose to use a parameter norm penalty Ω that penalizes only the weights of the
affine transformation at each layer and leaves the biases unregularized. The
biases typically require less data to fit accurately than the weights. Each weight
specifies how two variables interact. Fitting the weight well requires observing
both variables in a variety of conditions. Each bias controls only a single
variable. This means that we do not induce too much variance by leaving the
biases unregularized. Also, regularizing the bias parameters can introduce a
significant amount of underfitting. We therefore use the vector w to indicate all
of the weights that should be affected by a norm penalty, while the vector θ
denotes all of the parameters, including both w and the unregularized
parameters. In the context of neural networks, it is sometimes desirable to use a
separate penalty with a different α coefficient for each layer of the network.
Because it can be expensive to search for the correct value of multiple hyper
parameters, it is still reasonable to use the same weight decay at all layers just to
reduce the search space.
Norm penalties as constrained optimization
4. Data augmentation
5. Noise Robustness
6. Semi-Supervised Learning
Multitask Learning