Assignment_14_Modern_AI
Assignment_14_Modern_AI
Ameen Aazam
EE23BTECH11006
Unrolled across time steps, an RNN is trained like a feed forward structure. The hidden layer updates
at each step based on the current input and the previous hidden state:
zt = gz (Wz,z zt−1 + Wx,z xt ) (2)
ŷt = gy (Wz,y zt ) (3)
Training uses backpropagation through time (BPTT), where gradients are computed across all time
steps. For example, the gradient for Wz,z is:
T
∂L X ∂zt
= −2(yt − ŷt )gy′ (iny,t )Wz,y (4)
∂Wz,z t=1
∂W z,z
One of the biggest challenges when training Recurrent Neural Networks (RNNs), is the vanishing
and exploding gradient problem. Learning long term dependencies can be difficult due to gradients
that shrink or grow exponentially. We find that basic RNNs are less suitable for tasks which require
information retention for long sequences, due to this limitation. To overcome these issues, advanced
architectures, such as LSTMs, help RNNs adapt to tasks where such information retention over long
sequences is necessary.
RNN limitations are overcome by Lstm networks with introduction of memory cells and gates. For
information flow, LSTM has its gates : forget (ft ), input (it ), output (ot ). Key equations governing
LSTMs are:
ft = σ(Wx,f xt + Wz,f zt−1 ) (5)
it = σ(Wx,i xt + Wz,i zt−1 ) (6)
ot = σ(Wx,o xt + Wz,o zt−1 ) (7)
ct = ft ⊙ ct−1 + it ⊙ tanh(Wx,c xt + Wz,c zt−1 ) (8)
zt = tanh(ct ) ⊙ ot (9)
The mechanisms which prevent gradient decay make LSTMs able to tolerate long term dependencies,
which is useful for speech recognition as well handwriting analysis etc.
Text and images can be modeled with unsupervised learning using unlabeled data creating generative
models. Object recognition features and data samples probability distribution estimation are the
object of algorithms. Latent variables z, related to x, such as stroke thickness or color are represented
by joint models like PW (x, z).
• Probabilistic PCA: A simple generative model : A simple kind of model is probabilistic
principal components analysis (PPCA) where latent variable, z ∼ N (0, I) and observed data
x generated by x ∼ N (W z, σ 2 I). The overall likelyhood can be given by the following:
Z
PW (x) = PW (x, z) dz = N (x; 0, W W T + σ 2 I) (10)
Gradient methods and the EM algorithm both allow us to achieve W . PW (x) produces new
samples that can be used to generate new samples and low probability observations can be
flagged as an anomaly.
• Autoencoders : An autoencoder consists of:
Encoder f : x → ẑ (11)
Decoder g : ẑ → x (12)
2
P
This ties to minimise the reconstruction error j ∥xj − g(f (xj ))∥ . Linear autoencoder
uses a shared weight matrix W , learning the top m principal components.
In complex models the posterior distribution is approximated by variational methods. The
variational lower bound L is defined as:
L(x, Q) = log P (x) − DKL (Q(z)∥P (z | x)) (13)
Maximizing L approximates log P (x) using variational learning. The decoder g(z) models
log P (x|z), and the encoder f (x) defines parameters of Q.
• Deep Autoregressive Models : It is an autoregressive model in the sense that the prediction
of elements of the data vector x is based on other elements of that data vector, without
latent variables. Classical AR models are linear–Gaussian, for example. An alternative
to replacing linear models with deep networks is a deep autoregressive model, used in
DeepMind’s WaveNet application for example which generates speech from raw acoustic
signals via a nonlinear AR model.
• Generative Adversarial Networks (GANs) : A Generative Adversarial Network (GAN)
consists of two networks:
– Generator : Generates map from random values from a latent space z to generate
samples x resemling the distribution PW (x). It normally takes a unit Gaussian input.
– Discriminator : A classifier that tells between real data (from training set) and fake data
(generated by generator).
GANs are implicit generators that produce samples without explicit probability. Both train
at the same time, the generator tries to fool the discriminator and the discriminator classifies
the input correctly. If the generator perfectly replicates the training distribution, as they do
in the equilibrium state, it effectively removes any information about the source distribution.
• Unsupervised Translation : Unsupervised translation is a translation task that takes struc-
tured inputs x and produces structured outputs y without any paired input (x, y) data.
Supervised translation, traditionally, involves paired examples (e.g., human translated sen-
tences). Nevertheless, obtaining such pairs is seldom possible for tasks like converting a
night scene photo to daytime.
GANs are used to solve unsupervised translation problems, whereby we train one generator
to learn the distribution of reversed y given x, and another to learn x given y. With this
framework, it is able to generate plausible outputs without the need of specific pairs enabling
translation in a wide range of domains.
2
2.2 Transfer Learning and Multitask Learning
One very interesting but practical problem we call Transfer Learning, which suggests using knowledge
from one task to speed up performance in another. Copying weights from a pretrained model to a new
model, and fine tuning within this new model with data belonging to the new task. Faster learning
is made possible by the use of high quality models such as ResNet-50 or ROBERTA. Application
of transfer learning is highly critical for simulations and real world scenarios. Multitask Learning
reduces the training time of the model and allows models to understand and perform better by running
multiple objectives at the same time.
3 Applications
3.1 Computer Vision
AlexNet’s 2012 ImageNet competition breakthrough, which served as the core milestone initiating
deep learning revolutionizing computer vision, employed convolutional neural networks (CNNs),
since the 2D Cartesian arrays, or matrices, prescribed in CNNs naturally lend themselves to the
operations performed on 2D images. Today top-5 error rate for CNN in agriculture quality assessment
or self driving cars is below 2%.
NLP tasks such as machine translation and speech recognition have been improved greatly by deep
learning by allowing systems to be trained as one function and are more efficient and accurate than
there ever were before, and uses techniques like word embeddings.
Deep reinforcement learning is a combination of both deep learning and reinforcement learning
that finds an agent optimal behaviour through a reward signal. Despite its success in Atari and Go,
obtaining these properties consistently is difficult and it remains a thorny research area for which
there are limited commercial applications.