0% found this document useful (0 votes)
23 views

AML (Advanced Machine Learning)

The document discusses recurrent neural network architectures like vanilla RNNs, LSTMs, GRUs and bidirectional RNNs. It also explains the LSTM architecture in detail, describing the memory cell and gating mechanisms that allow LSTMs to capture long-term dependencies in sequential data.

Uploaded by

jitunnirmal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

AML (Advanced Machine Learning)

The document discusses recurrent neural network architectures like vanilla RNNs, LSTMs, GRUs and bidirectional RNNs. It also explains the LSTM architecture in detail, describing the memory cell and gating mechanisms that allow LSTMs to capture long-term dependencies in sequential data.

Uploaded by

jitunnirmal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Assignment – 3

Short type Questions

1. What is bi-directional property with recurrent neural networks?


- The bi-directional property in recurrent neural networks (RNNs)
refers to the ability of the network to consider both past and
future context when making predictions or processing sequences.
2. Differentiate the LSTM and GRU architecture?
LSTM GRU
- Long Short-Term - Gated Recurrent Unit
Memory (LSTM) (GRU)
- LSTM units have three - GRU units have two
gates: input gate, forget gates: update gate and
gate, and output gate. reset gate. The update
These gates control the gate combines the
flow of information within functionalities of the input
the LSTM cell. and forget gates in LSTM,
and the reset gate
controls the information
flow in the cell.
3. List out the advantages of convolutional neural networks (CNNs).
- Spatial Hierarchical Feature Learning: CNNs can automatically learn
hierarchical representations of input data, capturing spatial
dependencies in images through their convolutional and pooling
layers.
- Translation Invariance: CNNs can recognize patterns regardless of
their position in the input image, thanks to the local receptive
fields and weight sharing in convolutional layers.
- Parameter Sharing: By sharing parameters across the input space,
CNNs have fewer parameters compared to fully connected
networks, making them more efficient and less prone to
overfitting.
- Feature Hierarchy: CNNs learn features at multiple levels of
abstraction, starting from simple edges and textures in lower layers
to complex patterns and objects in higher layers.
- Robustness to Variations: CNNs can handle variations in input data,
such as changes in lighting, rotation, scale, and occlusion, making
them suitable for tasks like object detection and recognition.
- Local Connectivity: CNNs exploit local connectivity to capture
spatial patterns effectively, allowing them to focus on relevant
regions of the input while ignoring irrelevant information.
- Effective Feature Extraction: CNNs excel at automatically extracting
meaningful features from raw data, reducing the need for manual
feature engineering and improving performance on tasks like
image classification and segmentation.
4. Answer the following:
1) Explain the convolution operation?
2) What is a pooling operation with CNNs?

- 1) Convolution Operation: In the context of Convolutional Neural


Networks (CNNs), the convolution operation involves applying a filter
(also known as a kernel) to an input image to produce a feature map.
The filter is a small matrix of weights that slides over the input image,
computing the dot product between its values and the values of the
corresponding region in the input. This process is repeated for every
possible position of the filter over the input image, resulting in a
feature map that highlights different patterns or features present in
the input. Mathematically, the convolution operation can be
represented as follows:

𝑚 𝑛
(𝐼 ∗ 𝐾)(𝑥, 𝑦) = 𝛴ⅈ=1 𝛴𝑗=1 𝐼(𝑥 − 𝑖, 𝑦 − 𝑗) ⋅ 𝐾(𝑖, 𝑗) ) where:
o 𝐼 is the input image.
o 𝐾 is the filter/kernel.
o (x, 𝑦) represents the spatial coordinates of the output
feature map.
o (𝑖, 𝑗) represents the spatial coordinates within the filter.
o 𝑚 and 𝑛 are the dimensions of the filter/kernel.

2) Pooling Operation with CNNs: Pooling is a downsampling


operation commonly used in CNNs to reduce the spatial
dimensions (width and height) of feature maps while
preserving important information. The pooling operation
operates independently on each feature map/channel of the
input. There are different types of pooling operations, with
the two most common being max pooling and average
pooling:

o Max Pooling: In max pooling, for each region of the


input feature map, the maximum value is retained,
and the rest are discarded. This helps to preserve
the most salient features within each region.
o Average Pooling: In average pooling, the average
value of all the elements in each region of the input
feature map is computed. This provides a smoother
downsampling compared to max pooling.
5. Explain the following pooling operations with CNNs:
1) Max-pooling
2) Average pooling
- 1) Max Pooling: In max pooling, for each region of the input
feature map, the maximum value is retained, and the rest are
discarded. This helps to preserve the most salient features within
each region.
- 2) Average Pooling: In average pooling, the average value of all
the elements in each region of the input feature map is
computed. This provides a smoother downsampling compared
to max pooling.
6. Explain filters with CNNs? What is the reason for using the large
number of filters with CNNs.
- Filters in CNNs: Filters are small matrices of weights typically sized
3x3 or 5x5 (though other sizes are possible) that are convolved with
the input data.
- During the convolution operation, the filter slides across the input
data, computing the dot product between the filter weights and the
corresponding input values at each position. This process extracts
features from the input.
Reason for using a large number of filters:

- CNNs typically consist of multiple layers, with each layer having


multiple filters.
- Using a large number of filters allows the network to learn a
diverse set of features at different levels of abstraction. Earlier
layers in the network tend to learn simple features such as edges
and corners, while deeper layers learn more complex features and
patterns.

- By increasing the number of filters in each layer, the network can


learn to detect a wide variety of features, making it more robust
and capable of capturing intricate patterns in the data.

- The large number of filters also provides the network with


redundancy and robustness against variations and noise in the
input data, enhancing its generalization ability.

Long Type Questions

1. Discuss and explain the different types of RNN architectures.


- Vanilla RNNs:
o Vanilla RNNs are the basic form of RNN architecture. They
process sequential data by recursively applying the same set of
weights to each input and the hidden state from the previous
time step.
o Vanilla RNNs suffer from the vanishing gradient problem,
where gradients diminish exponentially as they propagate
backward in time, leading to difficulties in learning long-range
dependencies.
- Long Short-Term Memory (LSTM) Networks:
o LSTM networks were introduced to address the vanishing
gradient problem in vanilla RNNs and to capture long-term
dependencies more effectively.
o LSTM units incorporate a memory cell, which allows them to
store information over long sequences by controlling the flow
of information through gated mechanisms, including input,
forget, and output gates.
- Gated Recurrent Units (GRUs):
o GRUs are similar to LSTMs in that they address the vanishing
gradient problem and capture long-term dependencies, but
they have a simpler architecture with fewer parameters.
o GRUs combine the forget and input gates of LSTMs into a
single update gate, which controls the flow of information into
the memory cell and the hidden state simultaneously.
- Bidirectional RNNs:
o Bidirectional RNNs process sequential data in both forward
and backward directions, allowing them to capture contextual
information from past and future inputs simultaneously.
o Bidirectional RNNs consist of two separate RNN layers, one
processing the input sequence in the forward direction and the
other processing it in the backward direction.
- Deep RNNs:
o Deep RNNs extend the depth of vanilla RNNs, LSTMs, or GRUs
by stacking multiple recurrent layers on top of each other.
o Deep RNNs can capture hierarchical representations of
sequential data, with each layer learning increasingly abstract
features or representations.
2. Describe the LSTM architecture in details. Also write the
mathematical equations involved with LSTM.
- Long Short-Term Memory (LSTM) networks are a type of recurrent
neural network (RNN) architecture designed to address the
vanishing gradient problem and capture long-range dependencies
in sequential data. LSTMs achieve this by incorporating a memory
cell and various gating mechanisms that control the flow of
information through the cell.
o Memory Cell: At the core of an LSTM unit is a memory cell,
which stores information over time and allows the network
to maintain long-term dependencies. The memory cell has
an internal state 𝑐𝑡 (the cell state) that can be read from,
written to, and reset by various gates.
o Gates: LSTMs use three types of gates to control the flow of
information: the input gate, the forget gate, and the output
gate. Each gate takes input from the current input 𝑥𝑡, the
previous hidden state ℎ𝑡−1, and has its own set of weights
and biases.
o Mathematical Equations:
o The equations governing the behavior of an LSTM unit are as
follows:
o Forget Gate (𝑓𝑡 ): ft=σ(Wf⋅[ht−1,xt]+bf)
o Input Gate (𝑖𝑡): it=σ(Wi⋅[ht−1,xt]+bi)
o Candidate Cell State (𝑐~𝑡): 𝑐~𝑡=tanh(𝑊𝑐⋅[ℎ𝑡−1,𝑥𝑡]+𝑏𝑐)
o Update Cell State (𝑐𝑡): 𝑐𝑡=𝑓𝑡⋅𝑐𝑡−1+𝑖𝑡⋅𝑐~𝑡
o Output Gate (𝑜𝑡): 𝑜𝑡=𝜎(𝑊𝑜⋅[ℎ𝑡−1,𝑥𝑡]+𝑏𝑜)
o Hidden State (ℎ𝑡): ℎ𝑡=𝑜𝑡⋅tanh(𝑐𝑡)
- Here, 𝜎 represents the sigmoid activation function, tanh represents
the hyperbolic tangent activation function,
𝑊𝑓,𝑊𝑖,𝑊𝑐,𝑊𝑜 are weight matrices, 𝑏𝑓,𝑏𝑖,𝑏𝑐,𝑏𝑜 bias vectors, and
[ℎ𝑡−1,𝑥𝑡] denotes the concatenation of the previous hidden state
and the current input.
3. Distinguish between the standard CNNs, Dilated CNNs and Causal
CNNs using appropriate examples.
-

Standard CNNs Dilated CNNs Causal CNNs


- Standard - Dilated CNNs - Causal
CNNs consist of introduce Convolutional
convolutional dilation rates in Neural
layers followed the Networks
by pooling convolutional (Causal CNNs):
layers, fully layers, allowing Causal CNNs
connected for an are designed
layers, and increased for sequential
activation receptive field data
functions. without processing,
- They use sacrificing where the
regular resolution. output at each
convolutions - They use time step
with a fixed dilated depends only
filter size and convolutions, on the past and
stride to where the filter not the future.
capture spatial is applied over - They
features from a larger area incorporate
input images. with gaps causal
- Example: between the convolutions,
LeNet-5, which restrict
AlexNet, kernel the filter's
VGGNet, elements. receptive field
ResNet, etc. - Dilated CNNs to only look at
are effective for past inputs.
capturing multi- - Causal CNNs
scale features are crucial for
and handling tasks like time
images with series
large spatial prediction,
extents. natural
- Example: language
WaveNet for processing, and
audio speech
generation, recognition.
DeepLab for - Example:
semantic WaveNet for
segmentation. text-to-speech
synthesis,
WaveNet
Autoencoders
for audio
generation.
4. Explain the following CNNs parameters in details:
1) Filter-size
2) Stride
3) Padding
4) Dilation rate
- Filter Size:
o The filter size, also known as kernel size, refers to the
dimensions of the sliding window that moves across the input
data during the convolution operation.
o It determines the spatial extent of the local features that the
convolutional layer can detect.
o Common filter sizes include 3x3, 5x5, and 7x7. Smaller filter
sizes capture fine-grained details, while larger filter sizes
capture more global features.
Stride:
o The stride defines the step size at which the filter moves
across the input data during convolution.
o It determines the amount of overlap between adjacent
regions that the filter processes.
o A larger stride results in a smaller output volume spatially,
while a smaller stride produces a larger output volume.
o Stride affects the spatial dimensions of the output feature
map, with larger strides reducing the spatial size.
Padding:
o Padding refers to the technique of adding additional border
pixels around the input data before applying convolution.
o It helps preserve the spatial dimensions of the input volume
and ensures that the output feature map has the same spatial
size as the input.
o Padding can be "valid" (no padding) or "same" (adding
padding evenly to all sides) depending on whether the input
size needs to be preserved.
Dilation Rate:
o The dilation rate controls the spacing between the kernel
elements or receptive field of the convolutional filter.
o It determines how many pixels are skipped between each
element of the filter when convolving with the input.
o A dilation rate of 1 means no dilation (standard convolution),
while a dilation rate greater than 1 creates gaps between the
filter elements.
o Dilated convolutions increase the receptive field without
increasing the number of parameters, enabling the model to
capture larger context.
5. Given a 1-dmensional input data as [1, 2, 3, 4, 5, 6, 7] and filter as
CNN [1, 1, 0].
Give the output of convolution operation for the following
scenarios:
1) Without padding
2) With padding
- Sure, let's compute the convolution output for the given input
data and filter:
Given:
Input data: [1, 2, 3, 4, 5, 6, 7]
Filter: [1, 1, 0]
1) Without Padding:
- With no padding, the output size will be reduced due to the
filter's movement across the input data.
- The output size can be computed using the formula:
Output size = (Input size - Filter size + 1)
- Here, the input size is 7 and the filter size is 3.
- So, the output size will be (7 - 3 + 1) = 5.
- We perform the convolution operation by sliding the filter across
the input data and taking dot products at each step:
Output[0] = (1*1) + (2*1) + (3*0) = 3
Output[1] = (2*1) + (3*1) + (4*0) = 5
Output[2] = (3*1) + (4*1) + (5*0) = 7
Output[3] = (4*1) + (5*1) + (6*0) = 9
Output[4] = (5*1) + (6*1) + (7*0) = 11

- So, the convolution output without padding is [3, 5, 7, 9, 11].


2) With Padding:
- With padding, we add zeros to the input data before performing
the convolution operation.
- The amount of padding required depends on the size of the filter.
Since the filter size is 3, we add one zero on each side of the input
data.
- The padded input data becomes [0, 1, 2, 3, 4, 5, 6, 7, 0].
- Now, the output size will remain the same as the input size, i.e.,
7.
- We perform the convolution operation as before:

Output[0] = (0*1) + (1*1) + (2*0) = 1


Output[1] = (1*1) + (2*1) + (3*0) = 3
Output[2] = (2*1) + (3*1) + (4*0) = 5
Output[3] = (3*1) + (4*1) + (5*0) = 7
Output[4] = (4*1) + (5*1) + (6*0) = 9
Output[5] = (5*1) + (6*1) + (7*0) = 11
Output[6] = (6*1) + (7*1) + (0*0) = 13

- So, the convolution output with padding is [1, 3, 5, 7, 9, 11, 13].


6. Given an input sequence consisting of 20 words and 20 CNN filters
of size 3.
Assuming the stride = 1, answer the following:
1) Total number of parameters with CNNs.
2) Size of the convolved output without padding.
3) Size of the convolved output with padding.
4) If the stride is 2
- To answer these questions, let's consider the following:

1) Total number of parameters with CNNs:

- Each filter in a CNN has a weight parameter for each element in the
filter.

- Since each filter has a size of 3, there are 3 weight parameters.

- Therefore, the total number of parameters for 20 CNN filters is: (3


weights + 1 bias) * 20 = 80 parameters.

2) Size of the convolved output without padding:

- Given that the input sequence consists of 20 words and the filter
size is 3 with a stride of 1.

- Without padding, the size of the convolved output is: (input size -
filter size + 1) = (20 - 3 + 1) = 18.

3) Size of the convolved output with padding:

- If we assume padding of 1 on each side of the input sequence, the


padded input size becomes 22.

- With the same filter size of 3 and a stride of 1, the size of the
convolved output would remain the same as the padded input size:
22.

4) If the stride is 2:

- With a stride of 2, the convolved output size will change.

- Without padding, the size of the convolved output would be: (input
size - filter size) / stride + 1 = (20 - 3) / 2 + 1 = 9.

- With padding, the padded input size would be 22.


- So, with a stride of 2, the size of the convolved output would be:
(padded input size - filter size) / stride + 1 = (22 - 3) / 2 + 1 = 10.

1) Total number of parameters with CNNs: 80 parameters.

2) Size of the convolved output without padding: 18.

3) Size of the convolved output with padding: 22.

4) Size of the convolved output with a stride of 2: 10.

Name - Aniketa Das


Reg. no. - 2201030030
Branch - IOT & CS

You might also like