0% found this document useful (0 votes)

1 views

Unit - 4 DL

Uploaded by

gauravgautam268

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Unit - 4 DL

Uploaded by

gauravgautam268

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

1.

Saves Time:

Faster training as the base model is already trained.

2. Requires Less Data:

Works well with smaller datasets.

3. Improved Accuracy:

Leverages pre-trained features for better generalization.

Challenges of Transfer Learning

1. Domain Mismatch:

The pre-trained model may not perform well if the new task domain is too
different.

2. Overfitting:

Risk of overfitting if fine-tuning is not done carefully.

Unit 4
Introduction to Natural Language Processing (NLP)

What is NLP?
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that
focuses on enabling computers to understand, interpret, and respond to human
language in a meaningful way. It combines computational linguistics, machine
learning, and deep learning techniques to process and analyze text and speech
data.

Key Components of NLP

1. Syntax:

Focuses on the arrangement of words in sentences to ensure grammatical

correctness.

Reinforcement Learning and Deep Learning 198

Example: Parsing a sentence to identify parts of speech (e.g., nouns,
verbs).

2. Semantics:

Involves understanding the meaning of words, phrases, and sentences.

Example: Resolving word ambiguities and interpreting word meanings.

3. Pragmatics:

Analyzes the context of language to derive meaning.

Example: Understanding sarcasm or idiomatic expressions.

4. Morphology:

Studies the structure of words and their components (e.g., prefixes,

suffixes).

Example: Understanding how "walked" is derived from "walk."

5. Phonetics and Phonology:

Focuses on speech sounds and their patterns for spoken language

processing.

Key Tasks in NLP

1. Text Preprocessing:

Cleaning and preparing raw text data for analysis.

Includes tokenization, stemming, lemmatization, and stopword removal.

2. Tokenization:

Splitting text into smaller units (e.g., words, sentences).

Example: "I love AI." → ["I", "love", "AI"]

3. Part-of-Speech (POS) Tagging:

Assigning grammatical categories to words.

Example: "Cats run fast." → [("Cats", "Noun"), ("run", "Verb"), ("fast",

"Adverb")]

Reinforcement Learning and Deep Learning 199

4. Named Entity Recognition (NER):

Identifying entities like names, locations, dates in text.

Example: "Elon Musk founded SpaceX in 2002." → [("Elon Musk",

"Person"), ("SpaceX", "Organization"), ("2002", "Date")]

5. Sentiment Analysis:

Determining the sentiment (positive, negative, neutral) of a text.

Example: "The movie was fantastic!" → Positive.

6. Machine Translation:

Translating text from one language to another.

Example: "Hello, world!" → "Bonjour, le monde!"

7. Text Summarization:

Generating concise summaries of large text documents.

Example: Condensing an article into key points.

8. Speech Recognition:

Converting spoken language into text.

Example: Transcribing a podcast.

9. Language Generation:

Creating human-like text responses.

Example: Chatbots generating replies.

Applications of NLP
1. Chatbots and Virtual Assistants:

NLP powers systems like Siri, Alexa, and Google Assistant.

2. Search Engines:

Google and Bing use NLP for understanding queries and ranking results.

3. Sentiment Analysis:

Reinforcement Learning and Deep Learning 200

Businesses use NLP to analyze customer reviews and social media
sentiment.

4. Translation Tools:

Tools like Google Translate rely on NLP for accurate language translation.

5. Healthcare:

Extracting insights from medical records and assisting in diagnosis.

6. Document Summarization:

Summarizing lengthy legal or research documents.

Approaches to NLP
1. Rule-Based Methods:

Relies on manually crafted rules and dictionaries for language processing.

Example: Grammar-checking systems.

2. Machine Learning:

Uses statistical models and labeled datasets for training.

Example: Naive Bayes, Support Vector Machines (SVMs).

3. Deep Learning:

Employs neural networks for feature extraction and language modeling.

Example: Recurrent Neural Networks (RNNs), Transformers (e.g., BERT,

GPT).

Challenges in NLP
1. Ambiguity:

Words and phrases often have multiple meanings depending on context.

Example: "I saw her duck" (ambiguous meaning).

2. Context Understanding:

Difficulty in understanding long-range dependencies in text.

Reinforcement Learning and Deep Learning 201

Example: Resolving pronouns in complex sentences.

3. Sarcasm and Irony:

Challenging to detect non-literal language.

Example: "Yeah, great job!" (might be sarcastic).

4. Domain-Specific Language:

Text in specialized fields (e.g., medicine, law) requires domain expertise.

5. Low-Resource Languages:

Limited data for less widely spoken languages.

Popular NLP Libraries and Frameworks

1. NLTK (Natural Language Toolkit):

Comprehensive library for text processing in Python.

2. spaCy:

Optimized for production-ready NLP tasks like NER, POS tagging.

3. Hugging Face Transformers:

State-of-the-art library for transformer models like BERT, GPT.

4. Gensim:

Specialized in topic modeling and word embedding techniques.

5. TextBlob:

Simplified library for text analysis.

Future of NLP
1. Improved Contextual Understanding:

More advanced models like GPT-4 and BERT improve context handling.

2. Multilingual NLP:

Better support for low-resource languages.

Reinforcement Learning and Deep Learning 202

3. Real-Time Applications:

Faster NLP models enabling real-time translation and summarization.

4. Ethical NLP:

Addressing biases and ensuring fairness in language models.

Vector Space Model (VSM) of Semantics

What is the Vector Space Model?

The Vector Space Model (VSM) is a mathematical model used to represent
words, phrases, or documents as vectors in a multi-dimensional space. It is widely
used in natural language processing (NLP) to quantify and compare semantic
meaning.

Key Concepts of VSM

1. Vector Representation:

Words or documents are represented as vectors in a high-dimensional

space.

Dimensions are typically derived from features like terms, context, or co-
occurrence frequencies.

2. Semantic Similarity:

The semantic similarity between words or documents is computed based

on the closeness of their vectors in the space.

3. Applications:

Information retrieval.

Document clustering.

Word meaning analysis.

Steps in the Vector Space Model

Reinforcement Learning and Deep Learning 203

1. Text Representation
Terms as Dimensions:

Each unique term in the vocabulary becomes a dimension in the vector

space.

Vector Creation:

A document or word is represented as a vector with values corresponding

to term occurrences or importance.

Reinforcement Learning and Deep Learning 204

Mathematical Representation
1. Word Representation:

Reinforcement Learning and Deep Learning 205

Consider three words: "cat", "dog", "fish".

Feature dimensions: [mammal, aquatic, pet].

Vector representation:

"cat" → [1, 0, 1]

"dog" → [1, 0, 1]

"fish" → [0, 1, 0]

2. Document Representation:

Vocabulary: ["AI", "machine", "learning"].

Two documents:

D1 : "AI and machine learning"

D2 : "machine learning applications"

Term frequency:

D1: [1, 1, 1]

D2: [0, 1, 1]

Advantages of VSM
1. Simple and Effective:

Provides a straightforward way to represent and compare text.

2. Language Agnostic:

Works on any text dataset after preprocessing.

3. Supports Various Applications:

Widely used in search engines, recommendation systems, and text

classification.

Limitations of VSM
1. High Dimensionality:

Reinforcement Learning and Deep Learning 206

Representing large vocabularies results in sparse and high-dimensional
vectors.

2. No Contextual Understanding:

Fails to capture word meanings based on context (e.g., "bank" as a

riverbank vs. a financial bank).

3. Assumes Independence:

Assumes terms are independent, ignoring word order and syntax.

Modern Extensions to VSM

1. Word Embeddings:

Dense vector representations that capture semantic meaning based on

context.

Examples: Word2Vec, GloVe, FastText.

2. Contextual Models:

Context-sensitive embeddings using deep learning.

Examples: BERT, GPT, Transformer-based models.

Applications of Vector Space Model

1. Information Retrieval:

Search engines rank documents based on similarity to a query.

2. Document Clustering:

Grouping similar documents into clusters.

3. Semantic Analysis:

Measuring similarity between words or phrases.

4. Recommender Systems:

Suggesting similar items based on textual descriptions.

Reinforcement Learning and Deep Learning 207

Conclusion
The Vector Space Model is a foundational concept in NLP and information
retrieval. Although limited in its ability to capture contextual semantics, it forms the
basis for many modern advancements like word embeddings and transformer-
based models.

Word Vector Representations

Word vector representations are mathematical representations of words as
vectors of real numbers. They capture semantic and syntactic properties of words
based on their usage in a given corpus. Below, we'll delve into key methods for
creating word vector representations and their evaluations and applications.

1. Continuous Skip-Gram Model

Objective
The Skip-Gram model, introduced as part of Word2Vec, aims to predict the
context (surrounding words) given a target word.

Architecture
Input: A single target word (e.g., "dog").

Output: Probabilities of context words within a defined window size around

the target word.

Core Idea: Words that appear in similar contexts will have similar vector
representations.

Training Steps
1. Input Representation:

Represent the input word as a one-hot vector.

The vocabulary size determines the length of the vector.

2. Projection Layer:

Reinforcement Learning and Deep Learning 208

Map the input one-hot vector into a dense vector representation using a
weight matrix .

3. Output Layer:

Use another weight matrix to compute probabilities of all words in the

vocabulary being context words.

Apply a softmax function to normalize these probabilities.

4. Optimization:

Minimize the loss function using negative log likelihood or sampled

variants like negative sampling or hierarchical softmax to handle large
vocabularies efficiently.

Advantages
Captures semantic similarity well.

Performs better with large datasets.

2. Continuous Bag-of-Words Model (CBOW)

Objective
The CBOW model predicts a target word based on its context words.

Architecture
Input: Context words (a set of surrounding words).

Output: A single target word.

Core Idea: Words in similar contexts are likely to have similar meanings.

Training Steps
1. Input Representation:

Represent context words using one-hot vectors.

2. Projection Layer:

Reinforcement Learning and Deep Learning 209

Compute the average (or sum) of the vectors of the context words.

Map this average into a dense representation using a weight matrix.

3. Output Layer:

Similar to Skip-Gram, a second weight matrix predicts the target word

using softmax probabilities.

4. Optimization:

Minimize the loss function (negative log likelihood).

Advantages
Faster to train than Skip-Gram.

Suitable for smaller datasets.

3. GloVe (Global Vectors for Word Representation)

Objective
GloVe is a count-based method that constructs word vectors using the co-
occurrence statistics of words in a corpus.

Core Idea
Words that co-occur frequently in a corpus will have similar representations. For
example:

P("ice"/"cold") is high because "ice" and "cold" co-occur frequently.

Key Features
Matrix Construction:

Create a co-occurrence matrix X, where each element Xij represents the

frequency of word j in the context of word .

Matrix Factorization:

Solve for dense word vectors by factorizing the co-occurrence matrix.

Objective Function:

Reinforcement Learning and Deep Learning 210

GloVe minimizes the weighted least squares difference between word
vector dot products and the logarithms of their co-occurrence
probabilities.

Advantages
Combines local (context-based) and global (corpus-wide) information.

Efficient for large corpora.

4. Evaluations of Word Embeddings

Evaluating word embeddings ensures that they effectively capture meaningful
relationships between words.

c. Downstream Tasks
Evaluate embeddings based on their performance in tasks like:

Text classification.

Sentiment analysis.

Reinforcement Learning and Deep Learning 211

Machine translation.

5. Applications

a. Word Similarity and Relatedness

Search engines: Improve query relevance.

Thesaurus generation: Identify synonyms and related terms.

b. Analogy Reasoning
Knowledge extraction: Identify relationships in large datasets.

Question answering systems.

c. Sentiment Analysis
Represent words in sentiment analysis models to classify text polarity.

d. Machine Translation
Word embeddings help align representations of similar words across
languages.

e. Document Clustering and Classification

Represent documents as combinations of word vectors (e.g., using TF-IDF
weighted averaging).

Use embeddings for clustering and topic modeling.

f. Chatbots and Conversational AI

Generate meaningful responses by leveraging semantic similarities.

Comparison of Methods
Feature Skip-Gram CBOW GloVe

Context Target -> Context Context -> Target Global Co-occurrence

Reinforcement Learning and Deep Learning 212

Data
Large Moderate Large
Requirement

Training Speed Slower Faster Efficient

High for rare High for frequent Combines global and

Output Quality
words words local

Deep Learning for Computer Vision

Deep learning has revolutionized computer vision by enabling complex tasks like
image segmentation, object detection, and automatic image captioning. These
tasks leverage neural networks such as convolutional neural networks (CNNs) and
advanced architectures like transformers.

1. Image Segmentation

What is Image Segmentation?

Image segmentation involves dividing an image into multiple regions or objects,
assigning a label to every pixel based on its category.

Types:

1. Semantic Segmentation:

Classifies each pixel into a category (e.g., sky, car, road).

2. Instance Segmentation:

Identifies and separates individual objects (e.g., detecting each car

separately).

Deep Learning Architectures for Image Segmentation

1. Fully Convolutional Networks (FCN):

Replaces fully connected layers with convolutional layers for pixel-wise

predictions.

2. U-Net:

Symmetric encoder-decoder architecture with skip connections.

Reinforcement Learning and Deep Learning 213

Widely used for medical imaging tasks.

3. Mask R-CNN:

Extends Faster R-CNN for instance segmentation by predicting a mask for

each detected object.

4. DeepLab:

Utilizes atrous (dilated) convolutions for capturing context at multiple

scales.

Applications of Image Segmentation

1. Medical Imaging:

Tumor detection, organ segmentation.

2. Autonomous Vehicles:

Lane detection, object recognition.

3. Satellite Imagery:

Land cover classification.

Example: U-Net for Semantic Segmentation

from keras.models import Model

from keras.layers import Input, Conv2D, MaxPooling2D, UpSampl
ing2D, concatenate

def unet(input_size=(128, 128, 3)):

inputs = Input(input_size)
c1 = Conv2D(64, (3, 3), activation='relu', padding='sam
e')(inputs)
p1 = MaxPooling2D((2, 2))(c1)

c2 = Conv2D(128, (3, 3), activation='relu', padding='sam

e')(p1)
p2 = MaxPooling2D((2, 2))(c2)

Reinforcement Learning and Deep Learning 214

u1 = UpSampling2D((2, 2))(p2)
m1 = concatenate([u1, c2])
c3 = Conv2D(64, (3, 3), activation='relu', padding='sam
e')(m1)

outputs = Conv2D(1, (1, 1), activation='sigmoid')(c3)

model = Model(inputs, outputs)
return model

model = unet()
model.summary()

2. Object Detection

What is Object Detection?

Object detection involves identifying and localizing objects within an image by
drawing bounding boxes around them and classifying each object.

Deep Learning Architectures for Object Detection

1. Faster R-CNN:

Combines region proposal networks (RPNs) with CNNs for faster object
detection.

2. YOLO (You Only Look Once):

A single-shot detection model that predicts bounding boxes and class

probabilities simultaneously.

Versions: YOLOv3, YOLOv4, YOLOv5, YOLOv8.

3. SSD (Single Shot MultiBox Detector):

Detects objects in images in a single pass.

4. Vision Transformers (ViT):

Reinforcement Learning and Deep Learning 215

Emerging models that utilize transformer architectures for object detection
tasks.

Applications of Object Detection

1. Autonomous Vehicles:

Pedestrian detection, obstacle recognition.

2. Retail:

Inventory monitoring, checkout systems.

3. Healthcare:

Identifying abnormalities in medical images.

Example: YOLO for Object Detection

from ultralytics import YOLO

# Load a pre-trained YOLO model

model = YOLO("yolov5s.pt")

# Perform object detection on an image

results = model("image.jpg")

# Display the results

results.show()

3. Automatic Image Captioning

What is Automatic Image Captioning?

Image captioning involves generating a textual description for a given image by
understanding its content.

Deep Learning Architectures for Image Captioning

Reinforcement Learning and Deep Learning 216

1. Encoder-Decoder Model:

Encoder: A CNN (e.g., ResNet, Inception) extracts features from the

image.

Decoder: An RNN (e.g., LSTM) generates captions based on the encoded

features.

2. Attention Mechanism:

Allows the model to focus on specific parts of the image while generating
each word.

3. Vision-Language Transformers:

Models like CLIP and BLIP utilize transformers for improved image-text
understanding.

Applications of Image Captioning

1. Accessibility:

Assisting visually impaired individuals by describing images.

2. Social Media:

Automated hashtag generation, content descriptions.

3. E-Commerce:

Product descriptions for catalog images.

Example: Image Captioning with CNN-LSTM

from keras.applications import InceptionV3

from keras.models import Model
from keras.layers import LSTM, Dense, Embedding, Input

# Load a pre-trained CNN (e.g., InceptionV3) as the encoder

cnn_model = InceptionV3(weights='imagenet')
cnn_model = Model(cnn_model.input, cnn_model.layers[-2].outpu
t)

Reinforcement Learning and Deep Learning 217

# Define LSTM-based decoder
image_features = Input(shape=(2048,))
caption_input = Input(shape=(None,))
embedding = Embedding(input_dim=10000, output_dim=256)(captio
n_input)
lstm = LSTM(256)(embedding)
decoder_output = Dense(10000, activation='softmax')(lstm)

# Combine encoder and decoder

captioning_model = Model([image_features, caption_input], dec
oder_output)
captioning_model.summary()

Comparison of Tasks

Task Objective Output Key Models

Image Label each pixel in an Mask (pixel-level FCN, U-Net, Mask R-

Segmentation image labels) CNN, DeepLab

Identify and localize Bounding boxes + Faster R-CNN, YOLO,

Object Detection
objects in an image class labels SSD, ViT

Generate textual
Image Sentences or Encoder-Decoder,
descriptions for
Captioning phrases Attention, Transformers
images

Conclusion
Deep learning has enabled significant advancements in computer vision tasks like
image segmentation, object detection, and automatic image captioning. These
tasks find applications in autonomous vehicles, healthcare, e-commerce, and
accessibility technologies. Modern architectures, including transformers, continue
to push the boundaries of these applications.

Image Generation with Generative Adversarial

Networks (GANs)

Reinforcement Learning and Deep Learning 218

What is a GAN?
A Generative Adversarial Network (GAN) is a type of deep learning model that
generates realistic images, videos, or other data. It consists of two neural
networks:

1. Generator:

Produces synthetic images from random noise.

2. Discriminator:

Distinguishes between real and fake images.

The generator and discriminator compete in a zero-sum game:

The generator tries to create images that fool the discriminator.

The discriminator improves at identifying fake images.

How GANs Work

1. Random Noise:

The generator takes a random noise vector as input.

2. Synthetic Image:

The generator creates a fake image from the noise.

3. Real vs. Fake:

The discriminator evaluates whether an image is real (from the dataset) or

fake (from the generator).

4. Feedback:

The discriminator's feedback helps the generator improve.

Applications of GANs
1. Image Generation:

Creating realistic faces, artwork, or objects.

Example: StyleGAN generates high-quality facial images.

Reinforcement Learning and Deep Learning 219

2. Data Augmentation:

Expanding datasets for training models.

3. Super-Resolution:

Enhancing the resolution of low-quality images.

4. Text-to-Image:

Generating images based on textual descriptions (e.g., DALL·E).

Example: GAN for Image Generation

import tensorflow as tf
from tensorflow.keras.layers import Dense, Reshape, Flatten,
Conv2D, Conv2DTranspose, LeakyReLU
from tensorflow.keras.models import Sequential

# Generator model
def build_generator():
model = Sequential([
Dense(256, activation="relu", input_dim=100),
LeakyReLU(0.2),
Dense(512),
LeakyReLU(0.2),
Dense(1024),
LeakyReLU(0.2),
Dense(28 * 28 * 1, activation="tanh"),
Reshape((28, 28, 1))
])
return model

# Discriminator model
def build_discriminator():
model = Sequential([
Flatten(input_shape=(28, 28, 1)),
Dense(512),

Reinforcement Learning and Deep Learning 220

LeakyReLU(0.2),
Dense(256),
LeakyReLU(0.2),
Dense(1, activation="sigmoid")
])
return model

# Compile GAN
generator = build_generator()
discriminator = build_discriminator()
discriminator.compile(optimizer="adam", loss="binary_crossent
ropy", metrics=["accuracy"])

gan = Sequential([generator, discriminator])

discriminator.trainable = False
gan.compile(optimizer="adam", loss="binary_crossentropy")

Training GANs:

Train the discriminator and generator alternately.

Use techniques like label smoothing and gradient clipping to stabilize

training.

Video-to-Text with LSTM Models

What is Video-to-Text Conversion?

Video-to-text involves generating descriptive captions or summaries for a video
by understanding its temporal and spatial features.

How It Works
1. Feature Extraction:

Use a CNN (e.g., ResNet, Inception) to extract spatial features from video
frames.

Reinforcement Learning and Deep Learning 221

2. Sequence Modeling:

Use an LSTM to process the extracted features over time.

3. Text Generation:

Use an LSTM decoder or Transformer to generate textual captions.

Steps for Video-to-Text

1. Extract Frames:

Split the video into individual frames.

2. Feature Extraction:

Pass each frame through a pre-trained CNN to extract features.

3. Sequence Processing:

Input the sequence of features into an LSTM for temporal modeling.

4. Caption Generation:

Generate captions frame by frame.

Applications of Video-to-Text
1. Video Summarization:

Generate summaries for educational or surveillance videos.

2. Accessibility:

Create descriptive captions for visually impaired individuals.

3. Content Recommendation:

Annotate video content for better indexing.

Example: Video-to-Text with LSTM

import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, T

Reinforcement Learning and Deep Learning 222

imeDistributed

# Step 1: Feature Extraction (using a pre-trained CNN)

cnn = tf.keras.applications.InceptionV3(weights='imagenet', i
nclude_top=False, pooling='avg')
video_frames = [...] # Extracted video frames
frame_features = [cnn(frame) for frame in video_frames]

# Step 2: Define LSTM-based Sequence Model

def build_video_to_text_model(vocab_size):
model = Sequential([
LSTM(256, return_sequences=True, input_shape=(None, 2
048)),
Dense(256, activation='relu'),
Dense(vocab_size, activation='softmax')
])
return model

video_to_text_model = build_video_to_text_model(vocab_size=10
000)

# Step 3: Compile and Train

video_to_text_model.compile(optimizer='adam', loss='categoric
al_crossentropy')
video_to_text_model.fit(frame_features, captions, epochs=10,
batch_size=32)

Challenges in Video-to-Text
1. Temporal Dependencies:

Capturing long-term dependencies across video frames.

Solution: Use advanced models like transformers (e.g., ViT).

2. Dataset Complexity:

Requires large labeled datasets with diverse scenes and captions.

Reinforcement Learning and Deep Learning 223

3. Multimodal Understanding:

Combining visual and contextual understanding is challenging.

Comparison of GANs and LSTMs for Vision

Aspect GANs Video-to-Text (LSTM)

Primary Task Generate realistic images/videos Generate descriptive video captions

Input Random noise Video frames

Output Synthetic images Textual descriptions

Key Models DCGAN, StyleGAN, CycleGAN CNN-LSTM, Transformers

Applications Image generation, super-resolution Video summarization, accessibility

Conclusion
GANs excel in generating realistic images and videos, finding applications in
data augmentation, content creation, and super-resolution tasks.

LSTM-based video-to-text models focus on converting sequential video data

into meaningful textual captions, widely used in accessibility tools, video
summarization, and media indexing.

Advanced architectures like transformers are increasingly improving the

performance of both tasks.

Attention Models for Computer Vision Tasks

What are Attention Models?

Attention models are neural network architectures that allow the model to focus on
the most relevant parts of the input data while performing a task. Initially
introduced in natural language processing (NLP), attention mechanisms have been
successfully adapted for computer vision tasks, enabling more efficient and
accurate feature extraction and analysis.

Key Concepts of Attention in Vision

Reinforcement Learning and Deep Learning 224

1. Spatial Attention:

Focuses on specific regions of an image.

Example: Highlighting a cat's face in an image while ignoring the

background.

2. Channel Attention:

Identifies important feature maps in a CNN.

Example: Prioritizing color or texture channels for image classification.

3. Temporal Attention:

Applies to video analysis, focusing on key frames over time.

Example: Detecting a specific action in a video clip.

4. Self-Attention:

Calculates relationships between all parts of an input to understand

dependencies.

Widely used in transformers for capturing global context.

Attention Mechanisms in Computer Vision

1. Self-Attention
Computes attention scores between every pair of input elements.

Example: Vision Transformers (ViTs) use self-attention to model relationships

between image patches.

2. Spatial Attention
Focuses on specific spatial regions of an image.

Example: Convolutional Block Attention Module (CBAM) applies spatial

attention to highlight relevant areas.

3. Channel Attention
Determines which feature maps (channels) are important.

Reinforcement Learning and Deep Learning 225

Example: Squeeze-and-Excitation (SE) blocks apply channel attention.

4. Multi-Head Attention
Divides the input into multiple subspaces and computes attention for each
subspace.

Example: Multi-head self-attention in Vision Transformers.

Key Architectures Using Attention in Vision

1. Vision Transformers (ViTs)

Overview:

Applies self-attention to image patches.

Treats images as sequences, similar to words in NLP.

How it Works:

An image is divided into patches, each represented as a vector.

Self-attention layers process these patches to capture global

dependencies.

Applications:

Image classification, object detection, segmentation.

2. Convolutional Block Attention Module (CBAM)

Overview:

Combines spatial and channel attention mechanisms.

Enhances feature extraction in CNNs.

How it Works:

Channel Attention: Learns important feature maps.

Spatial Attention: Highlights relevant spatial regions.

Applications:

Improves CNN-based tasks like classification and segmentation.

Reinforcement Learning and Deep Learning 226

3. SENet (Squeeze-and-Excitation Network)
Overview:

Introduces channel attention to CNNs.

How it Works:

Squeezes feature maps to a global descriptor.

Excites important channels by reweighting them.

Applications:

Image classification, object detection.

4. DETR (DEtection TRansformer)

Overview:

Combines transformers with CNNs for object detection.

How it Works:

Uses self-attention to predict object bounding boxes and labels.

Applications:

Object detection tasks.

5. Attention U-Net
Overview:

Adds attention gates to U-Net for medical image segmentation.

How it Works:

Highlights relevant regions of interest in the feature maps.

Applications:

Medical imaging, tumor segmentation.

Applications of Attention Models in Vision

1. Image Classification:

Reinforcement Learning and Deep Learning 227

Vision Transformers (ViTs) achieve state-of-the-art performance by
capturing global context.

2. Object Detection:

DETR uses attention to predict bounding boxes and object classes.

3. Image Segmentation:

Attention U-Net and CBAM enhance segmentation tasks by focusing on

relevant regions.

4. Action Recognition in Videos:

Temporal attention highlights important frames for action detection.

5. Super-Resolution:

Attention mechanisms improve the generation of high-resolution images.

6. Anomaly Detection:

Focuses on unusual regions in images or videos.

Example: Vision Transformer (ViT) for Image Classification

import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, LayerNorm
alization, MultiHeadAttention, Dropout
from tensorflow.keras.models import Model

class VisionTransformer(Model):
def __init__(self, num_patches, projection_dim, num_head
s, transformer_units, num_classes):
super(VisionTransformer, self).__init__()
self.num_patches = num_patches
self.projection_dim = projection_dim
self.class_token = self.add_weight(shape=(1, 1, proje
ction_dim), initializer="random_normal")
self.position_embedding = self.add_weight(shape=(1, n
um_patches + 1, projection_dim), initializer="random_normal")

Reinforcement Learning and Deep Learning 228

self.multi_head_attention = MultiHeadAttention(num_he
ads=num_heads, key_dim=projection_dim)
self.dense_proj = tf.keras.Sequential([Dense(units, a
ctivation="relu") for units in transformer_units])
self.layer_norm = LayerNormalization(epsilon=1e-6)
self.classifier = Dense(num_classes)

def call(self, patches):

batch_size = tf.shape(patches)[0]
class_token = tf.broadcast_to(self.class_token, [batc
h_size, 1, self.projection_dim])
patches = tf.concat([class_token, patches], axis=1)
patches += self.position_embedding
for _ in range(2): # Two transformer layers
attention_output = self.multi_head_attention(patc
hes, patches)
patches = self.layer_norm(patches + attention_out
put)
proj_output = self.dense_proj(patches)
patches = self.layer_norm(patches + proj_output)
return self.classifier(patches[:, 0])

# Initialize and compile the ViT model

num_patches = 16 * 16
projection_dim = 64
num_heads = 4
transformer_units = [projection_dim * 2, projection_dim]
num_classes = 10

vit_model = VisionTransformer(num_patches, projection_dim, nu

m_heads, transformer_units, num_classes)
vit_model.compile(optimizer="adam", loss="sparse_categorical_
crossentropy", metrics=["accuracy"])

Advantages of Attention Models in Vision

Reinforcement Learning and Deep Learning 229

1. Captures Global Context:

Self-attention captures long-range dependencies.

2. Improves Feature Extraction:

Spatial and channel attention focus on relevant regions.

3. Versatility:

Applicable to various tasks: classification, detection, segmentation.

4. Handles Complex Relationships:

Multi-head attention captures diverse feature representations.

Challenges of Attention Models

1. Computational Complexity:

Self-attention scales quadratically with input size, making it resource-

intensive for high-resolution images.

2. Large Datasets:

Attention-based models require large datasets for training.

3. Lack of Spatial Inductive Bias:

Unlike CNNs, transformers lack inherent spatial understanding, requiring

more data.

Conclusion
Attention mechanisms have significantly advanced computer vision, enabling
state-of-the-art performance in tasks like image classification, object detection,
and segmentation. While models like Vision Transformers and DETR lead the
way, hybrid approaches combining CNNs with attention mechanisms (e.g., CBAM,
SENet) continue to be effective for resource-constrained applications.

Reinforcement Learning and Deep Learning 230

Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
UNIT_5_DL
No ratings yet
UNIT_5_DL
11 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
NLP handwritten notes_copy
No ratings yet
NLP handwritten notes_copy
26 pages
The 7 NLP Techniques That Will Change How You Communicate in the Future (Part I)
No ratings yet
The 7 NLP Techniques That Will Change How You Communicate in the Future (Part I)
19 pages
big data analytics Chap 11
No ratings yet
big data analytics Chap 11
8 pages
Summaries of The Chapters
No ratings yet
Summaries of The Chapters
29 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
unit 1 and 2 (1)
No ratings yet
unit 1 and 2 (1)
5 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
ML for NLP-LO3
No ratings yet
ML for NLP-LO3
61 pages
3-Natural Language Processing With Attention Models
No ratings yet
3-Natural Language Processing With Attention Models
62 pages
2009 Tutorial Nips
No ratings yet
2009 Tutorial Nips
113 pages
Project Plan - Kel 5 PDF
No ratings yet
Project Plan - Kel 5 PDF
5 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Chapter 1 Solutions
No ratings yet
Chapter 1 Solutions
5 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Introduction to NLP_first_week_lecture_1st
No ratings yet
Introduction to NLP_first_week_lecture_1st
6 pages
13. TEXT CLASSIFICATION USING NLP
No ratings yet
13. TEXT CLASSIFICATION USING NLP
28 pages
Natural language processing notes
No ratings yet
Natural language processing notes
61 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
NLP_slides2
No ratings yet
NLP_slides2
93 pages
NLP 1
No ratings yet
NLP 1
15 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Unit 5 - Aiaaia
No ratings yet
Unit 5 - Aiaaia
19 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
What Is Natural Language Processing (NLP)
No ratings yet
What Is Natural Language Processing (NLP)
15 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
14_Key_Skills_to_Master_Large_Language_Models__1729745509
No ratings yet
14_Key_Skills_to_Master_Large_Language_Models__1729745509
17 pages
Chapter 12
No ratings yet
Chapter 12
16 pages
Nlp Lab Manual (2)
No ratings yet
Nlp Lab Manual (2)
28 pages
natural language processing
No ratings yet
natural language processing
3 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
UNIT-2
No ratings yet
UNIT-2
6 pages
2022-foundations-tutorial3-sunwang-deeplearning4nlp
No ratings yet
2022-foundations-tutorial3-sunwang-deeplearning4nlp
103 pages
NLP
No ratings yet
NLP
88 pages
Natural Language Processing
No ratings yet
Natural Language Processing
87 pages
GenAI_Syllabus
No ratings yet
GenAI_Syllabus
17 pages
1 s2.0 S0925231221010997 Main
No ratings yet
1 s2.0 S0925231221010997 Main
14 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
SESSION_1_LLMs
No ratings yet
SESSION_1_LLMs
40 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
Trend
No ratings yet
Trend
47 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
1.Machine Learning and its Applications (2)
No ratings yet
1.Machine Learning and its Applications (2)
75 pages
Introduction To NLP - Part 1
No ratings yet
Introduction To NLP - Part 1
23 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Module1_L4_LLMs_new
No ratings yet
Module1_L4_LLMs_new
37 pages
BTech Advanced AI Unit04
No ratings yet
BTech Advanced AI Unit04
45 pages
Natural Language Processing: All You Need To Know About
No ratings yet
Natural Language Processing: All You Need To Know About
45 pages
Lect01
No ratings yet
Lect01
28 pages
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
AComparative Study of Machine Learning and Deep Learning Techniques For Fake News Detection
No ratings yet
AComparative Study of Machine Learning and Deep Learning Techniques For Fake News Detection
28 pages
Immediate download Knowledge Graph and Semantic Computing Language Knowledge and Intelligence Second China Conference CCKS 2017 Chengdu China August 26 29 2017 Revised Selected Papers 1st Edition Juanzi Li ebooks 2024
100% (5)
Immediate download Knowledge Graph and Semantic Computing Language Knowledge and Intelligence Second China Conference CCKS 2017 Chengdu China August 26 29 2017 Revised Selected Papers 1st Edition Juanzi Li ebooks 2024
55 pages
Project_Report_Template_AICTE_Internship_2025
No ratings yet
Project_Report_Template_AICTE_Internship_2025
21 pages
Brochure CMU NLP 24-08-2022 V13
No ratings yet
Brochure CMU NLP 24-08-2022 V13
13 pages
CMPT 413/713: Natural Language Processing: Nat Langlab
No ratings yet
CMPT 413/713: Natural Language Processing: Nat Langlab
43 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
61 pages
NLP Unit 5
No ratings yet
NLP Unit 5
10 pages
Efficient Encoding and Embedding Strategies
No ratings yet
Efficient Encoding and Embedding Strategies
21 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
8 pages
Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Dan Jurafsky - Download the ebook now and own the full detailed content
No ratings yet
Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Dan Jurafsky - Download the ebook now and own the full detailed content
65 pages
Rss Abstracts Booklet 2019 A4 PDF
No ratings yet
Rss Abstracts Booklet 2019 A4 PDF
324 pages
NLP03 Vector Space Models
No ratings yet
NLP03 Vector Space Models
61 pages
Exploring Progress in Aspect-Based Sentiment Analysis An In-Depth Survey
No ratings yet
Exploring Progress in Aspect-Based Sentiment Analysis An In-Depth Survey
10 pages
Natural Language Processing For The Legal Domain A Survey of Tasks, Datasets, Models, and Challenges
No ratings yet
Natural Language Processing For The Legal Domain A Survey of Tasks, Datasets, Models, and Challenges
35 pages
GitHub - peggy1502_Amazing-Resources_ List of references and online resources related to data science, machine learning and deep learning_
No ratings yet
GitHub - peggy1502_Amazing-Resources_ List of references and online resources related to data science, machine learning and deep learning_
41 pages
Llm Application Through Production
No ratings yet
Llm Application Through Production
254 pages
NLP in Medical
No ratings yet
NLP in Medical
11 pages
Seminar On Deep CNN
No ratings yet
Seminar On Deep CNN
36 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
37 pages
Going Beyond T-SNE: Exposing Whatlies in Text Embeddings
No ratings yet
Going Beyond T-SNE: Exposing Whatlies in Text Embeddings
8 pages
Mahowald - Language and thought in LLMs (2024)
No ratings yet
Mahowald - Language and thought in LLMs (2024)
24 pages
Gen AI
No ratings yet
Gen AI
26 pages
Us Presidential Vocabulary - Ipynb
No ratings yet
Us Presidential Vocabulary - Ipynb
40 pages
Embeddings
No ratings yet
Embeddings
13 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Bias in Machine Learning - What Is It Good For?: Thomas Hellstr Om and Virginia Dignum and Suna Bensch
No ratings yet
Bias in Machine Learning - What Is It Good For?: Thomas Hellstr Om and Virginia Dignum and Suna Bensch
8 pages
Efficient Lipophilicity Prediction of Molecules Employing Deep-Learning Models
No ratings yet
Efficient Lipophilicity Prediction of Molecules Employing Deep-Learning Models
13 pages
The Language of Proteins: NLP, Machine Learning & Protein Sequences
No ratings yet
The Language of Proteins: NLP, Machine Learning & Protein Sequences
9 pages
Transformer-Based Deep Learning Models For The Sentiment Analysis of Social Media Data
No ratings yet
Transformer-Based Deep Learning Models For The Sentiment Analysis of Social Media Data
12 pages