1. Presented by
21P27
1BI21CS05 Hitesh Gupta
Under the Guidance of
Prof. Prathima M G
Assistant Professor
Dept. of Computer Science and Engineering
“ Vision transformers with Hierarchical
Attention ”
2. • Introduction
• Literature Survey
• Existing System
• Problem Statement
• Proposed System
• Architecture
• Methodology
• Results
• Applications
• Conclusion
• References
AGENDA
3. INTRODUCTION
Vision Transformers (ViTs) are powerful for vision tasks but struggle with high
computational costs when handling high-resolution images due to global self-
attention. To address this, we introduce Hierarchical Mixed-Resolution Self-
Attention , which first computes attention locally within small windows and then
globally across summarized tokens. Implemented in new HAT-Net architecture,
this approach significantly improves both efficiency and accuracy, achieving faster
inference and higher performance.
4. Literature Review
SL
N0.
Title of the paper Author Name / Publication Year Adopted Technique Limitations
1.. Hierarchical-Vision
Transformer-with
Deformable Embedding for
Fine-Grained-Visual
Recognition
Z. Gu, Y. Zhang, B. Ni, M.
Zhang
Publication - ICCV 2021
2021
Introduces deformable patch embedding to better
capture local details and semantic structure
Increased model complexity and
training cost
2. Swin Transformer:
Hierarchical Vision
Transformer using Shifted
Windows
Z. Liu, Y. Lin, Y. Cao, et al.
Publication - ICCV 2021
2021 Uses shifted windowing scheme for efficient
hierarchical self-attention
Struggles with very long-range
dependencies
3. Pyramid Vision
Transformer (PVT): A
Versatile Backbone for
Dense Prediction
W. Wang, E. Xie, X. Li
Publication - CVPR 2021
2021
Introduces spatial-reduction attention and pyramid
architecture for dense prediction tasks
Reduction in spatial resolution
can affect fine-grained feature
learning
4. Convolutional-Vision
Transformer(CvT):A
Hierarchical-Vision
Transformer Incorporating
Convolutional Projections
H. Wu, B. Xiao, N. Codella
Publication - ICCV 2021
2021
Combines convolutional layers with transformers to
capture both local and global features
Added convolutional operations
increase training and inference
time
5. SL
N0.
Title of the paper Author Name Year Adopted Technique Limitations
5. HRViT: High-Resolution
Vision Transformer with
Dynamic Token
Sparsification
G. Cheng, Y. Qiao
Publication - ICCV 2021
2022 Applies token scarification for efficient high-
resolution image processing May compromise performance in
tasks needing dense token
representation
6. Twins: Revisiting the
Design of Spatial Attention
in Vision Transformers
X. Chu, Z. Tian, Y. Wang
Publication - ICCV 2021
2021 Separates local and global attention modules for
better spatial feature extraction Computational overhead remains
due to dual attention pathways
7. Hierarchical-Vision
Transformer with Local and
Global Representation
B. Li, Y. Cao, H. Xu
Publication - NeurIPS
2022 Combines local and global feature encoding
hierarchically
Requires careful tuning of
attention splits between local and
global levels
8. BoTNet:Bottleneck
Transformers Visual
Recognition
A. Srinivas, T.-Y. Lin, N.
Parmar
Publication - NeurIPS
2021 Inserts transformers into ResNet bottlenecks for
long-range dependency modeling
Limited scalability due to
integration within CNN
backbones
6. Sl
No
Title of the paper Author Name Year Adopted Technique Limitations
9. Neural Architecture
Transformer (NAT):
Accurate and Compact
Hierarchical Networks
Y. Chen, C. Ge, J. Liu
Publication - ICCV 2021
2022 Learns hierarchical vision architectures
automatically through search and distillation
Architecture search increases
computation during training
10. CrossFormer: Cross-scale
Vision Transformer for
Image Classification and
Object Detection
Q. Wang, Z. Shen, X. Li
Publication - NeurIPS
2022 Designs cross-scale attention to integrate multi-
resolution feature maps
Complexity increases with scale
depth and token interaction
strategy
7. Existing System
Sensor Type:
Utilizes RGB cameras in combination with hierarchical Vision Transformer (ViT)
architectures to process visual input from the environment.
Perception Approach:
Applies multi-level self-attention mechanisms to extract both local and global
features from input images, enabling improved semantic understanding and
object recognition.
8. Feature Representation:
Employs hierarchical attention blocks to refine spatial feature maps, offering
enhanced context awareness compared to traditional CNNs.
Positional Accuracy:
Relies on positional encodings and attention-based spatial modeling, which can
capture object relationships well, but may lack geometric precision in cluttered
scenes.
9. Limitations:
While effective in structured environments, performance can degrade in complex
or occluded scenes due to lack of depth information and sensitivity to lighting
conditions. Real-time implementation is also challenged by high computational
requirements.
Integration in Autonomous Systems:
Best suited for semantic-level tasks like lane marking detection, object
classification, and pedestrian intent prediction, often used in conjunction with
LiDAR or radar for full-scene understanding and robust 3D perception
10. Problem Statement
Existing Vision Transformers struggle to efficiently balance local detail
preservation and global context modeling in high-resolution vision tasks due to
the high computational cost of global self-attention.
11. Proposed System
The proposed Hierarchical Vision Transformer (HVT) architecture addresses the
computational challenges of standard Vision Transformers through a novel dual-
attention framework that efficiently balances local detail preservation with global
context modeling.
13. 1. Preprocessing:
• Input images are divided into small fixed-size patches, each treated as a
token (similar to ViT).
• The initial patch tokens are embedded using linear projection layers to form
the input token sequence.
• Positional encodings are added to preserve spatial relationships between
image patches.
14. 2. Token Management (Hierarchical Attention Modeling):
• Local Attention Stage:
Patch tokens are grouped into grids. Self-attention is applied within each grid to
model fine-grained, local dependencies and generate discriminative local
features.
• Global Attention Stage:
Tokens are down sampled and merged into larger ones. Self-attention is applied
across these merged tokens to capture long-range, global dependencies.
• A hierarchical structure reduces token count gradually, improving efficiency
without sacrificing detail.
15. 3. Feature Update:
• Outputs from both local and global attention stages are aggregated to create a
unified feature map.
• Residual connections and feed-forward MLP layers refine these features.
• The final hierarchical features are used for downstream tasks like image
classification, object detection, and segmentation.
16. Methodology/Algorithm
1. Preprocessing
Image Patch Embedding
• Input images are divided into fixed-size non-overlapping patches.
• Each patch is flattened and passed through a linear projection layer to create
patch embeddings.
• Positional encodings are added to preserve spatial arrangement and enable
attention-based modeling.
17. 2. Hierarchical Multi-Head Self-Attention (H-MHSA)
Local Attention
• Tokens are grouped into small grids (e.g., 8×8 patches).
• Within each grid, self-attention is computed independently to model local
dependencies, capturing fine-grained spatial patterns.
• Local attention outputs are reshaped back to the original spatial layout and added
with residual connections.
• For each grid, calculate queries (), keys (), and values () by linear projection.
18. Global Attention
• Local features are down sampled (using average pooling) to form fewer
merged tokens.
• Self-attention is computed across these tokens to model global dependencies in
a computationally efficient way.
• Global features are then aligned with local features spatially and combined for
enhanced context representation.
19. 3. Token Fusion & Feature Aggregation
Feature Fusion
• Local and global attentive features are combined using a trainable projection
layer.
• This aggregated feature map captures both fine details and overall scene
context.
Feed-Forward Network (MLP)
• Each transformer block ends with a multilayer perceptron to enhance feature
representation.
• Residual connections and activation functions (SiLU) improve convergence
and generalization.
20. 4. Network Output & Task-Specific Heads
Classification & Prediction
• After the final transformer stages, global average pooling is applied.
• The resulting feature vector is passed through a fully connected layer for
classification tasks.
• For detection/segmentation, appropriate task-specific heads are attached,
using the extracted multi-scale hierarchical features.
21. • State Update & Initialization
Measurement Update
For tracklets with matched measurements, use the decentralized Kalman filter
outputs to refine state estimates.
New Tracklets
Any unmatched radar detections spawn new tracklets, each recorded with its
creation time for future validation.
23. Applications
Image Classification
• HAT-Net provides strong global and local feature representation, making it
highly effective for fine-grained image recognition tasks (e.g., identifying vehicle
types, species classification, product recognition in retail).
Object Detection
• The hierarchical attention structure improves detection accuracy, particularly in
complex scenes with overlapping or occluded objects—ideal for applications like
autonomous driving, surveillance, and smart city monitoring.
24. Semantic Segmentation
• Useful in autonomous navigation, robotics, and medical imaging where precise
pixel-level understanding of scenes is critical (e.g., lane detection, tumor
boundary identification).
25. Conclusion
In conclusion, the hierarchical attention mechanism introduced in HAT-Net
plays a crucial role in enhancing feature representation for visual understanding.
By combining local detail with global context, HAT-Net enables more accurate
and robust perception across diverse vision tasks. This capability is especially
important in applications like autonomous driving, robotics, and smart
surveillance, where reliable scene interpretation is essential for safe and
intelligent decision-making.
26. References
[1] Z. Gu, Y. Zhang, B. Ni, and M. Zhang, Hierarchical Vision Transformer with
Deformable Embedding for Fine-Grained Visual Recognition.
[2] Z. Liu, Y. Lin, Y. Cao, et al., Swin Transformer: Hierarchical Vision Transformer
using Shifted Windows.
[3] W. Wang, E. Xie, and X. Li, Pyramid Vision Transformer (PVT): A Versatile
Backbone for Dense Prediction without Convolutions.
27. [4] H. Wu, B. Xiao, and N. Codella, Convolutional Vision Transformer (CvT): A
Hierarchical Vision Transformer Incorporating Convolutional Projections.
[5] G. Cheng and Y. Qiao, HRViT: High-Resolution Vision Transformer with
Dynamic Token Sparsification.
[6] X. Chu, Z. Tian, and Y. Wang, Twins: Revisiting the Design of Spatial
Attention in Vision Transformers.