SlideShare a Scribd company logo
Presented by
21P27
1BI21CS05 Hitesh Gupta
Under the Guidance of
Prof. Prathima M G
Assistant Professor
Dept. of Computer Science and Engineering
“ Vision transformers with Hierarchical
Attention ”
• Introduction
• Literature Survey
• Existing System
• Problem Statement
• Proposed System
• Architecture
• Methodology
• Results
• Applications
• Conclusion
• References
AGENDA
INTRODUCTION
Vision Transformers (ViTs) are powerful for vision tasks but struggle with high
computational costs when handling high-resolution images due to global self-
attention. To address this, we introduce Hierarchical Mixed-Resolution Self-
Attention , which first computes attention locally within small windows and then
globally across summarized tokens. Implemented in new HAT-Net architecture,
this approach significantly improves both efficiency and accuracy, achieving faster
inference and higher performance.
Literature Review
SL
N0.
Title of the paper Author Name / Publication Year Adopted Technique Limitations
1.. Hierarchical-Vision
Transformer-with
Deformable Embedding for
Fine-Grained-Visual
Recognition
Z. Gu, Y. Zhang, B. Ni, M.
Zhang
Publication - ICCV 2021
2021
Introduces deformable patch embedding to better
capture local details and semantic structure
Increased model complexity and
training cost
2. Swin Transformer:
Hierarchical Vision
Transformer using Shifted
Windows
Z. Liu, Y. Lin, Y. Cao, et al.
Publication - ICCV 2021
2021 Uses shifted windowing scheme for efficient
hierarchical self-attention
Struggles with very long-range
dependencies
3. Pyramid Vision
Transformer (PVT): A
Versatile Backbone for
Dense Prediction
W. Wang, E. Xie, X. Li
Publication - CVPR 2021
2021
Introduces spatial-reduction attention and pyramid
architecture for dense prediction tasks
Reduction in spatial resolution
can affect fine-grained feature
learning
4. Convolutional-Vision
Transformer(CvT):A
Hierarchical-Vision
Transformer Incorporating
Convolutional Projections
H. Wu, B. Xiao, N. Codella
Publication - ICCV 2021
2021
Combines convolutional layers with transformers to
capture both local and global features
Added convolutional operations
increase training and inference
time
SL
N0.
Title of the paper Author Name Year Adopted Technique Limitations
5. HRViT: High-Resolution
Vision Transformer with
Dynamic Token
Sparsification
G. Cheng, Y. Qiao
Publication - ICCV 2021
2022 Applies token scarification for efficient high-
resolution image processing May compromise performance in
tasks needing dense token
representation
6. Twins: Revisiting the
Design of Spatial Attention
in Vision Transformers
X. Chu, Z. Tian, Y. Wang
Publication - ICCV 2021
2021 Separates local and global attention modules for
better spatial feature extraction Computational overhead remains
due to dual attention pathways
7. Hierarchical-Vision
Transformer with Local and
Global Representation
B. Li, Y. Cao, H. Xu
Publication - NeurIPS
2022 Combines local and global feature encoding
hierarchically
Requires careful tuning of
attention splits between local and
global levels
8. BoTNet:Bottleneck
Transformers Visual
Recognition
A. Srinivas, T.-Y. Lin, N.
Parmar
Publication - NeurIPS
2021 Inserts transformers into ResNet bottlenecks for
long-range dependency modeling
Limited scalability due to
integration within CNN
backbones
Sl
No
Title of the paper Author Name Year Adopted Technique Limitations
9. Neural Architecture
Transformer (NAT):
Accurate and Compact
Hierarchical Networks
Y. Chen, C. Ge, J. Liu
Publication - ICCV 2021
2022 Learns hierarchical vision architectures
automatically through search and distillation
Architecture search increases
computation during training
10. CrossFormer: Cross-scale
Vision Transformer for
Image Classification and
Object Detection
Q. Wang, Z. Shen, X. Li
Publication - NeurIPS
2022 Designs cross-scale attention to integrate multi-
resolution feature maps
Complexity increases with scale
depth and token interaction
strategy
Existing System
Sensor Type:
Utilizes RGB cameras in combination with hierarchical Vision Transformer (ViT)
architectures to process visual input from the environment.
Perception Approach:
Applies multi-level self-attention mechanisms to extract both local and global
features from input images, enabling improved semantic understanding and
object recognition.
Feature Representation:
Employs hierarchical attention blocks to refine spatial feature maps, offering
enhanced context awareness compared to traditional CNNs.
Positional Accuracy:
Relies on positional encodings and attention-based spatial modeling, which can
capture object relationships well, but may lack geometric precision in cluttered
scenes.
Limitations:
While effective in structured environments, performance can degrade in complex
or occluded scenes due to lack of depth information and sensitivity to lighting
conditions. Real-time implementation is also challenged by high computational
requirements.
Integration in Autonomous Systems:
Best suited for semantic-level tasks like lane marking detection, object
classification, and pedestrian intent prediction, often used in conjunction with
LiDAR or radar for full-scene understanding and robust 3D perception
Problem Statement
Existing Vision Transformers struggle to efficiently balance local detail
preservation and global context modeling in high-resolution vision tasks due to
the high computational cost of global self-attention.
Proposed System
The proposed Hierarchical Vision Transformer (HVT) architecture addresses the
computational challenges of standard Vision Transformers through a novel dual-
attention framework that efficiently balances local detail preservation with global
context modeling.
Architecture
1. Preprocessing:
• Input images are divided into small fixed-size patches, each treated as a
token (similar to ViT).
• The initial patch tokens are embedded using linear projection layers to form
the input token sequence.
• Positional encodings are added to preserve spatial relationships between
image patches.
2. Token Management (Hierarchical Attention Modeling):
• Local Attention Stage:
Patch tokens are grouped into grids. Self-attention is applied within each grid to
model fine-grained, local dependencies and generate discriminative local
features.
• Global Attention Stage:
Tokens are down sampled and merged into larger ones. Self-attention is applied
across these merged tokens to capture long-range, global dependencies.
• A hierarchical structure reduces token count gradually, improving efficiency
without sacrificing detail.
3. Feature Update:
• Outputs from both local and global attention stages are aggregated to create a
unified feature map.
• Residual connections and feed-forward MLP layers refine these features.
• The final hierarchical features are used for downstream tasks like image
classification, object detection, and segmentation.
Methodology/Algorithm
1. Preprocessing
Image Patch Embedding
• Input images are divided into fixed-size non-overlapping patches.
• Each patch is flattened and passed through a linear projection layer to create
patch embeddings.
• Positional encodings are added to preserve spatial arrangement and enable
attention-based modeling.
2. Hierarchical Multi-Head Self-Attention (H-MHSA)
Local Attention
• Tokens are grouped into small grids (e.g., 8×8 patches).
• Within each grid, self-attention is computed independently to model local
dependencies, capturing fine-grained spatial patterns.
• Local attention outputs are reshaped back to the original spatial layout and added
with residual connections.
• For each grid, calculate queries (), keys (), and values () by linear projection.
Global Attention
• Local features are down sampled (using average pooling) to form fewer
merged tokens.
• Self-attention is computed across these tokens to model global dependencies in
a computationally efficient way.
• Global features are then aligned with local features spatially and combined for
enhanced context representation.
3. Token Fusion & Feature Aggregation
Feature Fusion
• Local and global attentive features are combined using a trainable projection
layer.
• This aggregated feature map captures both fine details and overall scene
context.
Feed-Forward Network (MLP)
• Each transformer block ends with a multilayer perceptron to enhance feature
representation.
• Residual connections and activation functions (SiLU) improve convergence
and generalization.
4. Network Output & Task-Specific Heads
Classification & Prediction
• After the final transformer stages, global average pooling is applied.
• The resulting feature vector is passed through a fully connected layer for
classification tasks.
• For detection/segmentation, appropriate task-specific heads are attached,
using the extracted multi-scale hierarchical features.
• State Update & Initialization
Measurement Update
For tracklets with matched measurements, use the decentralized Kalman filter
outputs to refine state estimates.
New Tracklets
Any unmatched radar detections spawn new tracklets, each recorded with its
creation time for future validation.
Results
Applications
Image Classification
• HAT-Net provides strong global and local feature representation, making it
highly effective for fine-grained image recognition tasks (e.g., identifying vehicle
types, species classification, product recognition in retail).
Object Detection
• The hierarchical attention structure improves detection accuracy, particularly in
complex scenes with overlapping or occluded objects—ideal for applications like
autonomous driving, surveillance, and smart city monitoring.
Semantic Segmentation
• Useful in autonomous navigation, robotics, and medical imaging where precise
pixel-level understanding of scenes is critical (e.g., lane detection, tumor
boundary identification).
Conclusion
In conclusion, the hierarchical attention mechanism introduced in HAT-Net
plays a crucial role in enhancing feature representation for visual understanding.
By combining local detail with global context, HAT-Net enables more accurate
and robust perception across diverse vision tasks. This capability is especially
important in applications like autonomous driving, robotics, and smart
surveillance, where reliable scene interpretation is essential for safe and
intelligent decision-making.
References
[1] Z. Gu, Y. Zhang, B. Ni, and M. Zhang, Hierarchical Vision Transformer with
Deformable Embedding for Fine-Grained Visual Recognition.
[2] Z. Liu, Y. Lin, Y. Cao, et al., Swin Transformer: Hierarchical Vision Transformer
using Shifted Windows.
[3] W. Wang, E. Xie, and X. Li, Pyramid Vision Transformer (PVT): A Versatile
Backbone for Dense Prediction without Convolutions.
[4] H. Wu, B. Xiao, and N. Codella, Convolutional Vision Transformer (CvT): A
Hierarchical Vision Transformer Incorporating Convolutional Projections.
[5] G. Cheng and Y. Qiao, HRViT: High-Resolution Vision Transformer with
Dynamic Token Sparsification.
[6] X. Chu, Z. Tian, and Y. Wang, Twins: Revisiting the Design of Spatial
Attention in Vision Transformers.
THANK YOU

More Related Content

DOCX
seminar_report_on_visiontransformers.docx
HiteshGupta702785
 
PDF
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
PDF
Transformer models for FER
IRJET Journal
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
gerogepatton
 
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
ijaia
 
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Edge AI and Vision Alliance
 
seminar_report_on_visiontransformers.docx
HiteshGupta702785
 
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
Transformer models for FER
IRJET Journal
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
gerogepatton
 
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
ijaia
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Edge AI and Vision Alliance
 

Similar to tech_seminar_ppt on vision transformers.pptx (20)

PDF
DaViT.pdf
ShahidJabbar10
 
PPTX
Presentation vision transformersppt.pptx
htn540
 
PDF
Paper Reviews on Visual Attention
민기 정
 
PDF
Hierarchical deep learning architecture for 10 k objects classification
csandit
 
PDF
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
cscpconf
 
PDF
20141003.journal club
Hayaru SHOUNO
 
PPT
Mit6870 orsu lecture11
zukun
 
PDF
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
PDF
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
PDF
Deep Learning for X ray Image to Text Generation
ijtsrd
 
PDF
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
PPTX
Classification of xray images using vision transformers
JayasankarShyam
 
PDF
CV18_Vision Transformers.pdf
RishabhNanawati1
 
PPTX
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Bruno769908
 
PDF
Object-Region Video Transformers
Sangwoo Mo
 
PDF
Séminaire IA & VA- Yassine Ruichek, UTBM
Mahdi Zarg Ayouna
 
PDF
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
Hironobu Fujiyoshi
 
PDF
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...
DataScienceConferenc1
 
PPTX
2014 - CVPR Tutorial on Deep Learning for Vision - Object Detection.pptx
himob78718
 
PDF
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Simone Ercoli
 
DaViT.pdf
ShahidJabbar10
 
Presentation vision transformersppt.pptx
htn540
 
Paper Reviews on Visual Attention
민기 정
 
Hierarchical deep learning architecture for 10 k objects classification
csandit
 
HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION
cscpconf
 
20141003.journal club
Hayaru SHOUNO
 
Mit6870 orsu lecture11
zukun
 
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
Deep Learning for X ray Image to Text Generation
ijtsrd
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
Classification of xray images using vision transformers
JayasankarShyam
 
CV18_Vision Transformers.pdf
RishabhNanawati1
 
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Bruno769908
 
Object-Region Video Transformers
Sangwoo Mo
 
Séminaire IA & VA- Yassine Ruichek, UTBM
Mahdi Zarg Ayouna
 
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
Hironobu Fujiyoshi
 
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...
DataScienceConferenc1
 
2014 - CVPR Tutorial on Deep Learning for Vision - Object Detection.pptx
himob78718
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Simone Ercoli
 
Ad

Recently uploaded (20)

PPTX
Odoo 18 Sales_ Managing Quotation Validity
Celine George
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
PDF
UTS Health Student Promotional Representative_Position Description.pdf
Faculty of Health, University of Technology Sydney
 
PDF
Exploring-Forces 5.pdf/8th science curiosity/by sandeep swamy notes/ppt
Sandeep Swamy
 
PPTX
Congenital Hypothyroidism pptx
AneetaSharma15
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
IMMUNIZATION PROGRAMME pptx
AneetaSharma15
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PDF
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
DOCX
UPPER GASTRO INTESTINAL DISORDER.docx
BANDITA PATRA
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
PREVENTIVE PEDIATRIC. pptx
AneetaSharma15
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
PDF
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
PDF
Landforms and landscapes data surprise preview
jpinnuck
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
Odoo 18 Sales_ Managing Quotation Validity
Celine George
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
UTS Health Student Promotional Representative_Position Description.pdf
Faculty of Health, University of Technology Sydney
 
Exploring-Forces 5.pdf/8th science curiosity/by sandeep swamy notes/ppt
Sandeep Swamy
 
Congenital Hypothyroidism pptx
AneetaSharma15
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
IMMUNIZATION PROGRAMME pptx
AneetaSharma15
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
UPPER GASTRO INTESTINAL DISORDER.docx
BANDITA PATRA
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PREVENTIVE PEDIATRIC. pptx
AneetaSharma15
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
Landforms and landscapes data surprise preview
jpinnuck
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
Ad

tech_seminar_ppt on vision transformers.pptx

  • 1. Presented by 21P27 1BI21CS05 Hitesh Gupta Under the Guidance of Prof. Prathima M G Assistant Professor Dept. of Computer Science and Engineering “ Vision transformers with Hierarchical Attention ”
  • 2. • Introduction • Literature Survey • Existing System • Problem Statement • Proposed System • Architecture • Methodology • Results • Applications • Conclusion • References AGENDA
  • 3. INTRODUCTION Vision Transformers (ViTs) are powerful for vision tasks but struggle with high computational costs when handling high-resolution images due to global self- attention. To address this, we introduce Hierarchical Mixed-Resolution Self- Attention , which first computes attention locally within small windows and then globally across summarized tokens. Implemented in new HAT-Net architecture, this approach significantly improves both efficiency and accuracy, achieving faster inference and higher performance.
  • 4. Literature Review SL N0. Title of the paper Author Name / Publication Year Adopted Technique Limitations 1.. Hierarchical-Vision Transformer-with Deformable Embedding for Fine-Grained-Visual Recognition Z. Gu, Y. Zhang, B. Ni, M. Zhang Publication - ICCV 2021 2021 Introduces deformable patch embedding to better capture local details and semantic structure Increased model complexity and training cost 2. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Z. Liu, Y. Lin, Y. Cao, et al. Publication - ICCV 2021 2021 Uses shifted windowing scheme for efficient hierarchical self-attention Struggles with very long-range dependencies 3. Pyramid Vision Transformer (PVT): A Versatile Backbone for Dense Prediction W. Wang, E. Xie, X. Li Publication - CVPR 2021 2021 Introduces spatial-reduction attention and pyramid architecture for dense prediction tasks Reduction in spatial resolution can affect fine-grained feature learning 4. Convolutional-Vision Transformer(CvT):A Hierarchical-Vision Transformer Incorporating Convolutional Projections H. Wu, B. Xiao, N. Codella Publication - ICCV 2021 2021 Combines convolutional layers with transformers to capture both local and global features Added convolutional operations increase training and inference time
  • 5. SL N0. Title of the paper Author Name Year Adopted Technique Limitations 5. HRViT: High-Resolution Vision Transformer with Dynamic Token Sparsification G. Cheng, Y. Qiao Publication - ICCV 2021 2022 Applies token scarification for efficient high- resolution image processing May compromise performance in tasks needing dense token representation 6. Twins: Revisiting the Design of Spatial Attention in Vision Transformers X. Chu, Z. Tian, Y. Wang Publication - ICCV 2021 2021 Separates local and global attention modules for better spatial feature extraction Computational overhead remains due to dual attention pathways 7. Hierarchical-Vision Transformer with Local and Global Representation B. Li, Y. Cao, H. Xu Publication - NeurIPS 2022 Combines local and global feature encoding hierarchically Requires careful tuning of attention splits between local and global levels 8. BoTNet:Bottleneck Transformers Visual Recognition A. Srinivas, T.-Y. Lin, N. Parmar Publication - NeurIPS 2021 Inserts transformers into ResNet bottlenecks for long-range dependency modeling Limited scalability due to integration within CNN backbones
  • 6. Sl No Title of the paper Author Name Year Adopted Technique Limitations 9. Neural Architecture Transformer (NAT): Accurate and Compact Hierarchical Networks Y. Chen, C. Ge, J. Liu Publication - ICCV 2021 2022 Learns hierarchical vision architectures automatically through search and distillation Architecture search increases computation during training 10. CrossFormer: Cross-scale Vision Transformer for Image Classification and Object Detection Q. Wang, Z. Shen, X. Li Publication - NeurIPS 2022 Designs cross-scale attention to integrate multi- resolution feature maps Complexity increases with scale depth and token interaction strategy
  • 7. Existing System Sensor Type: Utilizes RGB cameras in combination with hierarchical Vision Transformer (ViT) architectures to process visual input from the environment. Perception Approach: Applies multi-level self-attention mechanisms to extract both local and global features from input images, enabling improved semantic understanding and object recognition.
  • 8. Feature Representation: Employs hierarchical attention blocks to refine spatial feature maps, offering enhanced context awareness compared to traditional CNNs. Positional Accuracy: Relies on positional encodings and attention-based spatial modeling, which can capture object relationships well, but may lack geometric precision in cluttered scenes.
  • 9. Limitations: While effective in structured environments, performance can degrade in complex or occluded scenes due to lack of depth information and sensitivity to lighting conditions. Real-time implementation is also challenged by high computational requirements. Integration in Autonomous Systems: Best suited for semantic-level tasks like lane marking detection, object classification, and pedestrian intent prediction, often used in conjunction with LiDAR or radar for full-scene understanding and robust 3D perception
  • 10. Problem Statement Existing Vision Transformers struggle to efficiently balance local detail preservation and global context modeling in high-resolution vision tasks due to the high computational cost of global self-attention.
  • 11. Proposed System The proposed Hierarchical Vision Transformer (HVT) architecture addresses the computational challenges of standard Vision Transformers through a novel dual- attention framework that efficiently balances local detail preservation with global context modeling.
  • 13. 1. Preprocessing: • Input images are divided into small fixed-size patches, each treated as a token (similar to ViT). • The initial patch tokens are embedded using linear projection layers to form the input token sequence. • Positional encodings are added to preserve spatial relationships between image patches.
  • 14. 2. Token Management (Hierarchical Attention Modeling): • Local Attention Stage: Patch tokens are grouped into grids. Self-attention is applied within each grid to model fine-grained, local dependencies and generate discriminative local features. • Global Attention Stage: Tokens are down sampled and merged into larger ones. Self-attention is applied across these merged tokens to capture long-range, global dependencies. • A hierarchical structure reduces token count gradually, improving efficiency without sacrificing detail.
  • 15. 3. Feature Update: • Outputs from both local and global attention stages are aggregated to create a unified feature map. • Residual connections and feed-forward MLP layers refine these features. • The final hierarchical features are used for downstream tasks like image classification, object detection, and segmentation.
  • 16. Methodology/Algorithm 1. Preprocessing Image Patch Embedding • Input images are divided into fixed-size non-overlapping patches. • Each patch is flattened and passed through a linear projection layer to create patch embeddings. • Positional encodings are added to preserve spatial arrangement and enable attention-based modeling.
  • 17. 2. Hierarchical Multi-Head Self-Attention (H-MHSA) Local Attention • Tokens are grouped into small grids (e.g., 8×8 patches). • Within each grid, self-attention is computed independently to model local dependencies, capturing fine-grained spatial patterns. • Local attention outputs are reshaped back to the original spatial layout and added with residual connections. • For each grid, calculate queries (), keys (), and values () by linear projection.
  • 18. Global Attention • Local features are down sampled (using average pooling) to form fewer merged tokens. • Self-attention is computed across these tokens to model global dependencies in a computationally efficient way. • Global features are then aligned with local features spatially and combined for enhanced context representation.
  • 19. 3. Token Fusion & Feature Aggregation Feature Fusion • Local and global attentive features are combined using a trainable projection layer. • This aggregated feature map captures both fine details and overall scene context. Feed-Forward Network (MLP) • Each transformer block ends with a multilayer perceptron to enhance feature representation. • Residual connections and activation functions (SiLU) improve convergence and generalization.
  • 20. 4. Network Output & Task-Specific Heads Classification & Prediction • After the final transformer stages, global average pooling is applied. • The resulting feature vector is passed through a fully connected layer for classification tasks. • For detection/segmentation, appropriate task-specific heads are attached, using the extracted multi-scale hierarchical features.
  • 21. • State Update & Initialization Measurement Update For tracklets with matched measurements, use the decentralized Kalman filter outputs to refine state estimates. New Tracklets Any unmatched radar detections spawn new tracklets, each recorded with its creation time for future validation.
  • 23. Applications Image Classification • HAT-Net provides strong global and local feature representation, making it highly effective for fine-grained image recognition tasks (e.g., identifying vehicle types, species classification, product recognition in retail). Object Detection • The hierarchical attention structure improves detection accuracy, particularly in complex scenes with overlapping or occluded objects—ideal for applications like autonomous driving, surveillance, and smart city monitoring.
  • 24. Semantic Segmentation • Useful in autonomous navigation, robotics, and medical imaging where precise pixel-level understanding of scenes is critical (e.g., lane detection, tumor boundary identification).
  • 25. Conclusion In conclusion, the hierarchical attention mechanism introduced in HAT-Net plays a crucial role in enhancing feature representation for visual understanding. By combining local detail with global context, HAT-Net enables more accurate and robust perception across diverse vision tasks. This capability is especially important in applications like autonomous driving, robotics, and smart surveillance, where reliable scene interpretation is essential for safe and intelligent decision-making.
  • 26. References [1] Z. Gu, Y. Zhang, B. Ni, and M. Zhang, Hierarchical Vision Transformer with Deformable Embedding for Fine-Grained Visual Recognition. [2] Z. Liu, Y. Lin, Y. Cao, et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. [3] W. Wang, E. Xie, and X. Li, Pyramid Vision Transformer (PVT): A Versatile Backbone for Dense Prediction without Convolutions.
  • 27. [4] H. Wu, B. Xiao, and N. Codella, Convolutional Vision Transformer (CvT): A Hierarchical Vision Transformer Incorporating Convolutional Projections. [5] G. Cheng and Y. Qiao, HRViT: High-Resolution Vision Transformer with Dynamic Token Sparsification. [6] X. Chu, Z. Tian, and Y. Wang, Twins: Revisiting the Design of Spatial Attention in Vision Transformers.