0% found this document useful (0 votes)

94 views

Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications

EdgeNeXt is a new efficient hybrid CNN-Transformer architecture proposed for mobile vision applications. It introduces a split depth-wise transpose attention encoder that splits input tensors into channel groups and uses depth-wise convolution and self-attention across channels to implicitly increase receptive field and encode multi-scale features. Experiments show EdgeNeXt outperforms SOTA methods with lower compute requirements, achieving 71.2% top-1 accuracy on ImageNet-1K with 1.3M parameters and lower FLOPs than MobileViT.

Uploaded by

王大哥

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications

Uploaded by

王大哥

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for

Mobile Vision Applications

* *
Muhammad Maaz1, Abdelrahman Shaker1, Hisham Cholakkal1 Salman Khan1,2
Syed Waqas Zamir3 Rao Muhammad Anwer1 Fahad Shahbaz Khan1
1 2
Mohamed Bin Zayed university of Artificial Intelligence Australian National University
arXiv:2206.10589v1 [cs.CV] 21 Jun 2022

3
Inception Institute of Artificial Intelligence

Abstract

In the pursuit of achieving ever-increasing accuracy,

large and complex neural networks are usually developed.
Such models demand high computational resources and
therefore cannot be deployed on edge devices. It is of great
interest to build resource-efficient general purpose networks
due to their usefulness in several application areas. In
this work, we strive to effectively combine the strengths of
both CNN and Transformer models and propose a new ef-
ficient hybrid architecture EdgeNeXt. Specifically in Ed-
geNeXt, we introduce split depth-wise transpose attention
(SDTA) encoder that splits input tensors into multiple chan-
nel groups and utilizes depth-wise convolution along with
self-attention across channel dimensions to implicitly in-
crease the receptive field and encode multi-scale features.
Our extensive experiments on classification, detection and Figure 1. Comparison of our proposed EdgeNeXt models with
segmentation tasks, reveal the merits of the proposed ap- SOTA ViTs and hybrid architecture designs. The x-axis shows the
proach, outperforming state-of-the-art methods with com- multiplication-addition (MAdd) operations and y-axis displays the
paratively lower compute requirements. Our EdgeNeXt top-1 ImageNet-1K classification accuracy. The number of param-
model with 1.3M parameters achieves 71.2% top-1 accu- eters are mentioned for each corresponding point in the graph. Our
racy on ImageNet-1K, outperforming MobileViT with an EdgeNeXt shows better compute (parameters and MAdds) versus
absolute gain of 2.2% with 28% reduction in FLOPs. Fur- accuracy trade-off compared to recent approaches.
ther, our EdgeNeXt model with 5.6M parameters achieves
79.4% top-1 accuracy on ImageNet-1K. The code and mod-
els are publicly available at https://t.ly/_Vu9. accuracy, most existing CNN and ViT-based architectures
ignore the aspect of computational efficiency (i.e., model
size and speed) which is crucial to operating on resource-
1. Introduction constrained devices such as mobile platforms. In many real-
world applications e.g., robotics and self-driving cars, the
Convolutional neural networks (CNNs) and the recently recognition process is desired to be both accurate and have
introduced vision transformers (ViTs) have significantly ad- low latency on resource-constrained mobile platforms.
vanced the state-of-the-art in several mainstream computer
vision tasks, including object recognition, detection and Most existing approaches typically utilize carefully de-
segmentation [20, 34]. The general trend is to make the net- signed efficient variants of convolutions to achieve a trade-
work architectures more deeper and sophisticated in the pur- off between speed and accuracy on resource-constrained
suit of ever-increasing accuracy. While striving for higher mobile platforms [19, 28, 33]. Other than these approaches,
few existing works [16, 36] employ hardware-aware neu-
* Equal contribution ral architecture search (NAS) to build low latency accurate

1
models for mobile devices. While being easy to train and 2. Related Work
efficient in encoding local image details, these aforemen-
tioned light-weight CNNs do not explicitly model global In recent years, designing lightweight hardware-efficient
interactions between pixels. convolutional neural networks for mobile vision tasks has
been well studied in literature. The current methods fo-
The introduction of self-attention in vision transform- cus on designing efficient versions of convolutions for low-
ers (ViTs) [8] has made it possible to explicitly model this powered edge devices [17, 19]. Among these methods, Mo-
global interaction, however, this typically comes at the cost bileNet [17] is the most widely used architecture which em-
of slow inference because of the self-attention computa- ploys depth-wise separable convolutions [5]. On the other
tion [24]. This becomes an important challenge for design- hand, ShuffleNet [44] uses channel shuffling and low-cost
ing a lightweight ViT variant for mobile vision applications. group convolutions. MobileNetV2 [33] introduces inverted
The majority of the existing works employ CNN-based residual block with linear bottleneck, achieving promising
designs in developing efficient models. However, the con- performance on various vision tasks. ESPNetv2 [30] uti-
volution operation in CNNs inherits two main limitations: lizes depth-wise dilated convolutions to increase the recep-
First, it has local receptive field and thereby unable to model tive field of the network without increasing the network
global context; Second, the learned weights are stationary complexity. The hardware-aware neural architecture search
at inference times, making CNNs inflexible to adapt to the (NAS) has also been explored to find a better trade-off be-
input content. While both of these issues can be allevi- tween speed and accuracy on mobile devices [16, 36]. Al-
ated with Transformers, they are typically compute inten- though these CCNs are faster to train and infer on mobile
sive. Few recent works [29, 43] have investigated designing devices, they lack global interaction between pixels which
lightweight architectures for mobile vision tasks by com- limits their accuracy.
bining the strengths of CNNs and ViTs. However, these Recently, Desovitskiy et al. [8] introduces a vision
approaches mainly focus on optimizing the parameters and Knowledge
transformer architecture based on the self-attention mech- distillation
incur higher multiply-adds (MAdds) operations which re- anism [38] for vision tasks. Their proposed architecture transfers
stricts high-speed inference on mobile devices. The MAdds utilizes large-scale pre-training data (e.g. JFT-300M), ex- knowledge
are higher since the complexity of the attention block is from a large
tensive data augmentations, and a longer training schedule model to a
quadratic with respect to the input size [29]. This becomes to achieve competitive performance. Later, DeiT [37] pro- smaller model
further problematic due to multiple attention blocks in the poses to integrate distillation token in this architecture and without loss
network architecture. Here, we argue that the model size, of validty
only employ training on ImageNet-1K [32] dataset. Since
parameters, and MAdds are all desired to be small with re- then, several variants of ViTs and hybrid architectures are
spect to the resource-constrained devices when designing proposed in the literature, adding image-specific inductive
a unified mobile architecture that effectively combines the bias to ViTs for obtaining improved performance on differ-
complementary advantages of CNNs and ViTs (see Fig. 1). ent vision tasks [9, 11, 35, 39, 45].
Contributions. We propose a new light-weight architec- ViT models achieve competitive results for several vi-
ture, named EdgeNeXt, that is efficient in terms of model sual recognition tasks [8, 24]. However, it is difficult
size, parameters and MAdds, while being superior in ac- to deploy these models on resource-constrained edge de-
curacy on mobile vision tasks. Specifically, we introduce vices because of the high computational cost of the multi-
split depth-wise transpose attention (SDTA) encoder that headed self-attention (MHA). There has been recent work
effectively learns both local and global representations to on designing lightweight hybrid networks for mobile vision
address the issue of limited receptive fields in CNNs with- tasks that combine the advantages of CNNs and transform-
out increasing the number of parameters and MAdd oper- ers. MobileFormer [4] employs parallel branches of Mo-
ations. Our proposed architecture shows favorable perfor- bileNetV2 [33] and ViTs [8] with a bridge connecting both
mance in terms of both accuracy and latency compared to branches for local-global interaction. Mehta et al. [29] con-
state-of-the-art mobile networks on various tasks including sider transformers as convolution and propose a MobileViT
image classification, object detection, and semantic seg- block for local-global image context fusion. Their approach
mentation. Our EdgeNeXt backbone with 5.6M parame- achieves superior performance on image classification sur-
ters and 1.3G MAdds achieves 79.4% top-1 ImageNet-1K passing previous light-weight CNNs and ViTs using a sim-
classification accuracy which is superior to its recently in- ilar parameter budget.
troduced MobileViT counterpart [29], while requiring 35% Although MobileViT [29] mainly focuses on optimiz-
less MAdds. For object detection and semantic segmenta- ing parameters and latency, MHA is still the main effi-
tion tasks, the proposed EdgeNeXt achieves higher mAP ciency bottleneck in this model, especially for the number
and mIOU with fewer MAdds and a comparable number of of MAdds and the inference time on edge devices. The
parameters, compared to all the published lightweight mod- complexity of MHA in MobileViT is quadratic with respect
els in literature. to the input size, which is the main efficiency bottleneck

2
given their existing nine attention blocks in MobileViT-S at the latter stages in the convolution encoder blocks. This
model. In this work, we strive to design a new light-weight design choice is optimal as early stages in CNN usually cap-
architecture for mobile devices that is efficient in terms of ture low-level features and smaller kernels are suitable for
both parameters and MAdds, while being superior in ac- this purpose. However, in later stages of the network, large
curacy on mobile vision tasks. Our proposed architecture, convolutional kernels are required to capture high-level fea-
EdgeNeXt, is built on the recently introduced CNN method, tures [42]. We explain our architectural details next.
ConvNeXt [25], which modernizes the ResNet [14] archi-
Overall Architecture. Fig. 2 illustrates an overview of the
tecture following the ViT design choices. Within our Ed-
proposed EdgeNeXt architecture. The main ingredients are
geNeXt, we introduce an SDTA block that combines depth-
two-fold: (1) adaptive N ×N Conv. encoder, and (2) split
wise convolutions with adaptive kernel sizes along with
depth-wise transpose attention (SDTA) encoder. Our Ed-
transpose attention in an efficient manner, obtaining an op-
geNeXt architecture builds on the design principles of Con-
timal accuracy-speed trade-off.
vNeXt [25] and extracts hierarchical features at four differ-
ent scales across the four stages. The input image of size
3. EdgeNeXt H×W ×3 is passed through a patchify stem layer at the
The main objective of this work is to develop a beginning of the network, implemented using a 4×4 non-
lightweight hybrid design that effectively fuses the merits overlapping convolution followed by a layer norm, which
of ViTs and CNNs for low-powered edge devices. The results in H W
4 × 4 ×C1 feature maps. Then, the output is
computational overhead in ViTs (e.g., MobileViT [29]) is passed to 3×3 Conv. encoder to extract local features.
mainly due to the self-attention operation. In contrast to The second stage begins with a downsampling layer im-
MobileViT, the attention block in our model has linear plemented using 2×2 strided convolution that reduces the
complexity with respect to the input spatial dimension of spatial sizes by half and increases the channels, resulting
O(N d2 ), where N is the number of patches, and d is the in H W
8 × 8 ×C2 feature maps, followed by two consecutive
feature/channel dimension. The self-attention operation in 5×5 Conv. encoders. Positional Encoding (PE) is also
our model is applied across channel dimensions instead of added before the SDTA block in the second stage only. We
the spatial dimension. Furthermore, we demonstrate that observe that PE is sensitive for dense prediction tasks (e.g.,
with a much lower number of attention blocks (3 versus 9 object detection and segmentation) as well as adding it in all
in MobileViT), we can surpass their performance mark. In stages increases the latency of the network. Hence, we add
this way, the proposed framework can model global repre- it only once in the network to encode the spatial location
sentations with a limited number of MAdds which is a fun- information. The output feature maps are further passed
H
damental criterion to ensure low-latency inference on edge to the third and fourth stages, to generate 16 ×W 16 ×C3 and
H W
devices. To motivate our proposed architecture, we present ×
32 32 ×C4 dimensional features, respectively.
two desirable properties. Convolution Encoder. This block consists of depth-wise
a) Encoding the global information efficiently. The in- separable convolution with adaptive kernel sizes. We can
trinsic characteristic of self-attention to learn global repre- define it by two separate layers: (1) depth-wise convolution
sentations is crucial for vision tasks. To inherit this advan- with adaptive N ×N kernels. We use k = 3, 5, 7, and 9
tage efficiently, we use cross-covariance attention to incor- for stages 1, 2, 3, and 4, respectively. Then, (2) two point-
porate the attention operation across the feature channel di- wise convolution layers are used to enrich the local repre-
mension instead of the spatial dimension within a relatively sentation alongside standard Layer Normalization [2] (LN)
small number of network blocks. This reduces the complex- and Gaussian Error Linear Unit [15] (GELU) activation for
ity of the original self-attention operation from quadratic to non-linear feature mapping. Finally, a skip connection is
linear in terms of number of tokens and implicitly encodes added to make information flow across the network hierar-
the global information effectively. chy. This block is similar to the ConvNeXt block but the
kernel sizes are dynamic and vary depending on the stage.
b) Adaptive kernel sizes. Large-kernel convolutions are We observe that adaptive kernel sizes in Conv. encoder per-
known to be computationally expensive since the number form better compared to static kernel sizes (Table 7). The
of parameters and FLOPs quadratically increases as the ker- Conv. encoder can be represented as follows:
nel size grows. Although a larger kernel size is helpful to
increase the receptive field, using such large kernels across xi+1 = xi + LinearG (Linear(LN (Dw(xi )))), (1)
the whole network hierarchy is expensive and sub-optimal.
We propose an adaptive kernel sizes mechanism to reduce where xi denotes the input feature maps of shape
this complexity and capture different levels of features in H×W ×C, LinearG is a point-wise convolution layer fol-
the network. Inspired by the hierarchy of the CNNs, we lowed by GELU, Dw is k×k depth-wise convolution, LN
use smaller kernels at the early stages, while larger kernels is a normalization layer, and xi+1 denotes the output feature

3
Stage 1 Stage 2 Stage 3 Stage 4

NxN Conv. Encoder SDTA Encoder

split

L2 Norm Norm
3x3
L2 Norm softmax
R Linear
3x3

3x3

concat
R Linear
Positional Encoding R Reshape Activation
Linear

Figure 2. Top Row: The overall architecture of our framework is a stage-wise design. Here, the first stage downsamples the input image
to 1/4th resolution using 4 × 4 strided convolution followed by three 3 × 3 Convolution (Conv.) encoders. In stages 2-4, 2 × 2 strided
convolutions are used for downsampling at the start, followed by N ×N Convolution and the Split depth-wise Transpose Attention (SDTA)
encoders. Bottom Row: We present the design of the Conv. encoder (Left) and the SDTA encoder (right). The Conv. encoder uses
N ×N depth-wise convolutions for spatial mixing followed by two pointwise convolutions for channel mixing. The SDTA Encoder splits
the input tensor into B channel groups and applies 3 × 3 depth-wise convolutions for multi-scale spatial mixing. The skip connections
between branches increase the overall receptive field of the network. The branches B3 and B4 are progressively activated in stages 3 and
4, increasing the overall receptive field in the deeper layers of the network. Within the proposed SDTA, we utilize Transpose Attention
followed by a light-weight MLP, that applies attention to feature channels and has linear complexity with respect to the input image.

maps of the Conv. encoder. follows:


SDTA Encoder. There are two main components in the xi
 i = 1;
proposed split depth-wise transpose attention (SDTA) en- yi = di (xi ) i = 2, t = 2; (2)
coder. The first component strives to learn an adaptive 
di (xi + yi−1 ) 2 < i ≤ s, t.

multi-scale feature representation by encoding various spa-
tial levels within the input image and the second part implic- Each depth-wise operation di , as shown in SDTA encoder in
Downsampling

Downsampling

itly encodes global image representations. The first part of Fig. 2, receives feature maps output from all previous splits
7x7 Conv.

9x9 Conv.
Encoder

Encoder
Encoder

Encoder
SDTA

SDTA

our encoder is inspired by Res2Net [12] where we adopt a {xj , j ≤ i}.

multi-scale processing approach by developing hierarchical As mentioned earlier, the overhead of the transformer
representation into a single block. This makes the spatial re- self-attention layer is infeasible for vision tasks on edge-
ceptive field of the output feature representation more flex- devices because it comes at the cost of higher MAdds and
ible and adaptive. Different from Res2Net, the first block latency. To alleviate this issue and encode the global con-
in our SDTA encoder does not use the 1×1 pointwise context efficiently, we use transposed query and key attention
volution layers to ensure a lightweight network with a con- feature maps in our SDTA encoder [1]. This operation has a
Linear

strained number of parameters and MAdds. Also, we use linear complexity by applying the dot-product operation of
adaptive number of subsets per stage to allow effective and the MSA across channel dimensions instead of the spatial
Linear
Norm

flexible feature encoding. In our STDA encoder, we split dimension, which allows us to compute cross-covariance
the input tensor H×W ×C into s subsets, each subset is across channels to generate attention feature maps that
denoted by xi and has the same spatial size with C/s chan- have implicit knowledge about the global representations.
Linear

nels, where i ∈ {1, 2, ..., s} and C is the number of chan- Given a normalized tensor Y of shape H×W ×C, we com-
nels. Each feature maps subset (except the first subset) is pute query (Q), key (K), and value (V ) projections using
passed to 3×3 depth-wise convolution, denoted by di , and three linear layers, yielding Q=W Q Y , K=W K Y , and
the output is denoted by yi . Also, the output of di−1 , de- V =W V Y , with dimensions HW ×C, where W Q ,W K ,
noted by yi−1 , is added to the feature subset xi , and then and W V are the projection weights for Q, K, and V re-
fed into di . The number of subsets s is adaptive based on spectively. Then, L2 norm is applied to Q and K before
the stage number t, where t ∈ {2, 3, 4}. We can write yi as computing the cross-covariance attention as it stabilizes the

4
Layer Output Size #Layers (n) Kernel Output Channels Table 1. EdgeNeXt
Architectures. Descrip-
XXS XS S tion of the models’ lay-
ers with respect to out-
Image 256×256 1 - - - -
put size, kernel size, and
Stem 64×64 1 4×4 24 32 48 output channels, repeated
Conv. Encoder 64×64 3 3×3 24 32 48 n times, along with the
models MAdds and pa-
Downsampling 32×32 1 2×2 48 64 96 rameters. The number
Conv. Encoder 32×32 2 5×5 48 64 96 of the output channels for
STDA Encoder 32×32 1 - 48 64 96 small, extra-small, and
Downsampling 16×16 1 2×2 88 100 160 extra-extra small mod-
els is chosen to match
Conv. Encoder 16×16 8 7×7 88 100 160
the number of parameters
STDA Encoder 16×16 1 - 88 100 160 with the counterpart Mo-
Downsampling 8×8 1 2×2 168 192 304 bileViT model. We use
Conv. Encoder 8×8 2 9×9 168 192 304 adaptive kernel sizes in
STDA Encoder 8×8 1 - 168 192 304 Conv. Encoder to reduce
the model complexity and
Global Average Pooling 1×1 1 - - - - capture different levels of
Linear 1×1 1 - 1000 1000 1000 features. Also, we pad
the output size of the last
Model MAdds 0.3G 0.5G 1.3G
stage to be able to apply
Model Prameters 1.3M 2.3M 5.6M the 9×9 filter.

training. Instead of applying the dot-product between Q Following the literature [17, 29], we report top-1 accuracy
and K T along the spatial dimension i.e., (HW × C) · (C on the validation set for all experiments. For object detec-
× HW ), we apply the dot-product across the channel di- tion, we use COCO [22] dataset which provides approxi-
mensions between QT and K i.e., (C×HW ) · (HW ×C), mately 118k training and 5k validation images respectively.
producing C×C softmax scaled attention score matrix. To For segmentation, we use Pascal VOC 2012 dataset [10]
get the final attention maps, we multiply the scores by V which provides almost 10k images with semantic segmen-
and sum them up. The transposed attention operation can tation masks. Following the standard practice as in [29], we
be expressed as follows: use extra data and annotations from [22] and [13] as well.

X̂ = Attention(Q, K, V ) + X, (3) 4.2. Implementation Details

T
s.t., Attention(Q, K, V ) = V · softmax(Q · K) We train our EdgeNeXt models at an input resolution
where X is the input and X̂ is the output feature tensor. of 256×256 with an effective batch size of 4096. All the
After that, two 1×1 pointwise convolution layers, LN and experiments are run for 300 epochs with AdamW [27] opti-
GELU activation are used to generate non-linear features. mizer, and with a learning rate and weight decay of 6e-3 and
Table 1 shows the sequence of Conv. and STDA encoders 0.05 respectively. We use cosine learning rate schedule [26]
with the corresponding input size at each layer with more with linear warmup for 20 epochs. The data augmenta-
design details about extra-extra small, extra-small and small tions used during training are Random Resized Crop (RRC),
models. Horizontal Flip, and RandAugment [6], where RandAug-
ment is only used for the EdgeNeXt-S model. We also use
4. Experiments multi-scale sampler [29] during training. Further stochas-
tic depth [18] with a rate of 0.1 is used for EdgeNeXt-S
In this section, we evaluate our EdgeNeXt model on model only. We use EMA [31] with a momentum of 0.9995
ImageNet-1K classification, COCO object detection, and during training. For inference, the images are resized to
Pascal VOC segmentation benchmarks. 292×292 followed by a center crop at 256×256 resolution.
We also train and report the accuracy of our EdgeNeXt-S
4.1. Dataset
model at 224×224 resolution for a fair comparison with
We use ImageNet-1K [32] dataset in all classification previous methods. The classification experiments are run
experiments. The dataset provides approximately 1.28M on eight A100 GPUs with an average training time of al-
training and 50K validation images for 1000 categories. most 30 hours for the EdgeNeXt-S model.

5
Frameworks Models Date Input Params↓ MAdds↓ Top1↑ Table 2. Classification per-
formance comparison of
MobileNetV2 CVPR2018 2242 6.9M 585M 74.7 our proposed EdgeNeXt
ConvNets ShuffleNetV2 ECCV2018 2242 5.5M 597M 74.5 model with state-of-the-art
MobileNetV3 ICCV2019 2242 5.4M 219M 75.2 lightweight fully convolutional,
ViTs T2T-ViT ICCV2021 2242 6.9M 1.80G 76.5 transformer-based, and hybrid
DeiT-T ICML2021 2242 5.7M 1.25G 72.2 models on the ImageNet-1K
validation set. Our model
MobileFormer CoRR2021 2242 9.4M 214M 76.7 outperforms the SOTA models
Hybrid ViT-C NeurIPS2021 2242 4.6M 1.10G 75.3 and achieves a better trade-
CoaT-Lite-T ICCV2021 2242 5.7M 1.60G 77.5 off between accuracy and
MobileViT-S ICLR2022 2562 5.6M 2.01G 78.4 compute (i.e., parameters
EdgeNeXt-S Ours 2242 5.6M 965M 78.8 and multiplication-addition
EdgeNeXt-S Ours 2562 5.6M 1.30G 79.4 (MAdds) operations).

For detection and segmentation tasks, we finetune Ed- 78.8% top-1 accuracy, surpassing T2T-ViT [41] and DeiT-
geNeXt following similar settings as in [29] and report T [37] by 2.3% and 6.6% absolute margins respectively.
mean average precision (mAP) at IOU of 0.50-0.95 and
Comparison with Hybrid Models. EdgeNeXt outper-
mean intersection over union (mIOU) respectively. The ex-
forms MobileFormer [4], ViT-C [39], CoaT-Lite-T [7] with
periments are run on four A100 GPUs with an average train-
less parameters and fewer MAdds (Table 2). For a fair com-
ing time of ∼36 and ∼7 hours for detection and segmenta-
parison with MobileViT [29], we train our model at an in-
tion respectively.
put resolution of 256×256 and show consistent gains for
We also report the latency of our models on NVIDIA different models sizes (i.e., S, XS, and XXS) with fewer
Jetson Nano1 and NVIDIA A100 40GB GPU. For Jetson MAdds and faster inference on the edge devices (Table 3).
Nano, we convert all the models to TensorRT2 engines and For instance, our EdgeNeXt-XXS model achieves 71.2%
perform inference in FP16 mode using a batch size of 1. For top-1 accuracy with only 1.3M parameters, surpassing cor-
A100, similar to [25], we use PyTorch v1.8.1 with a batch responding MobileViT model by 2.2%. Our EdgeNeXt-S
size of 256 to measure the latency. model attains 79.4% accuracy on ImageNet with only 5.6M
parameters, a margin of 1.0% as compared to the corre-
4.3. Image Classification
sponding MobileViT-S model. This demonstrates the ef-
Table 2 compares our proposed EdgeNeXt model with fectiveness and the generalization of our design.
previous state-of-the-art fully convolutional (ConvNets),
transformer-based (ViTs) and hybrid models. Overall, our 4.4. Inference on Edge Devices
model demonstrates better accuracy versus compute (pa- We compute the inference time of our EdgeNeXt mod-
rameters and MAdds) trade-off compared to all three cat- els on the NVIDIA Jetson Nano edge device and compare
egories of methods (see Fig. 1). it with the state-of-the-art MobileViT [29] model (Table 3).
Comparison with ConvNets. EdgeNeXt surpasses ligh- All the models are converted to TensorRT engines and in-
weight ConvNets by a formidable margin in terms of top-1 ference is performed in FP16 mode. Our model attains low
accuracy with similar parameters (Table 2). Normally, Con- latency on the edge device with similar parameters, fewer
vNets have less MAdds compared to transformer and hybrid MAdds, and higher top-1 accuracy. Table 3 also lists the
models because of no attention computation, however, they inference time on A100 GPU for both MobileViT and Ed-
lack the global receptive field. For instance, EdgeNeXt- geNeXt models. It can be observed that our EdgeNeXt-
S has higher MAdds compared to MobileNetV2 [33], but XXS model is ∼34% faster than the MobileViT-XSS model
it obtains 4.1% gain in top-1 accuracy with less number on A100 as compared to only ∼8% faster on Jetson Nano,
of parameters. Also, our EdgeNeXt-S outperforms Shuf- indicating that EdgeNeXt better utilizes the advanced hard-
fleNetV2 [28] and MobileNetV3 [16] by 4.3% and 3.6% ware as compared to MobileViT.
respectively, with comparable number of parameters.
4.5. Object Detection
Comparison with ViTs. Our EdgeNeXt outperforms re-
cent ViT variants on ImageNet1K dataset with fewer pa- We use EdgeNeXt as a backbone in SSDLite and fine-
rameters and MAdds. For example, EdgeNeXt-S obtains tune the model on COCO 2017 dataset [22] at an input res-
olution of 320×320. The difference between SSD [23] and
1 https://developer.nvidia.com/embedded/jetson-nano-developer-kit SSDLite is that the standard convolutions are replaced with
2 https://github.com/NVIDIA/TensorRT separable convolutions in the SSD head. The results are re-

6
Model Date Input Params↓ MAdds↓ Top1↑ Jetson↓ A100↓ Table 3. Comparison of differ-
ent variants of EdgeNeXt with the
MobileViT-XXS 1.3M 364M 69.0 21.0 ms 216 µs counterpart models of MobileViT.
MobileViT-XS ICLR2022 2562 2.3M 886M 74.8 35.1 ms 423 µs The last two columns list the la-
MobileViT-S 5.6M 2.01G 78.4 53.0 ms 559 µs tency in ms and µs on Jetson
Nano and A100 devices, respec-
EdgeNeXt-XXS 1.3M 261M 71.2 19.3 ms 142 µs tively. Our EdgeNext models pro-
EdgeNeXt-XS Ours 2562 2.3M 538M 75.0 31.6 ms 227 µs vide higher accuracy with lower la-
EdgeNeXt-S 5.6M 1.30G 79.4 48.8 ms 332 µs tency for each model size.

ported in Table 4. EdgeNeXt consistently outperforms Mo- with convolution encoders degrades the accuracy by 1.1%,
bileNet backbones and gives competitive performance com- indicating its usefulness in our design. When we fix kernel
pared to MobileVit backbone. With ∼38% fewer MAdds size to 7 in all four stages of the network, it further reduces
and comparable parameters, EdgeNeXt achieves 27.9 box the accuracy by 0.4%. Overall, our proposed design pro-
AP which is 0.2 points more than MobileViT. vides an optimal speed-accuracy trade-off.
We also ablate the contributions of SDTA components
Model Params↓ MAdds↓ mAP↑ (e.g., adaptive branching and positional encoding) in Ta-
ble 6. Removing adaptive branching and positional encod-
MobileNetV1 5.1M 1.3G 22.2 ing slightly decreases the accuracy.
MobileNetV2 4.3M 800M 22.1
MobileNetV3 5.0M 620M 22.0
Model Top1↑ Latency↓
MobileViT-S 5.7M 3.4G 27.7 Base EdgeNeXt-S 79.4 332 µs
EdgeNeXt-S (ours) 6.2M 2.1G 27.9
Different w/o SDTA Encoders 78.3 265 µs
Table 4. Comparisons with SOTA on COCO object detection. Ed- Components + w/o Adaptive Kernels 77.9 301 µs
geNeXt improves over previous approaches. SDTA w/o Adaptive Branching 79.3 332 µs
Components + w/o PE 79.2 301 µs
4.6. Semantic Segmentation
Table 6. Ablation on different components of EdgeNeXt and
We use EdgeNeXt as backbone in DeepLabv3 [3] and SDTA encoder design. The results show the benefits of SDTA
finetune the model on Pascal VOC [10] dataset at an input encoders and adaptive kernels in our design. Further, adaptive
resolution of 512×512. DeepLabv3 uses dilated convolu- branching and positional encoding (PE) in our SDTA module are
tion in cascade design along with spatial pyramid pooling required to get the good accuracy.
to encode multi-scale features which are useful in encoding
objects at multiple scales. Our model obtains 80.2 mIOU Hybrid design. Table 7 ablates the different hybrid de-
on the validation dataset, providing a 1.1 points gain over sign choices for our EdgeNeXt model. Motivated from
MobileViT with ∼36% fewer MAdds. MetaFormer [40], we replace all convolutional modules in
the last two stages with SDTA encoders. The results show
Model Params↓ MAdds↓ mIOU↑ superior performance when all blocks in the last two stages
are SDTA blocks, but it increases the latency (row-2 vs 3).
MobileNetV1 11.1M 14.2G 75.3 Our hybrid design where we propose to use an SDTA mod-
MobileNetV2 4.5M 5.8G 75.7 ule as the last block in the last three stages provides an op-
MobileViT-S 5.7M 13.7G 79.1 timal speed-accuracy trade-off.
EdgeNeXt-S (ours) 6.5M 8.7G 80.2
Model Configuration Top1↑ Latency↓
Table 5. Comparisons with SOTA on VOC semantic segmentation.
Our model provides reasonable gain over previous approaches. 1: Conv=[3, 3, 9, 0], SDTA=[0, 0, 0, 3] 79.3 303 µs
2: Conv=[3, 3, 0, 0], SDTA=[0, 0, 9, 3] 79.7 393 µs
3: Conv=[3, 2, 8, 2], SDTA=[0, 1, 1, 1] 79.4 332 µs
5. Ablations
Table 7. Ablation on the hybrid architecture of Conv. and SDTA
In this section, we ablate different design choices in our encoders. Using one SDTA encoder as the last block in the last
proposed EdgeNeXt model. three stages provides an optimal accuracy-latency trade-off.
SDTA encoder and adaptive kernel sizes. Table 6 shows Table 8 provides an ablation of the importance of using
the importance of SDTA encoders and adaptive kernel sizes SDTA encoders at different stages of the network. It is noti-
in our proposed architecture. Replacing SDTA encoders cable that progressively adding an SDTA encoder as the last

7
block of the last three stages improves the accuracy with 6. Qualitative Results
some loss in inference latency. However, in row 4, we ob-
tain the best trade-off between accuracy and speed where Figs. 4 and 5 show the qualitative results of EdgeNeXt-S
the SDTA encoder is added as the last block in the last detection and segmentation models respectively. Our model
three stages of the network. Further, we notice that adding can precisely detect and segments objects in various views.
a global SDTA encoder to the first stage of the network is
not helpful where the features are not much mature.

Model Configuration Top1↑ Latency↓

1: Conv=[3, 3, 9, 3], SDTA=[0, 0, 0, 0] 78.3 265 µs
2: Conv=[3, 3, 9, 2], SDTA=[0, 0, 0, 1] 78.6 290 µs
3: Conv=[3, 3, 8, 2], SDTA=[0, 0, 1, 1] 79.1 310 µs
4: Conv=[3, 2, 8, 2], SDTA=[0, 1, 1, 1] 79.4 332 µs
5: Conv=[2, 2, 8, 2], SDTA=[1, 1, 1, 1] 79.2 387 µs

Table 8. Ablation on using SDTA encoder at different stages of

Figure 4. Qualitative results of our EdgeNeXt detection model on
the network. Including SDTA encoders in the last three stages im-
COCO validation dataset. The model is trained on COCO dataset
proves performance, whereas a global SDTA encoder is not helpful
with 80 detection classes. Our model can effectively localize and
in the first stage of the network.
classify objects in diverse scenes.
We also provide an ablation on using the SDTA module
at the start of each stage versus at the end of each stage.
Table 9 shows that using the global SDTA encoder at the
end of each stage is more beneficial. This observation is
consistent with the recent work [21].

SDTA Configuration Top1↑ Latency↓

Start of Stage (SDTA=[0, 1, 1, 1]) 79.0 332 µs
End of Stage (SDTA=[0, 1, 1, 1]) 79.4 332 µs

Table 9. Ablation on using SDTA at the start and end of each stage
in EdgeNeXt. The results show that it is generally beneficial to use
SDTA at the end of each stage.
Activation and normalization. EdgeNeXt uses GELU ac-
tivation and layer normalization throughout the network.
We found that the current PyTorch implementations of
GELU and layer normalization are not optimal for high
speed inference. To this end, we replace GELU with Hard-
Figure 5. Qualitative results of EdgeNeXt-S segmentation model
Swish and layer-norm with batch-norm and retrain our
on unseen COCO validation dataset. The model is trained on VOC
models. Fig. 3 indicates that it reduces the accuracy slightly, dataset with 20 segmentation classes. (a) shows the predicted
however, reduces the latency by a large margin. semantic segmentation mask where ‘black ’color represents the
background pixels. (b) displays the predicted masks on top of
original images. (c) represents the color encodings for all VOC
classes for the displayed segmentation masks. Our model provides
high-quality segmentation masks on unseen COCO images.

7. Conclusion
The success of the transformer models comes with a
Figure 3. Ablation on the affect of using different activation func- higher computational overhead compared to CNNs. Self-
tions and normalization layers on accuracy and latency of our net- attention operation is the major contributor to this overhead,
work variants. Using Hard Swish activation and batch normal- which makes vision transformers slow on the edge devices
ization instead of GELU and layer normalization significantly im- compared to CNN-based mobile architectures. In this pa-
proves the latency at the cost of some loss in accuracy. per, we introduce a hybrid design consisting of convolution
and efficient self-attention based encoders to jointly model

8
local and global information effectively, while being effi- [12] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu
cient in terms of both parameters and MAdds on vision Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new
tasks with superior performance compared to state-of-the- multi-scale backbone architecture. IEEE Transactions on
art methods. Our experimental results show promising per- Pattern Analysis and Machine Intelligence, 43(2):652–662,
formance for different variants of EdgeNeXt, which demon- 2019. 4
[13] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev,
strates the effectiveness and the generalization ability of the
Subhransu Maji, and Jitendra Malik. Semantic contours from
proposed model.
inverse detectors. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, 2011. 5
References [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[1] Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bo- Deep residual learning for image recognition. In Proceed-
janowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Na- ings of the IEEE/CVF Conference on Computer Vision and
talia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Pattern Recognition, 2016. 3
Cross-covariance image transformers. Advances in Neural [15] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
Information Processing Systems, 2021. 4 units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- [16] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh
ton. Layer normalization. arXiv preprint arXiv:1607.06450, Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,
2016. 3 Ruoming Pang, Vijay Vasudevan, et al. Searching for mo-
[3] Liang-Chieh Chen, George Papandreou, Florian Schroff, and bilenetv3. In Proceedings of the IEEE/CVF International
Hartwig Adam. Rethinking atrous convolution for seman- Conference on Computer Vision, 2019. 1, 2, 6
tic image segmentation. arXiv preprint arXiv:1706.05587, [17] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry
2017. 7 Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
[4] Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile- tional neural networks for mobile vision applications. CoRR,
former: Bridging mobilenet and transformer. In Proceed- abs/1704.04861, 2017. 2, 5
ings of the IEEE/CVF Conference on Computer Vision and [18] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
Pattern Recognition, 2022. 2, 6 Weinberger. Deep networks with stochastic depth. In The
[5] François Chollet. Xception: Deep learning with depthwise European Conference on Computer Vision, 2016. 5
separable convolutions. In Proceedings of the IEEE/CVF [19] Forrest N Iandola, Song Han, Matthew W Moskewicz,
Conference on Computer Vision and Pattern Recognition, Khalid Ashraf, William J Dally, and Kurt Keutzer.
2017. 2 Squeezenet: Alexnet-level accuracy with 50x fewer pa-
[6] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. rameters and¡ 0.5 mb model size. arXiv preprint
Randaugment: Practical automated data augmentation with arXiv:1602.07360, 2016. 1, 2
a reduced search space. In Advances in Neural Information [20] Salman Khan, Muzammal Naseer, Munawar Hayat,
Processing Systems, 2020. 5 Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak
[7] Zihang Dai, Hanxiao Liu, Quoc Le, and Mingxing Tan. Shah. Transformers in vision: A survey. ACM Computing
Coatnet: Marrying convolution and attention for all data Surveys (CSUR), 2021. 1
sizes. Advances in Neural Information Processing Systems, [21] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man-
2021. 6 galam, Bo Xiong, Jitendra Malik, and Christoph Feichten-
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, hofer. Improved multiscale vision transformers for classifi-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, cation and detection. In Proceedings of the IEEE/CVF Con-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- ference on Computer Vision and Pattern Recognition, 2022.
vain Gelly, et al. An image is worth 16x16 words: Trans- 8
formers for image recognition at scale. arXiv preprint [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
arXiv:2010.11929, 2020. 2 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[9] Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Zitnick. Microsoft coco: Common objects in context. In The
Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving European Conference on Computer Vision, 2014. 5, 6
vision transformers with soft convolutional inductive biases. [23] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
In International Conference on Machine Learning, 2021. 2 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[10] Mark Everingham, Luc Van Gool, Christopher KI Williams, Berg. Ssd: Single shot multibox detector. In The European
John Winn, and Andrew Zisserman. The Pascal Visual Ob- Conference on Computer Vision, 2016. 6
ject Classes (VOC) Challenge. International Journal of [24] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
Computer Vision, 88(2):303–338, 2010. 5, 7 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[11] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Hierarchical vision transformer using shifted windows. In
Zhicheng Yan, Jitendra Malik, and Christoph Feichten- Proceedings of the IEEE/CVF International Conference on
hofer. Multiscale vision transformers. In Proceedings of Computer Vision, 2021. 2
the IEEE/CVF International Conference on Computer Vi- [25] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
sion, 2021. 2 enhofer, Trevor Darrell, and Saining Xie. A convnet for the

9
2020s. In Proceedings of the IEEE/CVF Conference on Com- formers see better. In Advances in Neural Information Pro-
puter Vision and Pattern Recognition, 2022. 3, 6 cessing Systems, 2021. 2, 6
[26] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient [40] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou,
descent with warm restarts. In International Conference on Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer
Learning Representations, 2017. 5 is actually what you need for vision. In Proceedings of
[27] Ilya Loshchilov and Frank Hutter. Decoupled weight de- the IEEE/CVF Conference on Computer Vision and Pattern
cay regularization. In International Conference on Learning Recognition, 2021. 7
Representations, 2019. 5 [41] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
[28] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng
Shufflenet v2: Practical guidelines for efficient cnn architec- Yan. Tokens-to-token vit: Training vision transformers from
ture design. In The European Conference on Computer Vi- scratch on imagenet. In Proceedings of the IEEE/CVF Inter-
sion, 2018. 1, 6 national Conference on Computer Vision, 2021. 6
[29] Sachin Mehta and Mohammad Rastegari. Mobilevit: light- [42] Matthew D Zeiler and Rob Fergus. Visualizing and under-
weight, general-purpose, and mobile-friendly vision trans- standing convolutional networks. In The European Confer-
former. In International Conference on Learning Represen- ence on Computer Vision, 2014. 3
tations, 2022. 2, 3, 5, 6 [43] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Edgeformer:
[30] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Improving light-weight convnets by learning from vision
Hannaneh Hajishirzi. Espnetv2: A light-weight, power ef- transformers. arXiv preprint arXiv:2203.03952, 2022. 2
ficient, and general purpose convolutional neural network. [44] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
In Proceedings of the IEEE/CVF Conference on Computer Shufflenet: An extremely efficient convolutional neural net-
Vision and Pattern Recognition, 2019. 2 work for mobile devices. In Proceedings of the IEEE/CVF
[31] Boris T Polyak and Anatoli B Juditsky. Acceleration of Conference on Computer Vision and Pattern Recognition,
stochastic approximation by averaging. SIAM journal on 2018. 2
control and optimization, 1992. 5 [45] Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li,
[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- and Rong Jin. Elsa: Enhanced local self-attention for vision
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, transformer. arXiv preprint arXiv:2112.12786, 2021. 2
Aditya Khosla, Michael Bernstein, et al. ImageNet Large
Scale Visual Recognition Challenge. International Journal
of Computer Vision, 115(3):211–252, 2015. 2, 5
[33] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey
Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: In-
verted residuals and linear bottlenecks. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2018. 1, 2, 6
[34] Jürgen Schmidhuber. Deep learning in neural networks: An
overview. Neural networks, 61:85–117, 2015. 1
[35] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck
transformers for visual recognition. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021. 2
[36] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
Mark Sandler, Andrew Howard, and Quoc V Le. Mnas-
net: Platform-aware neural architecture search for mobile.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2019. 1, 2
[37] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. In International Conference on Machine Learning,
2021. 2, 6
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, 2017. 2
[39] Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor
Darrell, and Ross Girshick. Early convolutions help trans-

Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
No ratings yet
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
15 pages
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
No ratings yet
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
19 pages
Wadekar, Chaurasia - 2022 - MobileViTv3 Mobile-Friendly Vision Transformer With Simple and Effective Fusion of Local, Global and Input F
No ratings yet
Wadekar, Chaurasia - 2022 - MobileViTv3 Mobile-Friendly Vision Transformer With Simple and Effective Fusion of Local, Global and Input F
20 pages
Revisiting Mobile CNN From ViT Perspective
No ratings yet
Revisiting Mobile CNN From ViT Perspective
12 pages
Mehta, Rastegari - 2022 - Mobilevit Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer
No ratings yet
Mehta, Rastegari - 2022 - Mobilevit Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer
26 pages
Searching For Mobilenetv3: Accuracy Vs Madds Vs Model Size
No ratings yet
Searching For Mobilenetv3: Accuracy Vs Madds Vs Model Size
11 pages
MOBILENET FOR IMAGE CLASSIFICATION
No ratings yet
MOBILENET FOR IMAGE CLASSIFICATION
3 pages
Mobiledets: Searching For Object Detection Architectures For Mobile Accelerators
No ratings yet
Mobiledets: Searching For Object Detection Architectures For Mobile Accelerators
11 pages
Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper
No ratings yet
Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper
10 pages
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
No ratings yet
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
12 pages
Rethinking Mobile Block For Efficient Attention-Based Models (EMO) - V1
No ratings yet
Rethinking Mobile Block For Efficient Attention-Based Models (EMO) - V1
11 pages
Project Presentation
No ratings yet
Project Presentation
20 pages
4
No ratings yet
4
5 pages
mobilenetv4
No ratings yet
mobilenetv4
32 pages
Paper Review - Mobile Nets
No ratings yet
Paper Review - Mobile Nets
2 pages
用于目标检测的视觉Transformer的训练策略
No ratings yet
用于目标检测的视觉Transformer的训练策略
9 pages
Comprehensive Guide to Micro:bit Technology: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Micro:bit Technology: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
MobileNetV2 Inverted Residuals and Linear Bottlenecks
No ratings yet
MobileNetV2 Inverted Residuals and Linear Bottlenecks
11 pages
Rethinking Local Perception in Lightweight Vision Transformer
No ratings yet
Rethinking Local Perception in Lightweight Vision Transformer
14 pages
Comprehensive Guide to VNC Technology: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to VNC Technology: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2023 ExMobileViT Yang Et Al ArXiv
No ratings yet
2023 ExMobileViT Yang Et Al ArXiv
20 pages
propsal
No ratings yet
propsal
7 pages
2103.10619v2
No ratings yet
2103.10619v2
11 pages
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
From Everand
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Mbed Development: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Mbed Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
From Everand
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Origami
No ratings yet
Origami
14 pages
Mnasnet: Platform-Aware Neural Architecture Search For Mobile
No ratings yet
Mnasnet: Platform-Aware Neural Architecture Search For Mobile
9 pages
Efficient V It
No ratings yet
Efficient V It
11 pages
CO EL GROUP 11
No ratings yet
CO EL GROUP 11
11 pages
Programming NodeMCU for IoT Applications: Definitive Reference for Developers and Engineers
From Everand
Programming NodeMCU for IoT Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Nest DNN
No ratings yet
Nest DNN
13 pages
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
No ratings yet
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
5 pages
Nivetha Me P2 Report
No ratings yet
Nivetha Me P2 Report
86 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
IoT Edge Innovations
From Everand
IoT Edge Innovations
Kai Turing
No ratings yet
Attention-Based Smart-Camera For Spatial Cognition
No ratings yet
Attention-Based Smart-Camera For Spatial Cognition
10 pages
TIP21-SAMNet
No ratings yet
TIP21-SAMNet
11 pages
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
From Everand
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Contiki Operating System for Embedded IoT: Definitive Reference for Developers and Engineers
From Everand
Contiki Operating System for Embedded IoT: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Edge Computing Architecture and Applications: Definitive Reference for Developers and Engineers
From Everand
Edge Computing Architecture and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Challenging Task[1]
No ratings yet
Challenging Task[1]
21 pages
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
No ratings yet
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
23 pages
Zhang_ResNeSt_Split-Attention_Networks_CVPRW_2022_paper
No ratings yet
Zhang_ResNeSt_Split-Attention_Networks_CVPRW_2022_paper
11 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
2204.07118v1
No ratings yet
2204.07118v1
27 pages
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
2311.17132v3
No ratings yet
2311.17132v3
22 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
Hou Coordinate Attention for Efficient Mobile Network Design CVPR 2021 Paper
No ratings yet
Hou Coordinate Attention for Efficient Mobile Network Design CVPR 2021 Paper
10 pages
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Cloud vs Edge
From Everand
Cloud vs Edge
Isaac Berners-Lee
No ratings yet
VGG (Simonyan and Zisserman)
No ratings yet
VGG (Simonyan and Zisserman)
14 pages
(NIPS23) Scattering Transformation For ViT
No ratings yet
(NIPS23) Scattering Transformation For ViT
21 pages
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
From Everand
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
Robert Johnson
No ratings yet
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
From Everand
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Behavior Cloning For Self Driving Cars Using Attention Models
No ratings yet
Behavior Cloning For Self Driving Cars Using Attention Models
5 pages
2106.03348
No ratings yet
2106.03348
23 pages
Progress Report 1
No ratings yet
Progress Report 1
13 pages
ACCT4104 CDE Group Project
No ratings yet
ACCT4104 CDE Group Project
4 pages
McBride-2019- Overview of Surrogate Modeling in Chemical Process Engineering
No ratings yet
McBride-2019- Overview of Surrogate Modeling in Chemical Process Engineering
12 pages
Jurnal KB Suntik
No ratings yet
Jurnal KB Suntik
5 pages
Trace The Concept Grade 9 Teacher Guide - 24
No ratings yet
Trace The Concept Grade 9 Teacher Guide - 24
279 pages
A Methodology To Determine in Situ Rock Mass Failure
No ratings yet
A Methodology To Determine in Situ Rock Mass Failure
8 pages
Thin-Walled Structures: Full Length Article
No ratings yet
Thin-Walled Structures: Full Length Article
17 pages
Notes Chapter 2.1 Linear Motion
No ratings yet
Notes Chapter 2.1 Linear Motion
58 pages
Aim: Study of Sliding Window Protocol. Theory
No ratings yet
Aim: Study of Sliding Window Protocol. Theory
9 pages
Candle Burning Abhijeet
No ratings yet
Candle Burning Abhijeet
5 pages
Assignment #7 - EDA
No ratings yet
Assignment #7 - EDA
11 pages
Uk Us h4d Range User-Manual v16
No ratings yet
Uk Us h4d Range User-Manual v16
142 pages
Ecological Adaptations Practicals
No ratings yet
Ecological Adaptations Practicals
24 pages
مجلة الرائد في الدوال الأسية مرفقة بالحلول - نافع - بكالوريا 2022
No ratings yet
مجلة الرائد في الدوال الأسية مرفقة بالحلول - نافع - بكالوريا 2022
133 pages
Xii Ni Ui4 QB
No ratings yet
Xii Ni Ui4 QB
35 pages
Escort Limited Champion F 11 Brand Name - Farmtrac
100% (1)
Escort Limited Champion F 11 Brand Name - Farmtrac
12 pages
ReishiMax GLP PIP - SG (51112)
No ratings yet
ReishiMax GLP PIP - SG (51112)
3 pages
Masters (MSC) in Health Professions Education (Hpe)
No ratings yet
Masters (MSC) in Health Professions Education (Hpe)
8 pages
Principles of Assessment For Learning
No ratings yet
Principles of Assessment For Learning
8 pages
Working Methodology: Planning For Green Field International Airport Near Agra, Up
No ratings yet
Working Methodology: Planning For Green Field International Airport Near Agra, Up
2 pages
Shouldice Hospital PDF
No ratings yet
Shouldice Hospital PDF
4 pages
My Car: Generic Structure
No ratings yet
My Car: Generic Structure
2 pages
Acc224-Ais Final Exam Reviewer
No ratings yet
Acc224-Ais Final Exam Reviewer
3 pages
Letter To Parent - BYOD
No ratings yet
Letter To Parent - BYOD
2 pages
Scaffolds Scaffolding Work General Guide
No ratings yet
Scaffolds Scaffolding Work General Guide
18 pages
Annual Practicing Certificate Guide Forms v2 LATEST 2
No ratings yet
Annual Practicing Certificate Guide Forms v2 LATEST 2
4 pages
Long Phrase Tab
No ratings yet
Long Phrase Tab
6 pages
Literature Review
No ratings yet
Literature Review
15 pages
Technical Information: Gecko Frontal Uni
No ratings yet
Technical Information: Gecko Frontal Uni
3 pages
DELTA IABG C EN 20220420 Web
No ratings yet
DELTA IABG C EN 20220420 Web
160 pages

Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications

Uploaded by

Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications

Uploaded by

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for

Mobile Vision Applications

In the pursuit of achieving ever-increasing accuracy,

NxN Conv. Encoder SDTA Encoder

maps of the Conv. encoder. follows:

our encoder is inspired by Res2Net [12] where we adopt a {xj , j ≤ i}.

X̂ = Attention(Q, K, V ) + X, (3) 4.2. Implementation Details

Model Configuration Top1↑ Latency↓

Table 8. Ablation on using SDTA encoder at different stages of

SDTA Configuration Top1↑ Latency↓

You might also like