Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications
Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications
* *
Muhammad Maaz1, Abdelrahman Shaker1, Hisham Cholakkal1 Salman Khan1,2
Syed Waqas Zamir3 Rao Muhammad Anwer1 Fahad Shahbaz Khan1
1 2
Mohamed Bin Zayed university of Artificial Intelligence Australian National University
arXiv:2206.10589v1 [cs.CV] 21 Jun 2022
3
Inception Institute of Artificial Intelligence
Abstract
1
models for mobile devices. While being easy to train and 2. Related Work
efficient in encoding local image details, these aforemen-
tioned light-weight CNNs do not explicitly model global In recent years, designing lightweight hardware-efficient
interactions between pixels. convolutional neural networks for mobile vision tasks has
been well studied in literature. The current methods fo-
The introduction of self-attention in vision transform- cus on designing efficient versions of convolutions for low-
ers (ViTs) [8] has made it possible to explicitly model this powered edge devices [17, 19]. Among these methods, Mo-
global interaction, however, this typically comes at the cost bileNet [17] is the most widely used architecture which em-
of slow inference because of the self-attention computa- ploys depth-wise separable convolutions [5]. On the other
tion [24]. This becomes an important challenge for design- hand, ShuffleNet [44] uses channel shuffling and low-cost
ing a lightweight ViT variant for mobile vision applications. group convolutions. MobileNetV2 [33] introduces inverted
The majority of the existing works employ CNN-based residual block with linear bottleneck, achieving promising
designs in developing efficient models. However, the con- performance on various vision tasks. ESPNetv2 [30] uti-
volution operation in CNNs inherits two main limitations: lizes depth-wise dilated convolutions to increase the recep-
First, it has local receptive field and thereby unable to model tive field of the network without increasing the network
global context; Second, the learned weights are stationary complexity. The hardware-aware neural architecture search
at inference times, making CNNs inflexible to adapt to the (NAS) has also been explored to find a better trade-off be-
input content. While both of these issues can be allevi- tween speed and accuracy on mobile devices [16, 36]. Al-
ated with Transformers, they are typically compute inten- though these CCNs are faster to train and infer on mobile
sive. Few recent works [29, 43] have investigated designing devices, they lack global interaction between pixels which
lightweight architectures for mobile vision tasks by com- limits their accuracy.
bining the strengths of CNNs and ViTs. However, these Recently, Desovitskiy et al. [8] introduces a vision
approaches mainly focus on optimizing the parameters and Knowledge
transformer architecture based on the self-attention mech- distillation
incur higher multiply-adds (MAdds) operations which re- anism [38] for vision tasks. Their proposed architecture transfers
stricts high-speed inference on mobile devices. The MAdds utilizes large-scale pre-training data (e.g. JFT-300M), ex- knowledge
are higher since the complexity of the attention block is from a large
tensive data augmentations, and a longer training schedule model to a
quadratic with respect to the input size [29]. This becomes to achieve competitive performance. Later, DeiT [37] pro- smaller model
further problematic due to multiple attention blocks in the poses to integrate distillation token in this architecture and without loss
network architecture. Here, we argue that the model size, of validty
only employ training on ImageNet-1K [32] dataset. Since
parameters, and MAdds are all desired to be small with re- then, several variants of ViTs and hybrid architectures are
spect to the resource-constrained devices when designing proposed in the literature, adding image-specific inductive
a unified mobile architecture that effectively combines the bias to ViTs for obtaining improved performance on differ-
complementary advantages of CNNs and ViTs (see Fig. 1). ent vision tasks [9, 11, 35, 39, 45].
Contributions. We propose a new light-weight architec- ViT models achieve competitive results for several vi-
ture, named EdgeNeXt, that is efficient in terms of model sual recognition tasks [8, 24]. However, it is difficult
size, parameters and MAdds, while being superior in ac- to deploy these models on resource-constrained edge de-
curacy on mobile vision tasks. Specifically, we introduce vices because of the high computational cost of the multi-
split depth-wise transpose attention (SDTA) encoder that headed self-attention (MHA). There has been recent work
effectively learns both local and global representations to on designing lightweight hybrid networks for mobile vision
address the issue of limited receptive fields in CNNs with- tasks that combine the advantages of CNNs and transform-
out increasing the number of parameters and MAdd oper- ers. MobileFormer [4] employs parallel branches of Mo-
ations. Our proposed architecture shows favorable perfor- bileNetV2 [33] and ViTs [8] with a bridge connecting both
mance in terms of both accuracy and latency compared to branches for local-global interaction. Mehta et al. [29] con-
state-of-the-art mobile networks on various tasks including sider transformers as convolution and propose a MobileViT
image classification, object detection, and semantic seg- block for local-global image context fusion. Their approach
mentation. Our EdgeNeXt backbone with 5.6M parame- achieves superior performance on image classification sur-
ters and 1.3G MAdds achieves 79.4% top-1 ImageNet-1K passing previous light-weight CNNs and ViTs using a sim-
classification accuracy which is superior to its recently in- ilar parameter budget.
troduced MobileViT counterpart [29], while requiring 35% Although MobileViT [29] mainly focuses on optimiz-
less MAdds. For object detection and semantic segmenta- ing parameters and latency, MHA is still the main effi-
tion tasks, the proposed EdgeNeXt achieves higher mAP ciency bottleneck in this model, especially for the number
and mIOU with fewer MAdds and a comparable number of of MAdds and the inference time on edge devices. The
parameters, compared to all the published lightweight mod- complexity of MHA in MobileViT is quadratic with respect
els in literature. to the input size, which is the main efficiency bottleneck
2
given their existing nine attention blocks in MobileViT-S at the latter stages in the convolution encoder blocks. This
model. In this work, we strive to design a new light-weight design choice is optimal as early stages in CNN usually cap-
architecture for mobile devices that is efficient in terms of ture low-level features and smaller kernels are suitable for
both parameters and MAdds, while being superior in ac- this purpose. However, in later stages of the network, large
curacy on mobile vision tasks. Our proposed architecture, convolutional kernels are required to capture high-level fea-
EdgeNeXt, is built on the recently introduced CNN method, tures [42]. We explain our architectural details next.
ConvNeXt [25], which modernizes the ResNet [14] archi-
Overall Architecture. Fig. 2 illustrates an overview of the
tecture following the ViT design choices. Within our Ed-
proposed EdgeNeXt architecture. The main ingredients are
geNeXt, we introduce an SDTA block that combines depth-
two-fold: (1) adaptive N ×N Conv. encoder, and (2) split
wise convolutions with adaptive kernel sizes along with
depth-wise transpose attention (SDTA) encoder. Our Ed-
transpose attention in an efficient manner, obtaining an op-
geNeXt architecture builds on the design principles of Con-
timal accuracy-speed trade-off.
vNeXt [25] and extracts hierarchical features at four differ-
ent scales across the four stages. The input image of size
3. EdgeNeXt H×W ×3 is passed through a patchify stem layer at the
The main objective of this work is to develop a beginning of the network, implemented using a 4×4 non-
lightweight hybrid design that effectively fuses the merits overlapping convolution followed by a layer norm, which
of ViTs and CNNs for low-powered edge devices. The results in H W
4 × 4 ×C1 feature maps. Then, the output is
computational overhead in ViTs (e.g., MobileViT [29]) is passed to 3×3 Conv. encoder to extract local features.
mainly due to the self-attention operation. In contrast to The second stage begins with a downsampling layer im-
MobileViT, the attention block in our model has linear plemented using 2×2 strided convolution that reduces the
complexity with respect to the input spatial dimension of spatial sizes by half and increases the channels, resulting
O(N d2 ), where N is the number of patches, and d is the in H W
8 × 8 ×C2 feature maps, followed by two consecutive
feature/channel dimension. The self-attention operation in 5×5 Conv. encoders. Positional Encoding (PE) is also
our model is applied across channel dimensions instead of added before the SDTA block in the second stage only. We
the spatial dimension. Furthermore, we demonstrate that observe that PE is sensitive for dense prediction tasks (e.g.,
with a much lower number of attention blocks (3 versus 9 object detection and segmentation) as well as adding it in all
in MobileViT), we can surpass their performance mark. In stages increases the latency of the network. Hence, we add
this way, the proposed framework can model global repre- it only once in the network to encode the spatial location
sentations with a limited number of MAdds which is a fun- information. The output feature maps are further passed
H
damental criterion to ensure low-latency inference on edge to the third and fourth stages, to generate 16 ×W 16 ×C3 and
H W
devices. To motivate our proposed architecture, we present ×
32 32 ×C4 dimensional features, respectively.
two desirable properties. Convolution Encoder. This block consists of depth-wise
a) Encoding the global information efficiently. The in- separable convolution with adaptive kernel sizes. We can
trinsic characteristic of self-attention to learn global repre- define it by two separate layers: (1) depth-wise convolution
sentations is crucial for vision tasks. To inherit this advan- with adaptive N ×N kernels. We use k = 3, 5, 7, and 9
tage efficiently, we use cross-covariance attention to incor- for stages 1, 2, 3, and 4, respectively. Then, (2) two point-
porate the attention operation across the feature channel di- wise convolution layers are used to enrich the local repre-
mension instead of the spatial dimension within a relatively sentation alongside standard Layer Normalization [2] (LN)
small number of network blocks. This reduces the complex- and Gaussian Error Linear Unit [15] (GELU) activation for
ity of the original self-attention operation from quadratic to non-linear feature mapping. Finally, a skip connection is
linear in terms of number of tokens and implicitly encodes added to make information flow across the network hierar-
the global information effectively. chy. This block is similar to the ConvNeXt block but the
kernel sizes are dynamic and vary depending on the stage.
b) Adaptive kernel sizes. Large-kernel convolutions are We observe that adaptive kernel sizes in Conv. encoder per-
known to be computationally expensive since the number form better compared to static kernel sizes (Table 7). The
of parameters and FLOPs quadratically increases as the ker- Conv. encoder can be represented as follows:
nel size grows. Although a larger kernel size is helpful to
increase the receptive field, using such large kernels across xi+1 = xi + LinearG (Linear(LN (Dw(xi )))), (1)
the whole network hierarchy is expensive and sub-optimal.
We propose an adaptive kernel sizes mechanism to reduce where xi denotes the input feature maps of shape
this complexity and capture different levels of features in H×W ×C, LinearG is a point-wise convolution layer fol-
the network. Inspired by the hierarchy of the CNNs, we lowed by GELU, Dw is k×k depth-wise convolution, LN
use smaller kernels at the early stages, while larger kernels is a normalization layer, and xi+1 denotes the output feature
3
Stage 1 Stage 2 Stage 3 Stage 4
split
L2 Norm Norm
3x3
L2 Norm softmax
R Linear
3x3
3x3
concat
R Linear
Positional Encoding R Reshape Activation
Linear
Figure 2. Top Row: The overall architecture of our framework is a stage-wise design. Here, the first stage downsamples the input image
to 1/4th resolution using 4 × 4 strided convolution followed by three 3 × 3 Convolution (Conv.) encoders. In stages 2-4, 2 × 2 strided
convolutions are used for downsampling at the start, followed by N ×N Convolution and the Split depth-wise Transpose Attention (SDTA)
encoders. Bottom Row: We present the design of the Conv. encoder (Left) and the SDTA encoder (right). The Conv. encoder uses
N ×N depth-wise convolutions for spatial mixing followed by two pointwise convolutions for channel mixing. The SDTA Encoder splits
the input tensor into B channel groups and applies 3 × 3 depth-wise convolutions for multi-scale spatial mixing. The skip connections
between branches increase the overall receptive field of the network. The branches B3 and B4 are progressively activated in stages 3 and
4, increasing the overall receptive field in the deeper layers of the network. Within the proposed SDTA, we utilize Transpose Attention
followed by a light-weight MLP, that applies attention to feature channels and has linear complexity with respect to the input image.
Downsampling
itly encodes global image representations. The first part of Fig. 2, receives feature maps output from all previous splits
7x7 Conv.
9x9 Conv.
Encoder
Encoder
Encoder
Encoder
SDTA
SDTA
strained number of parameters and MAdds. Also, we use linear complexity by applying the dot-product operation of
adaptive number of subsets per stage to allow effective and the MSA across channel dimensions instead of the spatial
Linear
Norm
flexible feature encoding. In our STDA encoder, we split dimension, which allows us to compute cross-covariance
the input tensor H×W ×C into s subsets, each subset is across channels to generate attention feature maps that
denoted by xi and has the same spatial size with C/s chan- have implicit knowledge about the global representations.
Linear
nels, where i ∈ {1, 2, ..., s} and C is the number of chan- Given a normalized tensor Y of shape H×W ×C, we com-
nels. Each feature maps subset (except the first subset) is pute query (Q), key (K), and value (V ) projections using
passed to 3×3 depth-wise convolution, denoted by di , and three linear layers, yielding Q=W Q Y , K=W K Y , and
the output is denoted by yi . Also, the output of di−1 , de- V =W V Y , with dimensions HW ×C, where W Q ,W K ,
noted by yi−1 , is added to the feature subset xi , and then and W V are the projection weights for Q, K, and V re-
fed into di . The number of subsets s is adaptive based on spectively. Then, L2 norm is applied to Q and K before
the stage number t, where t ∈ {2, 3, 4}. We can write yi as computing the cross-covariance attention as it stabilizes the
4
Layer Output Size #Layers (n) Kernel Output Channels Table 1. EdgeNeXt
Architectures. Descrip-
XXS XS S tion of the models’ lay-
ers with respect to out-
Image 256×256 1 - - - -
put size, kernel size, and
Stem 64×64 1 4×4 24 32 48 output channels, repeated
Conv. Encoder 64×64 3 3×3 24 32 48 n times, along with the
models MAdds and pa-
Downsampling 32×32 1 2×2 48 64 96 rameters. The number
Conv. Encoder 32×32 2 5×5 48 64 96 of the output channels for
STDA Encoder 32×32 1 - 48 64 96 small, extra-small, and
Downsampling 16×16 1 2×2 88 100 160 extra-extra small mod-
els is chosen to match
Conv. Encoder 16×16 8 7×7 88 100 160
the number of parameters
STDA Encoder 16×16 1 - 88 100 160 with the counterpart Mo-
Downsampling 8×8 1 2×2 168 192 304 bileViT model. We use
Conv. Encoder 8×8 2 9×9 168 192 304 adaptive kernel sizes in
STDA Encoder 8×8 1 - 168 192 304 Conv. Encoder to reduce
the model complexity and
Global Average Pooling 1×1 1 - - - - capture different levels of
Linear 1×1 1 - 1000 1000 1000 features. Also, we pad
the output size of the last
Model MAdds 0.3G 0.5G 1.3G
stage to be able to apply
Model Prameters 1.3M 2.3M 5.6M the 9×9 filter.
training. Instead of applying the dot-product between Q Following the literature [17, 29], we report top-1 accuracy
and K T along the spatial dimension i.e., (HW × C) · (C on the validation set for all experiments. For object detec-
× HW ), we apply the dot-product across the channel di- tion, we use COCO [22] dataset which provides approxi-
mensions between QT and K i.e., (C×HW ) · (HW ×C), mately 118k training and 5k validation images respectively.
producing C×C softmax scaled attention score matrix. To For segmentation, we use Pascal VOC 2012 dataset [10]
get the final attention maps, we multiply the scores by V which provides almost 10k images with semantic segmen-
and sum them up. The transposed attention operation can tation masks. Following the standard practice as in [29], we
be expressed as follows: use extra data and annotations from [22] and [13] as well.
5
Frameworks Models Date Input Params↓ MAdds↓ Top1↑ Table 2. Classification per-
formance comparison of
MobileNetV2 CVPR2018 2242 6.9M 585M 74.7 our proposed EdgeNeXt
ConvNets ShuffleNetV2 ECCV2018 2242 5.5M 597M 74.5 model with state-of-the-art
MobileNetV3 ICCV2019 2242 5.4M 219M 75.2 lightweight fully convolutional,
ViTs T2T-ViT ICCV2021 2242 6.9M 1.80G 76.5 transformer-based, and hybrid
DeiT-T ICML2021 2242 5.7M 1.25G 72.2 models on the ImageNet-1K
validation set. Our model
MobileFormer CoRR2021 2242 9.4M 214M 76.7 outperforms the SOTA models
Hybrid ViT-C NeurIPS2021 2242 4.6M 1.10G 75.3 and achieves a better trade-
CoaT-Lite-T ICCV2021 2242 5.7M 1.60G 77.5 off between accuracy and
MobileViT-S ICLR2022 2562 5.6M 2.01G 78.4 compute (i.e., parameters
EdgeNeXt-S Ours 2242 5.6M 965M 78.8 and multiplication-addition
EdgeNeXt-S Ours 2562 5.6M 1.30G 79.4 (MAdds) operations).
For detection and segmentation tasks, we finetune Ed- 78.8% top-1 accuracy, surpassing T2T-ViT [41] and DeiT-
geNeXt following similar settings as in [29] and report T [37] by 2.3% and 6.6% absolute margins respectively.
mean average precision (mAP) at IOU of 0.50-0.95 and
Comparison with Hybrid Models. EdgeNeXt outper-
mean intersection over union (mIOU) respectively. The ex-
forms MobileFormer [4], ViT-C [39], CoaT-Lite-T [7] with
periments are run on four A100 GPUs with an average train-
less parameters and fewer MAdds (Table 2). For a fair com-
ing time of ∼36 and ∼7 hours for detection and segmenta-
parison with MobileViT [29], we train our model at an in-
tion respectively.
put resolution of 256×256 and show consistent gains for
We also report the latency of our models on NVIDIA different models sizes (i.e., S, XS, and XXS) with fewer
Jetson Nano1 and NVIDIA A100 40GB GPU. For Jetson MAdds and faster inference on the edge devices (Table 3).
Nano, we convert all the models to TensorRT2 engines and For instance, our EdgeNeXt-XXS model achieves 71.2%
perform inference in FP16 mode using a batch size of 1. For top-1 accuracy with only 1.3M parameters, surpassing cor-
A100, similar to [25], we use PyTorch v1.8.1 with a batch responding MobileViT model by 2.2%. Our EdgeNeXt-S
size of 256 to measure the latency. model attains 79.4% accuracy on ImageNet with only 5.6M
parameters, a margin of 1.0% as compared to the corre-
4.3. Image Classification
sponding MobileViT-S model. This demonstrates the ef-
Table 2 compares our proposed EdgeNeXt model with fectiveness and the generalization of our design.
previous state-of-the-art fully convolutional (ConvNets),
transformer-based (ViTs) and hybrid models. Overall, our 4.4. Inference on Edge Devices
model demonstrates better accuracy versus compute (pa- We compute the inference time of our EdgeNeXt mod-
rameters and MAdds) trade-off compared to all three cat- els on the NVIDIA Jetson Nano edge device and compare
egories of methods (see Fig. 1). it with the state-of-the-art MobileViT [29] model (Table 3).
Comparison with ConvNets. EdgeNeXt surpasses ligh- All the models are converted to TensorRT engines and in-
weight ConvNets by a formidable margin in terms of top-1 ference is performed in FP16 mode. Our model attains low
accuracy with similar parameters (Table 2). Normally, Con- latency on the edge device with similar parameters, fewer
vNets have less MAdds compared to transformer and hybrid MAdds, and higher top-1 accuracy. Table 3 also lists the
models because of no attention computation, however, they inference time on A100 GPU for both MobileViT and Ed-
lack the global receptive field. For instance, EdgeNeXt- geNeXt models. It can be observed that our EdgeNeXt-
S has higher MAdds compared to MobileNetV2 [33], but XXS model is ∼34% faster than the MobileViT-XSS model
it obtains 4.1% gain in top-1 accuracy with less number on A100 as compared to only ∼8% faster on Jetson Nano,
of parameters. Also, our EdgeNeXt-S outperforms Shuf- indicating that EdgeNeXt better utilizes the advanced hard-
fleNetV2 [28] and MobileNetV3 [16] by 4.3% and 3.6% ware as compared to MobileViT.
respectively, with comparable number of parameters.
4.5. Object Detection
Comparison with ViTs. Our EdgeNeXt outperforms re-
cent ViT variants on ImageNet1K dataset with fewer pa- We use EdgeNeXt as a backbone in SSDLite and fine-
rameters and MAdds. For example, EdgeNeXt-S obtains tune the model on COCO 2017 dataset [22] at an input res-
olution of 320×320. The difference between SSD [23] and
1 https://developer.nvidia.com/embedded/jetson-nano-developer-kit SSDLite is that the standard convolutions are replaced with
2 https://github.com/NVIDIA/TensorRT separable convolutions in the SSD head. The results are re-
6
Model Date Input Params↓ MAdds↓ Top1↑ Jetson↓ A100↓ Table 3. Comparison of differ-
ent variants of EdgeNeXt with the
MobileViT-XXS 1.3M 364M 69.0 21.0 ms 216 µs counterpart models of MobileViT.
MobileViT-XS ICLR2022 2562 2.3M 886M 74.8 35.1 ms 423 µs The last two columns list the la-
MobileViT-S 5.6M 2.01G 78.4 53.0 ms 559 µs tency in ms and µs on Jetson
Nano and A100 devices, respec-
EdgeNeXt-XXS 1.3M 261M 71.2 19.3 ms 142 µs tively. Our EdgeNext models pro-
EdgeNeXt-XS Ours 2562 2.3M 538M 75.0 31.6 ms 227 µs vide higher accuracy with lower la-
EdgeNeXt-S 5.6M 1.30G 79.4 48.8 ms 332 µs tency for each model size.
ported in Table 4. EdgeNeXt consistently outperforms Mo- with convolution encoders degrades the accuracy by 1.1%,
bileNet backbones and gives competitive performance com- indicating its usefulness in our design. When we fix kernel
pared to MobileVit backbone. With ∼38% fewer MAdds size to 7 in all four stages of the network, it further reduces
and comparable parameters, EdgeNeXt achieves 27.9 box the accuracy by 0.4%. Overall, our proposed design pro-
AP which is 0.2 points more than MobileViT. vides an optimal speed-accuracy trade-off.
We also ablate the contributions of SDTA components
Model Params↓ MAdds↓ mAP↑ (e.g., adaptive branching and positional encoding) in Ta-
ble 6. Removing adaptive branching and positional encod-
MobileNetV1 5.1M 1.3G 22.2 ing slightly decreases the accuracy.
MobileNetV2 4.3M 800M 22.1
MobileNetV3 5.0M 620M 22.0
Model Top1↑ Latency↓
MobileViT-S 5.7M 3.4G 27.7 Base EdgeNeXt-S 79.4 332 µs
EdgeNeXt-S (ours) 6.2M 2.1G 27.9
Different w/o SDTA Encoders 78.3 265 µs
Table 4. Comparisons with SOTA on COCO object detection. Ed- Components + w/o Adaptive Kernels 77.9 301 µs
geNeXt improves over previous approaches. SDTA w/o Adaptive Branching 79.3 332 µs
Components + w/o PE 79.2 301 µs
4.6. Semantic Segmentation
Table 6. Ablation on different components of EdgeNeXt and
We use EdgeNeXt as backbone in DeepLabv3 [3] and SDTA encoder design. The results show the benefits of SDTA
finetune the model on Pascal VOC [10] dataset at an input encoders and adaptive kernels in our design. Further, adaptive
resolution of 512×512. DeepLabv3 uses dilated convolu- branching and positional encoding (PE) in our SDTA module are
tion in cascade design along with spatial pyramid pooling required to get the good accuracy.
to encode multi-scale features which are useful in encoding
objects at multiple scales. Our model obtains 80.2 mIOU Hybrid design. Table 7 ablates the different hybrid de-
on the validation dataset, providing a 1.1 points gain over sign choices for our EdgeNeXt model. Motivated from
MobileViT with ∼36% fewer MAdds. MetaFormer [40], we replace all convolutional modules in
the last two stages with SDTA encoders. The results show
Model Params↓ MAdds↓ mIOU↑ superior performance when all blocks in the last two stages
are SDTA blocks, but it increases the latency (row-2 vs 3).
MobileNetV1 11.1M 14.2G 75.3 Our hybrid design where we propose to use an SDTA mod-
MobileNetV2 4.5M 5.8G 75.7 ule as the last block in the last three stages provides an op-
MobileViT-S 5.7M 13.7G 79.1 timal speed-accuracy trade-off.
EdgeNeXt-S (ours) 6.5M 8.7G 80.2
Model Configuration Top1↑ Latency↓
Table 5. Comparisons with SOTA on VOC semantic segmentation.
Our model provides reasonable gain over previous approaches. 1: Conv=[3, 3, 9, 0], SDTA=[0, 0, 0, 3] 79.3 303 µs
2: Conv=[3, 3, 0, 0], SDTA=[0, 0, 9, 3] 79.7 393 µs
3: Conv=[3, 2, 8, 2], SDTA=[0, 1, 1, 1] 79.4 332 µs
5. Ablations
Table 7. Ablation on the hybrid architecture of Conv. and SDTA
In this section, we ablate different design choices in our encoders. Using one SDTA encoder as the last block in the last
proposed EdgeNeXt model. three stages provides an optimal accuracy-latency trade-off.
SDTA encoder and adaptive kernel sizes. Table 6 shows Table 8 provides an ablation of the importance of using
the importance of SDTA encoders and adaptive kernel sizes SDTA encoders at different stages of the network. It is noti-
in our proposed architecture. Replacing SDTA encoders cable that progressively adding an SDTA encoder as the last
7
block of the last three stages improves the accuracy with 6. Qualitative Results
some loss in inference latency. However, in row 4, we ob-
tain the best trade-off between accuracy and speed where Figs. 4 and 5 show the qualitative results of EdgeNeXt-S
the SDTA encoder is added as the last block in the last detection and segmentation models respectively. Our model
three stages of the network. Further, we notice that adding can precisely detect and segments objects in various views.
a global SDTA encoder to the first stage of the network is
not helpful where the features are not much mature.
Table 9. Ablation on using SDTA at the start and end of each stage
in EdgeNeXt. The results show that it is generally beneficial to use
SDTA at the end of each stage.
Activation and normalization. EdgeNeXt uses GELU ac-
tivation and layer normalization throughout the network.
We found that the current PyTorch implementations of
GELU and layer normalization are not optimal for high
speed inference. To this end, we replace GELU with Hard-
Figure 5. Qualitative results of EdgeNeXt-S segmentation model
Swish and layer-norm with batch-norm and retrain our
on unseen COCO validation dataset. The model is trained on VOC
models. Fig. 3 indicates that it reduces the accuracy slightly, dataset with 20 segmentation classes. (a) shows the predicted
however, reduces the latency by a large margin. semantic segmentation mask where ‘black ’color represents the
background pixels. (b) displays the predicted masks on top of
original images. (c) represents the color encodings for all VOC
classes for the displayed segmentation masks. Our model provides
high-quality segmentation masks on unseen COCO images.
7. Conclusion
The success of the transformer models comes with a
Figure 3. Ablation on the affect of using different activation func- higher computational overhead compared to CNNs. Self-
tions and normalization layers on accuracy and latency of our net- attention operation is the major contributor to this overhead,
work variants. Using Hard Swish activation and batch normal- which makes vision transformers slow on the edge devices
ization instead of GELU and layer normalization significantly im- compared to CNN-based mobile architectures. In this pa-
proves the latency at the cost of some loss in accuracy. per, we introduce a hybrid design consisting of convolution
and efficient self-attention based encoders to jointly model
8
local and global information effectively, while being effi- [12] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu
cient in terms of both parameters and MAdds on vision Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new
tasks with superior performance compared to state-of-the- multi-scale backbone architecture. IEEE Transactions on
art methods. Our experimental results show promising per- Pattern Analysis and Machine Intelligence, 43(2):652–662,
formance for different variants of EdgeNeXt, which demon- 2019. 4
[13] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev,
strates the effectiveness and the generalization ability of the
Subhransu Maji, and Jitendra Malik. Semantic contours from
proposed model.
inverse detectors. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, 2011. 5
References [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[1] Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bo- Deep residual learning for image recognition. In Proceed-
janowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Na- ings of the IEEE/CVF Conference on Computer Vision and
talia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Pattern Recognition, 2016. 3
Cross-covariance image transformers. Advances in Neural [15] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
Information Processing Systems, 2021. 4 units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- [16] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh
ton. Layer normalization. arXiv preprint arXiv:1607.06450, Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,
2016. 3 Ruoming Pang, Vijay Vasudevan, et al. Searching for mo-
[3] Liang-Chieh Chen, George Papandreou, Florian Schroff, and bilenetv3. In Proceedings of the IEEE/CVF International
Hartwig Adam. Rethinking atrous convolution for seman- Conference on Computer Vision, 2019. 1, 2, 6
tic image segmentation. arXiv preprint arXiv:1706.05587, [17] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry
2017. 7 Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
[4] Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile- tional neural networks for mobile vision applications. CoRR,
former: Bridging mobilenet and transformer. In Proceed- abs/1704.04861, 2017. 2, 5
ings of the IEEE/CVF Conference on Computer Vision and [18] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
Pattern Recognition, 2022. 2, 6 Weinberger. Deep networks with stochastic depth. In The
[5] François Chollet. Xception: Deep learning with depthwise European Conference on Computer Vision, 2016. 5
separable convolutions. In Proceedings of the IEEE/CVF [19] Forrest N Iandola, Song Han, Matthew W Moskewicz,
Conference on Computer Vision and Pattern Recognition, Khalid Ashraf, William J Dally, and Kurt Keutzer.
2017. 2 Squeezenet: Alexnet-level accuracy with 50x fewer pa-
[6] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. rameters and¡ 0.5 mb model size. arXiv preprint
Randaugment: Practical automated data augmentation with arXiv:1602.07360, 2016. 1, 2
a reduced search space. In Advances in Neural Information [20] Salman Khan, Muzammal Naseer, Munawar Hayat,
Processing Systems, 2020. 5 Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak
[7] Zihang Dai, Hanxiao Liu, Quoc Le, and Mingxing Tan. Shah. Transformers in vision: A survey. ACM Computing
Coatnet: Marrying convolution and attention for all data Surveys (CSUR), 2021. 1
sizes. Advances in Neural Information Processing Systems, [21] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man-
2021. 6 galam, Bo Xiong, Jitendra Malik, and Christoph Feichten-
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, hofer. Improved multiscale vision transformers for classifi-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, cation and detection. In Proceedings of the IEEE/CVF Con-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- ference on Computer Vision and Pattern Recognition, 2022.
vain Gelly, et al. An image is worth 16x16 words: Trans- 8
formers for image recognition at scale. arXiv preprint [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
arXiv:2010.11929, 2020. 2 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[9] Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Zitnick. Microsoft coco: Common objects in context. In The
Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving European Conference on Computer Vision, 2014. 5, 6
vision transformers with soft convolutional inductive biases. [23] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
In International Conference on Machine Learning, 2021. 2 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[10] Mark Everingham, Luc Van Gool, Christopher KI Williams, Berg. Ssd: Single shot multibox detector. In The European
John Winn, and Andrew Zisserman. The Pascal Visual Ob- Conference on Computer Vision, 2016. 6
ject Classes (VOC) Challenge. International Journal of [24] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
Computer Vision, 88(2):303–338, 2010. 5, 7 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[11] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Hierarchical vision transformer using shifted windows. In
Zhicheng Yan, Jitendra Malik, and Christoph Feichten- Proceedings of the IEEE/CVF International Conference on
hofer. Multiscale vision transformers. In Proceedings of Computer Vision, 2021. 2
the IEEE/CVF International Conference on Computer Vi- [25] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
sion, 2021. 2 enhofer, Trevor Darrell, and Saining Xie. A convnet for the
9
2020s. In Proceedings of the IEEE/CVF Conference on Com- formers see better. In Advances in Neural Information Pro-
puter Vision and Pattern Recognition, 2022. 3, 6 cessing Systems, 2021. 2, 6
[26] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient [40] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou,
descent with warm restarts. In International Conference on Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer
Learning Representations, 2017. 5 is actually what you need for vision. In Proceedings of
[27] Ilya Loshchilov and Frank Hutter. Decoupled weight de- the IEEE/CVF Conference on Computer Vision and Pattern
cay regularization. In International Conference on Learning Recognition, 2021. 7
Representations, 2019. 5 [41] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
[28] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng
Shufflenet v2: Practical guidelines for efficient cnn architec- Yan. Tokens-to-token vit: Training vision transformers from
ture design. In The European Conference on Computer Vi- scratch on imagenet. In Proceedings of the IEEE/CVF Inter-
sion, 2018. 1, 6 national Conference on Computer Vision, 2021. 6
[29] Sachin Mehta and Mohammad Rastegari. Mobilevit: light- [42] Matthew D Zeiler and Rob Fergus. Visualizing and under-
weight, general-purpose, and mobile-friendly vision trans- standing convolutional networks. In The European Confer-
former. In International Conference on Learning Represen- ence on Computer Vision, 2014. 3
tations, 2022. 2, 3, 5, 6 [43] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Edgeformer:
[30] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Improving light-weight convnets by learning from vision
Hannaneh Hajishirzi. Espnetv2: A light-weight, power ef- transformers. arXiv preprint arXiv:2203.03952, 2022. 2
ficient, and general purpose convolutional neural network. [44] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
In Proceedings of the IEEE/CVF Conference on Computer Shufflenet: An extremely efficient convolutional neural net-
Vision and Pattern Recognition, 2019. 2 work for mobile devices. In Proceedings of the IEEE/CVF
[31] Boris T Polyak and Anatoli B Juditsky. Acceleration of Conference on Computer Vision and Pattern Recognition,
stochastic approximation by averaging. SIAM journal on 2018. 2
control and optimization, 1992. 5 [45] Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li,
[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- and Rong Jin. Elsa: Enhanced local self-attention for vision
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, transformer. arXiv preprint arXiv:2112.12786, 2021. 2
Aditya Khosla, Michael Bernstein, et al. ImageNet Large
Scale Visual Recognition Challenge. International Journal
of Computer Vision, 115(3):211–252, 2015. 2, 5
[33] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey
Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: In-
verted residuals and linear bottlenecks. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2018. 1, 2, 6
[34] Jürgen Schmidhuber. Deep learning in neural networks: An
overview. Neural networks, 61:85–117, 2015. 1
[35] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck
transformers for visual recognition. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021. 2
[36] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
Mark Sandler, Andrew Howard, and Quoc V Le. Mnas-
net: Platform-aware neural architecture search for mobile.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2019. 1, 2
[37] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. In International Conference on Machine Learning,
2021. 2, 6
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, 2017. 2
[39] Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor
Darrell, and Ross Girshick. Early convolutions help trans-
10